• Feeds

  • Posts Tagged ‘friendfeed’


    Friendfeed的MySQL key/value存储

    这是一篇2009年初的资料How FriendFeed uses MySQL to store schema-less data,相信大部分人已经看过了。如Fenng的中文介绍FriendFeed 使用 MySQL 的经验。本文从不同的角度再补充下。作者几个月前也曾经在广州技术沙龙作过一次Key value store漫谈的演讲,许多参会人员对key value方向存在强烈的使用意愿,但同时也对完全抛弃MySQL存在疑虑,本文介绍的方案也可以给这些人员一些架构参考。

    需求

    250M entities, entities表共有2.5亿条记录,当然是分库的。

    典型解决方案:RDBMS

    问题:由于业务需要不定期更改表结构,但是在2.5亿记录的表上增删字段、修改索引需要锁表,最长需要1小时到1天以上。

    Key value方案

    评估Document类型数据库,如CouchDB
    CouchDB问题: Performance? 广泛使用? 稳定性? 抗压性?

    MySQL方案

    MySQL相比Document store优点:

    • 不用担心丢数据或数据损坏
    • Replication
    • 非常熟悉它的特性及不足,知道如何解决

    结论

    综合取舍,使用MySQL来存储key/value(schema-less)数据,value中可以放:
    Python dict
    JSON object

    实际friendfeed存放的是zlib压缩的Python dict数据,当然这种绑定一种语言的做法具有争议性。

    表结构及Index设计模式

    feed数据基本上都存在entities表中,它的结构为

    mysql> desc entities;
    +----------+------------+------+-----+-------------------+----------------+
    | Field    | Type       | Null | Key | Default           | Extra          |
    +----------+------------+------+-----+-------------------+----------------+
    | added_id | int(11)    | NO   | PRI | NULL              | auto_increment |
    | id       | binary(16) | NO   | UNI |                   |                |
    | updated  | timestamp  | YES  | MUL | CURRENT_TIMESTAMP |                |
    | body     | mediumblob | YES  |     | NULL              |                |
    +----------+------------+------+-----+-------------------+----------------+

    假如里面存的数据如下

    {
    "id": "71f0c4d2291844cca2df6f486e96e37c",
    "user_id": "f48b0440ca0c4f66991c4d5f6a078eaf",
    "feed_id": "f48b0440ca0c4f66991c4d5f6a078eaf",
    "title": "We just launched a new backend system for FriendFeed!",
    "link": "http://friendfeed.com/e/71f0c4d2-2918-44cc-a2df-6f486e96e37c",
    "published": 1235697046,
    "updated": 1235697046,
    }

    如果要对link字段进行索引,则用另外一个表来存储。

    mysql> desc index_link;
    +-----------+--------------+------+-----+---------+-------+
    | Field     | Type         | Null | Key | Default | Extra |
    +-----------+--------------+------+-----+---------+-------+
    | link      | varchar(255) | NO   | PRI |         |       |
    | entity_id | binary(16)   | NO   | PRI |         |       |
    +-----------+--------------+------+-----+---------+-------+
    2 rows in set (0.00 sec)

    优点是

    • 增加索引时候只需要 1. CREATE TABLE,2.更新程序
    • 删除索引时候只需要 1. 程序停止写索引表(实际就是一个普通表),2. DROP TABLE 索引表

    这种索引方式也是一种值得借鉴的设计模式,特别是key value类型的数据需要索引其中的内容时。

    Ideas for creating a friendfeed like feed aggregator system

    I was invited to join a meeting about implement a friendfeed like feed system. Here are some ideas about requirement and architecture, which I typed on my BlackBerry during the meeting.

    1. Like the friendfeed, The product can import external RSS, so we separate the system into two parts. The rss crawl system and the feed pubsub system. The pubsub system has no responsbility to grab the feed from source. The feed itself only save feed summary and url.
    2. We decided to use the INBOX approach, which will push the published feed to be saved in all subscriber’s data table. More information of how this work can refer to Scaling a Microblogging Service – Part I.
    3. User’s homepage is an aggregation result. We choose to return a limited recent real-time date, no infinity pagination. But the user’s own feed(user’s profile page) may have a bigger date range.
    4. The unsubscibe logic have two choice, delete or keep the history data from one’s inbox. We decide to keep them.
    5. If the feed source had been deleted, do we need to delete all references in all subscriber’s inbox?  If need delete, each feed push to the pubsub system need to have a unique resource id. Another problem is after the source updated whether to publish a new feed or update the current feed?
    6. How to manage the group(QUN in Chinese) feed, deliver to all member’s inbox? Or share a group inbox?
    7. How to impl the feed comment logic, publish the comment to feed system or design a standalone comment system. We prefer to use a standalone comment system which doesn’t publish the comment back to the feed system.
    8. Every feed has a media type, such as text, video, image so the subscribe API can only retrieve a limited media type (text for mobile device). And a feed may have tags.
    9. The read/unread count is easy to implement. But the load is heavy. (QQ / QQzone may has such logic.)
    10. Need open API for 3rd party client(like twitter client), and RSS feed, may have OAuth integration.
    11. The storage may like friendfeed’s mysql schema (see How FriendFeed uses MySQL to store schema-less data) or use Amazon simpledb.
    12. May add support for XMPP Publish-Subscribe, or PEP(Personal Eventing Protocol) for pushing realtime time to users in the future.