• Feeds

  • Ideas for creating a friendfeed like feed aggregator system

    I was invited to join a meeting about implement a friendfeed like feed system. Here are some ideas about requirement and architecture, which I typed on my BlackBerry during the meeting.

    1. Like the friendfeed, The product can import external RSS, so we separate the system into two parts. The rss crawl system and the feed pubsub system. The pubsub system has no responsbility to grab the feed from source. The feed itself only save feed summary and url.
    2. We decided to use the INBOX approach, which will push the published feed to be saved in all subscriber’s data table. More information of how this work can refer to Scaling a Microblogging Service – Part I.
    3. User’s homepage is an aggregation result. We choose to return a limited recent real-time date, no infinity pagination. But the user’s own feed(user’s profile page) may have a bigger date range.
    4. The unsubscibe logic have two choice, delete or keep the history data from one’s inbox. We decide to keep them.
    5. If the feed source had been deleted, do we need to delete all references in all subscriber’s inbox?  If need delete, each feed push to the pubsub system need to have a unique resource id. Another problem is after the source updated whether to publish a new feed or update the current feed?
    6. How to manage the group(QUN in Chinese) feed, deliver to all member’s inbox? Or share a group inbox?
    7. How to impl the feed comment logic, publish the comment to feed system or design a standalone comment system. We prefer to use a standalone comment system which doesn’t publish the comment back to the feed system.
    8. Every feed has a media type, such as text, video, image so the subscribe API can only retrieve a limited media type (text for mobile device). And a feed may have tags.
    9. The read/unread count is easy to implement. But the load is heavy. (QQ / QQzone may has such logic.)
    10. Need open API for 3rd party client(like twitter client), and RSS feed, may have OAuth integration.
    11. The storage may like friendfeed’s mysql schema (see How FriendFeed uses MySQL to store schema-less data) or use Amazon simpledb.
    12. May add support for XMPP Publish-Subscribe, or PEP(Personal Eventing Protocol) for pushing realtime time to users in the future.

    Thrift and Protocol Buffers performance in Java

    I’ve used Thrift for some log client in our system. I’m going to use Protocol Buffers as the internal communication protocol between our XMPP servers. But I am hard to believe from the thrift and protocol buffers Python performance comparison, that that Protocol Buffers is 4-10 slower than Thrift. I’m going to do some similar tests on Java.

    The test is very similiar as the Python test. the .proto and .thrift file are copied from the above python test.

    The .thrift content:

    struct dns_record {
    1: string key,
    2: string value,
    3: string type = 'A',
    4: i32 ttl = 86400,
    5: string first,
    6: string last
    }
    
    typedef list<dns_record> biglist
    
    struct dns_response {
    1: biglist records
    }
    
    service PassiveDns {
    biglist search_question(1:string q);
    biglist search_answer(1:string q);
    }

    The .proto content

    package passive_dns;
    
    message DnsRecord {
    required string key = 1;
    required string value = 2;
    required string first = 3;
    required string last = 4;
    optional string type = 5 [default = "A"];
    optional int32  ttl = 6 [default = 86400];
    }
    
    message DnsResponse {
    repeated DnsRecord records = 1;
    }

    From the document, I learn that the optional and default values are one of the benefits of both serialization libraries. A record that matches the default value does not need to be included in the serialized output.

    I wrote up a simple test program to compare thrift, protocol buffers. I tested the serialize and deserialize together, because this is the most called part in most scenarioes.

    Test 1: 10,000,000 times

    ProtoBuf Loop  : 10,000,000
    Get object     : 15,130msec
    Serdes protobuf: 68,600msec
    Objs per second: 145,772
    Total bytes    : 829,996,683
    
    Thrift Loop    : 10,000,000
    Get object     : 12,651msec
    Serdes thrift  : 36,904msec
    Objs per second: 270,973
    Total bytes    : 1,130,000,000

    Test 2: 1,000,000 times

    ProtoBuf Loop  : 1,000,000
    Get object     : 1,094msec
    Serdes protobuf: 7,467msec
    Objs per second: 133,922
    Total bytes    : 83,000,419
    
    Thrift Loop    : 1,000,000
    Get object     : 524msec
    Serdes thrift  : 5,969msec
    Objs per second: 167,532
    Total bytes    : 113,000,000

    The serde_* functions are the times needed to serialize, and de-serialize the java object to and from a byte[].

    The result in Java was that Protocol Buffers 1.2-2 times slower than Thrift. (in the python test was 4~10 times). And PB binary size is smaller than Thrift. I think this is acceptable, and Google may improve the Protocol Buffers performance in the future version.

    Download my test code in Java: thrift-protocol-buffers-java.tgz,

    More information about thrift and protocol buffers: Thrift, Protocol Buffers installation and Java code howto

    Update: There is another Thrift vs. Protocol Buffers compare non-performance factors.

    UPDATE 2 (Apr 17): There is a performance tuning parameter optimize_for = SPEED (thanks Steve Chu) for Protocol Buffers, please see my next performance tests Thrift and Protocol Buffers performance in Java Round 2

    Thrift, Protocol Buffers installation and Java code howto

    I. Thrift installation and Java code

    1. build and install thrift

    install boost
    cd <boost_root>/tools/jam
    ./build_dist.sh
    # linux* will depends the platform
    cp stage/bin.linux*/bjam <boost_root>
    # build boost, use bjam will faster
    cd <boost_root>
    ./configure –without-icu –prefix=/usr/local/boost
    ./bjam -toolset=gcc –build-type=release install –prefix=/usr/local/boost

    # build thrift
    ./bootstrap.sh
    ./configure –with-boost=/usr/local
    make
    make install

    2. Build Thrift java library

    install apache ant if necessary

    cd lib/java/
    ant

    get libthrift.jar

    3. Create .thrift file and gen Java code

    (See http://wiki.apache.org/thrift/Tutorial for more .thrift tutorial info)
    tim.thrift

    struct dns_record {
    1: string key,
    2: string value,
    3: string type = 'A',
    4: i32 ttl = 86400,
    5: string first,
    6: string last
    }
    
    service TestDns {
    dns_record test(1:string q);
    }

    <thrift_root>/bin/thrift –gen java tim.thrift
    code will be generated in gen-java/*.java

    4. Write java code

    // new object
    dns_record dr = new dns_record(key, value, type, ttl, first, last)
    // serialize
    TSerializer serializer = new TSerializer(new TBinaryProtocol.Factory());
    TDeserializer deserializer = new TDeserializer(new TBinaryProtocol.Factory());
    byte[] bytes = serializer.serialize(dr);

    see also: http://wiki.apache.org/thrift/ThriftUsageJava

    II. Protocol Buffers install and Java code

    1. Build and install Protocol buffers

    ./configure
    make
    make install

    2. Build protobuf Java library

    install maven if not necessary

    cd protobuf/java
    mvn test
    mvn package

    get jar from target/protobuf-java-x.x.x.jar

    3. Create .proto file and gen Java code

    tim.proto

    package dns;
    
    message DnsRecord {
    required string key = 1;
    required string value = 2;
    required string first = 3;
    required string last = 4;
    optional string type = 5 [default = "A"];
    optional int32  ttl = 6 [default = 86400];
    }
    
    message DnsResponse {
    repeated DnsRecord records = 1;
    }

    bin/protoc –java_out . tim.proto

    4. Write Java code

    // protocol buffer need a builder to create object
    Dns.DnsRecord.Builder b = Dns.DnsRecord.newBuilder();
    b.setKey("key");
    b.setValue("value...");
    ...
    b.builder();
    
    byte[] bytes = dr.toByteArray();
    Dns.DnsRecord dr2 = Dns.DnsRecord.parseFrom(bytes);

    III. Resources

    Thrift: http://incubator.apache.org/thrift/

    Protocol Buffers: http://code.google.com/apis/protocolbuffers/

    Tim’s Blog: http://timyang.net/