<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>Tim[后端技术] &#187; performance</title>
	<atom:link href="http://timyang.net/tag/performance/feed/" rel="self" type="application/rss+xml" />
	<link>http://timyang.net</link>
	<description>Tim&#039;s blog, 关于后端架构、互联网技术、分布式、大型网络应用、NoSQL、Key Value等</description>
	<lastBuildDate>Mon, 02 Aug 2010 15:34:40 +0000</lastBuildDate>
	<generator>http://wordpress.org/?v=2.9.1</generator>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
			<item>
		<title>用Twitter的cursor方式进行Web数据分页</title>
		<link>http://timyang.net/web/pagination/</link>
		<comments>http://timyang.net/web/pagination/#comments</comments>
		<pubDate>Tue, 19 Jan 2010 14:16:58 +0000</pubDate>
		<dc:creator>Tim</dc:creator>
				<category><![CDATA[Web]]></category>
		<category><![CDATA[mysql]]></category>
		<category><![CDATA[performance]]></category>
		<category><![CDATA[twitter]]></category>

		<guid isPermaLink="false">http://timyang.net/?p=532</guid>
		<description><![CDATA[本文讨论Web应用中实现数据分页功能，不同的技术实现方式的性能方区别。

上图功能的技术实现方法拿MySQL来举例就是
select * from msgs where thread_id = ? limit page * count, count
不过在看Twitter API的时候，我们却发现不少接口使用cursor的方法，而不用page, count这样直观的形式，如 followers ids 接口
URL:
http://twitter.com/followers/ids.format
Returns an array of numeric IDs for every user following the specified user.
Parameters:
* cursor. Required. Breaks the results into pages. Provide a value of -1 to begin paging. Provide values as returned to in the response body&#8217;s next_cursor [...]]]></description>
			<content:encoded><![CDATA[<p>本文讨论Web应用中实现数据分页功能，不同的技术实现方式的性能方区别。<br />
<a href="http://timyang.net/blog/wp-content/uploads/2010/01/pagination.png"><img class="alignnone size-full wp-image-533" title="pagination" src="http://timyang.net/blog/wp-content/uploads/2010/01/pagination.png" alt="" width="376" height="148" /></a><br />
上图功能的技术实现方法拿MySQL来举例就是</p>
<pre>select * from msgs where thread_id = ? limit page * count, count</pre>
<p>不过在看Twitter API的时候，我们却发现不少接口使用cursor的方法，而不用page, count这样直观的形式，如 followers ids 接口</p>
<blockquote><p><strong>URL:</strong></p>
<p>http://twitter.com/followers/ids.format</p>
<p>Returns an array of numeric IDs for every user following the specified user.</p>
<p><strong>Parameters:</strong><br />
* cursor. Required. Breaks the results into pages. Provide a value of -1 to begin paging. Provide values as returned to in the response body&#8217;s next_cursor and previous_cursor attributes to page back and forth in the list.<br />
o Example: http://twitter.com/followers/ids/barackobama.xml?cursor=-1<br />
o Example: http://twitter.com/followers/ids/barackobama.xml?cursor=-1300794057949944903</p></blockquote>
<div id="_mcePaste" style="overflow: hidden; position: absolute; left: -10000px; top: 101px; width: 1px; height: 1px;"><span style="font-family: courier new,monospace;"><span style="font-family: arial,sans-serif;"><span style="font-family: courier new,monospace;"><span style="font-family: arial,sans-serif;"><span style="font-family: courier new,monospace;">http://twitter.com/followers/ids.<em>format</em></span></span></span></span></span></div>
<p>从上面描述可以看到，http://twitter.com/followers/ids.xml 这个调用需要传cursor参数来进行分页，而不是传统的 url?page=n&amp;count=n的形式。这样做有什么优点呢？是否让每个cursor保持一个当时数据集的镜像？防止由于结果集实时改变而产生查询结果有重复内容？<br />
在Google Groups这篇<a href="http://groups.google.com/group/twitter-development-talk/browse_thread/thread/712b4765028d527c/3f444faee8f8d7ef">Cursor Expiration</a>讨论中Twitter的架构师<a href="http://twitter.com/jkalucki">John Kalucki</a>提到</p>
<blockquote><p>A cursor is an opaque deletion-tolerant index into a Btree keyed by source<br />
userid and modification time. It brings you to a point in time in the<br />
reverse chron sorted list. So, since you can&#8217;t change the past, other than<br />
erasing it, it&#8217;s effectively stable. (Modifications bubble to the top.) But<br />
you have to deal with additions at the list head and also block shrinkage<br />
due to deletions, so your blocks begin to overlap quite a bit as the data<br />
ages. (If you cache cursors and read much later, you&#8217;ll see the first few<br />
rows of cursor[n+1]&#8217;s block as duplicates of the last rows of cursor[n]&#8217;s<br />
block. The intersection cardinality is equal to the number of deletions in<br />
cursor[n]&#8217;s block). Still, there may be value in caching these cursors and<br />
then heuristically rebalancing them when the overlap proportion crosses some<br />
threshold.</p></blockquote>
<p>在另外一篇<a href="http://groups.google.com/group/twitter-development-talk/browse_thread/thread/cfccfa4302ff9729/66d6b91f9a6bf96d">new cursor-based pagination not multithread-friendly</a>中John又提到</p>
<blockquote><p>The page based approach does not scale with large sets. We can no<br />
longer support this kind of API without throwing a painful number of<br />
503s.</p>
<p>Working with row-counts forces the data store to recount rows in an O<br />
(n^2) manner. Cursors avoid this issue by allowing practically<br />
constant time access to the next block. The cost becomes O(n/<br />
block_size) which, yes, is O(n), but a graceful one given n &lt; 10^7 and<br />
a block_size of 5000. The cursor approach provides a more complete and<br />
consistent result set.</p>
<p>Proportionally, very few users require multiple page fetches with a<br />
page size of 5,000.</p>
<p>Also, scraping the social graph repeatedly at high speed is could<br />
often be considered a low-value, borderline abusive use of the social<br />
graph API.</p></blockquote>
<p>通过这两段文字我们已经很清楚了，对于大结果集的数据，使用cursor方式的目的主要是为了极大地提高性能。还是拿MySQL为例说明，比如翻页到100,000条时，不用cursor，对应的SQL为</p>
<pre>select * from msgs limit 100000, 100</pre>
<p>在一个百万记录的表上，第一次执行这条SQL需要5秒以上。<br />
假定我们使用表的主键的值作为cursor_id, 使用cursor分页方式对应的SQL可以优化为</p>
<pre>select * from msgs where id &gt; cursor_id limit 100;</pre>
<p>同样的表中，通常只需要100ms以下, 效率会提高几十倍。MySQL limit性能差别也可参看我3年前写的一篇不成熟的文章 <a href="http://hi.baidu.com/jabber/blog/item/67485b43379290119313c6b5.html">MySQL LIMIT 的性能问题</a>。</p>
<h3>结论</h3>
<p>建议Web应用中<strong>大数据集翻页可以采用这种cursor方式</strong>，不过此方法缺点是翻页时必须连续，不能跳页。</p>
Similar Posts:<ul><li><a href="http://timyang.net/architecture/friendfeed-like-aggregator/" rel="bookmark" title="April 3, 2009">Ideas for creating a friendfeed like feed aggregator system</a></li>

<li><a href="http://timyang.net/tech/twitter-whale/" rel="bookmark" title="March 8, 2010">Twitter“鲸鱼”故障技术剖析</a></li>

<li><a href="http://timyang.net/python/python-rest/" rel="bookmark" title="February 12, 2009">用Python实现CRUD功能REST服务</a></li>

<li><a href="http://timyang.net/architecture/twitter-cache-architecture/" rel="bookmark" title="October 28, 2009">Twitter架构图(cache篇)</a></li>

<li><a href="http://timyang.net/sns/twitter-api-changes/" rel="bookmark" title="December 30, 2009">Twitter API最近的一些飞跃</a></li>
</ul><!-- Similar Posts took 10.604 ms -->]]></content:encoded>
			<wfw:commentRss>http://timyang.net/web/pagination/feed/</wfw:commentRss>
		<slash:comments>7</slash:comments>
		</item>
	</channel>
</rss>
