本文讨论Web应用中实现数据分页功能,不同的技术实现方式的性能方区别。
上图功能的技术实现方法拿MySQL来举例就是
select * from msgs where thread_id = ? limit page * count, count
不过在看Twitter API的时候,我们却发现不少接口使用cursor的方法,而不用page, count这样直观的形式,如 followers ids 接口
URL:
http://twitter.com/followers/ids.format
Returns an array of numeric IDs for every user following the specified user.
Parameters:
* cursor. Required. Breaks the results into pages. Provide a value of -1 to begin paging. Provide values as returned to in the response body’s next_cursor and previous_cursor attributes to page back and forth in the list.
o Example: http://twitter.com/followers/ids/barackobama.xml?cursor=-1
o Example: http://twitter.com/followers/ids/barackobama.xml?cursor=-1300794057949944903
从上面描述可以看到,http://twitter.com/followers/ids.xml 这个调用需要传cursor参数来进行分页,而不是传统的 url?page=n&count=n的形式。这样做有什么优点呢?是否让每个cursor保持一个当时数据集的镜像?防止由于结果集实时改变而产生查询结果有重复内容?
在Google Groups这篇Cursor Expiration讨论中Twitter的架构师John Kalucki提到
A cursor is an opaque deletion-tolerant index into a Btree keyed by source
userid and modification time. It brings you to a point in time in the
reverse chron sorted list. So, since you can’t change the past, other than
erasing it, it’s effectively stable. (Modifications bubble to the top.) But
you have to deal with additions at the list head and also block shrinkage
due to deletions, so your blocks begin to overlap quite a bit as the data
ages. (If you cache cursors and read much later, you’ll see the first few
rows of cursor[n+1]’s block as duplicates of the last rows of cursor[n]’s
block. The intersection cardinality is equal to the number of deletions in
cursor[n]’s block). Still, there may be value in caching these cursors and
then heuristically rebalancing them when the overlap proportion crosses some
threshold.
在另外一篇new cursor-based pagination not multithread-friendly中John又提到
The page based approach does not scale with large sets. We can no
longer support this kind of API without throwing a painful number of
503s.Working with row-counts forces the data store to recount rows in an O
(n^2) manner. Cursors avoid this issue by allowing practically
constant time access to the next block. The cost becomes O(n/
block_size) which, yes, is O(n), but a graceful one given n < 10^7 and
a block_size of 5000. The cursor approach provides a more complete and
consistent result set.Proportionally, very few users require multiple page fetches with a
page size of 5,000.Also, scraping the social graph repeatedly at high speed is could
often be considered a low-value, borderline abusive use of the social
graph API.
通过这两段文字我们已经很清楚了,对于大结果集的数据,使用cursor方式的目的主要是为了极大地提高性能。还是拿MySQL为例说明,比如翻页到100,000条时,不用cursor,对应的SQL为
select * from msgs limit 100000, 100
在一个百万记录的表上,第一次执行这条SQL需要5秒以上。
假定我们使用表的主键的值作为cursor_id, 使用cursor分页方式对应的SQL可以优化为
select * from msgs where id > cursor_id limit 100;
同样的表中,通常只需要100ms以下, 效率会提高几十倍。MySQL limit性能差别也可参看我3年前写的一篇不成熟的文章 MySQL LIMIT 的性能问题。
结论
建议Web应用中大数据集翻页可以采用这种cursor方式,不过此方法缺点是翻页时必须连续,不能跳页。
实际应用中问题一般是出在where和limit之间的status = pass或者其他筛选条件上,数据不连续cursor也就不那么灵光了
ls应该指order by吧,当不是以id为序时
跳页可以加一次运算,取出合适cursor就可以了吧,相对可能还是简单了
可以看一下我的一篇博客http://www.fuchaoqun.com/2009/04/efficient-pagination-using-mysql/
既可用到cursor,亦可随意翻页。
请问如果不是以主键id为排序,应该怎么做呢?
limit的性能实在是太恐怖了。。。
用这个方法数据库的问题是解决了,但与之相应的列表缓存又如何设计的呢?
列表缓存的实施方案(以新浪微博举例):
计数器中记录我当前的微博数(wcount=1000),redis中以sorted set记录微博信息的id score为发布时间 .翻到第二页(每页30)时
在没有新的微博的时候(wcount没有改变) get mylist 30 30.
如果有15条新的微博 在翻第二页的时候应该是 get mylist 30+15 30 .
memcache不太容易实现这个列表的缓存.
其实跳页也是可以的,只是设计上就更讲究了,比如向后翻2页,sql为:
select * from msgs where id > cursor_id limit 200, 100;
id不连续时,有什么更好的方法么?