TE
科技回声
首页24小时热榜最新最佳问答展示工作
GitHubTwitter
首页

科技回声

基于 Next.js 构建的科技新闻平台,提供全球科技新闻和讨论内容。

GitHubTwitter

首页

首页最新最佳问答展示工作

资源链接

HackerNews API原版 HackerNewsNext.js

© 2025 科技回声. 版权所有。

Looking into the future with Cassandra

126 点作者 bnmrrs超过 15 年前

8 条评论

rarrrrrr超过 15 年前
In SQL systems, you can easily shift the computation cost from at read time to at write time. That's what triggers are for. When new data comes in that changes a result you know you need quickly, a trigger can automatically add to a work queue to update a result table.<p>Since the trigger has automatic access to the contents of the new data (and old data in case of an update or delete) the computation to update the results table can often be made much faster.<p>Every situation is different of course, but it's overreaching to say that SQL systems have no options beyond read-time computation of results.
评论 #813879 未加载
mcav超过 15 年前
Wow:<p>&#62; <i>For this feature, the fully denormalized Cassandra dataset weighs in at 3 terabytes and 76 billion columns.</i>
评论 #813818 未加载
评论 #813855 未加载
mikeryan超过 15 年前
This is weird to me.<p>"We started thinking seriously about deploying Cassandra in production around three weeks ago. After looking at the site for something that would be a good fit, we settled on green badges."<p>It seems completely baffling to me that someone would go out and compare different db solutions, pick one and THEN try to find a way to fit it on their site architecture.
评论 #814033 未加载
gcv超过 15 年前
I read in several blog posts that Cassandra has its share of data-corrupting bugs. <a href="http://blog.evanweaver.com/articles/2009/07/06/up-and-running-with-cassandra/" rel="nofollow">http://blog.evanweaver.com/articles/2009/07/06/up-and-runnin...</a> mentioned that even Facebook does not use it as a system of record. Wonder how Digg deals with that.
评论 #813899 未加载
评论 #814083 未加载
justin_vanw超过 15 年前
SELECT `digdate`, `id` FROM `Diggs` WHERE `userid` IN (59, 9006, 15989, 16045, 29183, 30220, 62511, 75212, 79006) AND itemid = 13084479 ORDER BY `digdate` DESC, `id` DESC LIMIT 4;<p>Ok, how do we optimize this query?<p>Step 1: Keep all dugg items in memcached for the last N days, where N is determined by when you run out of memory. Then, your query becomes:<p>SELECT `digdate`, `id` FROM `Diggs` WHERE `userid` IN (59, 9006, 15989, 16045, 29183, 30220, 62511, 75212, 79006) AND digdate &#60; now() - interval '5 days' AND itemid = 13084479 ORDER BY `digdate` DESC, `id` DESC LIMIT 4; /* Excuse the postgresql syntax <i>/<p>If your database is properly clustered, this will mean you are only running the query against partitions holding old dugs, which is probably not as hot as the more recent stuff. Additionally, I strongly suspect that you see more recent articles more than old ones, if the article is less than 5 days old you need no SQL at all, just the memcache lookup. For example, if you are looking at the homepage, and there are 15 articles on it, you have to do a single memcached get request for all the pairs like (article_id, friend_id), so if you have 100 friends that is 100 </i> 15 keys to request. This is large, but who cares, you can add memcached servers and webservers until you puke and this will keep scaling without limit. When browsing old articles the db will get hit heavily, but only the partitions holding old data, and I would guess that this is a very very small fraction of their overall use.<p>Step 2: When a user is activly using the site, like they have viewed 2 pages in the last 10 minutes or something, shove all their old (article_id, friend_id) pairs into memcached as well. Once a user has reached the 'activity threshhold' and the cache is filled, no sql is necessary to find all their friend's dug articles. As a bonus, no weirdo software like 'cassandra' which may or may not continue to exist in 1 year is necessary.<p>For step 1 you need very little effort, just put a key into memcached every time a user digs something, and put a 5 day timeout on that key. This is 1 line of code in whatever code handles the http request representing a 'dig'. Then you have to build up the list of friends and keep it somewhere when a user logs in to the site (or returns with a cookie that has them logged in). This would take one memcache request when the user logs in/comes back to see if their friends list is in memcached, one sql statement if it is not, and a line in the are that handles adding friends to spoil the key if their friends list changes (you could try updating it, but why, just let it be regenerated on their next http request). Finally, you have to generate the keys for the (article_id, friend_id) pairs on each page view, and do a multi_get from memcached.<p>Step 2 would require an asynchronous process, so would be more complex.<p>I could implement step 1 in an hour or so if familiar with the digg codebase, and step 2 in perhaps 2 days, however if they have other async processes that occur when a user logs in that you could integrate this with it could take as little as an hour or two as well, since the logic is dead simple, it is the mechanics of running a process to do it that is time consuming.<p>Finally, you would have to figure out how much memory you would need to store N days of digs (users with no friends do not count in this). I believe it would not be very much.
评论 #814232 未加载
评论 #814434 未加载
评论 #814081 未加载
imajes超过 15 年前
Anyone from digg here? I'd love to know how long it took to build that dataset the first time. In other words: what's the recovery window like to reset that size dataset?
评论 #814119 未加载
by超过 15 年前
Why does the first step "Query Friends for all my friends" take 1.5 seconds? I am struggling to understand this. If this simple table of say 100,000,000 rows is indexed on userid and we are only reading and returning say 200 rows for a particular userid what makes it so slow?
Quarrelsome超过 15 年前
But surely that's broken. If I dig something but then add a friend AFTERWARDS, they wont see the shield as the bucket for my digg didn't contain their user id at write time. Am I missing something?