TE
科技回声
首页24小时热榜最新最佳问答展示工作
GitHubTwitter
首页

科技回声

基于 Next.js 构建的科技新闻平台,提供全球科技新闻和讨论内容。

GitHubTwitter

首页

首页最新最佳问答展示工作

资源链接

HackerNews API原版 HackerNewsNext.js

© 2025 科技回声. 版权所有。

How Netflix loaded 1 billion rows into SimpleDB

32 点作者 petewarden超过 15 年前

5 条评论

pierrefar超过 15 年前
I was running a website that was doing millions of writes a day to SDB for real time analytics. The biggest PITA feature of SDB is that it throttles writes in a very horrible stringent way. You can barely tickle a domain and it will throttle you. I never had any consistently good batch puts - they all eventually fail.<p>After talking with SDB folks, they recommended that I shard my data because each domain maps to a different network computer cluster. I'm glad it's the first recommendation in the OP list because it seriously is the best thing you can do.<p>Another trick that I experimented with: use multiple EC2 instance to write to the same domain. I managed to convince myself that the throttling is per EC2 instance per domain, not a global per domain. However, cost ruled this solution out.<p>Reading was much more consistent but was also throttled, especially at high write-loads. The solution was two-fold:<p>1. Cache everything "indefinitely" and break the cache when you know its contents will change. For the real time stuff, you can't cache. I used memcached, and looked at other solutions like Tokyo Tyrant, memcachedb and redis. Use what you feel comfortable using really.<p>2. Read as little as possible. Doing a "select * from domain where..." is horrible compared to doing "select attribute1, attribute2 from domain where...". Once you read, cache.
jordyhoyt超过 15 年前
The linked blog post gives more details on how he did it and the throughput he got.<p><a href="http://practicalcloudcomputing.com/post/284222088/forklift-1b-records" rel="nofollow">http://practicalcloudcomputing.com/post/284222088/forklift-1...</a><p>Very interesting that Oracle became the bottleneck.
评论 #1057756 未加载
pvg超过 15 年前
The post is notable in its absence of any hint as to why and with what kind of data this was done. Was the driver cost? Performance? A pleasant cloudy feeling? It seems a given you can, if you try, get a billion rows into SimpleDB. You can probably get a billion rows mechanical-turked onto clay tablets. The interesting thing to learn would be why doing so is advantageous.
评论 #1058060 未加载
lsb超过 15 年前
You can buy a machine with 128GB of memory and 2 TB of disk space for under $10k at Dell. A billion rows could be an in-memory dataset now.
评论 #1058078 未加载
elq超过 15 年前
bah. that's nothing! my team put over 3 billion rows into amazons cloud in a matter of hours without having to deal with the vagaries of sdb :)<p>to the best of my knowledge, oracle was the bottleneck because the oracle instance is an actual high volume production database and the IR process was restrained to minimize the impact to production users.
评论 #1058062 未加载