TE
科技回声
首页24小时热榜最新最佳问答展示工作
GitHubTwitter
首页

科技回声

基于 Next.js 构建的科技新闻平台,提供全球科技新闻和讨论内容。

GitHubTwitter

首页

首页最新最佳问答展示工作

资源链接

HackerNews API原版 HackerNewsNext.js

© 2025 科技回声. 版权所有。

Ask HN: Storing millions and billions of URLs?

12 点作者 gerenuk大约 7 年前
Hello Everyone!<p>Currently, using ElasticSearch for storing the meta data and other raw data information but it is a very small scale around 500,000 domains.<p>I have been tasked to scale it to 20-40 million domains and storing their internal&#x2F;external links while building a page rank&#x2F;domain authority score for each domain which we are adding to our database.<p>What do you guys suggest&#x2F;recommend for storing this data at a very large scale as web page internal links&#x2F;external links will be stored which will lead it over 100M-1B links database?<p>Any kind of feedback&#x2F;suggestion would be appreciated.<p>Thanks.

8 条评论

nik736大约 7 年前
I don&#x27;t think that any proper database technology will have issues with that amount of data. It all depends on how you use it.
sharemywin大约 7 年前
Found this:<p><a href="https:&#x2F;&#x2F;dba.stackexchange.com&#x2F;questions&#x2F;38793&#x2F;which-database-could-handle-storage-of-billions-trillions-of-records" rel="nofollow">https:&#x2F;&#x2F;dba.stackexchange.com&#x2F;questions&#x2F;38793&#x2F;which-database...</a><p>There&#x27;s a nice little triangle diagram here: <a href="https:&#x2F;&#x2F;stackoverflow.com&#x2F;questions&#x2F;2794736&#x2F;best-data-store-for-billions-of-rows" rel="nofollow">https:&#x2F;&#x2F;stackoverflow.com&#x2F;questions&#x2F;2794736&#x2F;best-data-store-...</a>
girishso大约 7 年前
I personally have used CouchDb to store tens of millions of documents. If you can find a way get the data you want using CouchDb views, the number of documents simply doesn’t matter with CouchDb (may be just the disc usage grows with additional documents&#x2F;views). And that too with excellent performance.
drizzle87大约 7 年前
Elasticsearch should be easily able to handle your scaling needs. Why do you think that it would not? What are your concerns?
jjirsa大约 7 年前
The answer will depend primarily on how you expect to query it.<p>Cassandra can do many orders of magnitude more than 1B, but would limit you in your query patterns.
mr__y大约 7 年前
Have you considered sharding the data to multiple independent ES instances? Each of them could handle amount of data that does not cause problems?
cimmanom大约 7 年前
We&#x27;ve found Elasticsearch to be quite performant with hundreds of millions of documents. What are your concerns with scaling it?
dchuk大约 7 年前
Building an ahrefs&#x2F;moz&#x2F;majestic competitor?
评论 #17000463 未加载