TE
TechEcho
Home24h TopNewestBestAskShowJobs
GitHubTwitter
Home

TechEcho

A tech news platform built with Next.js, providing global tech news and discussions.

GitHubTwitter

Home

HomeNewestBestAskShowJobs

Resources

HackerNews APIOriginal HackerNewsNext.js

© 2025 TechEcho. All rights reserved.

Ask HN: Fastest Crawl of HN Articles

8 pointsby agenciesabout 3 years ago
HN links to over 6 million urls in stories and comments. Many domains have expired or content is no longer available. Internet archive has much of the content but throttles requests. What's the fastest way to get the historical content?

3 comments

arinlenabout 3 years ago
HN does have a REST API which is quite easy to use.<p><a href="https:&#x2F;&#x2F;github.com&#x2F;HackerNews&#x2F;API" rel="nofollow">https:&#x2F;&#x2F;github.com&#x2F;HackerNews&#x2F;API</a><p>I&#x27;m not sure what rate limiting policy is in place, but in theory you can start with a request for maxitem and from that point on just GET all items down to zero until you hit some sort of blocker.
评论 #31207022 未加载
评论 #31207074 未加载
jpcapdevilaabout 3 years ago
The best way to do it is from Google BigQuery.<p>There&#x27;s a dataset containing everything: bigquery-public-data.hacker_news.full<p>You can write SQL and is super fast. Sample:<p>SELECT * FROM bigquery-public-data.hacker_news.full LIMIT 1
python273about 3 years ago
maybe <a href="https:&#x2F;&#x2F;commoncrawl.org&#x2F;" rel="nofollow">https:&#x2F;&#x2F;commoncrawl.org&#x2F;</a>