TE
TechEcho
Home24h TopNewestBestAskShowJobs
GitHubTwitter
Home

TechEcho

A tech news platform built with Next.js, providing global tech news and discussions.

GitHubTwitter

Home

HomeNewestBestAskShowJobs

Resources

HackerNews APIOriginal HackerNewsNext.js

© 2025 TechEcho. All rights reserved.

Any good api to scrape HN other than this?

34 pointsby kaushikfrndover 11 years ago
how to scrape HN other then https://github.com/karan/HackerNewsAPI . any good premade library in python ?

19 comments

carbocationover 11 years ago
The robots.txt from news.ycombinator.com reads as follows:<p><pre><code> User-Agent: * Disallow: &#x2F;x? Disallow: &#x2F;vote? Disallow: &#x2F;reply? Disallow: &#x2F;submitted? Disallow: &#x2F;submitlink? Disallow: &#x2F;threads? Crawl-delay: 30 </code></pre> So nominally you should feel free to set up a scraper that crawls one non-disallowed resource every 30 seconds.
评论 #6953862 未加载
评论 #6954387 未加载
评论 #6953848 未加载
评论 #6953773 未加载
评论 #6953780 未加载
napoleondover 11 years ago
Just use <a href="https://www.hnsearch.com" rel="nofollow">https:&#x2F;&#x2F;www.hnsearch.com</a>, along with <a href="https://www.hnsearch.com/rss" rel="nofollow">https:&#x2F;&#x2F;www.hnsearch.com&#x2F;rss</a> and <a href="https://www.hnsearch.com/bigrss" rel="nofollow">https:&#x2F;&#x2F;www.hnsearch.com&#x2F;bigrss</a> if you want to mimic the front page.<p>There is rarely a need to scrape HN directly, but if you do make sure your bot is polite (especially with respect to rate limits).
评论 #6953786 未加载
goldenkeyover 11 years ago
Yahoo pipes would work really well if you&#x27;re willing to write a few HTML regexes or dom element selectors.<p><a href="http://pipes.yahoo.com/pipes/" rel="nofollow">http:&#x2F;&#x2F;pipes.yahoo.com&#x2F;pipes&#x2F;</a>
jcla1over 11 years ago
Not a full featured api, but a way to scrape all of HN: <a href="http://jcla1.com/blog/2013/05/13/crawling-hackernews/" rel="nofollow">http:&#x2F;&#x2F;jcla1.com&#x2F;blog&#x2F;2013&#x2F;05&#x2F;13&#x2F;crawling-hackernews&#x2F;</a><p>Disclaimer: It&#x27;s my own blog<p>edit: Uses HNSearch, so it doesn&#x27;t violate the robots.txt and can be crawled faster
评论 #6954105 未加载
obayessheltonover 11 years ago
You don&#x27;t even need an api, all you need is an rss reader and read - <a href="https://news.ycombinator.com/rss" rel="nofollow">https:&#x2F;&#x2F;news.ycombinator.com&#x2F;rss</a>
deftover 11 years ago
I wrote an alright one in Python for use in my HN app for BlackBerry 10. Not sure how good it is, but check it out here: <a href="https://github.com/krruzic/Reader-YC/tree/master/app" rel="nofollow">https:&#x2F;&#x2F;github.com&#x2F;krruzic&#x2F;Reader-YC&#x2F;tree&#x2F;master&#x2F;app</a><p>I&#x27;m not sure what you&#x27;re trying to do though. I used beautifulsoup because I couldn&#x27;t get lxml working on BB10, but if it was switched to using lxml it would be much faster.
shamsulbuddyover 11 years ago
<a href="http://hnapp.com/" rel="nofollow">http:&#x2F;&#x2F;hnapp.com&#x2F;</a> -- This is the best HN Scraped site.. returns data in JSON &#x2F; RSS format.
mikektungover 11 years ago
Depending on what you&#x27;re trying to do with the data, you may find <a href="http://diffbot.com/products/automatic/" rel="nofollow">http:&#x2F;&#x2F;diffbot.com&#x2F;products&#x2F;automatic&#x2F;</a> helpful for getting the clean article text and categorization in JSON format. It can be used as a complement&#x2F;augmentation to the great suggestions here for getting the links.<p>Disclosure: Founder of Diffbot here.
dmpaytonover 11 years ago
I wrote a Python wrapper for the iHackerNews API, if that helps.<p><a href="https://github.com/dmpayton/python-ihackernews" rel="nofollow">https:&#x2F;&#x2F;github.com&#x2F;dmpayton&#x2F;python-ihackernews</a>
评论 #6953768 未加载
droid_wover 11 years ago
There&#x27;s a twitter feed based on HN - <a href="https://twitter.com/newsycombinator" rel="nofollow">https:&#x2F;&#x2F;twitter.com&#x2F;newsycombinator</a><p>You can use the twitter API and read from there
amiroucheover 11 years ago
There is hundred of data sets out there why it must always be HN?
评论 #6954106 未加载
mvanveenover 11 years ago
I have a ScraPy-based crawler project available at <a href="http://github.com/mvanveen/hncrawl" rel="nofollow">http:&#x2F;&#x2F;github.com&#x2F;mvanveen&#x2F;hncrawl</a>
cheeaunover 11 years ago
I built <a href="https://github.com/cheeaun/node-hnapi" rel="nofollow">https:&#x2F;&#x2F;github.com&#x2F;cheeaun&#x2F;node-hnapi</a>
kaushikfrndover 11 years ago
can anyone say me how to get <a href="https://news.ycombinator.com/news" rel="nofollow">https:&#x2F;&#x2F;news.ycombinator.com&#x2F;news</a> through hnsearch api . I want the api link -&gt; [<a href="http://api.thriftdb.com/api.hnsearch.com/" rel="nofollow">http:&#x2F;&#x2F;api.thriftdb.com&#x2F;api.hnsearch.com&#x2F;</a>] !!
rotubover 11 years ago
<a href="https://www.hnsearch.com/api" rel="nofollow">https:&#x2F;&#x2F;www.hnsearch.com&#x2F;api</a>
jenjenharover 11 years ago
Out of curiosity, Why does HN not release an official API?
评论 #6953905 未加载
评论 #6953518 未加载
评论 #6953746 未加载
fakenameover 11 years ago
other than this
notastartupover 11 years ago
I wrote <a href="http://scrape.it" rel="nofollow">http:&#x2F;&#x2F;scrape.it</a> and <a href="http://scrape.ly" rel="nofollow">http:&#x2F;&#x2F;scrape.ly</a> to do this.
culoover 11 years ago
try these<p>- <a href="https://www.mashape.com/scrape/scrape-it#!documentation" rel="nofollow">https:&#x2F;&#x2F;www.mashape.com&#x2F;scrape&#x2F;scrape-it#!documentation</a><p>- <a href="https://www.mashape.com/karangoel/hnify#!documentation" rel="nofollow">https:&#x2F;&#x2F;www.mashape.com&#x2F;karangoel&#x2F;hnify#!documentation</a>
评论 #6953763 未加载