TechEcho

19 comments

carbocationover 11 years ago

The robots.txt from news.ycombinator.com reads as follows:<pre><code> User-Agent: * Disallow: /x? Disallow: /vote? Disallow: /reply? Disallow: /submitted? Disallow: /submitlink? Disallow: /threads? Crawl-delay: 30 </code></pre> So nominally you should feel free to set up a scraper that crawls one non-disallowed resource every 30 seconds.

评论 #6953862 未加载

评论 #6954387 未加载

评论 #6953848 未加载

评论 #6953773 未加载

评论 #6953780 未加载

napoleondover 11 years ago

Just use <a href="https://www.hnsearch.com" rel="nofollow">https://www.hnsearch.com</a>, along with <a href="https://www.hnsearch.com/rss" rel="nofollow">https://www.hnsearch.com/rss</a> and <a href="https://www.hnsearch.com/bigrss" rel="nofollow">https://www.hnsearch.com/bigrss</a> if you want to mimic the front page.There is rarely a need to scrape HN directly, but if you do make sure your bot is polite (especially with respect to rate limits).

评论 #6953786 未加载

goldenkeyover 11 years ago

Yahoo pipes would work really well if you're willing to write a few HTML regexes or dom element selectors.<a href="http://pipes.yahoo.com/pipes/" rel="nofollow">http://pipes.yahoo.com/pipes/</a>

jcla1over 11 years ago

Not a full featured api, but a way to scrape all of HN: <a href="http://jcla1.com/blog/2013/05/13/crawling-hackernews/" rel="nofollow">http://jcla1.com/blog/2013/05/13/crawling-hackernews/</a>Disclaimer: It's my own blogedit: Uses HNSearch, so it doesn't violate the robots.txt and can be crawled faster

评论 #6954105 未加载

obayessheltonover 11 years ago

You don't even need an api, all you need is an rss reader and read - <a href="https://news.ycombinator.com/rss" rel="nofollow">https://news.ycombinator.com/rss</a>

deftover 11 years ago

I wrote an alright one in Python for use in my HN app for BlackBerry 10. Not sure how good it is, but check it out here: <a href="https://github.com/krruzic/Reader-YC/tree/master/app" rel="nofollow">https://github.com/krruzic/Reader-YC/tree/master/app</a>I'm not sure what you're trying to do though. I used beautifulsoup because I couldn't get lxml working on BB10, but if it was switched to using lxml it would be much faster.

shamsulbuddyover 11 years ago

<a href="http://hnapp.com/" rel="nofollow">http://hnapp.com/</a> -- This is the best HN Scraped site.. returns data in JSON / RSS format.

mikektungover 11 years ago

Depending on what you're trying to do with the data, you may find <a href="http://diffbot.com/products/automatic/" rel="nofollow">http://diffbot.com/products/automatic/</a> helpful for getting the clean article text and categorization in JSON format. It can be used as a complement/augmentation to the great suggestions here for getting the links.Disclosure: Founder of Diffbot here.

dmpaytonover 11 years ago

I wrote a Python wrapper for the iHackerNews API, if that helps.<a href="https://github.com/dmpayton/python-ihackernews" rel="nofollow">https://github.com/dmpayton/python-ihackernews</a>

评论 #6953768 未加载

droid_wover 11 years ago

There's a twitter feed based on HN - <a href="https://twitter.com/newsycombinator" rel="nofollow">https://twitter.com/newsycombinator</a>You can use the twitter API and read from there

amiroucheover 11 years ago

There is hundred of data sets out there why it must always be HN?

评论 #6954106 未加载

mvanveenover 11 years ago

I have a ScraPy-based crawler project available at <a href="http://github.com/mvanveen/hncrawl" rel="nofollow">http://github.com/mvanveen/hncrawl</a>

cheeaunover 11 years ago

I built <a href="https://github.com/cheeaun/node-hnapi" rel="nofollow">https://github.com/cheeaun/node-hnapi</a>

kaushikfrndover 11 years ago

can anyone say me how to get <a href="https://news.ycombinator.com/news" rel="nofollow">https://news.ycombinator.com/news</a> through hnsearch api . I want the api link -> [<a href="http://api.thriftdb.com/api.hnsearch.com/" rel="nofollow">http://api.thriftdb.com/api.hnsearch.com/</a>] !!

rotubover 11 years ago

<a href="https://www.hnsearch.com/api" rel="nofollow">https://www.hnsearch.com/api</a>

jenjenharover 11 years ago

Out of curiosity, Why does HN not release an official API?

评论 #6953905 未加载

评论 #6953518 未加载

评论 #6953746 未加载

fakenameover 11 years ago

other than this

notastartupover 11 years ago

I wrote <a href="http://scrape.it" rel="nofollow">http://scrape.it</a> and <a href="http://scrape.ly" rel="nofollow">http://scrape.ly</a> to do this.

culoover 11 years ago

try these- <a href="https://www.mashape.com/scrape/scrape-it#!documentation" rel="nofollow">https://www.mashape.com/scrape/scrape-it#!documentation</a>- <a href="https://www.mashape.com/karangoel/hnify#!documentation" rel="nofollow">https://www.mashape.com/karangoel/hnify#!documentation</a>

评论 #6953763 未加载

19 comments

carbocationover 11 years ago

评论 #6953862 未加载

评论 #6954387 未加载

评论 #6953848 未加载

评论 #6953773 未加载

评论 #6953780 未加载

napoleondover 11 years ago

评论 #6953786 未加载

goldenkeyover 11 years ago

Yahoo pipes would work really well if you're willing to write a few HTML regexes or dom element selectors.<a href="http://pipes.yahoo.com/pipes/" rel="nofollow">http://pipes.yahoo.com/pipes/</a>

jcla1over 11 years ago

评论 #6954105 未加载

obayessheltonover 11 years ago

You don't even need an api, all you need is an rss reader and read - <a href="https://news.ycombinator.com/rss" rel="nofollow">https://news.ycombinator.com/rss</a>

deftover 11 years ago

shamsulbuddyover 11 years ago

<a href="http://hnapp.com/" rel="nofollow">http://hnapp.com/</a> -- This is the best HN Scraped site.. returns data in JSON / RSS format.

mikektungover 11 years ago

dmpaytonover 11 years ago

I wrote a Python wrapper for the iHackerNews API, if that helps.<a href="https://github.com/dmpayton/python-ihackernews" rel="nofollow">https://github.com/dmpayton/python-ihackernews</a>

评论 #6953768 未加载

droid_wover 11 years ago

There's a twitter feed based on HN - <a href="https://twitter.com/newsycombinator" rel="nofollow">https://twitter.com/newsycombinator</a>You can use the twitter API and read from there

amiroucheover 11 years ago

There is hundred of data sets out there why it must always be HN?

评论 #6954106 未加载

mvanveenover 11 years ago

I have a ScraPy-based crawler project available at <a href="http://github.com/mvanveen/hncrawl" rel="nofollow">http://github.com/mvanveen/hncrawl</a>

cheeaunover 11 years ago

I built <a href="https://github.com/cheeaun/node-hnapi" rel="nofollow">https://github.com/cheeaun/node-hnapi</a>

kaushikfrndover 11 years ago

rotubover 11 years ago

<a href="https://www.hnsearch.com/api" rel="nofollow">https://www.hnsearch.com/api</a>

jenjenharover 11 years ago

Out of curiosity, Why does HN not release an official API?

评论 #6953905 未加载

评论 #6953518 未加载

评论 #6953746 未加载

fakenameover 11 years ago

other than this

notastartupover 11 years ago

I wrote <a href="http://scrape.it" rel="nofollow">http://scrape.it</a> and <a href="http://scrape.ly" rel="nofollow">http://scrape.ly</a> to do this.

culoover 11 years ago

评论 #6953763 未加载

Any good api to scrape HN other than this?

19 comments

Any good api to scrape HN other than this?

19 comments