TechEcho

8 comments

dfcover 12 years ago

Be careful not to hammer the site. Your IP could be added to the blocklist if you are too aggressive:"Yes, we block IPs that seem to be crawlers ignoring robots.txt. We've always blocked abusive IPs, but I tightened up the blocking a few weeks ago. A lot of people were crawling HN, most of them unnecessarily because they were doing things they could have done more efficiently through HNSearch's API[1]." --pg[2][1] <a href="http://www.hnsearch.com/api" rel="nofollow">http://www.hnsearch.com/api</a>[2] <a href="http://news.ycombinator.com/item?id=3196298" rel="nofollow">http://news.ycombinator.com/item?id=3196298</a>

mmackhover 12 years ago

I've written a script that extracts HN, which anyone is welcome to use. I use it for the Hacker News iPhone app:<a href="http://api.thequeue.org/hn/frontpage.xml" rel="nofollow">http://api.thequeue.org/hn/frontpage.xml</a><a href="http://api.thequeue.org/hn/new.xml" rel="nofollow">http://api.thequeue.org/hn/new.xml</a><a href="http://api.thequeue.org/hn/best.xml" rel="nofollow">http://api.thequeue.org/hn/best.xml</a>

markburnsover 12 years ago

item = HN2JSON.find 4623690NoMethodError: undefined method `url=' for #<HN2JSON::Entity:0x007fb84cd63a88>from /Users/markburns/.rvm/gems/ruby-1.9.3-p194/gems/hn2json-0.0.4/lib/hn2json/parser.rb:92:in `block in get_attrs_post' from /Users/markburns/.rvm/gems/ruby-1.9.3-p194/gems/hn2json-0.0.4/lib/hn2json/entity.rb:92:in `add_attrs' from /Users/markburns/.rvm/gems/ruby-1.9.3-p194/gems/hn2json-0.0.4/lib/hn2json/parser.rb:91:in `get_attrs_post' from /Users/markburns/.rvm/gems/ruby-1.9.3-p194/gems/hn2json-0.0.4/lib/hn2json/entity.rb:71:in `get_attrs' from /Users/markburns/.rvm/gems/ruby-1.9.3-p194/gems/hn2json-0.0.4/lib/hn2json/entity.rb:56:in `initialize' from /Users/markburns/.rvm/gems/ruby-1.9.3-p194/gems/hn2json-0.0.4/lib/hn2json.rb:35:in `new' from /Users/markburns/.rvm/gems/ruby-1.9.3-p194/gems/hn2json-0.0.4/lib/hn2json.rb:35:in `find'

评论 #4623827 未加载

rdudekulover 12 years ago

Going through the code on github to see how a HN page is parsed, was informative. I may use this to create one using Node.js. My interest is in building an intelligent agent that filters content based on my interests (example: coding, customer acquisition, hiring etc.) and notifies me on a daily or weekly basis.

评论 #4624007 未加载

selvanover 12 years ago

Checkout apify - <a href="http://apify.heroku.com/resources" rel="nofollow">http://apify.heroku.com/resources</a> & scrapify - <a href="https://github.com/sathish316/scrapify" rel="nofollow">https://github.com/sathish316/scrapify</a> Library to scrap HTML content as JSON data.

mvanveenover 12 years ago

I wrote a small, ScraPy based HN crawler available at <a href="http://github.com/mvanveen/hncrawl" rel="nofollow">http://github.com/mvanveen/hncrawl</a> in case anyone is interested.

qmacroover 12 years ago

Excellent! I know I'm biased but I also know you've put a lot of effort into this. Well done Joseph.

why-elover 12 years ago

Nice work. Does Cronic have to be a runtime dependency?

评论 #4624037 未加载

8 comments

dfcover 12 years ago

mmackhover 12 years ago

markburnsover 12 years ago

评论 #4623827 未加载

rdudekulover 12 years ago

评论 #4624007 未加载

selvanover 12 years ago

mvanveenover 12 years ago

I wrote a small, ScraPy based HN crawler available at <a href="http://github.com/mvanveen/hncrawl" rel="nofollow">http://github.com/mvanveen/hncrawl</a> in case anyone is interested.

qmacroover 12 years ago

Excellent! I know I'm biased but I also know you've put a lot of effort into this. Well done Joseph.

why-elover 12 years ago

Nice work. Does Cronic have to be a runtime dependency?

评论 #4624037 未加载

HN2JSON: A ruby gem for HackerNews

8 comments

HN2JSON: A ruby gem for HackerNews

8 comments