Be careful not to hammer the site. Your IP could be added to the blocklist if you are too aggressive:<p><i>"Yes, we block IPs that seem to be crawlers ignoring robots.txt. We've always blocked abusive IPs, but I tightened up the blocking a few weeks ago. A lot of people were crawling HN, most of them unnecessarily because they were doing things they could have done more efficiently through HNSearch's API[1]." --pg</i>[2]<p>[1] <a href="http://www.hnsearch.com/api" rel="nofollow">http://www.hnsearch.com/api</a><p>[2] <a href="http://news.ycombinator.com/item?id=3196298" rel="nofollow">http://news.ycombinator.com/item?id=3196298</a>
I've written a script that extracts HN, which anyone is welcome to use. I use it for the Hacker News iPhone app:<p><a href="http://api.thequeue.org/hn/frontpage.xml" rel="nofollow">http://api.thequeue.org/hn/frontpage.xml</a><p><a href="http://api.thequeue.org/hn/new.xml" rel="nofollow">http://api.thequeue.org/hn/new.xml</a><p><a href="http://api.thequeue.org/hn/best.xml" rel="nofollow">http://api.thequeue.org/hn/best.xml</a>
item = HN2JSON.find 4623690<p>NoMethodError: undefined method `url=' for #<HN2JSON::Entity:0x007fb84cd63a88><p>from /Users/markburns/.rvm/gems/ruby-1.9.3-p194/gems/hn2json-0.0.4/lib/hn2json/parser.rb:92:in `block in get_attrs_post'
from /Users/markburns/.rvm/gems/ruby-1.9.3-p194/gems/hn2json-0.0.4/lib/hn2json/entity.rb:92:in `add_attrs'
from /Users/markburns/.rvm/gems/ruby-1.9.3-p194/gems/hn2json-0.0.4/lib/hn2json/parser.rb:91:in `get_attrs_post'
from /Users/markburns/.rvm/gems/ruby-1.9.3-p194/gems/hn2json-0.0.4/lib/hn2json/entity.rb:71:in `get_attrs'
from /Users/markburns/.rvm/gems/ruby-1.9.3-p194/gems/hn2json-0.0.4/lib/hn2json/entity.rb:56:in `initialize'
from /Users/markburns/.rvm/gems/ruby-1.9.3-p194/gems/hn2json-0.0.4/lib/hn2json.rb:35:in `new'
from /Users/markburns/.rvm/gems/ruby-1.9.3-p194/gems/hn2json-0.0.4/lib/hn2json.rb:35:in `find'
Going through the code on github to see how a HN page is parsed, was informative. I may use this to create one using Node.js. My interest is in building an intelligent agent that filters content based on my interests (example: coding, customer acquisition, hiring etc.) and notifies me on a daily or weekly basis.
I wrote a small, ScraPy based HN crawler available at <a href="http://github.com/mvanveen/hncrawl" rel="nofollow">http://github.com/mvanveen/hncrawl</a> in case anyone is interested.