Diffbot's stuff is in a different league (but it's a hosted service with a large dataset) but if anyone's vaguely interested in this area, I've been working on a Ruby library that performs some <i>similar</i> features: <a href="https://github.com/peterc/pismo" rel="nofollow">https://github.com/peterc/pismo</a><p>It is currently undergoing the "big rewrite" (which includes some proper classification work rather than shooting in the dark), however, but it's still in daily use on several sites. Hopefully I can learn a few lessons from Diffbot!<p>I should also point out BoilerPlate - <a href="http://code.google.com/p/boilerpipe/" rel="nofollow">http://code.google.com/p/boilerpipe/</a> - an interesting Java based content extraction project that's being worked on an by an actual PhD student rather than a dilettante like me ;-) Again, Diffbot's stuff goes a lot further than this but there are lessons to be learned nonetheless.<p>Last but not least, a paper by the aforementioned PhD student called <i>Boilerplate Detection using Shallow Text Features</i> is available at <a href="http://www.l3s.de/~kohlschuetter/boilerplate/" rel="nofollow">http://www.l3s.de/~kohlschuetter/boilerplate/</a><p>I suspect that there's going to be a lot more work in these areas in the medium term because of the growth of the "e-discovery" market and because the dreams of a consistently marked up "semantic Web" have been washing down the pan for a while now.
Cool, works better than other services like this I've found. I tried it on Ars Technica's review of the Xoom tablet and it found all 10 pages. It didn't find the embedded video though. Also, all the formatting is stripped which makes it hard to differentiate section headers from content paragraphs, and all the images are in one list to the side, removed from their original context.<p>What I'd really love to see is a combination of the RSS API and the article API to produce full article RSS feeds for any site.
I really like this.<p>You guys should really open up a tagging API. As a developer working on a social site, I'd love to be able to auto-tag content that users upload.
This looks great. I'd love to find out more about your API and what type of web scraping techniques you're using. It looks like this is going to be available publicly to developers? What type of usage do you guys allow?
Article content starts out in a straightforward, easy-to-process form, as created by the reporter/author, in a content management system. Then the CMS chops it up into pages and adds boilerplate for presentation as a web page. Then you expend lots of effort to stick the pages back together and filter the crap back out, to arrive at an approximation of the original. Generally a noisy, imperfect approximation that is less useful for your purposes (indexing, information extraction, etc).<p>If technical considerations were the only considerations, we would find a way to get at the content directly instead of using this Rube Goldberg mechanism. But of course there are also economic considerations. Content owners don't want to give you unadulterated content for free; their business model requires that ads be served along with it.<p>Will an arms race develop between scrapers and publishers, similar to the arms race between spammers and spam filters? Will publishers start randomizing their HTML generation, or otherwise making it difficult to separate content from peripheral material?
I like the "machine learning" part of the api, but there seems to be no way of improving the learning by giving feedback.<p>Just testing another article on HN[1] that the tags are pretty far off. I expected iPad 2 and photos/pixels but i got 4G and manufacturing instead[2]. So I am really interested in how the system came up with the right and wrong tags (which I guess sound more important than find the body of the article, as people are making that easier for facebook and others through Open Graph/RDFa/hNews etc.)<p>[1]: <a href="http://daringfireball.net/2011/03/bending_over_backwards" rel="nofollow">http://daringfireball.net/2011/03/bending_over_backwards</a><p>[2]: tags received:
Recyclable materials, Battery, 4G, Apple Inc., Rechargeable battery, Walter Mossberg, Technology, Computing, Manufacturing, Technology_Internet
Duh... for the pages I tried, it always gives me the "No article at this URL".<p>I'd like to have some kind of boilerplate removal that works well for forum content (e.g. phpBB and related), and boilerpipe (the library that I tried) gives relatively mixed results.<p>Does anyone know an existing solution for this?
We are using <a href="http://purifyr.com/" rel="nofollow">http://purifyr.com/</a> for this. Pretty happy with the unicode support and 20-50 documents per second speed.