TechEcho

18 comments

petercooperabout 14 years ago

Diffbot's stuff is in a different league (but it's a hosted service with a large dataset) but if anyone's vaguely interested in this area, I've been working on a Ruby library that performs some similar features: <a href="https://github.com/peterc/pismo" rel="nofollow">https://github.com/peterc/pismo</a>It is currently undergoing the "big rewrite" (which includes some proper classification work rather than shooting in the dark), however, but it's still in daily use on several sites. Hopefully I can learn a few lessons from Diffbot!I should also point out BoilerPlate - <a href="http://code.google.com/p/boilerpipe/" rel="nofollow">http://code.google.com/p/boilerpipe/</a> - an interesting Java based content extraction project that's being worked on an by an actual PhD student rather than a dilettante like me ;-) Again, Diffbot's stuff goes a lot further than this but there are lessons to be learned nonetheless.Last but not least, a paper by the aforementioned PhD student called Boilerplate Detection using Shallow Text Features is available at <a href="http://www.l3s.de/~kohlschuetter/boilerplate/" rel="nofollow">http://www.l3s.de/~kohlschuetter/boilerplate/</a>I suspect that there's going to be a lot more work in these areas in the medium term because of the growth of the "e-discovery" market and because the dreams of a consistently marked up "semantic Web" have been washing down the pan for a while now.

评论 #2313692 未加载

评论 #2312219 未加载

modelessabout 14 years ago

Cool, works better than other services like this I've found. I tried it on Ars Technica's review of the Xoom tablet and it found all 10 pages. It didn't find the embedded video though. Also, all the formatting is stripped which makes it hard to differentiate section headers from content paragraphs, and all the images are in one list to the side, removed from their original context.What I'd really love to see is a combination of the RSS API and the article API to produce full article RSS feeds for any site.

评论 #2311208 未加载

tanseyabout 14 years ago

I really like this.You guys should really open up a tagging API. As a developer working on a social site, I'd love to be able to auto-tag content that users upload.

评论 #2311079 未加载

quanabout 14 years ago

It looks like something I would integrate for my current project but your term suggests that the api is only for personal and non-commercial uses.

评论 #2311064 未加载

jranckabout 14 years ago

This looks great. I'd love to find out more about your API and what type of web scraping techniques you're using. It looks like this is going to be available publicly to developers? What type of usage do you guys allow?

评论 #2311082 未加载

aaronkaplanabout 14 years ago

Article content starts out in a straightforward, easy-to-process form, as created by the reporter/author, in a content management system. Then the CMS chops it up into pages and adds boilerplate for presentation as a web page. Then you expend lots of effort to stick the pages back together and filter the crap back out, to arrive at an approximation of the original. Generally a noisy, imperfect approximation that is less useful for your purposes (indexing, information extraction, etc).If technical considerations were the only considerations, we would find a way to get at the content directly instead of using this Rube Goldberg mechanism. But of course there are also economic considerations. Content owners don't want to give you unadulterated content for free; their business model requires that ads be served along with it.Will an arms race develop between scrapers and publishers, similar to the arms race between spammers and spam filters? Will publishers start randomizing their HTML generation, or otherwise making it difficult to separate content from peripheral material?

itsnotvalidabout 14 years ago

I like the "machine learning" part of the api, but there seems to be no way of improving the learning by giving feedback.Just testing another article on HN[1] that the tags are pretty far off. I expected iPad 2 and photos/pixels but i got 4G and manufacturing instead[2]. So I am really interested in how the system came up with the right and wrong tags (which I guess sound more important than find the body of the article, as people are making that easier for facebook and others through Open Graph/RDFa/hNews etc.)[1]: <a href="http://daringfireball.net/2011/03/bending_over_backwards" rel="nofollow">http://daringfireball.net/2011/03/bending_over_backwards</a>[2]: tags received: Recyclable materials, Battery, 4G, Apple Inc., Rechargeable battery, Walter Mossberg, Technology, Computing, Manufacturing, Technology_Internet

mcfchanabout 14 years ago

That's a really nifty API. Performance is nice too. Would be interested in knowing more about it.

shantanubalaabout 14 years ago

This is fantastic! How resource-intense is it to run? Machine Learning, depending on the implementation, can be pretty costly from what I understand.

评论 #2311265 未加载

ronnochabout 14 years ago

The tagging feature is very impressive.

sqrt17about 14 years ago

Duh... for the pages I tried, it always gives me the "No article at this URL".I'd like to have some kind of boilerplate removal that works well for forum content (e.g. phpBB and related), and boilerpipe (the library that I tried) gives relatively mixed results.Does anyone know an existing solution for this?

评论 #2311115 未加载

评论 #2311887 未加载

alexdongabout 14 years ago

We are using <a href="http://purifyr.com/" rel="nofollow">http://purifyr.com/</a> for this. Pretty happy with the unicode support and 20-50 documents per second speed.

Mamadyabout 14 years ago

Looks good but doesn't work for wikipedia.Get it working there and you will have a lot more consumers.

评论 #2324889 未加载

dfgonzalezabout 14 years ago

I sent a token request almost a week ago, do you send positive and negative answers on this?Thanks

MrVitaliyabout 14 years ago

Would this make google's job on removing content aggregator slightly harder?

normaluserabout 14 years ago

cool

bitanarchabout 14 years ago

Looks awesome.

kevingao1about 14 years ago

Very interesting...

评论 #2310872 未加载

18 comments

petercooperabout 14 years ago

评论 #2313692 未加载

评论 #2312219 未加载

modelessabout 14 years ago

评论 #2311208 未加载

tanseyabout 14 years ago

I really like this.You guys should really open up a tagging API. As a developer working on a social site, I'd love to be able to auto-tag content that users upload.

评论 #2311079 未加载

quanabout 14 years ago

It looks like something I would integrate for my current project but your term suggests that the api is only for personal and non-commercial uses.

评论 #2311064 未加载

jranckabout 14 years ago

评论 #2311082 未加载

aaronkaplanabout 14 years ago

itsnotvalidabout 14 years ago

mcfchanabout 14 years ago

That's a really nifty API. Performance is nice too. Would be interested in knowing more about it.

shantanubalaabout 14 years ago

This is fantastic! How resource-intense is it to run? Machine Learning, depending on the implementation, can be pretty costly from what I understand.

评论 #2311265 未加载

ronnochabout 14 years ago

The tagging feature is very impressive.

sqrt17about 14 years ago

评论 #2311115 未加载

评论 #2311887 未加载

alexdongabout 14 years ago

We are using <a href="http://purifyr.com/" rel="nofollow">http://purifyr.com/</a> for this. Pretty happy with the unicode support and 20-50 documents per second speed.

Mamadyabout 14 years ago

Looks good but doesn't work for wikipedia.Get it working there and you will have a lot more consumers.

评论 #2324889 未加载

dfgonzalezabout 14 years ago

I sent a token request almost a week ago, do you send positive and negative answers on this?Thanks

MrVitaliyabout 14 years ago

Would this make google's job on removing content aggregator slightly harder?

normaluserabout 14 years ago

cool

bitanarchabout 14 years ago

Looks awesome.

kevingao1about 14 years ago

Very interesting...

评论 #2310872 未加载

Show HN: Readability-like API Using Machine Learning

18 comments

Show HN: Readability-like API Using Machine Learning

18 comments