Wanted to thank the HN community for all your encouragement. I first released the Diffbot API as a "Show HN:" post last year (<a href="http://news.ycombinator.com/item?id=2310852" rel="nofollow">http://news.ycombinator.com/item?id=2310852</a>). $2M+ and lots of hard work later, we're powering some of the largest destination sites out there like Stumbleupon and the new Digg.
Conceptually I like the product, it's something I would consider paying for. But in practice it doesn't seem to perform that well. It misclassify things it should get right (article hosted on posterous; a youtube page; hacker news) and for some queries it just returns results for a completely different webpage.<p>The page tagging technology looks good though.
Really like the vision approach to classifying web pages, I've been thinking google should add this to their algo for a while (if they havent already).<p>Classifying individual parts of pages (as Diffbot seems to be doing) is difficult, but I suspect google could take screenshots of pages reported as spam or whatever as one class and compare those to screenshots of pages w/high pr to get a pretty interesting classifier they could use as an extra datapoint. Could be an interesting experiment anyway, using data they've got lying around.
I see some potential in ad tech.<p>How does caching works?
Is there any focus on security?
Multiple geolocations?<p>I liked the TOS :)
----
Diffbot.com is made available for personal, non-commercial, and commercial purposes. Services are provided as-is, and we do not make any guarantees on the quality or performance.
wow, pretty cool. I wonder though is there much use for it outside of aggregate sites like digg? Even for a site like reddit, all the content is already split up into categories by users. While this is really cool, I'm not really seeing much use for it. What are some problems that this will solve?