I wonder if there is any API that allows do the same as Instapaper or readibility. In particular you can select any web page and just get the text, removing the navigation menus and advertising. I'm on a project that needs to analyze several Internet news sites and extract the contents. The problem is that each Internet portal has a different structure that is difficult to add a new site.<p>Greetings.
Viewtext [1] provides an API that gives you clean(er) HTML. It still contains some markup but is vastly simplified. You can also roll your own with tools like HtmlCleaner [2] or lxml [3]<p>[1] <a href="http://viewtext.org/" rel="nofollow">http://viewtext.org/</a><p>[2] <a href="http://htmlcleaner.sourceforge.net/" rel="nofollow">http://htmlcleaner.sourceforge.net/</a><p>[3] <a href="http://lxml.de/" rel="nofollow">http://lxml.de/</a>