TE
TechEcho
Home24h TopNewestBestAskShowJobs
GitHubTwitter
Home

TechEcho

A tech news platform built with Next.js, providing global tech news and discussions.

GitHubTwitter

Home

HomeNewestBestAskShowJobs

Resources

HackerNews APIOriginal HackerNewsNext.js

© 2025 TechEcho. All rights reserved.

Ask HN: Is there an API that allows me to extract the text from an article?

3 pointsby ppjimabout 13 years ago
I wonder if there is any API that allows do the same as Instapaper or readibility. In particular you can select any web page and just get the text, removing the navigation menus and advertising. I'm on a project that needs to analyze several Internet news sites and extract the contents. The problem is that each Internet portal has a different structure that is difficult to add a new site.<p>Greetings.

2 comments

polyfractalabout 13 years ago
Viewtext [1] provides an API that gives you clean(er) HTML. It still contains some markup but is vastly simplified. You can also roll your own with tools like HtmlCleaner [2] or lxml [3]<p>[1] <a href="http://viewtext.org/" rel="nofollow">http://viewtext.org/</a><p>[2] <a href="http://htmlcleaner.sourceforge.net/" rel="nofollow">http://htmlcleaner.sourceforge.net/</a><p>[3] <a href="http://lxml.de/" rel="nofollow">http://lxml.de/</a>
评论 #3669512 未加载
astrofinchabout 13 years ago
<a href="http://www.diffbot.com/docs/api/article" rel="nofollow">http://www.diffbot.com/docs/api/article</a>