TE
TechEcho
Home24h TopNewestBestAskShowJobs
GitHubTwitter
Home

TechEcho

A tech news platform built with Next.js, providing global tech news and discussions.

GitHubTwitter

Home

HomeNewestBestAskShowJobs

Resources

HackerNews APIOriginal HackerNewsNext.js

© 2025 TechEcho. All rights reserved.

How can I identify primary article text when scraping news pages

2 pointsby invisiblerobotover 4 years ago
I suspect this is a hard problem and that deep learning is the state of the art. But maybe I&#x27;m missing something?<p>Just to be clear, given the html of a wapo article I want to discard all the affiliate links&#x2F;comments and focus on the article text. I want a generalized solution for many blogs and news sites.<p>Any tips?

2 comments

tlackover 4 years ago
I&#x27;ve had some good luck with &quot;Unfluff&quot;[0], a credible Node.js package that uses a cascade of logical conditions to figure out what to extract.<p>It&#x27;s a very practical start.<p>I thought the science of it was called &quot;envelope detection&quot; but I&#x27;m not getting any relevant hits on that keyword. Will report back if I recall the name.<p>[0] <a href="https:&#x2F;&#x2F;www.npmjs.com&#x2F;package&#x2F;unfluff" rel="nofollow">https:&#x2F;&#x2F;www.npmjs.com&#x2F;package&#x2F;unfluff</a>
nmstokerover 4 years ago
You haven&#x27;t give any details about programming language preferences but if you&#x27;re interested in a Python approach then Newspaper3k is worth a look<p><a href="https:&#x2F;&#x2F;github.com&#x2F;codelucas&#x2F;newspaper" rel="nofollow">https:&#x2F;&#x2F;github.com&#x2F;codelucas&#x2F;newspaper</a>