How can I identify primary article text when scraping news pages

2 pointsby invisiblerobotover 4 years ago

I suspect this is a hard problem and that deep learning is the state of the art. But maybe I'm missing something?Just to be clear, given the html of a wapo article I want to discard all the affiliate links/comments and focus on the article text. I want a generalized solution for many blogs and news sites.Any tips?

2 comments

tlackover 4 years ago

I've had some good luck with "Unfluff"[0], a credible Node.js package that uses a cascade of logical conditions to figure out what to extract.It's a very practical start.I thought the science of it was called "envelope detection" but I'm not getting any relevant hits on that keyword. Will report back if I recall the name.[0] <a href="https://www.npmjs.com/package/unfluff" rel="nofollow">https://www.npmjs.com/package/unfluff</a>

nmstokerover 4 years ago

You haven't give any details about programming language preferences but if you're interested in a Python approach then Newspaper3k is worth a look<a href="https://github.com/codelucas/newspaper" rel="nofollow">https://github.com/codelucas/newspaper</a>