I'm working on a consumer facing project that involves analyzing text from web articles.<p>Does anyone know of an API that can handle the text extraction part automatically?<p>Ideally the API can take in a URL and just return the main text content of a website, even for sites with slightly complex layouts.<p>For example: https://www.nytimes.com/2024/03/28/technology/personaltech/smart-glasses-ray-ban-meta.html<p>We're most interested in an API that has a decent free tier + usage-based pricing (at least for overages).<p>So far, most of our searches have turned up website scrapers that return HTML that needs to be further parsed (ScrapingBot, ScrapingBee, Scrapingdog, etc.), or services that are prohibitively priced (Diffbot).<p>Next, we're looking into Apify, but maybe we've missed something?<p>Any recommendations would be <i>greatly</i> appreciated!
Would you consider rolling your own? Python’s goose3 has worked well for me in article extraction. It seemed to be successful more often than trafilatura and newspaper3k.