科技回声

I'm working on a consumer facing project that involves analyzing text from web articles.Does anyone know of an API that can handle the text extraction part automatically?Ideally the API can take in a URL and just return the main text content of a website, even for sites with slightly complex layouts.For example: https://www.nytimes.com/2024/03/28/technology/personaltech/smart-glasses-ray-ban-meta.htmlWe're most interested in an API that has a decent free tier + usage-based pricing (at least for overages).So far, most of our searches have turned up website scrapers that return HTML that needs to be further parsed (ScrapingBot, ScrapingBee, Scrapingdog, etc.), or services that are prohibitively priced (Diffbot).Next, we're looking into Apify, but maybe we've missed something?Any recommendations would be greatly appreciated!

2 条评论

timoteostewart大约 1 年前

Would you consider rolling your own? Python’s goose3 has worked well for me in article extraction. It seemed to be successful more often than trafilatura and newspaper3k.

评论 #40046484 未加载

cranberryturkey大约 1 年前

Brisk.news

评论 #40028353 未加载

Ask HN: Simple API to extract web article text?

2 条评论

Ask HN: Simple API to extract web article text?

2 条评论