TE
TechEcho
Home24h TopNewestBestAskShowJobs
GitHubTwitter
Home

TechEcho

A tech news platform built with Next.js, providing global tech news and discussions.

GitHubTwitter

Home

HomeNewestBestAskShowJobs

Resources

HackerNews APIOriginal HackerNewsNext.js

© 2025 TechEcho. All rights reserved.

Ask HN: Best web scraping toolset for 2024?

14 pointsby mortallywoundedabout 1 year ago
What to use for JavaScript based websites with strong bot&#x2F;scraping detection?<p>I find most popular frameworks out there are easily detectable (selenium, puppeteer, etc).<p>I have been using a homebrew&#x27;d solution using a native web view that mimics a popular web browser (yet allows me to run arbitrary JS, etc).

4 comments

robkabout 1 year ago
Three real paths I&#x27;d evaluate<p>0) Tools: Puppeteer and Playwright are the cleanest way I&#x27;ve found to get proper JS page rendering and behavior control. Node is well suited for this but some prefer python.. Since you sound concerned with blocks I&#x27;m guessing plain HTML scrapers like BeautifulSoup or Cheerio would be insufficient but they&#x27;re more robust in terms of sheer volume and overhead of course.<p>1) if you have money the fingerprint avoiding systems like GoLogin are exceptionally good at avoiding detection. But they&#x27;re not cheap so you would need to have a reasonable budget to use them well. I&#x27;ve had extremely high success with GoLogin myself and if budget wasn&#x27;t a concern I&#x27;d just default to that.<p>2) less expensively you can use headless Chrome (Browserless.io has an excellent docker image for this) and then proxy using 4g&#x2F;5g proxies. You usually pay by kb so you&#x27;d want to be savvy about blocking images etc to manage costs but with a good proxy and decent tuning this also takes you pretty far except some of the more onerous services like Cloudflare and Datadome. There are also v good captcha solving services that seamlessly integrate. I&#x27;ve had very good results with simple proxying this way and well crafted settings in Browserless like making sure stealth plug-ins used and user agent is properly done etc.<p>3) most inexpensive you can simply use Playwright on Browserless (Chrome Headless) and a captcha service with stealth plug-ins and running from a decent quality IP. I&#x27;m careful to check things like viewport size, user agent, ghost cursor etc.<p>Data center IPs usually raise flags and even the server should be selected to not have vpn ports or http ports open as those are also antibot signals among others.<p>--<p>I do #3 at a fairly decent volume (&lt;1m page views per day) across a few dozen machines (each with a half dozen IP addresses and separate VLANs for each container) to scrape from some private sites and so long as I stay under a sane rate limit I&#x27;ve had years of success on many sites that are fairly strict about blocking browsers.<p>In all cases I parse the key parts of the page and dump it into mongo for async processing of the data and to allow fixes when sites change. You need to keep an eye on your ETL pipeline and alert when something breaks - I expect once a quarter I have to fix a selector change or something trivial as sites change.<p>This is also a good substack evaluating the various paid options for the toughest sites. <a href="https:&#x2F;&#x2F;substack.thewebscraping.club" rel="nofollow">https:&#x2F;&#x2F;substack.thewebscraping.club</a>
评论 #39460423 未加载
评论 #39476828 未加载
评论 #39451927 未加载
dhruvkarabout 1 year ago
Usually JS based websites operate with some internal API under the hood.<p>Inspect those, and then directly hit those with something like Python or Go.<p>I prefer Python (requests, lxml, BeautifulSoup).<p>Mimic the headers that your browser is using.<p>Have you tried this already and run into issues?
评论 #39447458 未加载
nicbouabout 1 year ago
I have used Playwright to write tests for my website. It was <i>so</i> easy and the code is so readable. I can wholeheartedly recommend it if you need to control a browser from your code. I&#x27;m now using it to write a crawler for a dozen websites, because anything else would be too tedious.
tuktuktukabout 1 year ago
I think puppeteer is the way to go!