Ask HN: What are the best tools for web scraping in 2022?

313 pointsby pablohoffmanalmost 3 years ago

Last time this question was asked on HN was in 2017 (<a href="https://news.ycombinator.com/item?id=15694118" rel="nofollow">https://news.ycombinator.com/item?id=15694118</a>), a lot has changed in the last 5 years in the world of web scraping (legal landscape, antibot unblockers, data type specific APIs, etc), so I thought it may be a good idea to refresh this question and see what are the most popular tools used by the HN community these days.

65 comments

simonwalmost 3 years ago

It's increasingly difficult these days to write scrapers that don't at some point need to execute JavaScript on a page - so you need to have a good browser automation tool on hand.I'm really impressed by Playwright. It feels like it has learned all of the lessons from systems like Selenium that came before it - it's very well designed and easy to apply to problems.I wrote my own CLI scraping tool on top of Playwright a few months ago, which has been a fun way to explore Playwright's capabilities: <a href="https://simonwillison.net/2022/Mar/14/scraping-web-pages-shot-scraper/" rel="nofollow">https://simonwillison.net/2022/Mar/14/scraping-web-pages-sho...</a>

评论 #32413983 未加载

评论 #32419137 未加载

评论 #32421250 未加载

评论 #32413225 未加载

altiluniumalmost 3 years ago

Beautiful Soup gets the job done. I made several app by using it.[1] <a href="https://github.com/altilunium/wistalk" rel="nofollow">https://github.com/altilunium/wistalk</a> (Scrap wikipedia to analyze user's activity)[2] <a href="https://github.com/altilunium/psedex" rel="nofollow">https://github.com/altilunium/psedex</a> (Scrap goverment website to get list of all registered online services in Indonesia)[3] <a href="https://github.com/altilunium/makalahIF" rel="nofollow">https://github.com/altilunium/makalahIF</a> (Scrap university lecturer's web page to get list of papers)[4] <a href="https://github.com/altilunium/wi-page" rel="nofollow">https://github.com/altilunium/wi-page</a> (Scrap wikipedia to get most active contributors that contribute to a certain article)[5] <a href="https://github.com/altilunium/arachnid" rel="nofollow">https://github.com/altilunium/arachnid</a> (Web scraper, optimized for wordpress and blogger)

评论 #32411533 未加载

sneheshtalmost 3 years ago

In the world of SPA (single page applications), headless browser API is super helpful, playwright[1] and puppeteer[2] are very good choices.[1] <a href="https://github.com/microsoft/playwright" rel="nofollow">https://github.com/microsoft/playwright</a>[2] <a href="https://github.com/puppeteer/puppeteer" rel="nofollow">https://github.com/puppeteer/puppeteer</a>

评论 #32411797 未加载

评论 #32410941 未加载

评论 #32412014 未加载

评论 #32412480 未加载

namukangalmost 3 years ago

I built a tool called Browserflow (<a href="https://browserflow.app" rel="nofollow">https://browserflow.app</a>) that lets you automate any task in the browser, including scraping websites.People love it for its ease-of-use because you can record actions via click-and-point rather than having to manually come up with CSS selectors. It intelligently handles lists, infinite scrolling, pagination, etc. and can run on both your desktop and in the cloud.Grateful for how much love it received when it launched on HN 8 months ago: <a href="https://news.ycombinator.com/item?id=29254147" rel="nofollow">https://news.ycombinator.com/item?id=29254147</a>Try it out and let me know what you think!

评论 #32414637 未加载

评论 #32413030 未加载

评论 #32413940 未加载

shubhamjainalmost 3 years ago

Unpopular opinion, but Bash/Shell Scripting. Seriously, it's probably the fastest way to get things done. For fetching, use cURL. Want to extract particular markup? Use pup[1]. Want to process csv? Use cskit[2]. Or JSON? Use jq[3]. Want to use DB? Use psql. Once you get the hang of shell scripting, you can create simple scrapers by wiring up these utilities in a matter of minutes.The only thing I wish was present was better support for RegExes. Bash and most unix tools don't support PCRE which can severely limiting. Plus, sometimes you want to process text as a whole vs line-by-line.I would also recommend Python's sh[4] module if Shell scripting isn't your cup of tea. You get best of both worlds: faster dev work with Bash utils, and a saner syntax.[1]: <a href="https://github.com/ericchiang/pup" rel="nofollow">https://github.com/ericchiang/pup</a>[2]: <a href="https://csvkit.readthedocs.io/en/latest/" rel="nofollow">https://csvkit.readthedocs.io/en/latest/</a>[3]: <a href="https://stedolan.github.io/jq/" rel="nofollow">https://stedolan.github.io/jq/</a>[4]: <a href="https://pypi.org/project/sh/" rel="nofollow">https://pypi.org/project/sh/</a>

评论 #32413085 未加载

评论 #32413582 未加载

评论 #32413366 未加载

评论 #32412731 未加载

评论 #32412711 未加载

评论 #32413900 未加载

lwthikeralmost 3 years ago

curl-impersonate[1] is a curl fork that I maintain and which lets you fetch sites while impersonating a browser. Unfortunately, the practice of TLS and HTTP fingerprinting of web clients has become extremely common in the past ~1 year, which means a regular curl request will often return some JS challenge and not the real content. curl-impersonate helps with that.[1] <a href="https://github.com/lwthiker/curl-impersonate" rel="nofollow">https://github.com/lwthiker/curl-impersonate</a>

评论 #32411048 未加载

hartatoralmost 3 years ago

We've built <a href="https://serpapi.com" rel="nofollow">https://serpapi.com</a>We've invented the industry what you referring as "data type specific APIs"; APIs that abstract away all proxies issues, captcha solvings, various layouts support, even scrapping-related legal issues, and much more to a clean JSON response every single call. It was a lot of work but our success rate and response times are now rivaling non-scraping commercial APIs: <a href="https://serpapi.com/status" rel="nofollow">https://serpapi.com/status</a>I think the next battle will be still legal despite all the wins in favor of scrapping public pages and common sense understanding this is the way to go. The EFF has been doing an amazing work in this world and we are proud to be a significant yearly contributor to the EFF.

sphalmost 3 years ago

As someone that has built and maintained a few scraper tools in my career: hand-written logic and patience because your scraper will break any time upstream changes their HTML. It's an infinite game of whack-a-mole outside your control.Scrapers are very simple, effective and probably one of the least fun things to build.

评论 #32418932 未加载

评论 #32412082 未加载

评论 #32432120 未加载

评论 #32414760 未加载

评论 #32418836 未加载

mherrmannalmost 3 years ago

My <a href="http://heliumhq.com" rel="nofollow">http://heliumhq.com</a> is open source and gives you a very simple Python API:<pre><code> from helium import * start_chrome('github.com/login') write('user', into='Username') write('password', into='Password') click('Sign in') </code></pre> To get started:<pre><code> pip install helium </code></pre> Also, you need to download the latest ChromeDriver and put it in your PATH.Have fun :-)

qwertyforcealmost 3 years ago

Probably the best tool for scraping websites protected by services like cloudflare. <a href="https://github.com/ultrafunkamsterdam/undetected-chromedriver" rel="nofollow">https://github.com/ultrafunkamsterdam/undetected-chromedrive...</a>

评论 #32432136 未加载

评论 #32419372 未加载

smashahalmost 3 years ago

I've been using puppeteer as it's got a very established ecosystem. There are also puppeteer plugins that make it very powerful against captchas/detection/etc.The worst thing about Puppeteer is chrome and it's bad memory management so I'm going to give playwright a spin soon.

评论 #32411205 未加载

brenoalmost 3 years ago

estela is an elastic web scraping cluster running on Kubernetes. It provides mechanisms to deploy, run and scale web scraping spiders via a REST API and a web interface.It is a modern alternative to the few OSS projects available for such needs, like scrapyd and gerapy. estela aims to help web scraping teams and individuals that are considering moving away from proprietary scraping clouds, or who are in the process of designing their on-premise scraping architecture, so as not to needlessly reinvent the wheel, and to benefit from the get-go from features such as built-in scalability and elasticity, among others.estela has been recently published as OSS under the MIT license:<a href="https://github.com/bitmakerla/estela" rel="nofollow">https://github.com/bitmakerla/estela</a>More details about it can be found in the release blog post and the official documentation:<a href="https://bitmaker.la/blog/2022/06/24/estela-oss-release.html" rel="nofollow">https://bitmaker.la/blog/2022/06/24/estela-oss-release.html</a><a href="https://estela.bitmaker.la/docs/" rel="nofollow">https://estela.bitmaker.la/docs/</a>estela supports Scrapy spiders for the moment being, but additional frameworks/languages are on the roadmap.All kinds of feedback and contributions are welcome!Disclaimer: I'm part of the development team behind estela :-)

simonwalmost 3 years ago

Quick plug for running scrapers in GitHub Actions and writing the results back to a repository - which gives you a free way to track changes to a scraped resource over time. I call this "Git scraping" - I've written a whole bunch of notes about this technique here: <a href="https://simonwillison.net/series/git-scraping/" rel="nofollow">https://simonwillison.net/series/git-scraping/</a>

larrydagalmost 3 years ago

The polite package using R is intended to be a friendly way of scraping content from the owner. "The three pillars of a polite session are seeking permission, taking slowly and never asking twice."<a href="https://github.com/dmi3kno/polite" rel="nofollow">https://github.com/dmi3kno/polite</a>

zarzavatalmost 3 years ago

Python is my work horse, if I need to scrape something from a site that is relaxed about scraping (most are). I have my own library of helper functions I've built up over the years. In simple cases I just regex out what I need, if I need a full DOM then I use JSDOM/node.For sites that are "difficult" I remote control a real browser, GUI and all. I don't use Chrome headless because if there's e.g. a captcha I want to be able to fill it in manually.

brutuscatalmost 3 years ago

For Ruby I recommend Medusa Crawler gem.[1] <a href="https://github.com/brutuscat/medusa-crawler" rel="nofollow">https://github.com/brutuscat/medusa-crawler</a>Which I maintain as a fork of the unmaintained Anemone gem.

shakezulaalmost 3 years ago

I’m not sure about “best” but I’ve been using Colly (written in Go) and it’s been pretty slick. Haven’t run in to anything it can’t do.<a href="http://go-colly.org/" rel="nofollow">http://go-colly.org/</a>

评论 #32413684 未加载

samwillisalmost 3 years ago

I don’t think the landscape has changed much since then. However, from my experience you should do everything possible to avoid a headless browser for scraping. It’s in the region of 10-100x slower and significantly more resource intensive, even if you carefully block unwanted requests (images, css, video, ads).Obviously sometimes you have to go that route.

c0brac0braalmost 3 years ago

I have used the Apify SDK (now <a href="https://crawlee.dev/" rel="nofollow">https://crawlee.dev/</a>) in the past and found it very useful.

评论 #32417620 未加载

gkfasdfasdfalmost 3 years ago

If the content you need is static, I like using node + cheerio [0] as the selector syntax is quite powerful. If there is some javascript execution involved however, I will fall back to puppeteer.[0] - <a href="https://cheerio.js.org/" rel="nofollow">https://cheerio.js.org/</a>

评论 #32418651 未加载

asdadsdadalmost 3 years ago

Can I piggyback on the question and ask what are people scraping these days?

评论 #32411636 未加载

评论 #32411624 未加载

评论 #32411117 未加载

评论 #32412004 未加载

评论 #32412608 未加载

benibelaalmost 3 years ago

I wrote my own webscraper: <a href="https://videlibri.de/xidel.html" rel="nofollow">https://videlibri.de/xidel.html</a>The main purpose was to submit HTML forms. You just say in which input fields something should be written and then it does the other things (i.e. download the page, find all other fields and their default values, build a HTTP request from all of them and send that ).The last 5 years, I spent updating the XPath implementation to XPath/XQuery 3.1. The W3C has put a lot new stuff in the new XPath versions like JSON support or higher order functions, for some reason they decided to turn XPath into a Turing-complete functional programming language.

PigiVinci83almost 3 years ago

I’ve got several years of experience of webscraping, mainly in python. Scrapy is the first choice for “basic websites” while playwright is used then things get difficult. I’m collecting my experience in using these tools in this “web scraping open knowledge project” on github (<a href="https://github.com/reanalytics-databoutique/webscraping-open-project" rel="nofollow">https://github.com/reanalytics-databoutique/webscraping-open...</a>) and on my substack (<a href="http://thewebscraping.club/" rel="nofollow">http://thewebscraping.club/</a>) for longer free content

shireboyalmost 3 years ago

I have had some luck running puppeteer in a nodejs app hosted at glitch.com. I spring for the (cheap) paid hosting and get several containers for dev/test/prod, web based ide. Obviously, this would only scale to a point. In my case I just need a single client automating interaction with a single site. If I really needed scale, I'd probably use one of the services listed elsewhere.Of course, if you don't need a full javascript-enabled browser parse, consider alternatives first: simple HTTP requests, API, RSS, etc.

评论 #32580001 未加载

ravenstinealmost 3 years ago

For simple scraping where the content is fairly static, or when performance is critical, I will use linkedom to process pages.<a href="https://github.com/WebReflection/linkedom" rel="nofollow">https://github.com/WebReflection/linkedom</a>When the content is complex or involves clicking, Playwright is probably the best tool for the job.<a href="https://github.com/microsoft/playwright" rel="nofollow">https://github.com/microsoft/playwright</a>

shmoogyalmost 3 years ago

Has anybody created anything similar to Portia for scraping? I'd love to self host or pay a nominal fee to allow my team to create / adjust scrapers via a UI

评论 #32415855 未加载

tonymetalmost 3 years ago

Obviously scraping logic using puppeteer, but there are many other tooling aspects that are critical to bypass bot prevention.one is signature / fingerprinting emulation. It helps to run the bot in a real browser and export the fingerprint (e.g. UA, canvass, geoloc etc) into JS object . Add noise to the data too.Simulate residential IPs by routing through a residential proxy. If you run bots from cloud you will get blocked.

jawertyalmost 3 years ago

I’ve built a lot of tools utilizing web scraping most recently <a href="https://GitHub.com/Jawerty/myAlgorithm" rel="nofollow">https://GitHub.com/Jawerty/myAlgorithm</a> and <a href="https://metaheads.xyz" rel="nofollow">https://metaheads.xyz</a> I think the more control you have over the tools the better if you know your way around css selectors and selenium you can do anything web scraping. Selenium can seem hefty but there are plenty of ways to optimize for resource intensity; look up selenium grid. Overall, don’t be afraid of browser automation you can Always find a way to optimize. The real difficulty is freshness of html. This you can fix by being smart about time stamps and caching. If you have the same data you’re scraping consistently…don’t do that. Also if there’s a frontend in your application dependent on scraped data NEVER use your scraping routines as a direct feed, store data whenever you scrape.

simplectoalmost 3 years ago

i hate to be that guy, but “it depends”scrapy is still king for me (scrapy.org). there are even packages to use headless browsers for those awful javascript heavy siteshowever, APIs and RSS are still in play, and that does not require a heavy scraper. I am building vertical industry portals, and many of my data rollups consume APIs and structured XML/RSS feeds from social and other sites.

yobboalmost 3 years ago

Many years ago I wrote a scraper-module for a scripting language that exposed a fake DOM to an embedded JS-engine, spidermonkey. The DOM was just an empty object graph, readable both from the scripting language and inside the JS context. The documents were parsed by libxml2 and the resulting DOMs were not identical to mozilla's, for example. But fast and efficient.The purpose was to enable "live interactive" scraping of forms/js/ajax sites, with a web frontend controlling maybe 10 scrapers for each user. When that project fell through, I stopped maintaining it and the spidermonkey api has long since moved on.It works for simple sites that don't require the DOM to actually do anything (for example triggering images to load with some magic url). But many simple DOM behaviours can be implemented.

whoisjuanalmost 3 years ago

It depends on what you are trying to accomplish, but I think a combination of Puppeteer and JSDOM or Cheerio should take you far. Where it gets complex is when you need to do things such as rotating IPs, but in my experience, that's only needed if you're engaging in a heavy scraping workload.Puppeteer + JSDOM is what I used to build <a href="https://www.getscrape.com" rel="nofollow">https://www.getscrape.com</a>, which is a high-level web scraping API. Basically, you tell the API if you want links, images, texts, headings, numbers, etc; and the API gets all that stuff for you without the need to pass selectors or parsing instructions.In case anyone here wants something straightforward. It works well to build generic scraping operations.

nlhalmost 3 years ago

I'm working on a personal project that involves A LOT of scraping, and through several iterations I've gotten some stuff that works quite well. Here's a quick summary of what I've explored (both paid and free):* Apify (<a href="https://apify.com/" rel="nofollow">https://apify.com/</a>) is a great, comprehensive system if you need to get fairly low-level. Everything is hosted there, they've got their own proxy service (or you can roll your own), and their open source framework (<a href="https://github.com/apify/crawlee" rel="nofollow">https://github.com/apify/crawlee</a>) is excellent.* I've also experimented with running both their SDK (crawlee) and Playwright directly on Google Cloud Run, and that also works well and is an order-of-magnitude less expensive than running directly on their platform.* Bright Data nee Luminati is excellent for cheap data center proxies ($0.65/GB pay as you go), but prices get several orders of magnitude more expensive if you need anything more thorough than data center proxies.* For some direct API crawls that I do, all of the scraping stuff is unnecessary and I just ping the APIs directly.* If the site you're scraping is using any sort of anti-bot protection, I've found that ScrapingBee (<a href="https://www.scrapingbee.com/" rel="nofollow">https://www.scrapingbee.com/</a>) is by far the easiest solution. I spent many many hours fighting anti-bot protection doing it myself with some combination of Bright Data, Apify and Playwright, and in the end I kinda stopped battling and just decided to let ScrapingBee deal with it for me. I may be lucky in that the sites I'm scraping don't really use JS heavily, so the plain vanilla, no-JS ScrapingBee service works almost all of the time for those. Otherwise it can get quite expensive if you need JS rendering, premium proxies, etc. But a big thumbs up to them for making it really easy.Always looking for new techniques and tools, so I'll monitor this thread closely.

ardalannalmost 3 years ago

We’ve built a freemium cloud RPA software focused on web scraping and monitoring, called Browse AI.<a href="https://www.browse.ai" rel="nofollow">https://www.browse.ai</a>It lets you train a bot in 2 minutes. The bot will then open the site with rotating geolocated ip addresses, solve captchas, click on buttons and scroll and fill out forms, to get you the data you need.It’s integrated with Google Sheets, Airtable, Zapier, and more.We have a Google Sheets addon too which lets you run robots and get their results all in a spreadsheet.We have close to 10,000 users with 1,000+ signing up every week these days. That made us raise a bit of funding from Zapier and others to be able to scale quicker and build the next version.

colinramsayalmost 3 years ago

For a particular type of scraping, we wrote SSScraper on top of Colly and it works really well:<a href="https://github.com/gotripod/ssscraper/" rel="nofollow">https://github.com/gotripod/ssscraper/</a>

nreecealmost 3 years ago

* Shameless plug *: our super-easy feed builder at New Sloth (formerly Feedity) - <a href="https://newsloth.com" rel="nofollow">https://newsloth.com</a> combines a scraper and data transformer, which helps create custom RSS feeds for any public webpage. Our API can auto-magically detect relevant articles in most cases. The platform includes an integrated feed reader and clusterer/deduplicator, specially aimed for knowledge workers with hundreds and thousands of feeds to monitor daily.

arjunbahl1910almost 3 years ago

Not a full fledged scraper but IDS[1] has great heuristics to figure relevant content/information behind HTML code therefore lesser/no iterations needed in case frontend code changes.Would be cool to reverse engineer it and probably plug it into some JS rendering testing solution (say Puppeteer, etc.)[1] <a href="https://chrome.google.com/webstore/detail/instant-data-scraper/ofaokhiedipichpaobibbnahnkdoiiah" rel="nofollow">https://chrome.google.com/webstore/detail/instant-data-scrap...</a>

pipeline_peakalmost 3 years ago

Normalize Rest calls so programmers don’t have to rely on webscrapers, selenium and other flaky methods of retrieving data.Web scraping is fun, but in production it’s an absolute joke.

throwtoobusyalmost 3 years ago

I am probably in the minority, but I try to outsource scraping whenever possible. It's too much grunt work: you have to constantly baby sit the crawlers that break because websites keep changing.Personally, I use Indexed (<a href="https://www.indexedinc.com" rel="nofollow">https://www.indexedinc.com</a>) because they are technical and reliable, although there are many other providers out there..

71a54xdalmost 3 years ago

Is there any form of markup / library that would allow me to access a file-tree similar to what show's up in the chrome "inspect" "sources" tab? I'm working on a system to extract m3u8 files from websites. Haven't found a good way to do this yet a few years since my last project that required scraping with a headless browser.

picodguyoalmost 3 years ago

Have to appreciate the irony of someone's SEO spam submission (submitter works for a company selling scraping services) being SEO spammed in the comments...>Thanks for the links. And I read too. I see a lot of useful stuff that I will use for my site <a href="https://los-angeles-plumbers.com/" rel="nofollow">https://los-angeles-plumbers.com/</a>

fersarralmost 3 years ago

Selenium via Python is really useful too if you need to do a bit more (e.g clicks) than just fetching the html from the page.

评论 #32411597 未加载

Slixalmost 3 years ago

Scrapy's crawling and CSS/xpath selectors are fine. But I'm annoyed about the pipeline after that. Especially to get the data into a SQLite database. I wish cleaning up the data was a series of transformations on SQL tables instead of a bunch of work on Python models.

yokisanalmost 3 years ago

A good no-code solution is <a href="https://simplescraper.io" rel="nofollow">https://simplescraper.io</a>. Leans towards non-developers but there's an API too.

Stefan_Smirnovalmost 3 years ago

I’m biased since I’m an owner of a web scraping agency (<a href="https://webscrapingsolutions.co.uk/" rel="nofollow">https://webscrapingsolutions.co.uk/</a>). I was asking myself the same question in 2019. You can use any programming language, but have settled on this tech-stack Python, Scrapy (<a href="https://github.com/scrapy/scrapy" rel="nofollow">https://github.com/scrapy/scrapy</a>), Redis, PostgreSQL. for the following reasons:[1] Scrapy is a well-documented framework, so any Python programmer can start using it after 1 month of training. There are a lot of guides for beginners.[2] Lots of features are already implemented and open-source, you won’t have to waste time & money on them.[3] There is a strong community that can help with most of the questions (I don't think any other alternative has that).[4] Scrapy developers are cheap. You will only need junior+ to middle level software engineers to pull out most of the projects. It’s not rocket since.[5] Recruiting is easier: - there are hundreds of freelancers with relevant expertise - if you search on LinkedIn - there are hundreds of software developers that have worked with Scrapy in the past, and you don’t need that many - you can grow expertise in your own team quickly - developers are easily replaceable, even on larger projects - you can use the same developers on backend tasks.[6] You don’t need a DevOps expertise in your web scraping team because Scrapy Cloud (<a href="https://www.zyte.com/scrapy-cloud/" rel="nofollow">https://www.zyte.com/scrapy-cloud/</a>) is good and cheap enough for 99% of the projects.[7] If you decide to have your own infrastructure, you can use <a href="https://github.com/scrapy/scrapyd" rel="nofollow">https://github.com/scrapy/scrapyd</a>.[8] The entire ecosystem is well-well-maintained and steadily growing. You can integrate a lot of 3-rd party services into your project within hours: proxies, captcha solving, headless browsers, HTML parsing APIs.[9] It’s easy to integrate your own AI/ML models into the scraping workflow.[10]. With some work, you can use Scrapy for distributed projects that are scraping thousands (millions) of domains. We are using <a href="https://github.com/rmax/scrapy-redis" rel="nofollow">https://github.com/rmax/scrapy-redis</a>.[11] Commercial support is available. There are several companies that can develop you an entire project or take over an existing one - if you don’t have the time/don’t want to do it on your own.We have built dozens of projects in multiple industries:- news monitoring- job aggregators- real estate aggregators- ecommerce (anything from 1 website, to monitoring prices on 100k+ domains)- lead generation- search engines in a specific niche (SEO, pdf files, ecommerce, chemical retail)- macroeconomic research & indicators- social media, NFT marketplaces, etcSo, most of the projects can be finished using these tools.

评论 #32414075 未加载

epberryalmost 3 years ago

I've thought <a href="https://www.scrapingbee.com/" rel="nofollow">https://www.scrapingbee.com/</a> looked great, especially their auto rotation of IP addresses.

评论 #32411137 未加载

tompalmost 3 years ago

Python, requests, BeautifulSoup, lxml, BrightData proxy provider. If necessary, async (if you’re scraping multiple pages) and Pyppeteer (if scraping JS-heavy pages).

Drakula2kalmost 3 years ago

Try <a href="https://webscraping.ai/" rel="nofollow">https://webscraping.ai/</a> if you need rotating proxies and JS rendering

nkrtalmost 3 years ago

<a href="https://www.page2api.com/" rel="nofollow">https://www.page2api.com/</a> - powerful API, easy to use, pay-as-you-go.

rustdeveloperalmost 3 years ago

I use Scraping Fish API: <a href="https://scrapingfish.com/" rel="nofollow">https://scrapingfish.com/</a>

speedgoosealmost 3 years ago

You can also use the common crawl dataset.<a href="https://commoncrawl.org/" rel="nofollow">https://commoncrawl.org/</a>

rahulgoelalmost 3 years ago

If no javascript automation - python's asyncio, regex and aiohttp can scale really fast for simple tasks.

casualwriteralmost 3 years ago

it depends. for no-code solution, please check [powerpage-web-crawler](<a href="https://github.com/casualwriter/powerpage-web-crawler" rel="nofollow">https://github.com/casualwriter/powerpage-web-crawler</a>) for crawling blog/posts.

epolanskialmost 3 years ago

Playwright is imho great if you want to avoid the common pitfalls of most scrapers.

naseef14almost 3 years ago

if you are tech enough to find the query-selector of the elements, here is a great tool.Great things is, it have support for Zapier, webhooks and API access too.!<a href="https://browserbird.com" rel="nofollow">https://browserbird.com</a>

flameyalmost 3 years ago

Perl + Mojo::DOM, or Perl + Playwright or Selenium when JavaScript is required

RektBoyalmost 3 years ago

Selenium with custom geckodriver/chromedriver (to counter antibot hacks)

arinze11almost 3 years ago

if you are looking for pre-made tools and dont want to write any code, check our <a href="https://webautomation.io" rel="nofollow">https://webautomation.io</a>

mkl95almost 3 years ago

Scrapy for low level stuff plus some tool that can run Javascript

topherPedersenalmost 3 years ago

I use ScrapingBee (with Python, requests, & BeautifulSoup).

andrew_almost 3 years ago

In the Node realm; you can't beat undici + cheerio.

andrewmcwattersalmost 3 years ago

It’s still curl and selenium.

worldmergealmost 3 years ago

Selenium is very good

taesualmost 3 years ago

python requests python bs4 buy some proxy and go!

alex_eScraperalmost 3 years ago

e-scraper.com

SQL2219almost 3 years ago

powershell

评论 #32410376 未加载

65 comments

simonwalmost 3 years ago

评论 #32413983 未加载

评论 #32419137 未加载

评论 #32421250 未加载

评论 #32413225 未加载

altiluniumalmost 3 years ago

评论 #32411533 未加载

sneheshtalmost 3 years ago

评论 #32411797 未加载

评论 #32410941 未加载

评论 #32412014 未加载

评论 #32412480 未加载

namukangalmost 3 years ago

评论 #32414637 未加载

评论 #32413030 未加载

评论 #32413940 未加载

shubhamjainalmost 3 years ago

评论 #32413085 未加载

评论 #32413582 未加载

评论 #32413366 未加载

评论 #32412731 未加载

评论 #32412711 未加载

评论 #32413900 未加载

lwthikeralmost 3 years ago

评论 #32411048 未加载

hartatoralmost 3 years ago

sphalmost 3 years ago

评论 #32418932 未加载

评论 #32412082 未加载

评论 #32432120 未加载

评论 #32414760 未加载

评论 #32418836 未加载

mherrmannalmost 3 years ago

qwertyforcealmost 3 years ago

评论 #32432136 未加载

评论 #32419372 未加载

smashahalmost 3 years ago

评论 #32411205 未加载

brenoalmost 3 years ago

simonwalmost 3 years ago

larrydagalmost 3 years ago

zarzavatalmost 3 years ago

brutuscatalmost 3 years ago

shakezulaalmost 3 years ago

评论 #32413684 未加载

samwillisalmost 3 years ago

c0brac0braalmost 3 years ago

I have used the Apify SDK (now <a href="https://crawlee.dev/" rel="nofollow">https://crawlee.dev/</a>) in the past and found it very useful.

评论 #32417620 未加载

gkfasdfasdfalmost 3 years ago

评论 #32418651 未加载

asdadsdadalmost 3 years ago

Can I piggyback on the question and ask what are people scraping these days?

评论 #32411636 未加载

评论 #32411624 未加载

评论 #32411117 未加载

评论 #32412004 未加载

评论 #32412608 未加载

benibelaalmost 3 years ago

PigiVinci83almost 3 years ago

shireboyalmost 3 years ago

评论 #32580001 未加载

ravenstinealmost 3 years ago

shmoogyalmost 3 years ago

Has anybody created anything similar to Portia for scraping? I'd love to self host or pay a nominal fee to allow my team to create / adjust scrapers via a UI

评论 #32415855 未加载

tonymetalmost 3 years ago

jawertyalmost 3 years ago

simplectoalmost 3 years ago

yobboalmost 3 years ago

whoisjuanalmost 3 years ago

nlhalmost 3 years ago

ardalannalmost 3 years ago

colinramsayalmost 3 years ago

nreecealmost 3 years ago

arjunbahl1910almost 3 years ago

pipeline_peakalmost 3 years ago

Normalize Rest calls so programmers don’t have to rely on webscrapers, selenium and other flaky methods of retrieving data.Web scraping is fun, but in production it’s an absolute joke.

throwtoobusyalmost 3 years ago

71a54xdalmost 3 years ago

picodguyoalmost 3 years ago

fersarralmost 3 years ago

Selenium via Python is really useful too if you need to do a bit more (e.g clicks) than just fetching the html from the page.

评论 #32411597 未加载

Slixalmost 3 years ago

yokisanalmost 3 years ago

A good no-code solution is <a href="https://simplescraper.io" rel="nofollow">https://simplescraper.io</a>. Leans towards non-developers but there's an API too.

Stefan_Smirnovalmost 3 years ago

评论 #32414075 未加载

epberryalmost 3 years ago

I've thought <a href="https://www.scrapingbee.com/" rel="nofollow">https://www.scrapingbee.com/</a> looked great, especially their auto rotation of IP addresses.

评论 #32411137 未加载

tompalmost 3 years ago

Python, requests, BeautifulSoup, lxml, BrightData proxy provider. If necessary, async (if you’re scraping multiple pages) and Pyppeteer (if scraping JS-heavy pages).

Drakula2kalmost 3 years ago

Try <a href="https://webscraping.ai/" rel="nofollow">https://webscraping.ai/</a> if you need rotating proxies and JS rendering

nkrtalmost 3 years ago

<a href="https://www.page2api.com/" rel="nofollow">https://www.page2api.com/</a> - powerful API, easy to use, pay-as-you-go.

rustdeveloperalmost 3 years ago

I use Scraping Fish API: <a href="https://scrapingfish.com/" rel="nofollow">https://scrapingfish.com/</a>

speedgoosealmost 3 years ago

You can also use the common crawl dataset.<a href="https://commoncrawl.org/" rel="nofollow">https://commoncrawl.org/</a>

rahulgoelalmost 3 years ago

If no javascript automation - python's asyncio, regex and aiohttp can scale really fast for simple tasks.

casualwriteralmost 3 years ago

epolanskialmost 3 years ago

Playwright is imho great if you want to avoid the common pitfalls of most scrapers.

naseef14almost 3 years ago

flameyalmost 3 years ago

Perl + Mojo::DOM, or Perl + Playwright or Selenium when JavaScript is required

RektBoyalmost 3 years ago

Selenium with custom geckodriver/chromedriver (to counter antibot hacks)

arinze11almost 3 years ago

if you are looking for pre-made tools and dont want to write any code, check our <a href="https://webautomation.io" rel="nofollow">https://webautomation.io</a>

mkl95almost 3 years ago

Scrapy for low level stuff plus some tool that can run Javascript

topherPedersenalmost 3 years ago

I use ScrapingBee (with Python, requests, & BeautifulSoup).

andrew_almost 3 years ago

In the Node realm; you can't beat undici + cheerio.

andrewmcwattersalmost 3 years ago

It’s still curl and selenium.

worldmergealmost 3 years ago

Selenium is very good

taesualmost 3 years ago

python requests python bs4 buy some proxy and go!

alex_eScraperalmost 3 years ago

e-scraper.com

SQL2219almost 3 years ago

powershell

评论 #32410376 未加载