Note that 99% of time, if a web page is worth scraping, it probably has an accompanying mobile app. It's worth downloading the app and running mitmproxy/burp/charles on the traffic to see if it uses a private API. In my experience, it's much easier to scrape the private mobile API than a public website. This way you get nicely formatted JSON and often bypass rate limits.
Better solution: pay target-site.com to start building an API for you.<p>Pros:<p>* You'll be working with them rather than against them.<p>* Your solution will be far more robust.<p>* It'll be way cheaper, supposing you account for the ongoing maintenance costs of your fragile scraper.<p>* You're eliminating the possibility that you'll have to deal with legal antagonism<p>* Good anti-scraper defenses are far more sophisticated than what the author dealt with. As a trivial example: he didn't even verify that he was following links that would be visible!<p>Cons:<p>* Possible that target-site.com's owners will tell you to
get lost, or they are simply unreachable.<p>* Your competitors will gain access to the API methods you funded, perhaps giving them insight into why that data is valuable.<p>Alternative better solution for small one-off data collection needs: contract a low-income person to just manually download the data you need with a normal web browser. Provide a JS bookmarklet to speed their process if the data set is a bit too big for that.
Scrapy is indeed excellent. One feature that I really like is Scrapy Shell [1].<p>It allows to run and debug the scraping code without running the spider, right from the CLI.<p>I use it extensively to test that my selectors (both CSS and XPATH) are returning the proper data on a test URL.<p>[1] <a href="https://doc.scrapy.org/en/latest/topics/shell.html" rel="nofollow">https://doc.scrapy.org/en/latest/topics/shell.html</a>
Here's an idea (although probably an unpopular one around here): if a site is responding to your scraping attempts with 403s -- a.k.a. "Forbidden" -- stop what you're doing and go away.
The web scraping tool of my choice still has to be WWW::Mechanize for Perl.<p>P.S. I wrote a WWW::Mechanize::Query ext for it so that it supports css selectors etc if anyone is interested. It's on cpan.
I have done a lot of scraping in Python with requests and lxml and never really understood what scrapy offers beyond that. What are the main features that can't be easily implemented manually?
I'm curious what others use to scrape modern (javascript based) web applications.<p>The old web (html and links) work fine with tools like Scrapy, but for modern applications which rely on javascript this does no longer work.<p>For my last project I used a chrome plugin which controlled the browsers url locations and clicks. Results where transmitted to a backend server. New jobs (clicks, change urls) where retrieved from the server.<p>This worked fine but required some effort to implement. Is there an open source solution which is as helpful as Scrapy but solves the issues provided by modern javascript websites/applications?<p>With tools like Chrome headless this should now be possible, right?
I use greasemonkey on firefox. Recently, I have written a crawler for a major accomondation listing website in Copenhagen. Guess what? I got a place to live right in the center in 2 weeks. I love SCRAPERS I love CRAWLERS.
I use Java with simple task queue and multiple worker threads (scrapy is only singlethreaded, although uses async I/O).
Failed tasks are collected into second queue and restarted when needed.
Used Jsoup[1] for parsing, proxychains and HAproxy + tor [2] for distributing across multiple IPs.<p>[1] <a href="https://jsoup.org/" rel="nofollow">https://jsoup.org/</a>
[2] <a href="https://github.com/mattes/rotating-proxy" rel="nofollow">https://github.com/mattes/rotating-proxy</a>
We all do this, but how legal is it? If people end up in prison for pen testing without permission, how safe is to intentionally alter the user-agent, circumvent captchas, javascript and other protections? Can that be considered as hacking a site and stealing the data?
Good article! I been doing scraping for the last 10 years and I've seen a lots of differents things to try to avoid us.
Also, I'm in the other side protecting websites to ban scrapers, so funny!
What if the target page is blocking by IP address and if even with 20 different IP addresses you wouldn't be able to fetch all the data you need in a month?
Have you seen the sentry antirobot system I can't remember the name exactly but it's a hosted solution that randomly displays captchas, when it senses suspicious(robots) crawling. It's a nightmare, because after you solve 1 captcha it can display 4 more one after the other. They also ban your IP, so oyu need IP rotators. Any workarounds?
ic3cold
What if they use that before:after thing where the content takes say a couple of seconds to appear so when you try to scrape the site it appears that nothing is there. I have only used HTMLSimpleDom scraper with PHP at this point.
The first part seems like a very long-winded way to say "don't use the default user agent".<p>The captcha was unusually simple to solve, in most cases the best strategy is to avoid seeing it in the first place.