Web Scraping: Bypassing “403 Forbidden,” captchas, and more

564 pointsby foobabout 8 years ago

24 comments

chatmastaabout 8 years ago

Note that 99% of time, if a web page is worth scraping, it probably has an accompanying mobile app. It's worth downloading the app and running mitmproxy/burp/charles on the traffic to see if it uses a private API. In my experience, it's much easier to scrape the private mobile API than a public website. This way you get nicely formatted JSON and often bypass rate limits.

评论 #13886392 未加载

评论 #13886489 未加载

评论 #13886568 未加载

评论 #13893322 未加载

评论 #13888313 未加载

评论 #13889984 未加载

评论 #13888764 未加载

评论 #13894436 未加载

评论 #13888878 未加载

评论 #13898152 未加载

评论 #13889842 未加载

thefifthsetpinabout 8 years ago

Better solution: pay target-site.com to start building an API for you.Pros:* You'll be working with them rather than against them.* Your solution will be far more robust.* It'll be way cheaper, supposing you account for the ongoing maintenance costs of your fragile scraper.* You're eliminating the possibility that you'll have to deal with legal antagonism* Good anti-scraper defenses are far more sophisticated than what the author dealt with. As a trivial example: he didn't even verify that he was following links that would be visible!Cons:* Possible that target-site.com's owners will tell you to get lost, or they are simply unreachable.* Your competitors will gain access to the API methods you funded, perhaps giving them insight into why that data is valuable.Alternative better solution for small one-off data collection needs: contract a low-income person to just manually download the data you need with a normal web browser. Provide a JS bookmarklet to speed their process if the data set is a bit too big for that.

评论 #13889532 未加载

评论 #13888018 未加载

评论 #13886576 未加载

评论 #13889883 未加载

评论 #13888477 未加载

评论 #13889630 未加载

评论 #13888359 未加载

nipabout 8 years ago

Scrapy is indeed excellent. One feature that I really like is Scrapy Shell [1].It allows to run and debug the scraping code without running the spider, right from the CLI.I use it extensively to test that my selectors (both CSS and XPATH) are returning the proper data on a test URL.[1] <a href="https://doc.scrapy.org/en/latest/topics/shell.html" rel="nofollow">https://doc.scrapy.org/en/latest/topics/shell.html</a>

评论 #13886538 未加载

jlgaddisabout 8 years ago

Here's an idea (although probably an unpopular one around here): if a site is responding to your scraping attempts with 403s -- a.k.a. "Forbidden" -- stop what you're doing and go away.

评论 #13890222 未加载

superasnabout 8 years ago

The web scraping tool of my choice still has to be WWW::Mechanize for Perl.P.S. I wrote a WWW::Mechanize::Query ext for it so that it supports css selectors etc if anyone is interested. It's on cpan.

评论 #13887991 未加载

Lxrabout 8 years ago

I have done a lot of scraping in Python with requests and lxml and never really understood what scrapy offers beyond that. What are the main features that can't be easily implemented manually?

评论 #13886223 未加载

评论 #13888409 未加载

评论 #13886163 未加载

foxylionabout 8 years ago

I'm curious what others use to scrape modern (javascript based) web applications.The old web (html and links) work fine with tools like Scrapy, but for modern applications which rely on javascript this does no longer work.For my last project I used a chrome plugin which controlled the browsers url locations and clicks. Results where transmitted to a backend server. New jobs (clicks, change urls) where retrieved from the server.This worked fine but required some effort to implement. Is there an open source solution which is as helpful as Scrapy but solves the issues provided by modern javascript websites/applications?With tools like Chrome headless this should now be possible, right?

评论 #13886445 未加载

评论 #13886946 未加载

评论 #13887520 未加载

评论 #13886420 未加载

评论 #13888380 未加载

评论 #13886503 未加载

评论 #13886338 未加载

评论 #13886386 未加载

评论 #13893464 未加载

评论 #13886344 未加载

评论 #13887244 未加载

评论 #13890638 未加载

m00dyabout 8 years ago

I use greasemonkey on firefox. Recently, I have written a crawler for a major accomondation listing website in Copenhagen. Guess what? I got a place to live right in the center in 2 weeks. I love SCRAPERS I love CRAWLERS.

评论 #13885743 未加载

评论 #13891346 未加载

评论 #13885281 未加载

评论 #13887371 未加载

评论 #13885309 未加载

janciabout 8 years ago

I use Java with simple task queue and multiple worker threads (scrapy is only singlethreaded, although uses async I/O). Failed tasks are collected into second queue and restarted when needed. Used Jsoup[1] for parsing, proxychains and HAproxy + tor [2] for distributing across multiple IPs.[1] <a href="https://jsoup.org/" rel="nofollow">https://jsoup.org/</a> [2] <a href="https://github.com/mattes/rotating-proxy" rel="nofollow">https://github.com/mattes/rotating-proxy</a>

评论 #13886236 未加载

jacquesmabout 8 years ago

Note that in some places this constitutes breaking the law.

评论 #13886112 未加载

评论 #13885585 未加载

评论 #13885420 未加载

ivanhoeabout 8 years ago

We all do this, but how legal is it? If people end up in prison for pen testing without permission, how safe is to intentionally alter the user-agent, circumvent captchas, javascript and other protections? Can that be considered as hacking a site and stealing the data?

pikerabout 8 years ago

Proposition: 99% of scraping use cases are eliminated if the scraper agrees to subsequently abide by the target's terms of service.

评论 #13888787 未加载

herbstabout 8 years ago

I've used antigate for captchas and ether Tor or proxies for 403s before. Usually the browser header alone does not help for long.

评论 #13885048 未加载

jordifabout 8 years ago

Good article! I been doing scraping for the last 10 years and I've seen a lots of differents things to try to avoid us. Also, I'm in the other side protecting websites to ban scrapers, so funny!

评论 #13890232 未加载

fiatjafabout 8 years ago

What if the target page is blocking by IP address and if even with 20 different IP addresses you wouldn't be able to fetch all the data you need in a month?

评论 #13890661 未加载

ic3coldabout 8 years ago

Have you seen the sentry antirobot system I can't remember the name exactly but it's a hosted solution that randomly displays captchas, when it senses suspicious(robots) crawling. It's a nightmare, because after you solve 1 captcha it can display 4 more one after the other. They also ban your IP, so oyu need IP rotators. Any workarounds? ic3cold

ge96about 8 years ago

What if they use that before:after thing where the content takes say a couple of seconds to appear so when you try to scrape the site it appears that nothing is there. I have only used HTMLSimpleDom scraper with PHP at this point.

mirimirabout 8 years ago

Sometimes it's also necessary to spread requests over numerous IP addresses.

dmn001about 8 years ago

The first part seems like a very long-winded way to say "don't use the default user agent".The captcha was unusually simple to solve, in most cases the best strategy is to avoid seeing it in the first place.

eapenabout 8 years ago

Enjoyed learning this and playing with it. What would you recommend storing this sort of data in? Not too keen on going with the traditional MySQL.

bla2about 8 years ago

Nice overview! The "unfortunately-spelled threat_defence.php" just uses British spelling though.

评论 #13885313 未加载

评论 #13885899 未加载

ouidabout 8 years ago

too bad it's named for ovine prions.

Exumaabout 8 years ago

Great article!

knownabout 8 years ago

Try lynx