TE
TechEcho
Home24h TopNewestBestAskShowJobs
GitHubTwitter
Home

TechEcho

A tech news platform built with Next.js, providing global tech news and discussions.

GitHubTwitter

Home

HomeNewestBestAskShowJobs

Resources

HackerNews APIOriginal HackerNewsNext.js

© 2025 TechEcho. All rights reserved.

Web Scraping: Bypassing “403 Forbidden,” captchas, and more

564 pointsby foobabout 8 years ago

24 comments

chatmastaabout 8 years ago
Note that 99% of time, if a web page is worth scraping, it probably has an accompanying mobile app. It's worth downloading the app and running mitmproxy/burp/charles on the traffic to see if it uses a private API. In my experience, it's much easier to scrape the private mobile API than a public website. This way you get nicely formatted JSON and often bypass rate limits.
评论 #13886392 未加载
评论 #13886489 未加载
评论 #13886568 未加载
评论 #13893322 未加载
评论 #13888313 未加载
评论 #13889984 未加载
评论 #13888764 未加载
评论 #13894436 未加载
评论 #13888878 未加载
评论 #13898152 未加载
评论 #13889842 未加载
thefifthsetpinabout 8 years ago
Better solution: pay target-site.com to start building an API for you.<p>Pros:<p>* You&#x27;ll be working with them rather than against them.<p>* Your solution will be far more robust.<p>* It&#x27;ll be way cheaper, supposing you account for the ongoing maintenance costs of your fragile scraper.<p>* You&#x27;re eliminating the possibility that you&#x27;ll have to deal with legal antagonism<p>* Good anti-scraper defenses are far more sophisticated than what the author dealt with. As a trivial example: he didn&#x27;t even verify that he was following links that would be visible!<p>Cons:<p>* Possible that target-site.com&#x27;s owners will tell you to get lost, or they are simply unreachable.<p>* Your competitors will gain access to the API methods you funded, perhaps giving them insight into why that data is valuable.<p>Alternative better solution for small one-off data collection needs: contract a low-income person to just manually download the data you need with a normal web browser. Provide a JS bookmarklet to speed their process if the data set is a bit too big for that.
评论 #13889532 未加载
评论 #13888018 未加载
评论 #13886576 未加载
评论 #13889883 未加载
评论 #13888477 未加载
评论 #13889630 未加载
评论 #13888359 未加载
nipabout 8 years ago
Scrapy is indeed excellent. One feature that I really like is Scrapy Shell [1].<p>It allows to run and debug the scraping code without running the spider, right from the CLI.<p>I use it extensively to test that my selectors (both CSS and XPATH) are returning the proper data on a test URL.<p>[1] <a href="https:&#x2F;&#x2F;doc.scrapy.org&#x2F;en&#x2F;latest&#x2F;topics&#x2F;shell.html" rel="nofollow">https:&#x2F;&#x2F;doc.scrapy.org&#x2F;en&#x2F;latest&#x2F;topics&#x2F;shell.html</a>
评论 #13886538 未加载
jlgaddisabout 8 years ago
Here&#x27;s an idea (although probably an unpopular one around here): if a site is responding to your scraping attempts with 403s -- a.k.a. &quot;Forbidden&quot; -- stop what you&#x27;re doing and go away.
评论 #13890222 未加载
superasnabout 8 years ago
The web scraping tool of my choice still has to be WWW::Mechanize for Perl.<p>P.S. I wrote a WWW::Mechanize::Query ext for it so that it supports css selectors etc if anyone is interested. It&#x27;s on cpan.
评论 #13887991 未加载
Lxrabout 8 years ago
I have done a lot of scraping in Python with requests and lxml and never really understood what scrapy offers beyond that. What are the main features that can&#x27;t be easily implemented manually?
评论 #13886223 未加载
评论 #13888409 未加载
评论 #13886163 未加载
foxylionabout 8 years ago
I&#x27;m curious what others use to scrape modern (javascript based) web applications.<p>The old web (html and links) work fine with tools like Scrapy, but for modern applications which rely on javascript this does no longer work.<p>For my last project I used a chrome plugin which controlled the browsers url locations and clicks. Results where transmitted to a backend server. New jobs (clicks, change urls) where retrieved from the server.<p>This worked fine but required some effort to implement. Is there an open source solution which is as helpful as Scrapy but solves the issues provided by modern javascript websites&#x2F;applications?<p>With tools like Chrome headless this should now be possible, right?
评论 #13886445 未加载
评论 #13886946 未加载
评论 #13887520 未加载
评论 #13886420 未加载
评论 #13888380 未加载
评论 #13886503 未加载
评论 #13886338 未加载
评论 #13886386 未加载
评论 #13893464 未加载
评论 #13886344 未加载
评论 #13887244 未加载
评论 #13890638 未加载
m00dyabout 8 years ago
I use greasemonkey on firefox. Recently, I have written a crawler for a major accomondation listing website in Copenhagen. Guess what? I got a place to live right in the center in 2 weeks. I love SCRAPERS I love CRAWLERS.
评论 #13885743 未加载
评论 #13891346 未加载
评论 #13885281 未加载
评论 #13887371 未加载
评论 #13885309 未加载
janciabout 8 years ago
I use Java with simple task queue and multiple worker threads (scrapy is only singlethreaded, although uses async I&#x2F;O). Failed tasks are collected into second queue and restarted when needed. Used Jsoup[1] for parsing, proxychains and HAproxy + tor [2] for distributing across multiple IPs.<p>[1] <a href="https:&#x2F;&#x2F;jsoup.org&#x2F;" rel="nofollow">https:&#x2F;&#x2F;jsoup.org&#x2F;</a> [2] <a href="https:&#x2F;&#x2F;github.com&#x2F;mattes&#x2F;rotating-proxy" rel="nofollow">https:&#x2F;&#x2F;github.com&#x2F;mattes&#x2F;rotating-proxy</a>
评论 #13886236 未加载
jacquesmabout 8 years ago
Note that in some places this constitutes breaking the law.
评论 #13886112 未加载
评论 #13885585 未加载
评论 #13885420 未加载
ivanhoeabout 8 years ago
We all do this, but how legal is it? If people end up in prison for pen testing without permission, how safe is to intentionally alter the user-agent, circumvent captchas, javascript and other protections? Can that be considered as hacking a site and stealing the data?
pikerabout 8 years ago
Proposition: 99% of scraping use cases are eliminated if the scraper agrees to subsequently abide by the target&#x27;s terms of service.
评论 #13888787 未加载
herbstabout 8 years ago
I&#x27;ve used antigate for captchas and ether Tor or proxies for 403s before. Usually the browser header alone does not help for long.
评论 #13885048 未加载
jordifabout 8 years ago
Good article! I been doing scraping for the last 10 years and I&#x27;ve seen a lots of differents things to try to avoid us. Also, I&#x27;m in the other side protecting websites to ban scrapers, so funny!
评论 #13890232 未加载
fiatjafabout 8 years ago
What if the target page is blocking by IP address and if even with 20 different IP addresses you wouldn&#x27;t be able to fetch all the data you need in a month?
评论 #13890661 未加载
ic3coldabout 8 years ago
Have you seen the sentry antirobot system I can&#x27;t remember the name exactly but it&#x27;s a hosted solution that randomly displays captchas, when it senses suspicious(robots) crawling. It&#x27;s a nightmare, because after you solve 1 captcha it can display 4 more one after the other. They also ban your IP, so oyu need IP rotators. Any workarounds? ic3cold
ge96about 8 years ago
What if they use that before:after thing where the content takes say a couple of seconds to appear so when you try to scrape the site it appears that nothing is there. I have only used HTMLSimpleDom scraper with PHP at this point.
mirimirabout 8 years ago
Sometimes it&#x27;s also necessary to spread requests over numerous IP addresses.
dmn001about 8 years ago
The first part seems like a very long-winded way to say &quot;don&#x27;t use the default user agent&quot;.<p>The captcha was unusually simple to solve, in most cases the best strategy is to avoid seeing it in the first place.
eapenabout 8 years ago
Enjoyed learning this and playing with it. What would you recommend storing this sort of data in? Not too keen on going with the traditional MySQL.
bla2about 8 years ago
Nice overview! The &quot;unfortunately-spelled threat_defence.php&quot; just uses British spelling though.
评论 #13885313 未加载
评论 #13885899 未加载
ouidabout 8 years ago
too bad it&#x27;s named for ovine prions.
Exumaabout 8 years ago
Great article!
knownabout 8 years ago
Try lynx