I used to lead Sys Eng for a FTSE 100 company. Our data was valuable but only for a short amount of time. We were constantly scraped which cost us in hosting etc. We even seen competitors use our figures (good ones used it to offset their prices, bad ones just used it straight).
As the article suggest, we couldn't block mobile operator IPs, some had over 100k customers behind them. Forcing the users to login did little as the scrapers just created accounts.
We had a few approaches that minimised the scraping:<p>Rate Limiting by login,<p>Limiting data to know workflows
...<p>But our most fruitful effort was when we removed limits and started giving "bad" data. By bad I mean alter the price up or down by a small percentage. This hit them in the pocket but again, wasn't a golden bullet. If the customer made a transaction on the altered figure we we informed them and took it at the correct price.<p>It's a cool problem to tackle but it is just an arms race.
I scrap government sites a lot as they don't provide apis. For mobile proxies, I use the proxidize dongles and mobinet.io (free, with Android devices). As stated in the article, with cgNAT it's basically impossible to block them as in my case, half the country couldn't access the sites anymore (if you place them in several locations and use one carrier each there).
In a particularly hard to scrape website, using some kind of bot protection that I just couldn't reliably get working (if anybody wants to know what that was exactly, I'll go and check it) I now have a small Intel NUC running with firefox that listens to a local server and uses Temper Monkey to perform commands. Works like a charm and I can actualy see what it's doing and where it's going wrong. (though it's not scalable, of course)<p>We use it for data-entry on a government website. A human would average around 10 minutes of clicking and typing, where the bot takes maybe 10 seconds. Last year we did 12000 entries. Good bot.
Where I was working we stopped caring about ips browser etc because it was just a race. What we did was analyzing behaviour of clicks and acted on that. When we recognized it we went on serving a fake page. It cuts down a little bit of costs because it was static pages.
In general it took a lot of time for them to discover the pattern and it was way more manageable for us.
Its easy to detect chrome headless so scraping with it is not really how "big" boys do it :D the only scrapers/bots that are really hard to detect are the ones running and controlling real browser and not chromium. I do a lot od research aggainst abitbot systems, some times is friday night. If you spend each one in pub it doesnt mean your normal.
We are seeing a lot of bot traffic too but chose to accept it as reality. We are aware if thousands of bots create unpredictable cost surges that there is something wrong with our product, it should not create such heavy loads to our servers in the first place to fulfil it's mission.<p>I believe the future will make us more free by using more bot / AI technology since who wants to spend their whole day in front of a computer and research information if machines can do the job just fine?
If you want to avoid bot detection, learn how bot detection work. A lot of commercial "webapp firewalls" and the like actually have <i>minimum</i> requirements before they flag certain traffic as a botnet; stay below those limits and you can keep hammering away. Sometimes those limits are <i>quite</i> high.<p>In the past we've had the most success defeating bots by just finding stupid tricks to use against them. Identify the traffic, identify anything that is correlated with the botnet traffic, and throw a monkey wrench at it. They're only using one User Agent? Return fake results. 90% of the botnet traffic is coming from one network source (country/region/etc)? Cause "random" network delays and timeouts. They still won't quit? During attacks, redirect to captchas for specific pages. During active attacks this is enough to take them out for days to weeks while they figure it out and work around it.
Having spent a week battling a particularly inconsiderate scraping attempt, I’m quite unsurprised by the juvenile tone and fairly glib approach to the ethics of bots/scraping presented by the piece.<p>For the site I work for, about 20-30% of our monthly hosting costs go towards servicing bot/scraping traffic. We’ve generally priced this into the cost of doing business, as we’ve prioritised making our site as freely accessible as possible.<p>But after this week, where some amateur did real damage to us with a ham-fisted attempt to scrape too much too quickly, we’re forced to degrade the experience for ALL users by introducing captchas and other techniques we’d really rather not.
Doing a bit of low-stakes monitoring of webpages lately. It started (as I'm assuming it often does) with right-clicking a network request in Chrome and selecting "copy as curl"<p>Then graduated to JavaScript for surrounding logic e.g. data transformation<p>I had assumed I'd quickly give up and move to a headless browser, BUT I can't bring myself to move away from tiny CPU utilization of curl.<p>Throwing together a "plugin" probably takes me less than 20 minutes normally.<p>I'll probably have a look at using prowl to ping my phone.<p>And if I get more serious I'll look at auto authenticate options on npm. But I'm not sure if the overhead of maintaining a bunch of spoofy requests will be worth it.
Basic question, how does one profit from scraping data and what kinda data?<p>Taking a stab at answering it: you scrape the data and build a business around selling it. Stock prices? But that's boring, plus how many others are doing it? I bet a lot.
Thanks for the share. Great stuff.<p>I used to scrape websites to generate content for higher SERPs.<p>Ended up going into the adult industry lols. (<a href="https://javfilms.net" rel="nofollow">https://javfilms.net</a>)
Not the same kind of scraping, but does anyone have thoughts/resources/best practices for doing link previews (like Twitter/iMessage/Facebook)?
A little pet-peeve I have is when an obscure(ish) acronym is used and never defined. Is SERP a well-known acronym? Perhaps this is a niche blog and I'm not the intended audience.