Scrape like the big boys

374 点作者 incolumitas超过 3 年前

17 条评论

biosed超过 3 年前

I used to lead Sys Eng for a FTSE 100 company. Our data was valuable but only for a short amount of time. We were constantly scraped which cost us in hosting etc. We even seen competitors use our figures (good ones used it to offset their prices, bad ones just used it straight). As the article suggest, we couldn't block mobile operator IPs, some had over 100k customers behind them. Forcing the users to login did little as the scrapers just created accounts. We had a few approaches that minimised the scraping:Rate Limiting by login,Limiting data to know workflows ...But our most fruitful effort was when we removed limits and started giving "bad" data. By bad I mean alter the price up or down by a small percentage. This hit them in the pocket but again, wasn't a golden bullet. If the customer made a transaction on the altered figure we we informed them and took it at the correct price.It's a cool problem to tackle but it is just an arms race.

评论 #29121587 未加载

评论 #29124150 未加载

评论 #29123073 未加载

评论 #29122101 未加载

abc03超过 3 年前

I scrap government sites a lot as they don't provide apis. For mobile proxies, I use the proxidize dongles and mobinet.io (free, with Android devices). As stated in the article, with cgNAT it's basically impossible to block them as in my case, half the country couldn't access the sites anymore (if you place them in several locations and use one carrier each there).

评论 #29125754 未加载

评论 #29130858 未加载

neals超过 3 年前

In a particularly hard to scrape website, using some kind of bot protection that I just couldn't reliably get working (if anybody wants to know what that was exactly, I'll go and check it) I now have a small Intel NUC running with firefox that listens to a local server and uses Temper Monkey to perform commands. Works like a charm and I can actualy see what it's doing and where it's going wrong. (though it's not scalable, of course)We use it for data-entry on a government website. A human would average around 10 minutes of clicking and typing, where the bot takes maybe 10 seconds. Last year we did 12000 entries. Good bot.

评论 #29123496 未加载

评论 #29125246 未加载

InvOfSmallC超过 3 年前

Where I was working we stopped caring about ips browser etc because it was just a race. What we did was analyzing behaviour of clicks and acted on that. When we recognized it we went on serving a fake page. It cuts down a little bit of costs because it was static pages. In general it took a lot of time for them to discover the pattern and it was way more manageable for us.

评论 #29125689 未加载

max002超过 3 年前

Its easy to detect chrome headless so scraping with it is not really how "big" boys do it :D the only scrapers/bots that are really hard to detect are the ones running and controlling real browser and not chromium. I do a lot od research aggainst abitbot systems, some times is friday night. If you spend each one in pub it doesnt mean your normal.

评论 #29134846 未加载

评论 #29134437 未加载

chrisMyzel超过 3 年前

We are seeing a lot of bot traffic too but chose to accept it as reality. We are aware if thousands of bots create unpredictable cost surges that there is something wrong with our product, it should not create such heavy loads to our servers in the first place to fulfil it's mission.I believe the future will make us more free by using more bot / AI technology since who wants to spend their whole day in front of a computer and research information if machines can do the job just fine?

throwaway984393超过 3 年前

If you want to avoid bot detection, learn how bot detection work. A lot of commercial "webapp firewalls" and the like actually have minimum requirements before they flag certain traffic as a botnet; stay below those limits and you can keep hammering away. Sometimes those limits are quite high.In the past we've had the most success defeating bots by just finding stupid tricks to use against them. Identify the traffic, identify anything that is correlated with the botnet traffic, and throw a monkey wrench at it. They're only using one User Agent? Return fake results. 90% of the botnet traffic is coming from one network source (country/region/etc)? Cause "random" network delays and timeouts. They still won't quit? During attacks, redirect to captchas for specific pages. During active attacks this is enough to take them out for days to weeks while they figure it out and work around it.

IceWreck超过 3 年前

The author says proxys are expensive and then proceeds to spend a shitton of money buying all that hardware.

评论 #29120129 未加载

评论 #29118680 未加载

ebbp超过 3 年前

Having spent a week battling a particularly inconsiderate scraping attempt, I’m quite unsurprised by the juvenile tone and fairly glib approach to the ethics of bots/scraping presented by the piece.For the site I work for, about 20-30% of our monthly hosting costs go towards servicing bot/scraping traffic. We’ve generally priced this into the cost of doing business, as we’ve prioritised making our site as freely accessible as possible.But after this week, where some amateur did real damage to us with a ham-fisted attempt to scrape too much too quickly, we’re forced to degrade the experience for ALL users by introducing captchas and other techniques we’d really rather not.

评论 #29119681 未加载

评论 #29119103 未加载

评论 #29122390 未加载

评论 #29120031 未加载

评论 #29119602 未加载

评论 #29125290 未加载

评论 #29118947 未加载

评论 #29119190 未加载

评论 #29118605 未加载

评论 #29118406 未加载

KuhlMensch超过 3 年前

Doing a bit of low-stakes monitoring of webpages lately. It started (as I'm assuming it often does) with right-clicking a network request in Chrome and selecting "copy as curl"Then graduated to JavaScript for surrounding logic e.g. data transformationI had assumed I'd quickly give up and move to a headless browser, BUT I can't bring myself to move away from tiny CPU utilization of curl.Throwing together a "plugin" probably takes me less than 20 minutes normally.I'll probably have a look at using prowl to ping my phone.And if I get more serious I'll look at auto authenticate options on npm. But I'm not sure if the overhead of maintaining a bunch of spoofy requests will be worth it.

hall0ween超过 3 年前

Basic question, how does one profit from scraping data and what kinda data?Taking a stab at answering it: you scrape the data and build a business around selling it. Stock prices? But that's boring, plus how many others are doing it? I bet a lot.

评论 #29124553 未加载

评论 #29124523 未加载

评论 #29126257 未加载

评论 #29143655 未加载

kerokerokero超过 3 年前

Thanks for the share. Great stuff.I used to scrape websites to generate content for higher SERPs.Ended up going into the adult industry lols. (<a href="https://javfilms.net" rel="nofollow">https://javfilms.net</a>)

评论 #29123231 未加载

wilg超过 3 年前

Not the same kind of scraping, but does anyone have thoughts/resources/best practices for doing link previews (like Twitter/iMessage/Facebook)?

评论 #29129444 未加载

devops000超过 3 年前

Could you share your code for AWS lambda and puppetter? It’s definitely interesting for other websites

评论 #29120105 未加载

DeathArrow超过 3 年前

You can put some wasm crypto mining code and at least profit from bots. :D

mrg3_2013超过 3 年前

wow! That was an interesting read.

joekrill超过 3 年前

A little pet-peeve I have is when an obscure(ish) acronym is used and never defined. Is SERP a well-known acronym? Perhaps this is a niche blog and I'm not the intended audience.

评论 #29118797 未加载

评论 #29119203 未加载

评论 #29118807 未加载

评论 #29119083 未加载

评论 #29133899 未加载

评论 #29123293 未加载

评论 #29119079 未加载