How to detect web bots?

115 pointsby avastelover 5 years ago

15 comments

Minor49erover 5 years ago

While interesting and well-intentioned, the advice in the article can cause issues. The agent isn't downloading CSS or images? Could be a blind user. Pages being downloaded in quick succession? Could be browser prefetch. Lack of mouse movements during navigation? It could be a user on a mobile device or screen reader.What if the bot is making individual requests from unique IP addresses? What if it's scraping many pages over time rather than a single smash-and-grab approach?The article admits that this is a hard thing to solve. In most cases, it's probably not very worthwhile to try to detect bots in the ways that are suggested. Focusing on patterns in the server logs to find out what's being targeted might be more beneficial. Then slap a login or captcha around anything valuable that bots shouldn't have access to

评论 #22284294 未加载

评论 #22283483 未加载

评论 #22283694 未加载

wittyusernameover 5 years ago

My favorite thing to do was to detect a bot and then show the bot a parallel dataset I'd developed of bogus information. Instead of hitting a block and working its way around it, competitors scraped a bunch of slightly warbled nonsense data. Have fun with that gang! Also made it super duper easy to see who'd try to steal our data.

评论 #22286562 未加载

评论 #22285028 未加载

评论 #22285595 未加载

评论 #22284837 未加载

hinkleyover 5 years ago

I work for a SaaS company that provides a domain per customer, and one of the consequences of this that never occurred to me before is that even polite crawlers and bots can thrash your servers if you have enough TLDs.Every few months someone discovers the correlation between our log messages and user agents and we rehash the same discussion about a particular badly behaved crawler that produces enough log noise that it distracts from identifying new regressions.I coordinated an effort to fix a problem with bad encoding of query parameters ages ago, but we still see broken ones from this one bot.

dominickreyesover 5 years ago

I just launched a site. I have not mentioned it anywhere and there are no links to it on the internet, yet the logs are full of bots looking for vulnerabilities. Judging by what they are looking for, it seems I could eliminate half of them by if (location.contains('php')).

评论 #22286847 未加载

foucover 5 years ago

I think bots should have full rights to access the internet, just like humans.Discriminating against bots just makes the internet worse for everyone. In the future, everyone will have personal bots and agents to help automate many things.

评论 #22285928 未加载

评论 #22286370 未加载

评论 #22285774 未加载

评论 #22286361 未加载

kobollover 5 years ago

The message I'm getting from this article is that headless Chrome should offer an 'undetectable' mode where its unique, fingerprintable globals are replaced with those from the headful version.

zzo38computerover 5 years ago

There are many false positives, though. It might believe it a bot but is wrong. Changing the number in the URL is something I commonly do. Additionally, I often use curl when I want to download a file (rather than using the browser's download manager).

评论 #22288765 未加载

throwaway4392over 5 years ago

At the very least you can use these techniques to know what not to do when scraping, and therefore scrape better.

gesmanover 5 years ago

We detecting web bots by analyzing behavior of all web traffic and via clustering we finding entities that are behaving unusually similar to each other. This way we detect "clusters" or families of bots.

评论 #22285044 未加载

cpt1138over 5 years ago

Since 1/21 I have identified 153,423 attempts to gain access to some servers I run. All from unique ip addresses. It's one thing to identify bots, it's quite another what do about them.

madamelicover 5 years ago

> it’s safe to assume that attackers are likely to lie on their fingerprintsI hate this sentiment that every one who doesn't want to be tracked is a criminal.It's the digital version of "You don't have anything to hide if you aren't doing bad".---I don't understand the fear about bots or scraping. As long as bots are behaving nicely (not slamming servers), they are just as much web citizens as humans.The web is about sharing of information and having an entire company about exterminating the viability of bots is horrifying.

评论 #22283636 未加载

评论 #22283307 未加载

评论 #22284002 未加载

评论 #22283361 未加载

评论 #22284125 未加载

mobilemidgetover 5 years ago

"..(the OS and its version, as well as whether it is a VM.."Could anybody link me up with something to read on how to detect it is a VM based on TLS fingerprint?

评论 #22286470 未加载

DailyHNover 5 years ago

Bots have a PR problem.

评论 #22283610 未加载

Razenganover 5 years ago

I wonder if they could just make a new browser standard which lets the server put the browser into a mode where the browser itself ensures there’s a live human interacting with the page/app, like using the webcam + OS facial recognition to ensure the server that it’s not a bot.

评论 #22284980 未加载

评论 #22290423 未加载

brunoTbearover 5 years ago

Answering this question was recently a 1B acquisition for F5 with their purchase of Shape Security.Will be interesting to see what happens to other industry players like PerimeterX. Distil was eaten by the corpse of Imperva, and I don't see Akamai making strong headway with Botman.Google is going after this too with reCAPTCHA. The HN reaction to that has been interesting.It's interesting to me how many comments in these threads talk about scraping as the issue with bots. Every sale I saw when I worked on this problem was related to credential stuffing. Seems the enterprise dollars are in the fraud space, but the HN sentiment is in scraping.Funny how disconnected the community here can be from what I saw first-hand as the "real" issue. Makes me wonder what other topics it gets wrong. Surely my area of expertise isn't special.

评论 #22285149 未加载

评论 #22285117 未加载

评论 #22285036 未加载

评论 #22284701 未加载