While interesting and well-intentioned, the advice in the article can cause issues. The agent isn't downloading CSS or images? Could be a blind user. Pages being downloaded in quick succession? Could be browser prefetch. Lack of mouse movements during navigation? It could be a user on a mobile device or screen reader.<p>What if the bot is making individual requests from unique IP addresses? What if it's scraping many pages over time rather than a single smash-and-grab approach?<p>The article admits that this is a hard thing to solve. In most cases, it's probably not very worthwhile to try to detect bots in the ways that are suggested. Focusing on patterns in the server logs to find out what's being targeted might be more beneficial. Then slap a login or captcha around anything valuable that bots shouldn't have access to
My favorite thing to do was to detect a bot and then show the bot a parallel dataset I'd developed of bogus information. Instead of hitting a block and working its way around it, competitors scraped a bunch of slightly warbled nonsense data. Have fun with that gang! Also made it super duper easy to see who'd try to steal our data.
I work for a SaaS company that provides a domain per customer, and one of the consequences of this that never occurred to me before is that even polite crawlers and bots can thrash your servers if you have enough TLDs.<p>Every few months someone discovers the correlation between our log messages and user agents and we rehash the same discussion about a particular badly behaved crawler that produces enough log noise that it distracts from identifying new regressions.<p>I coordinated an effort to fix a problem with bad encoding of query parameters ages ago, but we still see broken ones from this one bot.
I just launched a site. I have not mentioned it anywhere and there are no links to it on the internet, yet the logs are full of bots looking for vulnerabilities. Judging by what they are looking for, it seems I could eliminate half of them by if (location.contains('php')).
I think bots should have full rights to access the internet, just like humans.<p>Discriminating against bots just makes the internet worse for everyone. In the future, everyone will have personal bots and agents to help automate many things.
The message I'm getting from this article is that headless Chrome should offer an 'undetectable' mode where its unique, fingerprintable globals are replaced with those from the headful version.
There are many false positives, though. It might believe it a bot but is wrong. Changing the number in the URL is something I commonly do. Additionally, I often use curl when I want to download a file (rather than using the browser's download manager).
We detecting web bots by analyzing behavior of all web traffic and via clustering we finding entities that are behaving unusually similar to each other. This way we detect "clusters" or families of bots.
Since 1/21 I have identified 153,423 attempts to gain access to some servers I run. All from unique ip addresses. It's one thing to identify bots, it's quite another what do about them.
> it’s safe to assume that attackers are likely to lie on their fingerprints<p>I hate this sentiment that every one who doesn't want to be tracked is a criminal.<p>It's the digital version of "You don't have anything to hide if you aren't doing bad".<p>---<p>I don't understand the fear about bots or scraping. As long as bots are behaving nicely (not slamming servers), they are just as much web citizens as humans.<p>The web is about sharing of information and having an entire <i>company</i> about exterminating the viability of bots is horrifying.
"..(the OS and its version, as well as whether it is a VM.."<p>Could anybody link me up with something to read on how to detect it is a VM based on TLS fingerprint?
I wonder if they could just make a new browser standard which lets the server put the browser into a mode where the browser itself ensures there’s a live human interacting with the page/app, like using the webcam + OS facial recognition to ensure the server that it’s not a bot.
Answering this question was recently a 1B acquisition for F5 with their purchase of Shape Security.<p>Will be interesting to see what happens to other industry players like PerimeterX. Distil was eaten by the corpse of Imperva, and I don't see Akamai making strong headway with Botman.<p>Google is going after this too with reCAPTCHA. The HN reaction to that has been interesting.<p>It's interesting to me how many comments in these threads talk about scraping as the issue with bots. Every sale I saw when I worked on this problem was related to credential stuffing. Seems the enterprise dollars are in the fraud space, but the HN sentiment is in scraping.<p>Funny how disconnected the community here can be from what I saw first-hand as the "real" issue. Makes me wonder what other topics it gets wrong. Surely my area of expertise isn't special.