Sites detecting headless browsers vs headless browsers trying not to be detected by sites, is an arms race that's been going on for a long time. The problem is that, if you're trying to detect headless browsers in order to stop scraping, you're stepping into an arms race that's being played very, very far above your level.<p>The main context in which Javascript tries to detect whether it's being run headless is when malware is trying to evade behavioral fingerprinting, by behaving nicely inside a scanning environment and badly inside real browsers. The main context in which a headless browser tries to make itself indistinguishable from a real user's web browser is when it's trying to stop malware from doing that. Scrapers can piggyback on the latter effort, but scraper-detectors can't really piggyback on the former. So this very strongly favors the scrapers.
Blocking crawlers is dead simple:<p>Find a way to build an API for your data that allows you both to make money.<p>Any effort besides that is wasted.<p>Honey pots links? Great my crawler only clicks things that are visible. See capybara.<p>IP thresholds? Great I have burner IPs that hit a good page of yours until I’m blocked (am I time banned, captchad or perma) and then I back that number out across my network of residential IPs (bought through squid, hello or anyone else) and a mix of tor nodes ( I sample your site with that too) to make sure I never approach that number. But then I also geolocate the IP so it’s only crawling during sensible browsing hours for that location.<p>Keystrokes detection? yeah I slow down keystrokes so it looks like Grandma is browsing<p>Mouse detection? looks like Michael j Fox is on your site (that’s an old Dell or Gateway Commercial reference don’t be mad)<p>Poison the well? I get a page from multiple IPs and headless browser combinations on different screen orientations and if I detect odd changes in data I flag that URL for a turk to provide insight/tune the crawler.<p>I keep the screenshot and full payload (css,js,html) that I use over time to do more devious shit like render old versions of your page behind a private nginx server so I can re-extract pieces of data I may have missed.<p>Stop trying to stop the crawling and figure out how to create a revenue stream.
This article is a joke, all those methods of "protections" are a joke. What we called "script kiddys" and are now a major amount of so called developers are just underdeveloped lamers who just don't know that the fight is lost in advance. All the methods that you take are useless when you get into situation of scraper run by someone who is able to modify (oh and is able to code in c/c++) and recompile the client side. The world went into Idiocracy so much that methods are beeing invented by people who are so narrow minded that they see the development in a scope of a browser and have a false sense that they "can handle it". Only if the oponent is as narrow minded as they are. Only than. I can modify the source code of chromium, you will get back exactly what you expect from regular user, i am able to scrape fb and linkedin and the only thing they can do is to slow me down (to hide the fact that the code is doing surfing, not human). Stop wasting your time on protection, you are running your inneficient crappy code in insecure environment, the only "attacker" you are safe against is the one who is as clueless as you are.<p>The same moment when you send content to the client, it is game over. You have lost all control.<p>I am sorry for all non-gentle sentences here, but we had developers who were able to decompile asm code and patch it to avoid drms, while now sandboxed idiots are thinking, they are smart. The whole dev. environment became toxic =/ And people are just to stupid to understand how stupid they are =/
Isn't it impossible to win the game of blocking headless browsers?<p>What's stopping someone from creating an API that opens up a real browser, uses a real (or virtual) keyboard, types in/clicks the real address, etc. then proceeds to use computer vision to scrape the information from the page without touching the DOM?
Good, the less effective various spying techniques are, and the easier they are to throw off, the better the internet is for its users. I don't want any website owners to know what device, browser, or other program, I use to access their site, and they have no business knowing that. I like it being a piece of information I can supply voluntarily for my own purposes, and I get the heebie jeebies every time I read about a new shady fingerprinting technique that exploits some new, previously unexplored quirk of web technologies.
I'm not sure why one wants to bother to do this.<p>With tools like Sikuli script (sikuli.org) already around for ages, automating a headed browser isn't rocket science. So the best-case scenario for detecting headless browsers is "The bad guys just use headed browsers and another automation solution."
This dicussion is also happening on a counterpoint posted about 9 hours earler, also currently on the front page:<p>It is possible to detect and block Chrome headless | <a href="https://news.ycombinator.com/item?id=16175646" rel="nofollow">https://news.ycombinator.com/item?id=16175646</a>
"That’s when it becomes impossible. You can come up with whatever tests you want, but any dedicated web scraper can easily get around them."<p>As long as the logic is hidden from the scrapers, i.e. not running in a web browser, scrapers are at a disadvantage. They don't have the data about the users that websites have. And even something as simple as Accept-Language header associated with an IP subnet is a data point that can be used to protect against scraping. There are a lot more data points though and more aggressive fingerprinting can effectively destroy scraping.
Interesting follow-up (again). It will be very interesting to see where attempts to detect headless browser will first appear in the wild. Once we know that and the prevalence, we can make a judgement call on how much effort to put into anti-detection techniques. It's an arms race for sure, but once you know your target you can evaluate whether you even have to put up the effort to defeat a non-existent adversary.
It's a very dangerous thing to do for SEO reasons too.<p>I'm sure Google and others have automated user-like crawling which attempts to validate their official Google indexing bot.<p>If the results between the two differ in certain ways you may well get your site buried way down in search results.
Crawlers & scrapers that rely on headless browsers like Chrome often initiate playback of video on the pages they access.<p>The company I work for (Mux) has a product that collects user-experience metrics for video playback in browsers & native apps. It's been a non-trivial effort developing a system to identify video views from headless browsers so that we might limit their impact on metrics. Being able to make this differentiation has a real benefit to human users of our customer's websites.<p>My preference would be for headless browsers to not interact with web video or be easily identifiable via request headers, though I doubt either of these things will happen any time soon.
The author's navigator.webdriver fix is easily detected, though of course it is fixable with changes to Chrome. This cat and mouse game probably isn't worth pursuing against dedicated adversaries.<p><pre><code> if (navigator.webdriver || Object.getOwnPropertyDescriptor(navigator, 'webdriver')) {
// navigator.webdriver exists or was redefined
}</code></pre>
As someone who writes web scrapers for a living, I have only come across one site where I have been unable to reliably extract the information we need. If we were more flexible, we would be able to deal with this site too. Defending yourself from scrapers is an arms race you are almost certain to loose.
Its trivial to randomise HTTP headers, both the content and the <i>order</i>. There are free and commercial databases of user-agent strings available to any user, the same ones the websites may use.<p>Users can also modify or delete HTTP headers through local proxies, using the same proxy software that many high volume websites use. Sites that rely on redirects to set headers make this even easier.<p>p0f only works with TCP. Could this be another a selling point for alternative congestion controlled reliable transports that are not TCP, e.g. CurveCP? I have prototype "websites" on my local LAN that do not use TCP.<p>The arguments in favor of controlling access to public information through "secret hacker ninja shit" (<a href="https://news.ycombinator.com/item?id=16176572" rel="nofollow">https://news.ycombinator.com/item?id=16176572</a>) are not winning on the www or in the courts. Consider the recent Oracle ruling and the pending LinkedIn HiQ case.<p>If the information is intended to be non-public, then there is no excuse for not using access controls. Anything from basic HTTP authentication to requiring client x509 certificates would suffice for making a believable claim.<p>Detecting headless Chrome and serving fake information, or any other such "secret hacker ninja shit" is not going to suffice as a legitimate access control, whether in practice or in an argument to a reasonable person.<p>The fact is in 2017 websites still cannot even tell what "browser" I am using, let alone what "device" I am using. They still get it wrong every time. Best they can do is make lousy guesses and block indiscriminately. Everything that is not what they want/expect is a "bot", a competitor, an evil villan. Yet they have no idea. Sometimes, assumptions need to be tested.1<p><pre><code> 1 https://news.ycombinator.com/item?id=16103235 (where developer thought spike in traffic was an "attack")</code></pre>
From my experience in the scene:<p>Bot mill people are very aware of headless browsers being an effortless solution to mimic a browser, but not that efficient.<p>The amount of ram and so a bots spends to do a single click can truly hurt their bottom line.<p>Top tier collectives I heard of use own C/C++ frameworks with hardcoded requests and challenge solvers, and in-depth knowledge of anti-botting anti-fraud techniques used by the opposing force. If DoubleClick finds a brand new performance profiling test, and send it out in the JS code in one in 1000 requests, expect those guys to detect it and crack it within 24 hours.<p>They have no objective of getting through captchas, just having their number of valid clicks in double digits.
The problem is that you can easily detect that some properties have been overloaded. For example, you can execute Object.getOwnPropertyDescriptor(navigator, "languages") to detect if navigator.languages is a native property or not.
Could someone tell me why everybody wants to fight against headless browser ? If I want to use such a browser to browse your site, site that you voluntarily show to the public, then it's my problem, my code, not yours. If you want to protect your data so much, then maybe you shouldn't put them on the web first place. (yep, I present things in black and white, but you get the picture)<p>I would also add this :<p><a href="https://www.bitlaw.com/copyright/database.html#Feist" rel="nofollow">https://www.bitlaw.com/copyright/database.html#Feist</a><p>because it basically says it's hard/pointless to protect data.
Some people seem to have figured out how to detect without relying on fingerprinting the browser. ex. Crunchbase<p>but headless chrome shouldn't be possible to distinguish from a regular chrome browser.<p>The only vector to block scraper is some sort of navigational awareness that deviates from a distribution curve + awareness of IP.<p>but this comes at a great cost to hurting your own real vistors by taxing them with captcha or other annoyances.
It is "easy" to block scraping. Make it very costly to scrape:<p>- Render your page using canvas and WebAssembly compiled from C, C++, or Rust. Create your own text rendering function.<p>- Have multiple page layouts<p>- Have multiple compiled versions of your code (change function names, introduce useless code, different implementations of the same function) so it is very difficult reverse engineer, fingerprint and patch.<p>- Try to prevent debugging by monitoring time interval between function calls, compare local time interval with server time interval to detect sandboxes.<p>- Always encrypt data from server using different encryption mechanisms every time.<p>- Hide the decryption key into random locations of your code (use generated multiple versions of the code that gets the key)<p>- Create huge objects in memory and consume a lot of CPU (you may mine some crypto coins) for a brief period of time (10s) on the first visit of the user. Make very expensive for the scrapers to run the servers. Save an encrypted cookie to avoid doing it later. Monitor concurrent requests from the same cookie.<p>The answer is that it is possible but it will cost you a lot.
If you want to detect if a human is visiting your site, open an ad popup with a big close button directly over the content.<p>A human being will always, 100% of the time, immediately close the popup. Automation won't care.
It is impossible to make headless and normal browser send 100% indistinguishable traffic. The timing of the browser requests is influenced by rendering that for the two versions will be always different.
All those tests are useless and effective only against script kiddys (which are now like 99.99999% of developers by old standards) and are unable to code anything else but crappy languages like js. For people grown up with web, capable of coding in c/c++ those tests are a joke, I'll just modify the source code to return what is expected and 'game over'. We were reversing drms by dissasembling and patching the binaries - in world of text based protocols and scripts, Idiocracy of todays world is making us invincible.