The listed techniques not only detect Chrome headless, but all custom browsers built on CEF (Chromium Embedded Framework) <a href="https://bitbucket.org/chromiumembedded/cef" rel="nofollow">https://bitbucket.org/chromiumembedded/cef</a>, such as Kantu from <a href="https://a9t9.com" rel="nofollow">https://a9t9.com</a><p>If your goal is to only allow the original Google Chrome browser, that is fine. Otherwise this might cause false alarms.
And it’s possible to pretend not to be Chrome headless too.<p><a href="https://intoli.com/blog/making-chrome-headless-undetectable/" rel="nofollow">https://intoli.com/blog/making-chrome-headless-undetectable/</a>
I read these things and I think "So much wasted energy and effort"<p>In the beginning was the web, and it was good. Content came along. Some was good, some was cats. Then paid sites with sign-up. Then search engines. Then ads.<p>Pretty soon folks thought "I not only own this content, I own how it will be presented to the end user. If I choose to add in cats, or Flash ads, or whatnot? They're stuck consuming it. I own everything about the content from the server to the mind of the person consuming it, the entire pipe."<p>Many people did not like this idea. Ads were malicious, they installed malware. The practice of using ads on content caused sites to track users like lab rats. Armies of people majoring in psychology were hired to try to make the rats do more of what we wanted them to do.<p>Ad blockers were born. Then anti-ad-blockers. Then headless browsers. Now anti-headeless browsers.<p>It's just a huge waste of time and energy. The model is broken, and no amount of secret hacker ninja shit is going to make it work. You want to know where we'll end up? We'll end up with multiple VMs, each with a statistically common setup, each consuming content on the web looking just like a human doing it. (We'll be able to do that by tracking actual humans as they consume content). But nobody will be looking at those VMs. Instead, those invisible screens will be read by image recognition software which will then condense what's on there and send the results back to whoever wants it.<p>Content providers will never win at this. Nor should they. Instead, we're just going to sink billions into a busted-ass business model over the next couple of decades throwing good money after bad.<p></rant>
You probably want the web equivalent of malicious compliance - an algorithmically generated web-hole or similar. That way the bot author isn't entirely sure you're on to them; it could be a bot or server error. Like send the right headers but garbage data that looks like it's compressed but isn't, or doubly compressed garbage, or trim pages at a different place (before anything interesting), or slow data transfers, or ...
All web automation and automation prevention is a cat and mouse game where you never stop the scrapers, you just create more effort for them. It’s like traditional and digital security in that regard, except that security often has an element of difficulty in overcoming it (cryptography, thickness of physical barriers), whereas stopping web scraping is about adding more trivial things to make the process more complicated.<p>Eventually, human browsing and headless browsing converge. Nobody wants to make the human browsing experience bad, so the headless browsing continues.<p>In my opinion, if you’re running a site that is existentially threatened by someone else having your content, you need something else for your moat.
This feels a bit like the "VMs aren't quite like real machines" problem --- as in, it's a cat-and-mouse game that will probably continue indefinitely.<p>Personally, as someone who regularly uses several different browsers and experiments with others, I wish the Web was far more browser-neutral.
The whole point of using an headless browser is to work around web sites that attempt to block simple "curl" style scraping (or where you need to execute JavaScript to scrape).<p>So making it detectable (intentionally, even, right there in the user agent!) is really absurd.<p>Or actually, it makes one wonder about Google's motives.
Is there a way to enable Chrome PDF Viewer/Widevine Content Decryption Module etc in headless chromium? Is there some switch in chromium code base that would enable that?
To every action there is always opposed an equal reaction... <a href="https://intoli.com/blog/making-chrome-headless-undetectable/" rel="nofollow">https://intoli.com/blog/making-chrome-headless-undetectable/</a>
Re. blocking scrapers: Some of us are neither vast corporate espionage practicioners nor zombie-botnet users: we're on our own, scraping for data science & other academic research purposes.<p>Is there some way to declare, "I am a legitimate academic user", something akin to 'TSA Pre' status?<p>"Sure, register for & use the site's API," you'll say. What if they don't have one?<p>"Sure, just don't slam the server with too many requests in a short time," you'll say. But if they're rejecting you just because they detect you're headless, etc...?
For what it's worth, Dullahan, my headless SDK on top of Chromium Embedded Framework appears exactly the same as desktop Chrome:<p>Overview: <a href="https://bitbucket.org/lindenlab/dullahan/overview" rel="nofollow">https://bitbucket.org/lindenlab/dullahan/overview</a><p>Examples: <a href="https://bitbucket.org/lindenlab/dullahan/src/default/examples/?at=default" rel="nofollow">https://bitbucket.org/lindenlab/dullahan/src/default/example...</a><p>Not suggesting it's better or worse - just an alternative if you need something that appears to be like a desktop browser.
This discussion is also happening on a counterpoint posted about 9 hours later, also currently on the front page:<p>It is not possible to detect and block Chrome headless | <a href="https://news.ycombinator.com/item?id=16179181" rel="nofollow">https://news.ycombinator.com/item?id=16179181</a>
Worth noting, I believe: the word "block" doesn't appear in the article, and seems to have been editorialized in the poster's title.