To comply with the new European legislation many websites put a GDPR / cookie consent notice in front of their websites. There are different implementations of this. While some are only implemented as modal covering the website or bar on the bottom of the screens (in both cases right next to the original content), other implementations redirect the user to a totally different (sub-)domain or even hijack the request and show the consent form instead of the requested content (on the same URL with a 200 status code).<p>The latter ones present a issue to my crawler. I cannot access the content of the page without accepting those notices.<p>Things I'm considering to bypass those notices:<p>* US IP address (easy to implement, but some websites also display those notices to US IP's)<p>* Heuristics to detect those notices and accept them programatically (takes some time to implement - while a couple of vendors (i.e. OneTrust) offer off-the-shelf solutions which are easy to identify and automate, there are also many custom made solutions, so the system would need understand the concept of a consent form and how to bypass it - some forms only require the press of the right button, others involve checkboxes/radio buttons). To collect test data one solution might be to visit a set of websites once with an US IP, once with an EU IP and/or with different user agents (browser or googlebot).<p>Do you have any ideas how to approach this problem? Or are you even utilizing some techniques already and are willing to share them?
Side question - how does HN feel about the cookie/gdpr notices in general? I personally feel that while I like the purpose they have, they just feel like spam at this point. I kind of expect most websites to use cookies, and if I didn't want them to I'd probably block them with an extension. As for the GDPR notices, are these going to be persistent forever? It feels like the web did 5 years ago, except instead of viagra ads I'm getting GDPR and cookie popups on every site now.<p>Overall I feel like the intent of these is correct, but the execution is terrible. I'd much rather have say a badge in the address bar of the browser (similar to the https badge) saying a site was gdpr compliant and used cookies then a popup everywhere.
I have a related question. How do you bypass them in the context of an RSS reader/podcatcher? I was building a service to parse some podcast feeds into JSON, and noticed they were failing on NPR podcasts. Pulled up the URL fine on my laptop, but it failed in Hetzner.<p>Of course, it failed because it was getting some sort of GDPR page at the podcast feed URL. I'm wondering if there was some way around this, because it's not like podcatchers can opt into something via an RSS feed...can they? I'm pretty sure I passed headers only accepting feed content-types, but even that wasn't enough.<p>Sure I can host elsewhere, but I just didn't care enough about the project to do that. But if there's a way around this, then I might pick it up again.
Why does every website need to create its own UI for this? Whatever happened to that "Do not track" browser setting? This should be equivalent to rejecting all of these notices automatically.
Sounds like a possible use case for a mechanical Turk for those that do a redirect popup and not just a forefront dom object while loading the actual content behind it.