TE
TechEcho
Home24h TopNewestBestAskShowJobs
GitHubTwitter
Home

TechEcho

A tech news platform built with Next.js, providing global tech news and discussions.

GitHubTwitter

Home

HomeNewestBestAskShowJobs

Resources

HackerNews APIOriginal HackerNewsNext.js

© 2025 TechEcho. All rights reserved.

Ask HN: How to deal with GDPR / cookie notices in the context of a crawler?

65 pointsby mgliwkaalmost 7 years ago
To comply with the new European legislation many websites put a GDPR &#x2F; cookie consent notice in front of their websites. There are different implementations of this. While some are only implemented as modal covering the website or bar on the bottom of the screens (in both cases right next to the original content), other implementations redirect the user to a totally different (sub-)domain or even hijack the request and show the consent form instead of the requested content (on the same URL with a 200 status code).<p>The latter ones present a issue to my crawler. I cannot access the content of the page without accepting those notices.<p>Things I&#x27;m considering to bypass those notices:<p>* US IP address (easy to implement, but some websites also display those notices to US IP&#x27;s)<p>* Heuristics to detect those notices and accept them programatically (takes some time to implement - while a couple of vendors (i.e. OneTrust) offer off-the-shelf solutions which are easy to identify and automate, there are also many custom made solutions, so the system would need understand the concept of a consent form and how to bypass it - some forms only require the press of the right button, others involve checkboxes&#x2F;radio buttons). To collect test data one solution might be to visit a set of websites once with an US IP, once with an EU IP and&#x2F;or with different user agents (browser or googlebot).<p>Do you have any ideas how to approach this problem? Or are you even utilizing some techniques already and are willing to share them?

7 comments

jjcmover 6 years ago
Side question - how does HN feel about the cookie&#x2F;gdpr notices in general? I personally feel that while I like the purpose they have, they just feel like spam at this point. I kind of expect most websites to use cookies, and if I didn&#x27;t want them to I&#x27;d probably block them with an extension. As for the GDPR notices, are these going to be persistent forever? It feels like the web did 5 years ago, except instead of viagra ads I&#x27;m getting GDPR and cookie popups on every site now.<p>Overall I feel like the intent of these is correct, but the execution is terrible. I&#x27;d much rather have say a badge in the address bar of the browser (similar to the https badge) saying a site was gdpr compliant and used cookies then a popup everywhere.
评论 #17807132 未加载
评论 #17806767 未加载
评论 #17809226 未加载
评论 #17807567 未加载
评论 #17807259 未加载
评论 #17806886 未加载
评论 #17805976 未加载
评论 #17806168 未加载
评论 #17806469 未加载
评论 #17807773 未加载
评论 #17806773 未加载
ndarilekover 6 years ago
I have a related question. How do you bypass them in the context of an RSS reader&#x2F;podcatcher? I was building a service to parse some podcast feeds into JSON, and noticed they were failing on NPR podcasts. Pulled up the URL fine on my laptop, but it failed in Hetzner.<p>Of course, it failed because it was getting some sort of GDPR page at the podcast feed URL. I&#x27;m wondering if there was some way around this, because it&#x27;s not like podcatchers can opt into something via an RSS feed...can they? I&#x27;m pretty sure I passed headers only accepting feed content-types, but even that wasn&#x27;t enough.<p>Sure I can host elsewhere, but I just didn&#x27;t care enough about the project to do that. But if there&#x27;s a way around this, then I might pick it up again.
评论 #17808894 未加载
评论 #17808131 未加载
评论 #17807160 未加载
评论 #17807348 未加载
ddebernardyover 6 years ago
Manually accept (or reject) the tracking once, and then pass the relevant cookie as part of your crawler&#x27;s request.
eberkundover 6 years ago
Why does every website need to create its own UI for this? Whatever happened to that &quot;Do not track&quot; browser setting? This should be equivalent to rejecting all of these notices automatically.
Scaurover 6 years ago
Thanks for asking this question, I&#x27;d like to learn about this too.
dogma1138over 6 years ago
Sounds like a possible use case for a mechanical Turk for those that do a redirect popup and not just a forefront dom object while loading the actual content behind it.
highaceover 6 years ago
Don&#x27;t use a server or IP based in Europe. Problem solved.
评论 #17807231 未加载
评论 #17807772 未加载