Ask HN: Can I get in trouble for crawling using the Googlebot user agent?

44 点作者 goferito大约 8 年前

A lot of sites have IP crawl restrictions, but add exceptions for Googlebot. Could Google or the crawled site legally do something when they find out?

19 条评论

CobrastanJorji大约 8 年前

I feel like it's a good thing to maintain a certain level of professional ethics, and, while it depends on the specifics of the situation, I'd suggest that falsely claiming to third parties be something you aren't in order to do something they don't want you to do generally falls short of that ethical bar.Say your bot misbehaves and effectively starts DOSing a site with a whole lot of pages, like a small Reddit clone or something. And say Reddit doesn't have another way to determine between your bot and the Googlebot. You have now put Reddit in a position where they have to either block the Googlebot (and possibly lose a huge pile of money in the process) or else buy up a lot more hardware and bandwidth to pay for your crawler as well. That's not cool, to put it bluntly.

评论 #14167449 未加载

评论 #14167332 未加载

评论 #14167452 未加载

awinter-py大约 8 年前

I'm not a lawyer and this isn't legal advice; but my instinct is you won't get in trouble.Most important argument: the chrome user-agent contains the word 'mozilla'. Obviously (we argue) google isn't intending these to be accurate and instead are some kind of compatibility mark.Are you committing trademark violation? Given the nature of trademarks, it's not clear that you are.Are you misrepresenting yourself to the site in a way that violates the CFAA? This is probably your biggest area of risk. But you can argue the site is giving away information to google, a company whose slogan until recently was 'free the world's information'. Therefore they weren't taking plausible steps to secure the information you've scraped.

评论 #14166896 未加载

评论 #14167602 未加载

mootothemax大约 8 年前

It depends on what and how you're trying to crawl, it's trivial to verify a "true" Googlebot using reverse DNS:<a href="https://support.google.com/webmasters/answer/80553?hl=en" rel="nofollow">https://support.google.com/webmasters/answer/80553?hl=en</a>I know of a few sites that use this as the first step (of many!) to add bots to their "naughty" list.

cube00大约 8 年前

Sure, go join the realms of shady SEOs and malware, if I want to really stop you I'll know you're not coming from a Google IP range. <a href="https://www.incapsula.com/blog/was-that-really-a-google-bot-crawling-my-site.html" rel="nofollow">https://www.incapsula.com/blog/was-that-really-a-google-bot-...</a>However, consider what your ultimate end game is, if it's a website you expect visitors to find through Google or the Play store, good luck once web masters start reporting your misbehaving "Googlebot" crawler.

评论 #14166755 未加载

评论 #14166767 未加载

beejiu大约 8 年前

I cannot comment on the legal aspects, but the Chrome user-agent contains "like Gecko", "AppleWebKit" and "Safari". It is common for user-agents to be constructed like this for compatibility. (Most for historical reasons.)

matt4077大约 8 年前

Depends on the jurisdiction. In the US, the answer is "you really don't want to find out".In my home country, it's actually quite interesting: fraud usually requires (a) a lie (conveying wrong information with intent), and (b) a financial cost to the other party, and (c) a financial gain for you.It's debatable at that level, already, because their loss is rather hard to quantify, and probably small. Plus, I believe your financial gain must be directly related to their cost.And, finally, you actually have to lie to a human being. Lying to a machine doesn't qualify. There was a guy who earned some 5-digit Euros amount by producing fake bottles and feeding them into deposit machines–no crime!

d2p大约 8 年前

What happens if you put "(not Googlebot)" on the end of your user agent?

riceo100大约 8 年前

Maybe "Googlebot" is a trademark, or maybe you are violating the usage terms the crawled sites have put in place by masquerading yourself... Could you get in to trouble? _MAYBE_? Seems like a stretch in practice though. I've come across people doing this to sites i've been an admin of relatively often, and unless you're crawling with enough intensity to cause a DoS or doing something nefarious with the content, most site owners would maybe roll their eyes and move on.

评论 #14167164 未加载

taftster大约 8 年前

If you crawl a site, index it, and then use that for commercial purposes -- all while using Google's trademark to crawl -- yes, you'll probably get a letter from Google.As for the site owner, it's on them to decide what to do with your traffic. HTTP is an open protocol and extensible. You could send almost anything in your request, as allowed by the protocol. The site owner has opened their service to the HTTP protocol and it's on them to decide what to do with your traffic.

评论 #14166809 未加载

terminalcommand大约 8 年前

Are you only crawling or also scraping the website?If the sites in question only add an exception for googlebot and not other crawlers (e.g. Yahoo, bing, etc.) I would say that it is against the site owner's consent.However if the site owner adds this exception also for other crawlers, you could argue that the site owner's intent of only allowing certain crawlers has not been made explicit. In that case you'd have a chance against the claims from the site's owner.On the other hand Google could possibly sue you for using the user-agent "Googlebot".The important question here is: would they? If you stay under the radar no one -even the courts- would bother.PS: I am only a law student, I am not familiar with any laws/regulations/precedents governing this specific issue. I think from the site owner's perspective it's a grey area. From google's perspective brands and ip are established concepts in law. This is a student's very personal opinion at first sight, take it with a grain of salt :).

评论 #14166801 未加载

dbg31415大约 8 年前

May violate a site's TOS... but I don't think you'd ever get in any real trouble... most you'd get is a cease and desist letter... have to waste some time with lawyers... But I think it's on them to block you at the IP level if you are violating the TOS / causing them grief. And look... if a crawl causes them grief then they really need to invest more in DevOps. (Please do what you can to encourage more companies to invest in DevOps!)"I left my door unlocked and told my friends they could use my living room, but then they put their feet up on my coffee table... Not cool, man!" Pretty much the equivalent situation.

syrrim大约 8 年前

One thing they could do is tell you to stop. If they have told you to stop, and taken measures to block you out (blocking crawlers besides google) then persisting is illegal. I believe somebody got succesfully sued by facebook for continuing to scrape after facebook told them to stop. I'm not sure about the legality if you haven't been explicitly asked to stop, but as long as you are never blatant enough for them to notice, you shouldn't have any trouble.

Edmond大约 8 年前

You won't get in trouble but if the site uses products from the like of Akamai (Bot Manager) or Shape Security then you'll probably be blocked.

mightytightywty大约 8 年前

Since when has it ever been illegal to claim you're someone or something that you're not on the Internet? This is legal without question.

评论 #14167421 未加载

评论 #14167460 未加载

alxmdev大约 8 年前

I wonder if most major sites that whitelist Googlebot also have exceptions for Slurp, Bingbot, and other major search engine bots. If not, then it would be interesting to know how these other companies deal with it, or if they just politely back off.

fbomb大约 8 年前

I handle a number of sites that require a login but need Google to index their content. I verify that Googlebot requests are actually coming from a domain owned by Google. I can't imagine that I'm the only one doing that.

评论 #14166905 未加载

jasonkostempski大约 8 年前

Not saying you should live your life in irrational fear but, even if they just think they can do something about it, that would be enough to mess up your life significantly while having no effect on them at all.

mdekkers大约 8 年前

To properly allow the Googlebot to crawl your site, you usually combine checking googlebot with an IP whois lookup. This is also what Google recommends.

oliv__大约 8 年前

If I were going to do this the last place I'd ask this question is HackerNews. But that's just me.