How Google’s Web Crawler Bypasses Paywalls

640 点作者 elaineo大约 9 年前

44 条评论

"Remember: Any time you introduce an access point for a trusted third party, you inevitably end up allowing access to anybody."See also: <a href="http://www.apple.com/customer-letter/" rel="nofollow">http://www.apple.com/customer-letter/</a>:)

评论 #11136069 未加载

评论 #11138315 未加载

评论 #11136080 未加载

评论 #11139222 未加载

评论 #11137898 未加载

slig大约 9 年前

If they're now blocking clicks from Google, doesn't that mean that they're cloaking and violating the Google's Webmaster Guidelines [1]?[1]: <a href="https://support.google.com/webmasters/answer/66355?hl=en" rel="nofollow">https://support.google.com/webmasters/answer/66355?hl=en</a>

评论 #11135387 未加载

评论 #11136630 未加载

anewhnaccount2大约 9 年前

If this is true, what WSJ is doing is called "cloaking" and should cause it to get de-indexed: <a href="https://support.google.com/webmasters/answer/66355?hl=en" rel="nofollow">https://support.google.com/webmasters/answer/66355?hl=en</a>

评论 #11135663 未加载

eps大约 9 年前

Correct me if I'm wrong, but wasn't there a long standing Google's policy that the version of the page served to their crawler must also be publicly accessible. That would then be the reason why WSJ articles were accessible through the paste-into-google trick, rather than because WSJ was incompetent and failed to "fix" the bypass.So does it mean that Google will no longer index full WSJ articles or does it mean a change in the Google's policy?

评论 #11137845 未加载

zaroth大约 9 年前

And congratulations, you have likely just "exceeded authorized access" and committed a felony violation of the CFAA punishable by a fine or imprisonment for not more than 5 years under 18 U.S.C. § 1030(c)(2)(B)(i).From the ABA: "Exceeds authorized access is defined in the Computer Fraud and Abuse Act (CFAA) to mean "to access a computer with authorization and to use such access to obtain or alter information in the computer that the accesser is not entitled so to obtain or alter."To prove you have committed this terrible felony, the FBI will now demand that Apple assist in disabling the secure enclave of your device in order to access your browser history. But remember, they only need to do this because they aren't allow to MITM all TLS and "acquire" -- not "collect" -- every HTTP request your machine ever makes. </s>

评论 #11135862 未加载

评论 #11136606 未加载

评论 #11136805 未加载

评论 #11135829 未加载

评论 #11137361 未加载

评论 #11136548 未加载

评论 #11135725 未加载

评论 #11137381 未加载

评论 #11136935 未加载

评论 #11136834 未加载

mbroshi大约 9 年前

Am I alone in feeling like this is akin to a tutorial on how you can shoplift without getting caught? WSJ, for better or worse, does not want to give you content without your paying for it. If you take that content without paying, you are stealing. Just because you have figured out how to get past their security does not mean it's not stealing.(See the second precept here: <a href="https://en.wikipedia.org/wiki/Five_Precepts" rel="nofollow">https://en.wikipedia.org/wiki/Five_Precepts</a>)

评论 #11135805 未加载

评论 #11136308 未加载

评论 #11135896 未加载

评论 #11137137 未加载

评论 #11135733 未加载

评论 #11135795 未加载

评论 #11137740 未加载

评论 #11135895 未加载

mikemikemike大约 9 年前

This is an odd debate. Let's say a restaurant declares "veterans eat free." This blog post is like a friend telling you "Hey if you tell this restaurant you're a vet they'll give you a free meal." No one said it's legal or ethical. It's lying to trick someone into giving you something at their expense.I think the relevant point, underscored by the author's last sentence, is it doesn't matter who you open a back door for - it opens the possibility for anyone to barge through.

评论 #11137801 未加载

mangeletti大约 9 年前

This is not meant to be purely controversial, but I thought long and hard about WSJ back a few months ago when HN mod (always forget his name) said to stop complaining about HN links being posted because paywalls were ok. I agree paywalls are ok. But some things are not ok.Take a look, for instance, at the WSJ.com home page with an ad blocker turned on (note all the missing letters and scrambled up titles). They want me to pay, and they want me to see ads, and they want to track my behavior? Should I send them my DNA also?Organizations like WSJ are exactly the disease that causes ad blockers to proliferate and ruin the web for all the decent publishers. They're at war with my privacy (by breaking their site intentionally when I visit with a blocker on). They want it all, ads, tracking, your private data, and subscription revenue, not to mention...# Agenda-Driven ContentI mean, we're basically talking about NBC or Fox here, just on the web. Imagine every morning when you woke up you turned on the television and tune to some "news" show. After talking about the weather, they start talking about a lost pickle that is thought to be potentially alive and moving about with free will. Over the next two years, talk about the same pickle extends to every other TV show. Before you know it, everybody in the nation is talking about the same pickle. Years go by, and that pickle has become a part of our society, and that's not because people are born with an innate care the well-being of pickles, but because "news" shows taught them to be.That's not a good position to be in. I have to believe I'm not the only one in here that doesn't watch any TV. So, why do we all treat the same media giants differently on the web? We crave their content so much that we build browser add-ons to get to their content, etc.

评论 #11136832 未加载

评论 #11137409 未加载

评论 #11137505 未加载

metafunctor大约 9 年前

I'm pretty sure Google will soon stop indexing WSJ. Why index something if the vast majority of users cannot access the pages behind the links?EDIT: The "paste a headline into Google" trick still works for me, though. If this continues to be the case, they will keep indexing, of course.

评论 #11137101 未加载

评论 #11135908 未加载

评论 #11135968 未加载

sylvinus大约 9 年前

Well, that trick won't last long either. It's trivial to verify that an IP indeed belongs to Google:<a href="https://support.google.com/webmasters/answer/80553?hl=en" rel="nofollow">https://support.google.com/webmasters/answer/80553?hl=en</a>

评论 #11135537 未加载

评论 #11135420 未加载

kenshaw大约 9 年前

Basically, the article is stating to change the User-Agent to GoogleBot or Bing or whatever other crawler UA you'd prefer. While that's doable, that's something that is easily detectable and prevented, as all of the big crawlers can be validated against DNS.Additionally, I would like to point out that I wrote a Varnish extension for the express purpose of validating User-Agent strings through DNS lookups, and is available here: <a href="https://github.com/knq/libvmod-dns" rel="nofollow">https://github.com/knq/libvmod-dns</a>It was built because we had specifically a problem with bad bots crawling a large site (multiply.com) and this was one of the easiest ways to filter out the bad bots from the good, and to enforce robots.txt policies on a per bot basis. It works very well, as you can do any kind of DNS caching internally and prevent this kind of behavior, if that's your goal.

matt_wulfeck大约 9 年前

I like wsj but I only read maybe 1 article every other day. They need a more reasonable price point, especially since the market will almost bear no price at all.That being said I do enjoy their content, save for maybe the op-eds.

评论 #11135826 未加载

jrochkind1大约 9 年前

I thought Google specifically disallowed returning different pages based on User-Agent targetting googlebot, and this included paywalls.Are they running afoul of Google policies and going to get pinged by Google?I can't find the text from Google now (when can you ever find any docs at google?), but I am very certain I remember reading from them that you may not return different content to GoogleBot based on User-Agent.

crazysim大约 9 年前

Doesn't this kind of also hurt SEO? I'm would guess Google has some automated system to detect and apply a negative signal to sites that provide different content to a Googlebot user agent than a non-Googlebot user agent. I guess these sites are counting that the other signals outweigh that negative hit.Otherwise, why would expertsexchange be obligated to provide the answers at the very bottom? Did something change?

评论 #11135392 未加载

评论 #11135425 未加载

Gratsby大约 9 年前

If you hit a paywall or a "sign up to access this content" message from a google search result, report it. Google will remove them from the search results, they will lose their largest traffic source, and they will address the issue. Or they won't because they have enough paying customers.

zem大约 9 年前

i thought of doing that when the "search google" trick stopped working, but i decided it crossed the point where i would feel like i was unfairly circumventing their clear desire not to serve me the content. i've just added wsj to my mental ignore list and count it as a few more minutes gained to do something else.

评论 #11137984 未加载

jdunck大约 9 年前

If Google (or any other crawler) wanted to play nice with paywalls, they could issue a public key for their bot, and put a signature in their User Agent string that the domain could then verify.Those signatures could obviously leak, but on a per-domain basis. Perhaps the domains could have a secure way of bumping the valid key generation if they had a leak.

评论 #11136332 未加载

评论 #11139358 未加载

mchahn大约 9 年前

Bypassing the paywall is more unethical that blocking ads. It is one thing to have control over your own browser but another to steal something from another site.Also, isn't it illegal to bypass computer security?

评论 #11135861 未加载

hueving大约 9 年前

Based on the comments here, am I to understand that constantly browsing the web with my user agent string set to a googlebot string, I am committing a felony? How would I even know which sites I'm gaining unauthorized access to?That is completely idiotic if there is a string you can put in a Mozilla browser config that is literally illegal to browse the web with.

评论 #11136668 未加载

评论 #11142584 未加载

chrishn大约 9 年前

> Remember: Any time you introduce an access point for a trusted third party, you inevitably end up allowing access to anybody.coughNSAcough

ikeboy大约 9 年前

New workaround: paste the article title into archive.is. I don't know what they're doing but they have a workaround of some sort.

评论 #11137126 未加载

评论 #11137789 未加载

jgh大约 9 年前

I just tried clicking on "Harper Lee, Author of ‘To Kill a Mockingbird,’ Dies at Age 89" from wsj.com's homepage and got the paywall.I then pasted the headline into google and clicked on it from Google results and did not get hit by the paywall.

评论 #11135754 未加载

评论 #11135516 未加载

评论 #11136551 未加载

评论 #11135626 未加载

GigabyteCoin大约 9 年前

I was under the impression that the "hack" whereby you searched for the article on Google and clicked through to that article (effectively skipping over the paywall) was a demand of Google's and not an oversight by the paywalled website.I thought that google deemed providing search results which were behind paywalls as a "bad experience" for their search users, and would penalize websites for doing so.Is this no longer the case?

评论 #11137822 未加载

tete大约 9 年前

Doesn't Google usually try to punish websites that show users something different and even mentions that somewhere?Not an SEO Expert here, but wonder how and whether Google will end up handling that. I mean making an exception could also be considered abuse of power in some countries of the world. Don't have any strong opinion yet on that, just saying that because of how the EU exercised certain laws in recent years.

Illniyar大约 9 年前

Aren't you supposed to verify if a visitor is a googlebot by reverse lookup of the IP address? I.E.: <a href="https://support.google.com/webmasters/answer/80553?hl=en" rel="nofollow">https://support.google.com/webmasters/answer/80553?hl=en</a>User-agents are notoriously unreliable.

philip1209大约 9 年前

I wonder how many Google Cloud customers use the servers to run spoofed Googlebot crawlers from the Google IP range in order to bypass paywalls and scrape large sites (like LinkedIn) without hinderance.

0xCMP大约 9 年前

It's broken already. Tried to access an article about new china rules for online news and it pay-walled me. They're probably looking for clients coming from googlebot.com now.

mikestew大约 9 年前

So does HN now choose to not post articles from the WSJ? I was comfortable with the "google it" trick, and frankly was a little annoyed with constant "paywall, wah!" comments when what should be by now a well-known workaround was available. But that workaround no longer works.

评论 #11137581 未加载

coverband大约 9 年前

My Windows anti-virus deletes the linked sample code automatically upon download, marking it as "Trojan:Win32/Spursint.A". Did anyone have the same experience? (I was actually more interested in using it as a template for writing a simple Chrome extension.)

评论 #11136137 未加载

mildweed大约 9 年前

Solution:Content providers register a (yet-to-be-written) Google News API account, get an API key, with which Google indexes the site and the site recognizes as legit.

jasonwilk大约 9 年前

I've noticed that this has stopped working on WSJ if you've already hit the paywall and try to google the article to bypass.

f137大约 9 年前

I wonder if anybody tried to do as suggested? I copied the files to Chrome as per instructions, and the paywall was still in place.

warrenmar大约 9 年前

You can also access WSJ for free at the library.

评论 #11136570 未加载

jupp0r大约 9 年前

It's not bypassing at all. Googles crawlers are deliberately let in because a paywall that nobody runs into is useless.

chinathrow大约 9 年前

So soon they have to block anyone with a fake Google UA and whitelist the well known 66.249 IP range. Trivial.

yyin大约 9 年前

Does WSJ check visits from a Googlebot UA against a list of known Google IP addresses?

amelius大约 9 年前

Fix: replace the user agent string by a cryptographic challenge/response scheme.

pmontra大约 9 年前

They'll start allowing only some IP addresses search engines agreed with them.

daveheq大约 9 年前

Possible in Firefox? Some people won't use Chrome.

spitfire大约 9 年前

Is there a version of this available for Safari?

systemz大约 9 年前

So their next move is check if IP is from Google

评论 #11136355 未加载

throwaway21816大约 9 年前

>Archaic news source does something to hurt their market penetration to internetGreat idea here guys

dude_abides大约 9 年前

Or simply use incognito mode and click on Google search result.

评论 #11135313 未加载

obelisk_大约 9 年前

1. Google's Web Crawlers are not "bypassing" paywall. It's the paywall that let's crawlers through. I.e. exactly the reverse of what the author implies with their headline.2. The idea that this is somehow new is wrong. The way for a server to identify crawlers have "always" been to look at the user-agent, and, when done right, IP, verified either by net block owner or by doing PTR lookup and then checking that the A or AAA record for the claimed host points back at the same IPv4 or IPv6 address. Meanwhile, I do agree that paywalling is a more recent phenomenon, at least with regards to the extend it is popular among sites today, but the concept of presenting different data to crawlers and visitors arose much earlier and is something Google have been aware of and has made sure to delist such sites when found, whereas in fact Google has since then moved abit in the direction of allowing it in that they do so for Google News if declared as explained by others ITT.So in my view, it seems that the author is jumping to incorrect conclusions based on an incomplete understanding of what's actually going on here. What then about the HN readership, how come this article became so highly voted and I don't see these issues raised by anyone else? Or maybe I'm just crazy?

评论 #11140172 未加载