TE
TechEcho
Home24h TopNewestBestAskShowJobs
GitHubTwitter
Home

TechEcho

A tech news platform built with Next.js, providing global tech news and discussions.

GitHubTwitter

Home

HomeNewestBestAskShowJobs

Resources

HackerNews APIOriginal HackerNewsNext.js

© 2025 TechEcho. All rights reserved.

Matt Cutts is looking for scraper sites

428 pointsby adamljabout 11 years ago

33 comments

spindritfabout 11 years ago
It&#x27;s a funny quip but it&#x27;s getting more attention than the important piece of news it highlights. Google is finally doing something about scraping sites doing better in search results than original creators. Good.<p>Many people don&#x27;t write for money, to put ads on their website, or as part of some &quot;content marketing&quot; campaign. All they want is a little recognition. A boost in positioning on the SERP means we will be getting useful stuff at no cost.<p>And there are genuine replies there. Ryan Jones[1] even got the scrapers to confess their sins[2].<p>[1] <a href="https://twitter.com/RyanJones/status/439123533349015553" rel="nofollow">https:&#x2F;&#x2F;twitter.com&#x2F;RyanJones&#x2F;status&#x2F;439123533349015553</a><p>[2] <a href="https://www.google.com/search?q=%20%22istwfn%22+%22stole+this+word+from+noslang%22" rel="nofollow">https:&#x2F;&#x2F;www.google.com&#x2F;search?q=%20%22istwfn%22+%22stole+thi...</a>
评论 #7318717 未加载
评论 #7318755 未加载
评论 #7318812 未加载
评论 #7319953 未加载
评论 #7319915 未加载
评论 #7320014 未加载
评论 #7319305 未加载
VikingCoderabout 11 years ago
Scrapers lift the full content, wholesale, without attribution.<p>You may as will just show <a href="http://images.google.com" rel="nofollow">http:&#x2F;&#x2F;images.google.com</a> and complain that it&#x27;s scraping. Or <a href="http://news.google.com" rel="nofollow">http:&#x2F;&#x2F;news.google.com</a>.<p>In general, do you think Wikipedia gets more traffic because Google exists, or do you think Google gets more traffic because Wikipedia exists? Meaning, which affect is larger? I&#x27;m pretty sure the answer to this is obvious.<p>And if more scrapers donated millions to the site they scrape from, the world would be a much better place.<p><a href="http://wikimediafoundation.org/wiki/Press_releases/Wikimedia_Foundation_announces_$2_million_grant_from_Google" rel="nofollow">http:&#x2F;&#x2F;wikimediafoundation.org&#x2F;wiki&#x2F;Press_releases&#x2F;Wikimedia...</a>
评论 #7318494 未加载
评论 #7320816 未加载
评论 #7318440 未加载
评论 #7318433 未加载
评论 #7318413 未加载
xukiabout 11 years ago
This is pretty funny <a href="https://twitter.com/danbarker/status/439125570115223552" rel="nofollow">https:&#x2F;&#x2F;twitter.com&#x2F;danbarker&#x2F;status&#x2F;439125570115223552</a>
评论 #7318387 未加载
评论 #7318426 未加载
评论 #7318709 未加载
jjoonathanabout 11 years ago
Bah, what would I possibly need with a scraped definition that<p>1) Hasn&#x27;t been chunked into 20 pieces of varying grammatical structure which are automatically matched to corresponding questions<p>2) Hasn&#x27;t been subsequently pasted over a slideshow of completely irrelevant stock photos in bold, white font<p>3) Isn&#x27;t accompanied by a grid of ~30 vaguely related questions helpfully linked to similar pages and tastefully decorated with more irrelevant stock photos<p>4) Only occupies ~1.5 rather than 3 or 4 of the front page search results<p>5) Contains only closely related textual ads rather than a melange of casino, fast food, and online college banners<p>6) Has fewer than 25 trustworthy stock faces smiling back at me from any given scroll position<p>If this is the best google can do then I don&#x27;t think wiki.answers.com has anything to fear.<p>------------<p>Seriously, how the hell does wiki.answers.com manage to pollute half of the searches I make with their algorithmically generated garbage (multiple times, at that)?! What kind of SEO catapulted them to the top despite 0 viewer retention and what surely must be about 0 reputable backlinks? How haven&#x27;t they been sent to the 1000th page with manual penalties already? They show up before wikipedia itself, for crying out loud!<p>Google, if you aren&#x27;t going to let users maintain a manual blacklist, you need to be on top of this kind of thing. It&#x27;s seriously degrading my search experience and I suspect I&#x27;m not alone. This kind of inattention is the type of thing that can push even the most inattentive users to change default search engines.
评论 #7318538 未加载
评论 #7318770 未加载
pudabout 11 years ago
Wikipedia&#x27;s database is public and used by Google with permission. You can probably use it for your projects, too.<p>So this is neither scraping, nor against the rules.<p>Here are dumps in SQL and XML format:<p><a href="http://dumps.wikimedia.org/enwiki/" rel="nofollow">http:&#x2F;&#x2F;dumps.wikimedia.org&#x2F;enwiki&#x2F;</a><p>Ps- Yes the original post was meant to be funny and it was; I do have a sense of humor. :)
评论 #7318445 未加载
_wmdabout 11 years ago
Cue damage control explaining the indiscernibly subtle difference between what Google does and what these evil, spammy scraper sites are doing
评论 #7318685 未加载
level09about 11 years ago
Google is taking Cognitive Dissonance to a new level: It&#x27;s okay for them to scrape every single site, download its content and images, and cache it on their servers, and run their AD platform on top of it. but that&#x27;s not enough, they would still like to impose their rules and punish people who do the same thing.
评论 #7320707 未加载
评论 #7321617 未加载
评论 #7320282 未加载
fear91about 11 years ago
Google seems less and less connected to reality the bigger they grow.<p>It&#x27;s a shame that the search engine market share isn&#x27;t split evenly by several different engines. I think it would be beneficent both to the users and website owners. Right now everyone tries to court Google and they seem to do whatever the fuck they want.
评论 #7318673 未加载
smoyerabout 11 years ago
There are lots of places where Google decides to &quot;help&quot; me, but sometimes I just want search results. Other times, I actually like getting the curated content (e.g. search for &quot;delta 3810&quot;). Is there a way to disable this?<p>EDIT: I should also note that I&#x27;m one of those who switched over to DuckDuckGo for privacy reasons, so I don&#x27;t see these results as often now.
评论 #7318439 未加载
评论 #7318600 未加载
k-mcgradyabout 11 years ago
I&#x27;d love to see a response from Matt to that. If they think the Wikipedia article is most important and they will scrape it and put it to the top why not just put the wikipedia article as the top link and leave out the Google box.
300bpsabout 11 years ago
I wonder how Google chooses which Wikipedia articles they scrape and which ones they don&#x27;t.<p>In testing, they definitely don&#x27;t seem to scrape every article:<p><a href="http://i.imgur.com/ujDqZhB.png" rel="nofollow">http:&#x2F;&#x2F;i.imgur.com&#x2F;ujDqZhB.png</a>
评论 #7318468 未加载
habosaabout 11 years ago
Hey guys you know this is meant to be humorous right? I honestly can&#x27;t believe that people here are saying Google is a scraper site and complaining about &quot;hypocrisy&quot;. No more caching! When I search Google I want them to freshly crawl the web and get back to me in a day or two with my results.<p>&lt;&#x2F;rant&gt;
ITBabout 11 years ago
Google is most certainly crossing the line here.<p>1. They are not only doing this with wikipedia, but with many, many sites: &quot;what is the smallest cell in the human body&quot;, &quot;what is the biggest planet in the solar system&quot;.<p>2. The sites they chose to link are not always the highest quality sites, such as the two examples above- why are these websites being featured?<p>3. Many times, the user will get their answer right then and there, and be done with the search process. The site misses a visitor. In spite of these type of questions being &quot;facts&quot;, someone took the time to organize and give context to these &quot;facts&quot;. Turning facts into useful, consumable, content costs money. Google should not be taking visitors away from these sites.<p>4. There should be public information on the CTR of these snippets. See if it helps or hurts the user.<p>5. Google is abusing its power as a major search engine to reinforce structuring rules, such as microformats. With these rules, webmasters are giving more and more semantic meaning to their content, which means Google has an easier time completing their knowledge graph. They might link to the source site for a while, but there is no good argument for linking back to wikipedia to attribute the fact that Jupiter is the largest planet, since it&#x27;s a fact, just like 2+2 is 4 (no attribution).<p>6. Google is all about ML&#x2F;NLP&#x2F;AI driven knowledge. But in reality they are turning all of the internet content creators into a giant sweat shop for their knowledge graph. This is not fair, and sooner or later it will come back to bite them.
评论 #7320342 未加载
higherpurposeabout 11 years ago
Google should be <i>very</i> careful with this. They don&#x27;t want someone in power to get the idea like &quot;wait a minute...isn&#x27;t Google mining whole websites too and <i>profiting</i> from it? Maybe we should do something about that!&quot;
altcognitoabout 11 years ago
Scraper sites usually don&#x27;t reference the source material, but yeah, you might want to get some ice for that burn.
Angosturaabout 11 years ago
Of course, its not just Wikipedia these days - Movie theatres etc all suffer it.
nkuttlerabout 11 years ago
Indeed. Google is constantly doing things they punish other people for.<p>A happy DDG user, who still uses !g too often though.
评论 #7318612 未加载
评论 #7318510 未加载
tobehonestabout 11 years ago
&quot;All therefore whatsoever they bid you observe, that observe and do; but do not ye after their works: for they say, and do not.&quot; --Matthew 23:3<p>&quot;Do as I say, not as I do&quot; -- Google
Grue3about 11 years ago
Cue Bing, DuckDuckGo and any other search engine (except Google, of course) being Google-killed for &quot;scraping&quot;. It&#x27;s the perfect plan!
评论 #7318419 未加载
ricgabout 11 years ago
Easy. Search for a programming related question. After the result from stackoverflow you&#x27;ll find dozens of scraper sites.
评论 #7318386 未加载
评论 #7318549 未加载
评论 #7318427 未加载
sebiiabout 11 years ago
More evidence: <a href="http://shadyseo.com/" rel="nofollow">http:&#x2F;&#x2F;shadyseo.com&#x2F;</a>
gwu78about 11 years ago
I do not understand the Wikipedia definition of &quot;scraper site&quot;.<p>By this definition webcache.googleusercontent.com qualifies.<p>It is a full copy of every site GoogleBot scrapes.<p>Google gives attrition to the original source, but if this isn&#x27;t &quot;scraping&quot;, what is?<p>They have been sued for this, and they&#x27;ve won. The benefits of a decent search engine outweigh the burden of infringing the copyrights of others. At least where Google and other search engines that cache websites are concerned.
评论 #7325791 未加载
baldfatabout 11 years ago
Not funny since it is a double post for the same wikipedia: <a href="http://en.wikipedia.org/wiki/Scraper_site" rel="nofollow">http:&#x2F;&#x2F;en.wikipedia.org&#x2F;wiki&#x2F;Scraper_site</a><p>Seriously that was just a stretch, but they both say the full url. So all of Google News is a scraper site and any other summery given is a scrapper site then. Sad.
return0about 11 years ago
Joking aside, I see these people as the waste collectors of the web. Respect their work, but i wouldn&#x27;t want to do it.
cousin_itabout 11 years ago
That SERP should show only one result from Wikipedia instead of two. It should be on top, have a blue title link to Wikipedia, <i>and</i> look like an answer to the user&#x27;s question. That could be done by a general mechanism that lets every site customize their representation on the SERP, or by a special case for Wikipedia.
bhartzerabout 11 years ago
Who is deemed to be the scraper? The site that get crawled and indexed first, and ranks better, or the site that ranks well but has sites with scraped content that doesn&#x27;t rank as well?<p>Matt is looking for scapers that rank better than the original, basically meaning that they have higher PageRank and more links.
lazyjonesabout 11 years ago
I would report Google to him, but I&#x27;m afraid he&#x27;s not planning to act fairly&#x2F;consistently ...
MitziMotoabout 11 years ago
Google should offer some kind of revenue sharing (Like Youtube) to the sites it&#x27;s &quot;stealing&quot; visitors from by showing information directly. And you should have to opt into it through something like webmaster tools.
rip747about 11 years ago
why they don&#x27;t just integrate this into the results page? what&#x27;s wrong with having up and down votes or a report this link button for the results?
globalpanicabout 11 years ago
I thought this was largely taken care of by Google Panda?
motyarabout 11 years ago
I know one, Google.com
iamabrahamabout 11 years ago
Sensational.
pearjuiceabout 11 years ago
Ah, a whole thread filled with pseudo-intellectual discussion about what scraping is (or isn&#x27;t) due to some silly snarky-joke which Matt is probably laughing at, too. Hacker News to the rescue!