Hacker News bans Google and all other search engines

318 pointsby jsrfdedabout 15 years ago

Looks like news.ycombinator.com is rejecting all search engine robots with "User-Agent: * Disallow: /". This is unfortuante, I often do site: searches to find old threads here. esp since news.ycombinator.com doesn't have its own site search, this is the only way to find old threads.pg, what's up?

36 comments

pgabout 15 years ago

Don't worry, it doesn't mean anything. The software for ranking applications runs on the same server, and it is horribly inefficient (something 4 people use every 6 months doesn't tend to get optimized much). This weekend all of us were reading applications at the same time, and the system was getting so slow that I banned crawlers for a bit to buy us some margin. (Traffic from crawlers is much more expensive for us than traffic from human users, because it interacts badly with lazy item loading.) We only finished reading applications an hour before I had to leave for SXSW, so I forgot to set robots.txt back to the normal one, but I just did now.

评论 #1194807 未加载

评论 #1195004 未加载

评论 #1195020 未加载

评论 #1198452 未加载

评论 #1207234 未加载

评论 #1195137 未加载

eelabout 15 years ago

What does this mean for <a href="http://www.searchyc.com" rel="nofollow">http://www.searchyc.com</a>? Will it stop too? (Afterall, it must be a bot too, and a robots.txt like this implies that no bots /should/ scrape the site.) Or will it be allowed?

评论 #1194495 未加载

评论 #1194522 未加载

评论 #1194589 未加载

g0atbuttabout 15 years ago

That's pretty disappointing. I'd love to hear an official explanation of why this choice was made.

评论 #1194537 未加载

评论 #1194566 未加载

评论 #1194648 未加载

评论 #1194626 未加载

评论 #1195176 未加载

sahajabout 15 years ago

i just googled this: <a href="http://www.google.com/search?q=site:news.ycombinator.com+Hacker+News+bans+Google+and+all+other+search+engines" rel="nofollow">http://www.google.com/search?q=site:news.ycombinator.com+Hac...</a>and found this: <a href="http://news.ycombinator.com/item?id=165279" rel="nofollow">http://news.ycombinator.com/item?id=165279</a>one of the top replies:My vote is to constrain growth as much as possible, at least that which comes from stupid sources. Smart hackers will find this site just fine without Yahoo or MSN, probably even Google. As "evil" as blocking sites and crawlers may sound, I think these types of measures will be necessary to preserve the quality of content here. Whatever actions further that objective have my vote.

评论 #1194666 未加载

petercooperabout 15 years ago

Will this break Google Reader and other RSS readers from legitimately using the RSS feed here? After all, Google Reader uses a bot to read the feed and allows us to search it from within their app.. much like their search engine does with regular pages :-) (That is, "web robots" aren't just spiders.)Also things like <a href="http://hacker-newspaper.gilesb.com/" rel="nofollow">http://hacker-newspaper.gilesb.com/</a> and <a href="http://hnsort.com/" rel="nofollow">http://hnsort.com/</a> become less legitimate due to this. If the reason is to reduce the SEO benefits of getting a link on HN, just "nofollow" everything instead..(Update: Googling on this topic brought up a page of my own where a Google Reader engineer explained how Google Reader deals with robots.txt - <a href="http://www.petercooper.co.uk/google-reader-ignores-robottxt-rules-51.html" rel="nofollow">http://www.petercooper.co.uk/google-reader-ignores-robottxt-...</a> - though their definition of Web robot is far from universal)

proeeabout 15 years ago

This makes me sad. I often do site:news.ycombinator.com google searches for key topics.

评论 #1194569 未加载

评论 #1194736 未加载

pierrefarabout 15 years ago

Does that also kill Search YC?What about the "official" HNSearch (<a href="http://www.webmynd.com/html/hackernews.html" rel="nofollow">http://www.webmynd.com/html/hackernews.html</a> )?

评论 #1194542 未加载

chaosmachineabout 15 years ago

If this is an anti-spam measure, I expect it will be about as effective as "no follow" was. There's still plenty of good reasons for spammers to submit crap. RSS and Twitter syndication of links, for example.On the other hand, if the goal is to push HN back into semi-obscurity by making it harder to find, it might work.

tkileyabout 15 years ago

Is this permanent?HN is a good source of google juice for interesting new startups, and it would be a shame to see that go away...

评论 #1194500 未加载

Matt_Cuttsabout 15 years ago

Does anyone know why HN decided to disallow search engines? It's certainly up to the website owner to decide, but there are good ways to (say) reduce the load on the web server without blocking search engines entirely.

评论 #1194686 未加载

评论 #1194678 未加载

simonwabout 15 years ago

I noticed that the other day while playing around with the YQL console - YQL obeys robots.txt, so the Hacker News data table doesn't work any more.

tptacekabout 15 years ago

I'm interested in the justification for this, but I'm happy about it. I'm actually uncomfortable with how high Hacker News comments score on Google.

评论 #1194708 未加载

评论 #1194570 未加载

redsymbolabout 15 years ago

Everyone RELAX. There's probably a good explanation for this, maybe it was even an (easily correctable) accident. Give Paul a chance to respond before raising the pitchforks, okay?

chaosmachineabout 15 years ago

It appears to have been changed, just seconds ago. Now it reads:<pre><code> User-Agent: * Disallow: /x? Disallow: /vote? Disallow: /reply? Disallow: /submitted? Disallow: /threads?</code></pre>

评论 #1194800 未加载

WalkingDeadabout 15 years ago

I guess that might have to do with load these search engines generate. Previosly in a thread we checked that comments posted in HN appear in google search after a minute of posting. That should create a good amount of load on HN. Multiply it for all the search engines in the wild, and PG probably have decided blocking those woun't do any harm. My guesss. Could be wrong, or it might be temporary or a mistake.

mikecaneabout 15 years ago

What is the reason for this?

评论 #1194611 未加载

评论 #1194435 未加载

评论 #1194527 未加载

CoreDumplingabout 15 years ago

Well, at least my profile on HN will no longer be the first result for a Google search for my username (it can now revert to some blog that's not mine). But I really wish HN were searchable; I try to find insight on the comment threads here all the time.

arsabout 15 years ago

The robots page is returning the mime-type text/html instead of text/plain

brianrabout 15 years ago

Whoa there, everything important is still crawlable and indexable. Here's what robots.txt says right now:<pre><code> User-Agent: * Disallow: /x? Disallow: /vote? Disallow: /reply? Disallow: /submitted? Disallow: /threads? </code></pre> This just disallows those pages... not the home page, and not the /item? action (note the url of this page).

评论 #1194787 未加载

评论 #1194789 未加载

nirmalabout 15 years ago

Looks like the Readable Feeds stuff is still working. Not sure how this will effect it in the future.<a href="http://andrewtrusty.appspot.com/readability/" rel="nofollow">http://andrewtrusty.appspot.com/readability/</a>

prsabout 15 years ago

Truth be told: On various occasions the site: operator on Google came in handy for me to dig out some nugget of information from the archives of HN.I can't await to hear the reason behind that decision.

tsallyabout 15 years ago

So is SearchYC done, or do they use a scraper not a crawler?

评论 #1194487 未加载

sevabout 15 years ago

I'd like to understand why this decision has been made, as well as why the explanation has been delayed (or will not be given at all) directly from the source.

euroclydonabout 15 years ago

Didn't I read here that the MSN bot refuses to obey the robots.txt file? Maybe I'll have to search HN with it.

AdamNabout 15 years ago

I think this is still on the front page because people want it to stay this way.

rmasonabout 15 years ago

I think that it's brilliant. Short term it gets publicity and drives people to the group who are curious.Long term PG signs a deal with Bing to be the exclusive search engine for Hacker News that pays for the servers and bandwidth.Down mod me if you will but it's simply brilliant.

andrewcookeabout 15 years ago

surprised no-one has suggested that this is to improve performance. this place can be very slow sometimes, and blocking robots can reduce load.

评论 #1194550 未加载

评论 #1195475 未加载

评论 #1195335 未加载

jchrisaabout 15 years ago

not strictly true (read the code), but of all the sites I know, HN is best poised to implement their own search and drop Google. go for it!

ddsmoothabout 15 years ago

That notwithstanding, if you go to Google.com, type hacker news into the search box and click "I'm Feeling Lucky" you will find yourself at a familiar web page. ;-)

评论 #1194485 未加载

karlztabout 15 years ago

I just searched in google and is not blocked

_piusabout 15 years ago

Very lame.

jgavrisabout 15 years ago

very interesting

评论 #1194484 未加载

python123about 15 years ago

This is horrible. Now what's a talentless, middle-aged career software engineer with hopeless dreams of entrepreneurship going to do with his time?

hackermomabout 15 years ago

i really enjoyed that the news item linked to the robots.txt file :) it's nerdy, it's short, it's clear and concise, all in one :)

quinto42about 15 years ago

This is fantastic! No more spam links.

fnazeeriabout 15 years ago

Bye bye HN...it was good to know you.

36 comments

pgabout 15 years ago

评论 #1194807 未加载

评论 #1195004 未加载

评论 #1195020 未加载

评论 #1198452 未加载

评论 #1207234 未加载

评论 #1195137 未加载

eelabout 15 years ago

评论 #1194495 未加载

评论 #1194522 未加载

评论 #1194589 未加载

g0atbuttabout 15 years ago

That's pretty disappointing. I'd love to hear an official explanation of why this choice was made.

评论 #1194537 未加载

评论 #1194566 未加载

评论 #1194648 未加载

评论 #1194626 未加载

评论 #1195176 未加载

sahajabout 15 years ago

评论 #1194666 未加载

petercooperabout 15 years ago

proeeabout 15 years ago

This makes me sad. I often do site:news.ycombinator.com google searches for key topics.

评论 #1194569 未加载

评论 #1194736 未加载

pierrefarabout 15 years ago

Does that also kill Search YC?What about the "official" HNSearch (<a href="http://www.webmynd.com/html/hackernews.html" rel="nofollow">http://www.webmynd.com/html/hackernews.html</a> )?

评论 #1194542 未加载

chaosmachineabout 15 years ago

tkileyabout 15 years ago

Is this permanent?HN is a good source of google juice for interesting new startups, and it would be a shame to see that go away...

评论 #1194500 未加载

Matt_Cuttsabout 15 years ago

评论 #1194686 未加载

评论 #1194678 未加载

simonwabout 15 years ago

I noticed that the other day while playing around with the YQL console - YQL obeys robots.txt, so the Hacker News data table doesn't work any more.

tptacekabout 15 years ago

I'm interested in the justification for this, but I'm happy about it. I'm actually uncomfortable with how high Hacker News comments score on Google.

评论 #1194708 未加载

评论 #1194570 未加载

redsymbolabout 15 years ago

Everyone RELAX. There's probably a good explanation for this, maybe it was even an (easily correctable) accident. Give Paul a chance to respond before raising the pitchforks, okay?

chaosmachineabout 15 years ago

It appears to have been changed, just seconds ago. Now it reads:<pre><code> User-Agent: * Disallow: /x? Disallow: /vote? Disallow: /reply? Disallow: /submitted? Disallow: /threads?</code></pre>

评论 #1194800 未加载

WalkingDeadabout 15 years ago

mikecaneabout 15 years ago

What is the reason for this?

评论 #1194611 未加载

评论 #1194435 未加载

评论 #1194527 未加载

CoreDumplingabout 15 years ago

arsabout 15 years ago

The robots page is returning the mime-type text/html instead of text/plain

brianrabout 15 years ago

评论 #1194787 未加载

评论 #1194789 未加载

nirmalabout 15 years ago

prsabout 15 years ago

tsallyabout 15 years ago

So is SearchYC done, or do they use a scraper not a crawler?

评论 #1194487 未加载

sevabout 15 years ago

I'd like to understand why this decision has been made, as well as why the explanation has been delayed (or will not be given at all) directly from the source.

euroclydonabout 15 years ago

Didn't I read here that the MSN bot refuses to obey the robots.txt file? Maybe I'll have to search HN with it.

AdamNabout 15 years ago

I think this is still on the front page because people want it to stay this way.

rmasonabout 15 years ago

andrewcookeabout 15 years ago

surprised no-one has suggested that this is to improve performance. this place can be very slow sometimes, and blocking robots can reduce load.

评论 #1194550 未加载

评论 #1195475 未加载

评论 #1195335 未加载

jchrisaabout 15 years ago

not strictly true (read the code), but of all the sites I know, HN is best poised to implement their own search and drop Google. go for it!

ddsmoothabout 15 years ago

That notwithstanding, if you go to Google.com, type hacker news into the search box and click "I'm Feeling Lucky" you will find yourself at a familiar web page. ;-)

评论 #1194485 未加载

karlztabout 15 years ago

I just searched in google and is not blocked

_piusabout 15 years ago

Very lame.

jgavrisabout 15 years ago

very interesting

评论 #1194484 未加载

python123about 15 years ago

This is horrible. Now what's a talentless, middle-aged career software engineer with hopeless dreams of entrepreneurship going to do with his time?

hackermomabout 15 years ago

i really enjoyed that the news item linked to the robots.txt file :) it's nerdy, it's short, it's clear and concise, all in one :)

quinto42about 15 years ago

This is fantastic! No more spam links.

fnazeeriabout 15 years ago

Bye bye HN...it was good to know you.