Hacker News bans Google and all other search engines

318 点作者 jsrfded大约 15 年前

Looks like news.ycombinator.com is rejecting all search engine robots with "User-Agent: * Disallow: /". This is unfortuante, I often do site: searches to find old threads here. esp since news.ycombinator.com doesn't have its own site search, this is the only way to find old threads.pg, what's up?

36 条评论

pg大约 15 年前

Don't worry, it doesn't mean anything. The software for ranking applications runs on the same server, and it is horribly inefficient (something 4 people use every 6 months doesn't tend to get optimized much). This weekend all of us were reading applications at the same time, and the system was getting so slow that I banned crawlers for a bit to buy us some margin. (Traffic from crawlers is much more expensive for us than traffic from human users, because it interacts badly with lazy item loading.) We only finished reading applications an hour before I had to leave for SXSW, so I forgot to set robots.txt back to the normal one, but I just did now.

评论 #1194807 未加载

评论 #1195004 未加载

评论 #1195020 未加载

评论 #1198452 未加载

评论 #1207234 未加载

评论 #1195137 未加载

eel大约 15 年前

What does this mean for <a href="http://www.searchyc.com" rel="nofollow">http://www.searchyc.com</a>? Will it stop too? (Afterall, it must be a bot too, and a robots.txt like this implies that no bots /should/ scrape the site.) Or will it be allowed?

评论 #1194495 未加载

评论 #1194522 未加载

评论 #1194589 未加载

g0atbutt大约 15 年前

That's pretty disappointing. I'd love to hear an official explanation of why this choice was made.

评论 #1194537 未加载

评论 #1194566 未加载

评论 #1194648 未加载

评论 #1194626 未加载

评论 #1195176 未加载

sahaj大约 15 年前

i just googled this: <a href="http://www.google.com/search?q=site:news.ycombinator.com+Hacker+News+bans+Google+and+all+other+search+engines" rel="nofollow">http://www.google.com/search?q=site:news.ycombinator.com+Hac...</a>and found this: <a href="http://news.ycombinator.com/item?id=165279" rel="nofollow">http://news.ycombinator.com/item?id=165279</a>one of the top replies:My vote is to constrain growth as much as possible, at least that which comes from stupid sources. Smart hackers will find this site just fine without Yahoo or MSN, probably even Google. As "evil" as blocking sites and crawlers may sound, I think these types of measures will be necessary to preserve the quality of content here. Whatever actions further that objective have my vote.

评论 #1194666 未加载

petercooper大约 15 年前

Will this break Google Reader and other RSS readers from legitimately using the RSS feed here? After all, Google Reader uses a bot to read the feed and allows us to search it from within their app.. much like their search engine does with regular pages :-) (That is, "web robots" aren't just spiders.)Also things like <a href="http://hacker-newspaper.gilesb.com/" rel="nofollow">http://hacker-newspaper.gilesb.com/</a> and <a href="http://hnsort.com/" rel="nofollow">http://hnsort.com/</a> become less legitimate due to this. If the reason is to reduce the SEO benefits of getting a link on HN, just "nofollow" everything instead..(Update: Googling on this topic brought up a page of my own where a Google Reader engineer explained how Google Reader deals with robots.txt - <a href="http://www.petercooper.co.uk/google-reader-ignores-robottxt-rules-51.html" rel="nofollow">http://www.petercooper.co.uk/google-reader-ignores-robottxt-...</a> - though their definition of Web robot is far from universal)

proee大约 15 年前

This makes me sad. I often do site:news.ycombinator.com google searches for key topics.

评论 #1194569 未加载

评论 #1194736 未加载

pierrefar大约 15 年前

Does that also kill Search YC?What about the "official" HNSearch (<a href="http://www.webmynd.com/html/hackernews.html" rel="nofollow">http://www.webmynd.com/html/hackernews.html</a> )?

评论 #1194542 未加载

chaosmachine大约 15 年前

If this is an anti-spam measure, I expect it will be about as effective as "no follow" was. There's still plenty of good reasons for spammers to submit crap. RSS and Twitter syndication of links, for example.On the other hand, if the goal is to push HN back into semi-obscurity by making it harder to find, it might work.

tkiley大约 15 年前

Is this permanent?HN is a good source of google juice for interesting new startups, and it would be a shame to see that go away...

评论 #1194500 未加载

Matt_Cutts大约 15 年前

Does anyone know why HN decided to disallow search engines? It's certainly up to the website owner to decide, but there are good ways to (say) reduce the load on the web server without blocking search engines entirely.

评论 #1194686 未加载

评论 #1194678 未加载

simonw大约 15 年前

I noticed that the other day while playing around with the YQL console - YQL obeys robots.txt, so the Hacker News data table doesn't work any more.

tptacek大约 15 年前

I'm interested in the justification for this, but I'm happy about it. I'm actually uncomfortable with how high Hacker News comments score on Google.

评论 #1194708 未加载

评论 #1194570 未加载

redsymbol大约 15 年前

Everyone RELAX. There's probably a good explanation for this, maybe it was even an (easily correctable) accident. Give Paul a chance to respond before raising the pitchforks, okay?

chaosmachine大约 15 年前

It appears to have been changed, just seconds ago. Now it reads:<pre><code> User-Agent: * Disallow: /x? Disallow: /vote? Disallow: /reply? Disallow: /submitted? Disallow: /threads?</code></pre>

评论 #1194800 未加载

WalkingDead大约 15 年前

I guess that might have to do with load these search engines generate. Previosly in a thread we checked that comments posted in HN appear in google search after a minute of posting. That should create a good amount of load on HN. Multiply it for all the search engines in the wild, and PG probably have decided blocking those woun't do any harm. My guesss. Could be wrong, or it might be temporary or a mistake.

mikecane大约 15 年前

What is the reason for this?

评论 #1194611 未加载

评论 #1194435 未加载

评论 #1194527 未加载

CoreDumpling大约 15 年前

Well, at least my profile on HN will no longer be the first result for a Google search for my username (it can now revert to some blog that's not mine). But I really wish HN were searchable; I try to find insight on the comment threads here all the time.

ars大约 15 年前

The robots page is returning the mime-type text/html instead of text/plain

brianr大约 15 年前

Whoa there, everything important is still crawlable and indexable. Here's what robots.txt says right now:<pre><code> User-Agent: * Disallow: /x? Disallow: /vote? Disallow: /reply? Disallow: /submitted? Disallow: /threads? </code></pre> This just disallows those pages... not the home page, and not the /item? action (note the url of this page).

评论 #1194787 未加载

评论 #1194789 未加载

nirmal大约 15 年前

Looks like the Readable Feeds stuff is still working. Not sure how this will effect it in the future.<a href="http://andrewtrusty.appspot.com/readability/" rel="nofollow">http://andrewtrusty.appspot.com/readability/</a>

prs大约 15 年前

Truth be told: On various occasions the site: operator on Google came in handy for me to dig out some nugget of information from the archives of HN.I can't await to hear the reason behind that decision.

tsally大约 15 年前

So is SearchYC done, or do they use a scraper not a crawler?

评论 #1194487 未加载

sev大约 15 年前

I'd like to understand why this decision has been made, as well as why the explanation has been delayed (or will not be given at all) directly from the source.

euroclydon大约 15 年前

Didn't I read here that the MSN bot refuses to obey the robots.txt file? Maybe I'll have to search HN with it.

AdamN大约 15 年前

I think this is still on the front page because people want it to stay this way.

rmason大约 15 年前

I think that it's brilliant. Short term it gets publicity and drives people to the group who are curious.Long term PG signs a deal with Bing to be the exclusive search engine for Hacker News that pays for the servers and bandwidth.Down mod me if you will but it's simply brilliant.

andrewcooke大约 15 年前

surprised no-one has suggested that this is to improve performance. this place can be very slow sometimes, and blocking robots can reduce load.

评论 #1194550 未加载

评论 #1195475 未加载

评论 #1195335 未加载

jchrisa大约 15 年前

not strictly true (read the code), but of all the sites I know, HN is best poised to implement their own search and drop Google. go for it!

ddsmooth大约 15 年前

That notwithstanding, if you go to Google.com, type hacker news into the search box and click "I'm Feeling Lucky" you will find yourself at a familiar web page. ;-)

评论 #1194485 未加载

karlzt大约 15 年前

I just searched in google and is not blocked

_pius大约 15 年前

Very lame.

jgavris大约 15 年前

very interesting

评论 #1194484 未加载

python123大约 15 年前

This is horrible. Now what's a talentless, middle-aged career software engineer with hopeless dreams of entrepreneurship going to do with his time?

hackermom大约 15 年前

i really enjoyed that the news item linked to the robots.txt file :) it's nerdy, it's short, it's clear and concise, all in one :)

quinto42大约 15 年前

This is fantastic! No more spam links.

fnazeeri大约 15 年前

Bye bye HN...it was good to know you.

36 条评论

pg大约 15 年前

评论 #1194807 未加载

评论 #1195004 未加载

评论 #1195020 未加载

评论 #1198452 未加载

评论 #1207234 未加载

评论 #1195137 未加载

eel大约 15 年前

评论 #1194495 未加载

评论 #1194522 未加载

评论 #1194589 未加载

g0atbutt大约 15 年前

That's pretty disappointing. I'd love to hear an official explanation of why this choice was made.

评论 #1194537 未加载

评论 #1194566 未加载

评论 #1194648 未加载

评论 #1194626 未加载

评论 #1195176 未加载

sahaj大约 15 年前

评论 #1194666 未加载

petercooper大约 15 年前

proee大约 15 年前

This makes me sad. I often do site:news.ycombinator.com google searches for key topics.

评论 #1194569 未加载

评论 #1194736 未加载

pierrefar大约 15 年前

Does that also kill Search YC?What about the "official" HNSearch (<a href="http://www.webmynd.com/html/hackernews.html" rel="nofollow">http://www.webmynd.com/html/hackernews.html</a> )?

评论 #1194542 未加载

chaosmachine大约 15 年前

tkiley大约 15 年前

Is this permanent?HN is a good source of google juice for interesting new startups, and it would be a shame to see that go away...

评论 #1194500 未加载

Matt_Cutts大约 15 年前

评论 #1194686 未加载

评论 #1194678 未加载

simonw大约 15 年前

I noticed that the other day while playing around with the YQL console - YQL obeys robots.txt, so the Hacker News data table doesn't work any more.

tptacek大约 15 年前

I'm interested in the justification for this, but I'm happy about it. I'm actually uncomfortable with how high Hacker News comments score on Google.

评论 #1194708 未加载

评论 #1194570 未加载

redsymbol大约 15 年前

Everyone RELAX. There's probably a good explanation for this, maybe it was even an (easily correctable) accident. Give Paul a chance to respond before raising the pitchforks, okay?

chaosmachine大约 15 年前

It appears to have been changed, just seconds ago. Now it reads:<pre><code> User-Agent: * Disallow: /x? Disallow: /vote? Disallow: /reply? Disallow: /submitted? Disallow: /threads?</code></pre>

评论 #1194800 未加载

WalkingDead大约 15 年前

mikecane大约 15 年前

What is the reason for this?

评论 #1194611 未加载

评论 #1194435 未加载

评论 #1194527 未加载

CoreDumpling大约 15 年前

ars大约 15 年前

The robots page is returning the mime-type text/html instead of text/plain

brianr大约 15 年前

评论 #1194787 未加载

评论 #1194789 未加载

nirmal大约 15 年前

prs大约 15 年前

tsally大约 15 年前

So is SearchYC done, or do they use a scraper not a crawler?

评论 #1194487 未加载

sev大约 15 年前

I'd like to understand why this decision has been made, as well as why the explanation has been delayed (or will not be given at all) directly from the source.

euroclydon大约 15 年前

Didn't I read here that the MSN bot refuses to obey the robots.txt file? Maybe I'll have to search HN with it.

AdamN大约 15 年前

I think this is still on the front page because people want it to stay this way.

rmason大约 15 年前

andrewcooke大约 15 年前

surprised no-one has suggested that this is to improve performance. this place can be very slow sometimes, and blocking robots can reduce load.

评论 #1194550 未加载

评论 #1195475 未加载

评论 #1195335 未加载

jchrisa大约 15 年前

not strictly true (read the code), but of all the sites I know, HN is best poised to implement their own search and drop Google. go for it!

ddsmooth大约 15 年前

That notwithstanding, if you go to Google.com, type hacker news into the search box and click "I'm Feeling Lucky" you will find yourself at a familiar web page. ;-)

评论 #1194485 未加载

karlzt大约 15 年前

I just searched in google and is not blocked

_pius大约 15 年前

Very lame.

jgavris大约 15 年前

very interesting

评论 #1194484 未加载

python123大约 15 年前

This is horrible. Now what's a talentless, middle-aged career software engineer with hopeless dreams of entrepreneurship going to do with his time?

hackermom大约 15 年前

i really enjoyed that the news item linked to the robots.txt file :) it's nerdy, it's short, it's clear and concise, all in one :)

quinto42大约 15 年前

This is fantastic! No more spam links.

fnazeeri大约 15 年前

Bye bye HN...it was good to know you.