TE
科技回声
首页24小时热榜最新最佳问答展示工作
GitHubTwitter
首页

科技回声

基于 Next.js 构建的科技新闻平台,提供全球科技新闻和讨论内容。

GitHubTwitter

首页

首页最新最佳问答展示工作

资源链接

HackerNews API原版 HackerNewsNext.js

© 2025 科技回声. 版权所有。

Hacker News bans Google and all other search engines

318 点作者 jsrfded大约 15 年前
Looks like news.ycombinator.com is rejecting all search engine robots with "User-Agent: * Disallow: /". This is unfortuante, I often do site: searches to find old threads here. esp since news.ycombinator.com doesn't have its own site search, this is the only way to find old threads.<p>pg, what's up?

36 条评论

pg大约 15 年前
Don't worry, it doesn't mean anything. The software for ranking applications runs on the same server, and it is horribly inefficient (something 4 people use every 6 months doesn't tend to get optimized much). This weekend all of us were reading applications at the same time, and the system was getting so slow that I banned crawlers for a bit to buy us some margin. (Traffic from crawlers is much more expensive for us than traffic from human users, because it interacts badly with lazy item loading.) We only finished reading applications an hour before I had to leave for SXSW, so I forgot to set robots.txt back to the normal one, but I just did now.
评论 #1194807 未加载
评论 #1195004 未加载
评论 #1195020 未加载
评论 #1198452 未加载
评论 #1207234 未加载
评论 #1195137 未加载
eel大约 15 年前
What does this mean for <a href="http://www.searchyc.com" rel="nofollow">http://www.searchyc.com</a>? Will it stop too? (Afterall, it must be a bot too, and a robots.txt like this implies that no bots /should/ scrape the site.) Or will it be allowed?
评论 #1194495 未加载
评论 #1194522 未加载
评论 #1194589 未加载
g0atbutt大约 15 年前
That's pretty disappointing. I'd love to hear an official explanation of why this choice was made.
评论 #1194537 未加载
评论 #1194566 未加载
评论 #1194648 未加载
评论 #1194626 未加载
评论 #1195176 未加载
sahaj大约 15 年前
i just googled this: <a href="http://www.google.com/search?q=site:news.ycombinator.com+Hacker+News+bans+Google+and+all+other+search+engines" rel="nofollow">http://www.google.com/search?q=site:news.ycombinator.com+Hac...</a><p>and found this: <a href="http://news.ycombinator.com/item?id=165279" rel="nofollow">http://news.ycombinator.com/item?id=165279</a><p>one of the top replies:<p><i>My vote is to constrain growth as much as possible, at least that which comes from stupid sources. Smart hackers will find this site just fine without Yahoo or MSN, probably even Google. As "evil" as blocking sites and crawlers may sound, I think these types of measures will be necessary to preserve the quality of content here. Whatever actions further that objective have my vote.</i>
评论 #1194666 未加载
petercooper大约 15 年前
Will this break Google Reader and other RSS readers from legitimately using the RSS feed here? After all, Google Reader uses a bot to read the feed and allows us to search it from within their app.. much like their search engine does with regular pages :-) (That is, "web robots" aren't just spiders.)<p>Also things like <a href="http://hacker-newspaper.gilesb.com/" rel="nofollow">http://hacker-newspaper.gilesb.com/</a> and <a href="http://hnsort.com/" rel="nofollow">http://hnsort.com/</a> become less legitimate due to this. If the reason is to reduce the SEO benefits of getting a link on HN, just "nofollow" everything instead..<p>(Update: Googling on this topic brought up a page of my own where a Google Reader engineer explained how Google Reader deals with robots.txt - <a href="http://www.petercooper.co.uk/google-reader-ignores-robottxt-rules-51.html" rel="nofollow">http://www.petercooper.co.uk/google-reader-ignores-robottxt-...</a> - though their definition of Web robot is far from universal)
proee大约 15 年前
This makes me sad. I often do site:news.ycombinator.com google searches for key topics.
评论 #1194569 未加载
评论 #1194736 未加载
pierrefar大约 15 年前
Does that also kill Search YC?<p>What about the "official" HNSearch (<a href="http://www.webmynd.com/html/hackernews.html" rel="nofollow">http://www.webmynd.com/html/hackernews.html</a> )?
评论 #1194542 未加载
chaosmachine大约 15 年前
If this is an anti-spam measure, I expect it will be about as effective as "no follow" was. There's still plenty of good reasons for spammers to submit crap. RSS and Twitter syndication of links, for example.<p>On the other hand, if the goal is to push HN back into semi-obscurity by making it harder to find, it might work.
tkiley大约 15 年前
Is this permanent?<p>HN is a good source of google juice for interesting new startups, and it would be a shame to see that go away...
评论 #1194500 未加载
Matt_Cutts大约 15 年前
Does anyone know why HN decided to disallow search engines? It's certainly up to the website owner to decide, but there are good ways to (say) reduce the load on the web server without blocking search engines entirely.
评论 #1194686 未加载
评论 #1194678 未加载
simonw大约 15 年前
I noticed that the other day while playing around with the YQL console - YQL obeys robots.txt, so the Hacker News data table doesn't work any more.
tptacek大约 15 年前
I'm interested in the justification for this, but I'm happy about it. I'm actually uncomfortable with how high Hacker News comments score on Google.
评论 #1194708 未加载
评论 #1194570 未加载
redsymbol大约 15 年前
Everyone RELAX. There's probably a good explanation for this, maybe it was even an (easily correctable) accident. Give Paul a chance to respond before raising the pitchforks, okay?
chaosmachine大约 15 年前
It appears to have been changed, just seconds ago. Now it reads:<p><pre><code> User-Agent: * Disallow: /x? Disallow: /vote? Disallow: /reply? Disallow: /submitted? Disallow: /threads?</code></pre>
评论 #1194800 未加载
WalkingDead大约 15 年前
I guess that might have to do with load these search engines generate. Previosly in a thread we checked that comments posted in HN appear in google search after a minute of posting. That should create a good amount of load on HN. Multiply it for all the search engines in the wild, and PG probably have decided blocking those woun't do any harm. My guesss. Could be wrong, or it might be temporary or a mistake.
mikecane大约 15 年前
What is the reason for this?
评论 #1194611 未加载
评论 #1194435 未加载
评论 #1194527 未加载
CoreDumpling大约 15 年前
Well, at least my profile on HN will no longer be the first result for a Google search for my username (it can now revert to some blog that's not mine). But I really wish HN were searchable; I try to find insight on the comment threads here all the time.
ars大约 15 年前
The robots page is returning the mime-type text/html instead of text/plain
brianr大约 15 年前
Whoa there, everything important is still crawlable and indexable. Here's what robots.txt says right now:<p><pre><code> User-Agent: * Disallow: /x? Disallow: /vote? Disallow: /reply? Disallow: /submitted? Disallow: /threads? </code></pre> This just disallows those pages... not the home page, and not the /item? action (note the url of this page).
评论 #1194787 未加载
评论 #1194789 未加载
nirmal大约 15 年前
Looks like the Readable Feeds stuff is still working. Not sure how this will effect it in the future.<p><a href="http://andrewtrusty.appspot.com/readability/" rel="nofollow">http://andrewtrusty.appspot.com/readability/</a>
prs大约 15 年前
Truth be told: On various occasions the site: operator on Google came in handy for me to dig out some nugget of information from the archives of HN.<p>I can't await to hear the reason behind that decision.
tsally大约 15 年前
So is SearchYC done, or do they use a scraper not a crawler?
评论 #1194487 未加载
sev大约 15 年前
I'd like to understand why this decision has been made, as well as why the explanation has been delayed (or will not be given at all) directly from the source.
euroclydon大约 15 年前
Didn't I read here that the MSN bot refuses to obey the robots.txt file? Maybe I'll have to search HN with it.
AdamN大约 15 年前
I think this is still on the front page because people want it to stay this way.
rmason大约 15 年前
I think that it's brilliant. Short term it gets publicity and drives people to the group who are curious.<p>Long term PG signs a deal with Bing to be the exclusive search engine for Hacker News that pays for the servers and bandwidth.<p>Down mod me if you will but it's simply brilliant.
andrewcooke大约 15 年前
surprised no-one has suggested that this is to improve performance. this place can be very slow sometimes, and blocking robots can reduce load.
评论 #1194550 未加载
评论 #1195475 未加载
评论 #1195335 未加载
jchrisa大约 15 年前
not strictly true (read the code), but of all the sites I know, HN is best poised to implement their own search and drop Google. go for it!
ddsmooth大约 15 年前
That notwithstanding, if you go to Google.com, type hacker news into the search box and click "I'm Feeling Lucky" you will find yourself at a familiar web page. ;-)
评论 #1194485 未加载
karlzt大约 15 年前
I just searched in google and is not blocked
_pius大约 15 年前
Very lame.
jgavris大约 15 年前
<i>very</i> interesting
评论 #1194484 未加载
python123大约 15 年前
This is horrible. Now what's a talentless, middle-aged career software engineer with hopeless dreams of entrepreneurship going to do with his time?
hackermom大约 15 年前
i really enjoyed that the news item linked to the robots.txt file :) it's nerdy, it's short, it's clear and concise, all in one :)
quinto42大约 15 年前
This is fantastic! No more spam links.
fnazeeri大约 15 年前
Bye bye HN...it was good to know you.