Looks like news.ycombinator.com is rejecting all search engine robots with "User-Agent: * Disallow: /". This is unfortuante, I often do site: searches to find old threads here. esp since news.ycombinator.com doesn't have its own site search, this is the only way to find old threads.<p>pg, what's up?
Don't worry, it doesn't mean anything. The software for ranking applications runs on the same server, and it is horribly inefficient (something 4 people use every 6 months doesn't tend to get optimized much). This weekend all of us were reading applications at the same time, and the system was getting so slow that I banned crawlers for a bit to buy us some margin. (Traffic from crawlers is much more expensive for us than traffic from human users, because it interacts badly with lazy item loading.) We only finished reading applications an hour before I had to leave for SXSW, so I forgot to set robots.txt back to the normal one, but I just did now.
What does this mean for <a href="http://www.searchyc.com" rel="nofollow">http://www.searchyc.com</a>? Will it stop too? (Afterall, it must be a bot too, and a robots.txt like this implies that no bots /should/ scrape the site.) Or will it be allowed?
i just googled this:
<a href="http://www.google.com/search?q=site:news.ycombinator.com+Hacker+News+bans+Google+and+all+other+search+engines" rel="nofollow">http://www.google.com/search?q=site:news.ycombinator.com+Hac...</a><p>and found this:
<a href="http://news.ycombinator.com/item?id=165279" rel="nofollow">http://news.ycombinator.com/item?id=165279</a><p>one of the top replies:<p><i>My vote is to constrain growth as much as possible, at least that which comes from stupid sources. Smart hackers will find this site just fine without Yahoo or MSN, probably even Google.
As "evil" as blocking sites and crawlers may sound, I think these types of measures will be necessary to preserve the quality of content here. Whatever actions further that objective have my vote.</i>
Will this break Google Reader and other RSS readers from legitimately using the RSS feed here? After all, Google Reader uses a bot to read the feed and allows us to search it from within their app.. much like their search engine does with regular pages :-) (That is, "web robots" aren't just spiders.)<p>Also things like <a href="http://hacker-newspaper.gilesb.com/" rel="nofollow">http://hacker-newspaper.gilesb.com/</a> and <a href="http://hnsort.com/" rel="nofollow">http://hnsort.com/</a> become less legitimate due to this. If the reason is to reduce the SEO benefits of getting a link on HN, just "nofollow" everything instead..<p>(Update: Googling on this topic brought up a page of my own where a Google Reader engineer explained how Google Reader deals with robots.txt - <a href="http://www.petercooper.co.uk/google-reader-ignores-robottxt-rules-51.html" rel="nofollow">http://www.petercooper.co.uk/google-reader-ignores-robottxt-...</a> - though their definition of Web robot is far from universal)
Does that also kill Search YC?<p>What about the "official" HNSearch (<a href="http://www.webmynd.com/html/hackernews.html" rel="nofollow">http://www.webmynd.com/html/hackernews.html</a> )?
If this is an anti-spam measure, I expect it will be about as effective as "no follow" was. There's still plenty of good reasons for spammers to submit crap. RSS and Twitter syndication of links, for example.<p>On the other hand, if the goal is to push HN back into semi-obscurity by making it harder to find, it might work.
Does anyone know why HN decided to disallow search engines? It's certainly up to the website owner to decide, but there are good ways to (say) reduce the load on the web server without blocking search engines entirely.
Everyone RELAX. There's probably a good explanation for this, maybe it was even an (easily correctable) accident. Give Paul a chance to respond before raising the pitchforks, okay?
It appears to have been changed, just seconds ago. Now it reads:<p><pre><code> User-Agent: *
Disallow: /x?
Disallow: /vote?
Disallow: /reply?
Disallow: /submitted?
Disallow: /threads?</code></pre>
I guess that might have to do with load these search engines generate. Previosly in a thread we checked that comments posted in HN appear in google search after a minute of posting. That should create a good amount of load on HN. Multiply it for all the search engines in the wild, and PG probably have decided blocking those woun't do any harm. My guesss. Could be wrong, or it might be temporary or a mistake.
Well, at least my profile on HN will no longer be the first result for a Google search for my username (it can now revert to some blog that's not mine). But I really wish HN were searchable; I try to find insight on the comment threads here all the time.
Whoa there, everything important is still crawlable and indexable. Here's what robots.txt says right now:<p><pre><code> User-Agent: * Disallow: /x? Disallow: /vote? Disallow: /reply? Disallow: /submitted? Disallow: /threads?
</code></pre>
This just disallows those pages... not the home page, and not the /item? action (note the url of this page).
Looks like the Readable Feeds stuff is still working. Not sure how this will effect it in the future.<p><a href="http://andrewtrusty.appspot.com/readability/" rel="nofollow">http://andrewtrusty.appspot.com/readability/</a>
Truth be told: On various occasions the site: operator on Google came in handy for me to dig out some nugget of information from the archives of HN.<p>I can't await to hear the reason behind that decision.
I'd like to understand why this decision has been made, as well as why the explanation has been delayed (or will not be given at all) directly from the source.
I think that it's brilliant. Short term it gets publicity and drives people to the group who are curious.<p>Long term PG signs a deal with Bing to be the exclusive search engine for Hacker News that pays for the servers and bandwidth.<p>Down mod me if you will but it's simply brilliant.
That notwithstanding, if you go to Google.com, type hacker news into the search box and click "I'm Feeling Lucky" you will find yourself at a familiar web page. ;-)