TE
科技回声
首页24小时热榜最新最佳问答展示工作
GitHubTwitter
首页

科技回声

基于 Next.js 构建的科技新闻平台,提供全球科技新闻和讨论内容。

GitHubTwitter

首页

首页最新最佳问答展示工作

资源链接

HackerNews API原版 HackerNewsNext.js

© 2025 科技回声. 版权所有。

Protect Your Site with a Blackhole for Bad Bots

75 点作者 ScotterC超过 14 年前

14 条评论

yetanotherjosh超过 14 年前
I've worked at places that have tried this trick. It doesn't work - it's always been removed because of real users complaining they've lost access.<p>Several scenarios can trigger it, and probably more. The internet is a weird place. Consider:<p>1. Some clients, browser plugins, and proxy servers implement link prefetching. These agents will not care that the link is attached to a 1px gif that the user won't see. This is not really breaking the rules, either, it is quite permissible and in the scope of HTTP implementations - unless you've put your black hole behind a form POST (which bots won't for anyway.)<p>2. Internet explorer among other tools allows users to download content for offline viewing. The client does not respect robots.txt when such fetching has been initiated by a user.<p>3. Not all users browse the web visually, and your 1px gif is discriminating against the visually impaired. When browsed with a screen reader, a linked image is a linked image is a linked image.<p>Additionally, outright blacklisting by IP address as noted by others on this thread is highly problematic, especially when the behavior that triggers it could accidentally come from real users behind a NAT firewall (at a typical office, library, etc). A single user performing any of the above behaviors would block the entire group from the service.<p>There are better ways to fight misbehaving robots that do not so easily trigger false positives...
评论 #1855623 未加载
评论 #1856197 未加载
评论 #1855816 未加载
ars超过 14 年前
If you do this, create the robots.txt first, then wait a week or two!<p>Only then activate the actual blackhole.<p>The reason is that robots do not download robots.txt each time, they can, and do, cache it for quite a while, especially for sites that don't change much.
评论 #1855820 未加载
jasonkester超过 14 年前
So this is just to ban non-targeted crawlers? Any particular reason you'd want to ban crawlers from your site? Surely your server is up to the task of serving a few extra requests, enough so that it's not worth your time adding code (and slowing down good requests) to restrict them.<p>The kind of bot that I care about are the ones that spam up my content site. They only go to pages that real users visit (the "Post Stuff" page), so this trick wouldn't help against them. And they never post from the same IP twice, preferring to hop between infected machines on a botnet every time they make a post.<p>I'm curious what sort of traffic pattern this author is seeing that would motivate him to build this.
groaner超过 14 年前
<i>Whitelisting these user agents ensures that anything claiming to be a major search engine is allowed open access. The downside is that user-agent strings are easily spoofed, so a bad bot could crawl along and say, “hey look, I’m teh Googlebot!” and the whitelist would grant access.</i><p>How many of these so-called "bad bots" already do this sort of spoofing? Would usage of these techniques only encourage such behavior?
评论 #1855555 未加载
nostromo超过 14 年前
A few bad ideas here:<p>1) Blocking by IP address. (AOL and Universities come to mind.)<p>2) nofollow links are followed by search engines and users alike. (display:none is ignored by some text-based browsers that ignore CSS.)
评论 #1855718 未加载
benjoffe超过 14 年前
If your site becomes popular this could become a target for trolls. Eg. in some forums trolls will post a fake link to a logout page, this is why sites should use POST or private keys for logging out. If you implement this blackhole it will become a much more serious target.
评论 #1855761 未加载
symkat超过 14 年前
This is pretty neat.<p>Something we did at $company[-2] is we had a block of IPs that weren't used for customer traffic. If something hit them (SSH login attempts, HTTP GET requests that were looking for RFI vulnerabilities, etc, etc) the IP would be firewalled from the entire network for a period of time (generally 2-3 days).
keyle超过 14 年前
This is okay, but what would happen if someone writes a popular flash client that pulls data from a site using that blackhole.php?<p>The clients could access the data once then blocked out forever?
jwr超过 14 年前
Against what, exactly, does this protect? And why?<p>"Bad bots" are the least of my worries, and if I were to protect against anything, I'd protect agains excessive requests per second.
codefisher超过 14 年前
There are some interesting ideas to work around the problems with this method in the comments, such as hashing the ip with a secret string on the link to stop others making you ban all your users. And also sorts of other stuff - even putting a CAPTCHA on the ban page as an escape method. But in the end I think the method is flawed:<p>1) a singled infected computer on a network could take out a large number of users. 2) any thing doing prefetching will cause them to be banned. 3) there is a risk of taking out valid bots, and verifying them correctly is just too expensive for a large site.<p>My main issue with bots is their spam, so I just use tools like Akismet to keep that under control.
rarestblog超过 14 年前
robots.txt is malformed in this example:<p><pre><code> Disallow: /*/blackhole/* </code></pre> This line won't work even for good robots (robots.txt doesn't have wildcard characters).
WillyF超过 14 年前
Wouldn't using a hidden link also subject you to a possible penalty from Google and other search engines?
pornel超过 14 年前
Implementation is very weak. It reads whole blacklist, line by line (could have used sqlite at least), and uses extract() to emulate register_globals misfeature on hosts that disabled it (and it doesn't even check for disabled register_globals properly).
quellhorst超过 14 年前
Would like to see how someone implements this type of blocking in a rails app.