科技回声

14 条评论

I've worked at places that have tried this trick. It doesn't work - it's always been removed because of real users complaining they've lost access.Several scenarios can trigger it, and probably more. The internet is a weird place. Consider:1. Some clients, browser plugins, and proxy servers implement link prefetching. These agents will not care that the link is attached to a 1px gif that the user won't see. This is not really breaking the rules, either, it is quite permissible and in the scope of HTTP implementations - unless you've put your black hole behind a form POST (which bots won't for anyway.)2. Internet explorer among other tools allows users to download content for offline viewing. The client does not respect robots.txt when such fetching has been initiated by a user.3. Not all users browse the web visually, and your 1px gif is discriminating against the visually impaired. When browsed with a screen reader, a linked image is a linked image is a linked image.Additionally, outright blacklisting by IP address as noted by others on this thread is highly problematic, especially when the behavior that triggers it could accidentally come from real users behind a NAT firewall (at a typical office, library, etc). A single user performing any of the above behaviors would block the entire group from the service.There are better ways to fight misbehaving robots that do not so easily trigger false positives...

评论 #1855623 未加载

评论 #1856197 未加载

评论 #1855816 未加载

ars超过 14 年前

If you do this, create the robots.txt first, then wait a week or two!Only then activate the actual blackhole.The reason is that robots do not download robots.txt each time, they can, and do, cache it for quite a while, especially for sites that don't change much.

评论 #1855820 未加载

jasonkester超过 14 年前

So this is just to ban non-targeted crawlers? Any particular reason you'd want to ban crawlers from your site? Surely your server is up to the task of serving a few extra requests, enough so that it's not worth your time adding code (and slowing down good requests) to restrict them.The kind of bot that I care about are the ones that spam up my content site. They only go to pages that real users visit (the "Post Stuff" page), so this trick wouldn't help against them. And they never post from the same IP twice, preferring to hop between infected machines on a botnet every time they make a post.I'm curious what sort of traffic pattern this author is seeing that would motivate him to build this.

groaner超过 14 年前

Whitelisting these user agents ensures that anything claiming to be a major search engine is allowed open access. The downside is that user-agent strings are easily spoofed, so a bad bot could crawl along and say, “hey look, I’m teh Googlebot!” and the whitelist would grant access.How many of these so-called "bad bots" already do this sort of spoofing? Would usage of these techniques only encourage such behavior?

评论 #1855555 未加载

nostromo超过 14 年前

A few bad ideas here:1) Blocking by IP address. (AOL and Universities come to mind.)2) nofollow links are followed by search engines and users alike. (display:none is ignored by some text-based browsers that ignore CSS.)

评论 #1855718 未加载

benjoffe超过 14 年前

If your site becomes popular this could become a target for trolls. Eg. in some forums trolls will post a fake link to a logout page, this is why sites should use POST or private keys for logging out. If you implement this blackhole it will become a much more serious target.

评论 #1855761 未加载

symkat超过 14 年前

This is pretty neat.Something we did at $company[-2] is we had a block of IPs that weren't used for customer traffic. If something hit them (SSH login attempts, HTTP GET requests that were looking for RFI vulnerabilities, etc, etc) the IP would be firewalled from the entire network for a period of time (generally 2-3 days).

keyle超过 14 年前

This is okay, but what would happen if someone writes a popular flash client that pulls data from a site using that blackhole.php?The clients could access the data once then blocked out forever?

jwr超过 14 年前

Against what, exactly, does this protect? And why?"Bad bots" are the least of my worries, and if I were to protect against anything, I'd protect agains excessive requests per second.

codefisher超过 14 年前

There are some interesting ideas to work around the problems with this method in the comments, such as hashing the ip with a secret string on the link to stop others making you ban all your users. And also sorts of other stuff - even putting a CAPTCHA on the ban page as an escape method. But in the end I think the method is flawed:1) a singled infected computer on a network could take out a large number of users. 2) any thing doing prefetching will cause them to be banned. 3) there is a risk of taking out valid bots, and verifying them correctly is just too expensive for a large site.My main issue with bots is their spam, so I just use tools like Akismet to keep that under control.

rarestblog超过 14 年前

robots.txt is malformed in this example:<pre><code> Disallow: /*/blackhole/* </code></pre> This line won't work even for good robots (robots.txt doesn't have wildcard characters).

WillyF超过 14 年前

Wouldn't using a hidden link also subject you to a possible penalty from Google and other search engines?

pornel超过 14 年前

Implementation is very weak. It reads whole blacklist, line by line (could have used sqlite at least), and uses extract() to emulate register_globals misfeature on hosts that disabled it (and it doesn't even check for disabled register_globals properly).

quellhorst超过 14 年前

Would like to see how someone implements this type of blocking in a rails app.

14 条评论

yetanotherjosh超过 14 年前

评论 #1855623 未加载

评论 #1856197 未加载

评论 #1855816 未加载

ars超过 14 年前

评论 #1855820 未加载

jasonkester超过 14 年前

groaner超过 14 年前

评论 #1855555 未加载

nostromo超过 14 年前

评论 #1855718 未加载

benjoffe超过 14 年前

评论 #1855761 未加载

symkat超过 14 年前

keyle超过 14 年前

This is okay, but what would happen if someone writes a popular flash client that pulls data from a site using that blackhole.php?The clients could access the data once then blocked out forever?

jwr超过 14 年前

Against what, exactly, does this protect? And why?"Bad bots" are the least of my worries, and if I were to protect against anything, I'd protect agains excessive requests per second.

codefisher超过 14 年前

rarestblog超过 14 年前

robots.txt is malformed in this example:<pre><code> Disallow: /*/blackhole/* </code></pre> This line won't work even for good robots (robots.txt doesn't have wildcard characters).

WillyF超过 14 年前

Wouldn't using a hidden link also subject you to a possible penalty from Google and other search engines?

pornel超过 14 年前

quellhorst超过 14 年前

Would like to see how someone implements this type of blocking in a rails app.

Protect Your Site with a Blackhole for Bad Bots

14 条评论

Protect Your Site with a Blackhole for Bad Bots

14 条评论