TE
科技回声
首页24小时热榜最新最佳问答展示工作
GitHubTwitter
首页

科技回声

基于 Next.js 构建的科技新闻平台,提供全球科技新闻和讨论内容。

GitHubTwitter

首页

首页最新最佳问答展示工作

资源链接

HackerNews API原版 HackerNewsNext.js

© 2025 科技回声. 版权所有。

Detecting Bots in Apache and Nginx Logs Using Python

66 点作者 marklit大约 8 年前

4 条评论

jonknee大约 8 年前
I've had solid success detecting bots with a really easy pattern--usage frequency. Humans don't make request after request for long periods of times, but bots almost all do. The time between requests is usually pretty consistent too, not a lot of humans wait X seconds between doing things. Or not take breaks (what are the odds a human has made a request every hour for 48 hours straight?).
评论 #13843294 未加载
languagehacker大约 8 年前
I was hoping there would be some machine learning in here. This just seems to be cross referencing a couple of different data sources.
评论 #13840315 未加载
评论 #13840248 未加载
orf大约 8 年前
Seems to be more &#x27;filtering access logs by a blacklist&#x27; than actually detecting bots.<p>I run a VPN through Hetzner, so requests from my IP are not a bot (I hope!). Really you want to look at the paths (filtering out all the &#x2F;w00tw00t requests) and the user agents above all, which the author touches on. However a whitelist approach is better than a blacklist IMO.<p>Also in the `in_block` you really want to hoist the `IPAddress(ip)` call out of the `any()` loop!
评论 #13840194 未加载
评论 #13844685 未加载
guillem_lefait大约 8 年前
You may also want to add the amazon IP ranges: <a href="http:&#x2F;&#x2F;docs.aws.amazon.com&#x2F;general&#x2F;latest&#x2F;gr&#x2F;aws-ip-ranges.html" rel="nofollow">http:&#x2F;&#x2F;docs.aws.amazon.com&#x2F;general&#x2F;latest&#x2F;gr&#x2F;aws-ip-ranges.h...</a>