TechEcho

AI agents now have impressive reasoning capabilities. This raises an important question: how dangerous are these AI agents at identifying & exploiting web vulnerabilities?We created CVE-bench to find out (I'm one contributor of 16). To our knowledge CVE-bench is the first benchmark using real-world web vulnerabilities to evaluate AI agents' cyberattack capabilities. We included 40 CVEs from NIST's database, focusing on critical-severity vulnerability (CVSS > 9.0).To properly evaluate agents’ attacks, we built isolated environments with containerization and identified 8 common attack vectors. Each vulnerability took 5-24 person-hours to properly set up and validate.Our results show that current AI agents successfully exploited up to 13% of vulnerabilities without knowledge about the vulnerability (0-day). If given a brief description of the vulnerability (1-day), they can exploit up to 25%. Agents are all using GPT-4o without specialized training.The growing risk of AI misuse highlights the need for careful red-teaming. We hope CVE-bench can serve as a valuable tool for the community to assess the risks of emerging AI systems.Paper: <a href="https://arxiv.org/abs/2503.17332" rel="nofollow">https://arxiv.org/abs/2503.17332</a>Code: <a href="https://github.com/uiuc-kang-lab/cve-benchmark" rel="nofollow">https://github.com/uiuc-kang-lab/cve-benchmark</a>Medium: <a href="https://medium.com/@danieldkang/measuring-ai-agents-ability-to-exploit-web-applications-ba4225aa281f" rel="nofollow">https://medium.com/@danieldkang/measuring-ai-agents-ability-...</a>Substack: <a href="https://ddkang.substack.com/p/measuring-ai-agents-ability-to-exploit" rel="nofollow">https://ddkang.substack.com/p/measuring-ai-agents-ability-to...</a>

1 comment

cookiengineerabout 2 months ago

404? Is the repo still private?Edit: Ah, the URL was wrong. It's cve-bench!I couldn't find anything related MCP servers or tools that were offered to the agents. Wouldn't it be much more likely to succeed if there was e.g. a gdb server or an sqli/http server running for debugging purposes? That way the thinking process could succeed more easily, no?[1] <a href="https://github.com/uiuc-kang-lab/cve-bench" rel="nofollow">https://github.com/uiuc-kang-lab/cve-bench</a>

1 comment

cookiengineerabout 2 months ago

Show HN: CVE-Bench, the first LLM benchmark using real-world web vulnerabilities

1 comment

Show HN: CVE-Bench, the first LLM benchmark using real-world web vulnerabilities

1 comment