TE
科技回声
首页24小时热榜最新最佳问答展示工作
GitHubTwitter
首页

科技回声

基于 Next.js 构建的科技新闻平台,提供全球科技新闻和讨论内容。

GitHubTwitter

首页

首页最新最佳问答展示工作

资源链接

HackerNews API原版 HackerNewsNext.js

© 2025 科技回声. 版权所有。

Show HN: CVE-Bench, the first LLM benchmark using real-world web vulnerabilities

6 点作者 jbenn大约 2 个月前
AI agents now have impressive reasoning capabilities. This raises an important question: how dangerous are these AI agents at identifying &amp; exploiting web vulnerabilities?<p>We created CVE-bench to find out (I&#x27;m one contributor of 16). To our knowledge CVE-bench is the first benchmark using real-world web vulnerabilities to evaluate AI agents&#x27; cyberattack capabilities. We included 40 CVEs from NIST&#x27;s database, focusing on critical-severity vulnerability (CVSS &gt; 9.0).<p>To properly evaluate agents’ attacks, we built isolated environments with containerization and identified 8 common attack vectors. Each vulnerability took 5-24 person-hours to properly set up and validate.<p>Our results show that current AI agents successfully exploited up to 13% of vulnerabilities without knowledge about the vulnerability (0-day). If given a brief description of the vulnerability (1-day), they can exploit up to 25%. Agents are all using GPT-4o without specialized training.<p>The growing risk of AI misuse highlights the need for careful red-teaming. We hope CVE-bench can serve as a valuable tool for the community to assess the risks of emerging AI systems.<p>Paper: <a href="https:&#x2F;&#x2F;arxiv.org&#x2F;abs&#x2F;2503.17332" rel="nofollow">https:&#x2F;&#x2F;arxiv.org&#x2F;abs&#x2F;2503.17332</a><p>Code: <a href="https:&#x2F;&#x2F;github.com&#x2F;uiuc-kang-lab&#x2F;cve-benchmark" rel="nofollow">https:&#x2F;&#x2F;github.com&#x2F;uiuc-kang-lab&#x2F;cve-benchmark</a><p>Medium: <a href="https:&#x2F;&#x2F;medium.com&#x2F;@danieldkang&#x2F;measuring-ai-agents-ability-to-exploit-web-applications-ba4225aa281f" rel="nofollow">https:&#x2F;&#x2F;medium.com&#x2F;@danieldkang&#x2F;measuring-ai-agents-ability-...</a><p>Substack: <a href="https:&#x2F;&#x2F;ddkang.substack.com&#x2F;p&#x2F;measuring-ai-agents-ability-to-exploit" rel="nofollow">https:&#x2F;&#x2F;ddkang.substack.com&#x2F;p&#x2F;measuring-ai-agents-ability-to...</a>

1 comment

cookiengineer大约 2 个月前
404? Is the repo still private?<p>Edit: Ah, the URL was wrong. It&#x27;s cve-bench!<p>I couldn&#x27;t find anything related MCP servers or tools that were offered to the agents. Wouldn&#x27;t it be much more likely to succeed if there was e.g. a gdb server or an sqli&#x2F;http server running for debugging purposes? That way the thinking process could succeed more easily, no?<p>[1] <a href="https:&#x2F;&#x2F;github.com&#x2F;uiuc-kang-lab&#x2F;cve-bench" rel="nofollow">https:&#x2F;&#x2F;github.com&#x2F;uiuc-kang-lab&#x2F;cve-bench</a>