A small thing, but I found the author's project-organization practices useful – creating individual .prompt files for system prompt, background information, and auxiliary instructions [1], and then running it through `llm`.<p>It reveals how good LLM use, like any other engineering tool, requires good engineering thinking – methodical, and oriented around thoughtful specifications that balance design constraints – for best results.<p>[1] <a href="https://github.com/SeanHeelan/o3_finds_cve-2025-37899">https://github.com/SeanHeelan/o3_finds_cve-2025-37899</a>
The article cites a signal to noise ratio of ~1:50. The author is clearly deeply familiar with this codebase and is thus well-positioned to triage the signal from the noise. Automating <i>this</i> part will be where the real wins are, so I'll be watching this closely.
The most interesting and significant bit of this article for me was that the author ran this search for vulnerabilities 100 times for each of the models. That's significantly more computation than I've historically been willing to expend on most of the problems that I try with large language models, but maybe I should let the models go brrrrr!
Either I'm very lucky or as I suspected Gemini 2.5 PRO can more easily identify the vulnerability. My success rate is so high that running the following prompt a few times is enough: <a href="https://gist.github.com/antirez/8b76cd9abf29f1902d46b2aed3cdc1bf" rel="nofollow">https://gist.github.com/antirez/8b76cd9abf29f1902d46b2aed3cd...</a>
This has become a common recurrence recently.<p>Have a problem with clear definition and evaluation function. Let LLM reduce the size of solution space. LLMs are very good at pattern reconstruction, and if the solution has a similar pattern to what was known before, it can work very well.<p>In this case the problem is a specific type of security vulnerability and the evaluator is the expert. This is similar in spirit to other recent endeavors where LLMs are used in genetic optimization; on a different scale.<p>Here’s an interesting read on “Mathematical discoveries from program search with large language models” which was I believe was also featured in HN the past:<p><a href="https://www.nature.com/articles/s41586-023-06924-6" rel="nofollow">https://www.nature.com/articles/s41586-023-06924-6</a><p>One small note, concluding that the LLM is “reasoning” about code just _based on this experiment_ is bit of a stretch IMHO.
I really hope this is legit and not what keeps happening to curl<p>[1] <a href="https://daniel.haxx.se/blog/2024/01/02/the-i-in-llm-stands-for-intelligence/" rel="nofollow">https://daniel.haxx.se/blog/2024/01/02/the-i-in-llm-stands-f...</a>
I'm not sure about the assertion that this is the first vulnerability found with an LLM. For e.g. OSS-Fuzz [0] has found a few using fuzzing, and Big Sleep using an agent approach [1].<p>[0] <a href="https://security.googleblog.com/2024/11/leveling-up-fuzzing-finding-more.html?m=1" rel="nofollow">https://security.googleblog.com/2024/11/leveling-up-fuzzing-...</a><p>[1] <a href="https://googleprojectzero.blogspot.com/2024/10/from-naptime-to-big-sleep.html?m=1" rel="nofollow">https://googleprojectzero.blogspot.com/2024/10/from-naptime-...</a>
Given the value of finding zero days, pretty much every intelligence agency in the world is going to be pouring money into this if it can reliably find them with just a few hundred api calls. Especially if you can fine tune a model with lots of examples, which I don't think open ai, etc are going to do with any public api.
I know there were at least a few kernel devs who "validated" this bug, but did anyone actually build a PoC and test it? It's such a critical piece of the process yet a proof of concept is completely omitted? If you don't have a PoC, you don't know what sort of hiccups would come along the way and therefore can't determine exploitability or impact. At least the author avoided calling it an RCE without validation.<p>But what if there's a missing piece of the puzzle that the author and devs missed or assumed o3 covered, but in fact was out of o3's context, that would invalidate this vulnerability?<p>I'm not saying there is, nor am I going to take the time to do the author's work for them, rather I am saying this report is not fully validated which feels like a dangerous precedent to set with what will likely be an influential blog post in the LLM VR space moving forward.<p>IMO the idea of PoC || GTFO should be applied more strictly than ever before to any vulnerability report generated by a model.<p>The underlying perspective that o3 is much better than previous or other current models still remains, and the methodology is still interesting. I understand the desire and need to get people to focus on something by wording it a specific way, it's the clickbait problem. But dammit, do better. Build a PoC and validate your claims, don't be lazy. If you're going to write a blog post that might influence how vulnerability researchers conduct their research, you should promote validation and not theoretical assumption. The alternative is the proliferation of ignorance through false-but-seemingly-true reporting, versus deepening the community's understanding of a system through vetted and provable reports.
There's a beautiful little snippet here that perfectly captures how most of my prompt development sessions go:<p>> <i>I tried to strongly guide it to not report false positives, and to favour not reporting any bugs over reporting false positives. I have no idea if this helps, but I’d like it to help, so here we are. In fact my entire system prompt is speculative in that I haven’t ran a sufficient number of evaluations to determine if it helps or hinders, so consider it equivalent to me saying a prayer, rather than anything resembling science or engineering. Once I have ran those evaluations I’ll let you know.</i>
My understanding is that ksmbd is a kernel-space SMB server "developed as a lightweight, high-performance alternative" to the traditional (user-space) Samba server...<p>Q1: Who is using ksmbd in production?<p>Q2: Why?
Anyone else feel like this is a best case application for LLMs?<p>You could in theory automate the entire process, treat the LLM as a very advanced fuzzer. Run it against your target in one or more VMs. If the VM crashes or otherwise exhibits anomalous behavior, you've found something. (Most exploits like this will crash the machine initially, before you refine them.)<p>On one hand: great application for LLMs.<p>On the other hand: conversely implies that demonstrating this doesn't mean <i>that</i> much.
We followed a very similar approach at work, created a test harness and tested all the models available in AWS bedrock and the OpenAI. We created our own code challenges not available on the Internet for training with vulnerable and non-vulnerable inline snippets and more contextual multi-file bugs. We also used 100 tests per challenge - I wanted to do 1000 test per challenge but realized that these models are not even close to 2 Sigma in accuracy!
Overall we found very similar results. But, we were also able to increase accuracy using additional methods - which comes as additional costs. The issue I see overall is that we found is when dealing with large codebases you'll need to put blinders on the LLMs to shorten context windows so that hallucinated results are less likely to happen. The worst thing would be to follow red herrings - perhaps in 5 years we'll have models used for more engineering specific tasks that can be rated with Six Sigma accuracy if posed with the same questions and problems sets.
> With o3 you get something that feels like a human-written bug report, condensed to just present the findings, whereas with Sonnet 3.7 you get something like a stream of thought, or a work log.<p>This is likely because the author didn't give Claude a scratchpad or space to think, essentially forcing it to mix its thoughts with its report. I'd be interested to see if using the official thinking mechanism gives it enough space to get differing results.
I think an approach like AlphaEvolve is very likely to work well for this space.<p>You've got all the elements for a successful optimization algorithm: 1) A fast and good enough sampling function + 2) a fairly good energy function.<p>For 1) this post shows that LLMs (even unoptimized) are quite good at sampling candidate vulnerabilities in large code bases. A 1% accuracy rate isn't bad at all, and they can be made quite fast (at least very parallelizable).<p>For 2) theoretically you can test any exploit easily and programmatically determine if it works. The main challenge is getting the energy function to provide gradient—some signal when you're close to finding a vulnerability/exploit.<p>I expect we'll see such a system within the next 12 months (or maybe not, since it's the kind of system that many lettered agencies would be very interested in).
I think this is the biggest alignment problem with LLMs in the short term imo. It is getting scarily good at this.<p>I recently found a pretty serious security vulnerability in an open source very niche server I sometimes use. This took virtually no effort using LLMs. I'm worried that there is a huge long tail of software out there which wasn't worth finding vulnerabilities in for nefarious means manually but if it was automated could lead to really serious problems.
I feel like our jobs are reasonably secure for a while because the LLM didn't immediately say "SMB implemented in the kernel, are you f-ing joking!?"
I'll have to get my facts straight but I'm pretty sure that ksmbd is ... not used much (by me).<p><a href="https://lwn.net/Articles/871866/" rel="nofollow">https://lwn.net/Articles/871866/</a> This is also nothing to do with Samba which is a well trodden path.<p>So why not attack a codebase that is rather more heavily used and older? Why not go for vi?
Meanwhile, as a maintainer, I've been reviewing more than a dozen false positives slop CVEs in my library and not a single one found an actual issue. This article's is probably going to make my situation worse.
The scary part of this is that the bad guys are doing the same thing. They’re looking for zero day exploits, and their ability to find them just got better. More importantly, it’s now almost automated. While the arms race will always continue, I wonder if this change of speed hurts the good guys more than the bad guys. There are many of these, and they take time to fix.
This made me think that the near future will be LLMs trained specifically on Linux or another large project. The source code is a small part of the dataset fed to LLMs. The more interesting is runtime data flow, similar to what we observe in a debugger. Looking at the codebase alone is like trying to understand a waterfall by looking at equations that describe the water flow.
This is a great case study. I wonder how hard o3 would find it to build a minimal repro for these vulns? This would of course make it easier to identify true positives and discard false positives.<p>This is I suppose an area where the engineer can apply their expertise to build a validation rig that the LLM may be able to utilize.
This is why AI safety is going to be impossible. This easily could have been a bad actor who would use this finding for nefarious acts. A person can just lie and there really isn't <i>any</i> safety finetuning that would let it separate the two intents.
Wow, interesting. I been hacking a tool called <a href="https://diffwithgpt.com" rel="nofollow">https://diffwithgpt.com</a> with a similar angle but indexing git changelogs with qwen to have it raise risks for backward compat issues, risks including security when upgrading k8s etc.
I have a presentation here on doing it to target zk bugs <a href="https://youtu.be/MN2LJ5XBQS0?si=x3nX1iQy7iex0K66" rel="nofollow">https://youtu.be/MN2LJ5XBQS0?si=x3nX1iQy7iex0K66</a>
><i>It is interesting by virtue of being part of the remote attack surface of the Linux kernel.</i><p>...if your linux kernel has ksmbd built into it; that's a much smaller interest group
Are there better tools for finding this? It feels like the sort of thing static analysis should reliably find, but it's in the Linux kernel, so you'd think either coding standards or tooling around these sorts of C bugs would be mature.
Noteable:<p>> o3 finds the kerberos authentication vulnerability in 8 of the 100 runs<p>And I'd guess this only became a blog post because the author already knew about the vuln and was just curious to see if the intern could spot it too, given a curated subset of the codebase