I used o3 to find a remote zeroday in the Linux SMB implementation

658 pointsby zielmicha14 days ago

38 comments

A small thing, but I found the author's project-organization practices useful – creating individual .prompt files for system prompt, background information, and auxiliary instructions [1], and then running it through `llm`.It reveals how good LLM use, like any other engineering tool, requires good engineering thinking – methodical, and oriented around thoughtful specifications that balance design constraints – for best results.[1] <a href="https://github.com/SeanHeelan/o3_finds_cve-2025-37899">https://github.com/SeanHeelan/o3_finds_cve-2025-37899</a>

评论 #44084222 未加载

评论 #44083555 未加载

评论 #44083998 未加载

评论 #44084964 未加载

评论 #44087115 未加载

评论 #44084683 未加载

评论 #44087623 未加载

Retr0id14 days ago

The article cites a signal to noise ratio of ~1:50. The author is clearly deeply familiar with this codebase and is thus well-positioned to triage the signal from the noise. Automating this part will be where the real wins are, so I'll be watching this closely.

评论 #44085002 未加载

评论 #44083131 未加载

评论 #44082991 未加载

评论 #44082975 未加载

评论 #44085813 未加载

评论 #44083019 未加载

评论 #44083281 未加载

评论 #44083061 未加载

评论 #44089099 未加载

iandanforth14 days ago

The most interesting and significant bit of this article for me was that the author ran this search for vulnerabilities 100 times for each of the models. That's significantly more computation than I've historically been willing to expend on most of the problems that I try with large language models, but maybe I should let the models go brrrrr!

评论 #44086614 未加载

评论 #44083926 未加载

评论 #44083184 未加载

评论 #44084395 未加载

antirez13 days ago

Either I'm very lucky or as I suspected Gemini 2.5 PRO can more easily identify the vulnerability. My success rate is so high that running the following prompt a few times is enough: <a href="https://gist.github.com/antirez/8b76cd9abf29f1902d46b2aed3cdc1bf" rel="nofollow">https://gist.github.com/antirez/8b76cd9abf29f1902d46b2aed3cd...</a>

geraneum13 days ago

This has become a common recurrence recently.Have a problem with clear definition and evaluation function. Let LLM reduce the size of solution space. LLMs are very good at pattern reconstruction, and if the solution has a similar pattern to what was known before, it can work very well.In this case the problem is a specific type of security vulnerability and the evaluator is the expert. This is similar in spirit to other recent endeavors where LLMs are used in genetic optimization; on a different scale.Here’s an interesting read on “Mathematical discoveries from program search with large language models” which was I believe was also featured in HN the past:<a href="https://www.nature.com/articles/s41586-023-06924-6" rel="nofollow">https://www.nature.com/articles/s41586-023-06924-6</a>One small note, concluding that the LLM is “reasoning” about code just _based on this experiment_ is bit of a stretch IMHO.

firesteelrain14 days ago

I really hope this is legit and not what keeps happening to curl[1] <a href="https://daniel.haxx.se/blog/2024/01/02/the-i-in-llm-stands-for-intelligence/" rel="nofollow">https://daniel.haxx.se/blog/2024/01/02/the-i-in-llm-stands-f...</a>

meander_water14 days ago

I'm not sure about the assertion that this is the first vulnerability found with an LLM. For e.g. OSS-Fuzz [0] has found a few using fuzzing, and Big Sleep using an agent approach [1].[0] <a href="https://security.googleblog.com/2024/11/leveling-up-fuzzing-finding-more.html?m=1" rel="nofollow">https://security.googleblog.com/2024/11/leveling-up-fuzzing-...</a>[1] <a href="https://googleprojectzero.blogspot.com/2024/10/from-naptime-to-big-sleep.html?m=1" rel="nofollow">https://googleprojectzero.blogspot.com/2024/10/from-naptime-...</a>

评论 #44086557 未加载

empath7514 days ago

Given the value of finding zero days, pretty much every intelligence agency in the world is going to be pouring money into this if it can reliably find them with just a few hundred api calls. Especially if you can fine tune a model with lots of examples, which I don't think open ai, etc are going to do with any public api.

评论 #44084032 未加载

stonepresto13 days ago

I know there were at least a few kernel devs who "validated" this bug, but did anyone actually build a PoC and test it? It's such a critical piece of the process yet a proof of concept is completely omitted? If you don't have a PoC, you don't know what sort of hiccups would come along the way and therefore can't determine exploitability or impact. At least the author avoided calling it an RCE without validation.But what if there's a missing piece of the puzzle that the author and devs missed or assumed o3 covered, but in fact was out of o3's context, that would invalidate this vulnerability?I'm not saying there is, nor am I going to take the time to do the author's work for them, rather I am saying this report is not fully validated which feels like a dangerous precedent to set with what will likely be an influential blog post in the LLM VR space moving forward.IMO the idea of PoC || GTFO should be applied more strictly than ever before to any vulnerability report generated by a model.The underlying perspective that o3 is much better than previous or other current models still remains, and the methodology is still interesting. I understand the desire and need to get people to focus on something by wording it a specific way, it's the clickbait problem. But dammit, do better. Build a PoC and validate your claims, don't be lazy. If you're going to write a blog post that might influence how vulnerability researchers conduct their research, you should promote validation and not theoretical assumption. The alternative is the proliferation of ignorance through false-but-seemingly-true reporting, versus deepening the community's understanding of a system through vetted and provable reports.

评论 #44087709 未加载

评论 #44086921 未加载

simonw14 days ago

There's a beautiful little snippet here that perfectly captures how most of my prompt development sessions go:> I tried to strongly guide it to not report false positives, and to favour not reporting any bugs over reporting false positives. I have no idea if this helps, but I’d like it to help, so here we are. In fact my entire system prompt is speculative in that I haven’t ran a sufficient number of evaluations to determine if it helps or hinders, so consider it equivalent to me saying a prayer, rather than anything resembling science or engineering. Once I have ran those evaluations I’ll let you know.

logifail14 days ago

My understanding is that ksmbd is a kernel-space SMB server "developed as a lightweight, high-performance alternative" to the traditional (user-space) Samba server...Q1: Who is using ksmbd in production?Q2: Why?

评论 #44083182 未加载

评论 #44083530 未加载

评论 #44083010 未加载

评论 #44083694 未加载

zielmicha14 days ago

(To be clear, I'm not the author of the post, the title just starts with "How I")

eqvinox14 days ago

Anyone else feel like this is a best case application for LLMs?You could in theory automate the entire process, treat the LLM as a very advanced fuzzer. Run it against your target in one or more VMs. If the VM crashes or otherwise exhibits anomalous behavior, you've found something. (Most exploits like this will crash the machine initially, before you refine them.)On one hand: great application for LLMs.On the other hand: conversely implies that demonstrating this doesn't mean that much.

评论 #44086022 未加载

KTibow14 days ago

> With o3 you get something that feels like a human-written bug report, condensed to just present the findings, whereas with Sonnet 3.7 you get something like a stream of thought, or a work log.This is likely because the author didn't give Claude a scratchpad or space to think, essentially forcing it to mix its thoughts with its report. I'd be interested to see if using the official thinking mechanism gives it enough space to get differing results.

评论 #44083686 未加载

评论 #44084127 未加载

jp000113 days ago

We followed a very similar approach at work, created a test harness and tested all the models available in AWS bedrock and the OpenAI. We created our own code challenges not available on the Internet for training with vulnerable and non-vulnerable inline snippets and more contextual multi-file bugs. We also used 100 tests per challenge - I wanted to do 1000 test per challenge but realized that these models are not even close to 2 Sigma in accuracy! Overall we found very similar results. But, we were also able to increase accuracy using additional methods - which comes as additional costs. The issue I see overall is that we found is when dealing with large codebases you'll need to put blinders on the LLMs to shorten context windows so that hallucinated results are less likely to happen. The worst thing would be to follow red herrings - perhaps in 5 years we'll have models used for more engineering specific tasks that can be rated with Six Sigma accuracy if posed with the same questions and problems sets.

评论 #44087234 未加载

resiros13 days ago

I think an approach like AlphaEvolve is very likely to work well for this space.You've got all the elements for a successful optimization algorithm: 1) A fast and good enough sampling function + 2) a fairly good energy function.For 1) this post shows that LLMs (even unoptimized) are quite good at sampling candidate vulnerabilities in large code bases. A 1% accuracy rate isn't bad at all, and they can be made quite fast (at least very parallelizable).For 2) theoretically you can test any exploit easily and programmatically determine if it works. The main challenge is getting the energy function to provide gradient—some signal when you're close to finding a vulnerability/exploit.I expect we'll see such a system within the next 12 months (or maybe not, since it's the kind of system that many lettered agencies would be very interested in).

martinald14 days ago

I think this is the biggest alignment problem with LLMs in the short term imo. It is getting scarily good at this.I recently found a pretty serious security vulnerability in an open source very niche server I sometimes use. This took virtually no effort using LLMs. I'm worried that there is a huge long tail of software out there which wasn't worth finding vulnerabilities in for nefarious means manually but if it was automated could lead to really serious problems.

评论 #44083773 未加载

评论 #44083788 未加载

评论 #44085240 未加载

评论 #44085227 未加载

dboreham14 days ago

I feel like our jobs are reasonably secure for a while because the LLM didn't immediately say "SMB implemented in the kernel, are you f-ing joking!?"

gerdesj14 days ago

I'll have to get my facts straight but I'm pretty sure that ksmbd is ... not used much (by me).<a href="https://lwn.net/Articles/871866/" rel="nofollow">https://lwn.net/Articles/871866/</a> This is also nothing to do with Samba which is a well trodden path.So why not attack a codebase that is rather more heavily used and older? Why not go for vi?

评论 #44085560 未加载

zison14 days ago

Very interesting. Is the bug it found exploitable in practice? Could this have been found by syzkaller?

评论 #44081991 未加载

mezyt14 days ago

Meanwhile, as a maintainer, I've been reviewing more than a dozen false positives slop CVEs in my library and not a single one found an actual issue. This article's is probably going to make my situation worse.

评论 #44083391 未加载

评论 #44083820 未加载

mehulashah13 days ago

The scary part of this is that the bad guys are doing the same thing. They’re looking for zero day exploits, and their ability to find them just got better. More importantly, it’s now almost automated. While the arms race will always continue, I wonder if this change of speed hurts the good guys more than the bad guys. There are many of these, and they take time to fix.

akomtu14 days ago

This made me think that the near future will be LLMs trained specifically on Linux or another large project. The source code is a small part of the dataset fed to LLMs. The more interesting is runtime data flow, similar to what we observe in a debugger. Looking at the codebase alone is like trying to understand a waterfall by looking at equations that describe the water flow.

评论 #44083826 未加载

davidgerard14 days ago

This is just fuzzing with extra power consumption?

评论 #44084401 未加载

1oooqooq13 days ago

i can ask offline o3 about that cve and get a reply, does that mean the author used a model that knew about the vulnerability?

theptip14 days ago

This is a great case study. I wonder how hard o3 would find it to build a minimal repro for these vulns? This would of course make it easier to identify true positives and discard false positives.This is I suppose an area where the engineer can apply their expertise to build a validation rig that the LLM may be able to utilize.

jobswithgptcom14 days ago

Wow, interesting. I been hacking a tool called <a href="https://diffwithgpt.com" rel="nofollow">https://diffwithgpt.com</a> with a similar angle but indexing git changelogs with qwen to have it raise risks for backward compat issues, risks including security when upgrading k8s etc.

qoez13 days ago

This is why AI safety is going to be impossible. This easily could have been a bad actor who would use this finding for nefarious acts. A person can just lie and there really isn't any safety finetuning that would let it separate the two intents.

baby14 days ago

I have a presentation here on doing it to target zk bugs <a href="https://youtu.be/MN2LJ5XBQS0?si=x3nX1iQy7iex0K66" rel="nofollow">https://youtu.be/MN2LJ5XBQS0?si=x3nX1iQy7iex0K66</a>

评论 #44091275 未加载

评论 #44084600 未加载

mettamage14 days ago

I wonder how often it will say there’s a vulnerability where there is non. Running it 100 times is a lot

ape414 days ago

Seems we need something like kernel modules but with memory protection

Hilift14 days ago

Does the vulnerability exist in other implementations of SMB?

评论 #44083604 未加载

jokoon13 days ago

WowI think the NSA already has this, without the need for a LLM.

fsckboy14 days ago

>It is interesting by virtue of being part of the remote attack surface of the Linux kernel....if your linux kernel has ksmbd built into it; that's a much smaller interest group

tomalbrc14 days ago

“I brute forced an ai to help me find potential zero day bugs”

fHr14 days ago

meanwhile boomers out here still thinking they are better than AI wehen even local gemma3 models can write better code then them allready

dehrmann14 days ago

Are there better tools for finding this? It feels like the sort of thing static analysis should reliably find, but it's in the Linux kernel, so you'd think either coding standards or tooling around these sorts of C bugs would be mature.

评论 #44083668 未加载

评论 #44083669 未加载

mdaniel14 days ago

Noteable:> o3 finds the kerberos authentication vulnerability in 8 of the 100 runsAnd I'd guess this only became a blog post because the author already knew about the vuln and was just curious to see if the intern could spot it too, given a curated subset of the codebase

评论 #44082818 未加载