I read a lot of niggling comments here about whether Claude was really being smart in writing this GIF fuzzer. Of course it was trained on fuzzer source code. Of course it has read every blog post about esoteric boundary conditions in GIF parsers.<p>But to bring all of those things together and translate the concepts into working Python code is astonishing. We have just forgotten that a year ago, this achievement would have blown our minds.<p>I recently had to write an email to my kid’s school so that he could get some more support for a learning disability. I fed Claude 3 Opus a copy of his 35 page psychometric testing report along with a couple of his recent report cards and asked it to draft the email for me, making reference to things in the three documents provided. I also suggested it pay special attention to one of the testing results.<p>The first email draft was ready to send. Sure, I tweaked a thing or two, but this saved me half an hour of digging through dense material written by a psychologist. After verifying that there were no factual errors, I hit “Send.” To me, it’s still magic.
I have kind of pet peeve with people testing LLMs like this these days.<p>They take whatever it spits out in the first attempt. And then they go on extrapolate this to draw all kinds of conclusions. They forget the output it generated is based on a random seed. A new attempt (with a new seed) is going to give a totally different answer.<p>If the author has retried that prompt, that new attempt might have generated better code or might have generated lot worse code. You can not draw conclusions from just one answer.
You could likely also combine the LLM with a coverage tool to provide additional guidance when regenerating the fuzzer: "Your fuzzer missed lines XX-YY in the code. Explain why you think the fuzzer missed those lines, describe inputs that might reach those lines in the code, and then update the fuzzer code to match your observations."<p>This approach could likely also be combined with RL; the code coverage provides a decent reward signal.
It seems to overlook that the language model was developed using a large corpora of code, which probably includes structured fuzzers for file formats such as GIF. Plus, the scope of the "unknown" format introduced is limited.
Why wouldn't you have an LLM write some code that uses something like libfuzzer instead?<p>That way you get an efficient, robust coverage-driven fuzzing engine, rather than having an LLM try to reinvent the wheel on that part of the code poorly. Let the LLM help write the boilerplate code for you.
I don't understand why we are getting LLMs to generate <i>code</i> to create fuzzing data as a 'thing'<p>Logically LLMs should be quite good at creating the fuzzing data.<p>To state the obvious why, it's too expensive to use LLMs directly and this way works since they found "4 memory safety bugs and one hang"<p>But the future we are heading to should be LLMs will directly pentest/test the code. This is where it's interesting and new.