In every single system I have worked on, tests were not just tests - they were their own parallel application, and it required careful architecture and constant refactoring in order for it to not get out of hand.<p>"More tests" is not the goal - you need to write high impact tests, you need to think about how to test the most of your app surface with least amount of test code. Sometimes I spend more time on the test code than the actual code (probably normal).<p>Also, I feel like people would be inclined to go with whatever the LLM gives them, as opposed to really sitting down and thinking about all the unhappy paths and edge cases of UX. Using an autocomplete to "bang it out" seems foolish.
I actually tested Claude Sonnet to see how it would fare at writing a test suite for a background worker. My previous experience was with some version of GPT via Copilot, and it was... not good.<p>I was, however, extremely impressed with Claude this time around. Not only did it do a great job off the bat, but it taught me some techniques and tricks available in the language/framework (Ruby, Rspec) which I wasn't familiar with.<p>I'm certain that it helped having a decent prompt, asking it to consider all the potential user paths and edge cases, and also having a very good understanding of the code myself. Still, this was the first time for me I could honestly say that an LLM actually saved me time as a developer.
I am very sceptical of LLM (or any AI) code generation usefulness and it does not really have anything to do with AI itself.<p>In the past I've been involved in several projects deeply using MDA (Model Driven Architecture) techniques which used various code generation methods to develop software. One of the main obstacles was the problem of maintaining the generated code.<p>IOW: how should we treat generated code?<p>If we treat it in the same way as code produced by humans (ie. we maintain it) then the maintenance cost grows (super-linearly) with the amount of code we generate. To make matters worse for LLM: since the code it generates is buggy it means we have more buggy code to maintain. Code review is not the answer because code review power in finding bugs is very weak.<p>This is unlike compilers (that also generate code) because we don't maintain code generated by compilers - we regenerate it anytime we need.<p>The fundamental issue is: for a given set of requirements the goal is to produce less code, not more. _Any_ code generation (however smart it might be) goes against this goal.<p>EDIT: typos
It's hard to generate tests for typical C# code. Or for any context where you have external dependencies.<p>If you have injected services in your current service, the LLM doesn't know anything about those so it makes poor guesses. You have to bring those in context, so they can be mocked properly.<p>You end up spending a lot of time guiding the LLM, so it's not measurably faster than writing test by hand.<p>I want my prompt to be: "write unit tests for XYZ method" without having to accurately describe it the prompt what the method does, how it does it and why it does it. Writing too many details in the prompt takes the same time as writing the code myself.<p>Github Copilot should be better since it's supposed to have access to you entire code base. But somehow it doesn't look at dependencies and it just uses the knowledge of the codebase for stylistic purposes.<p>It's probably my fault, there are for sure better ways to use LLMs for code, but I am probably not the only one who struggles.
Like nearly all the articles about AI doing "testing" or any other skilled activity, the last part of it admits that it is an unreliable method. What I don't see in this article-- which I suspect is because they haven't done any-- is any description of a competent and reasonably complete testing process of this method of writing "tests." What they probably did is to try this, feel good about it (because testing is not their passion, so they are easily impressed), and then mark it off in their minds as a solved problem.<p>The retort by AI fanboys is always "humans are unreliable, too." Yes, they are. But they have other important qualities: accountability, humility, legibility, and the ability to learn experientially as well as conceptually.<p>LLM's are good at instantiating typical or normal patterns (based on its training data). Skilled testing cannot be limited to typicality, although that's a start. What I'd say is that this is an interesting idea that has an important hazard associated with it: complacency on the part of the developer who uses this method, which turns things that COULD be missed by a skilled tester into things that are GUARANTEED to be missed.
Each time a new LLM version comes out, I give it another try at generating tests. However, even with the latest models, tailored GPTs, and well-crafted prompts with code examples, the same issues keep surfacing:<p>- The models often create several tests within the same equivalence class, which barely expands test coverage<p>- They either skip parameterization, creating multiple redundant tests, or go overboard with 5+ parameters that make tests hard to read and maintain<p>- The model seems focused on "writing a test at any cost" often resorting to excessive mocking or monkey-patching without much thought<p>- The models don’t leverage existing helper functions or classes in the project, requiring me to upload the whole project context each time or customize GPTs for every individual project<p>Given these limitations, I primarily use LLMs for refactoring tests where IDE isn’t as efficient:<p>- Extracting repetitive code in tests into helpers or fixtures<p>- Merging multiple tests into a single parameterized test<p>- Breaking up overly complex parameterized tests for readability<p>- Renaming tests to maintain a consistent style across a module, without getting stuck on names
I did this for Laravel a few months ago and it’s great. It’s basically the same as the article describes, and it has definitely increased the number of tests I write.<p>Happy to open source if anyone is interested.
I went with a more clinical approach and used models that were available a half year ago but I also was interested in using LLMs to write unit tests. You can learn the details of that experiment at <a href="https://www.infoq.com/articles/llm-productivity-experiment/" rel="nofollow">https://www.infoq.com/articles/llm-productivity-experiment/</a> but the net of what I found was that LLMs improve developer productivity in the form of unit test creation but only marginally. Perhaps I find myself a bit skeptical on the claims from that Assembled blog on significant improvement.
If you add "white-space: pre-wrap" to the elements containing those prompt examples you'll avoid the horizontal scrollbar (which I'm getting even on desktop) and make them easier to read.
i would love to used to use it change code in ways that compiles and see if test fails. Coverage metric sometimes doesn't really tell you if some piece of code is covered or not.