First off, congratulations folks! It’s never easy getting a new product off the ground, and I wish you the best of luck. So please don’t take this as anything other than genuine constructive criticism as a potential customer: generating tests to increase coverage is a misunderstanding of the point of collecting code coverage metrics, and businesses that depend on getting verification activities right will know this when they evaluate your product.<p>A high-quality test passes when the functionality of the software under test is consistent with the design intent of that software. If the software doesn’t do the Right Thing, the test must fail. It’s why TDD is effective: you’re essentially specifying the intent and then implementing code against it, like a self-verifying requirements specification. When we look at Qodo tests in the GitHub MRs you’ve linked, it’s argued that a high-quality test is defined as one that:<p>1. Executes successfully<p>2. Passes all assertions<p>3. Increases overall code coverage<p>4. Tests previously uncovered behaviors (as specified in the LLM prompt)<p>So, given source code for a project as input, a hypothetical “perfect AI” built into Qodo that always writes a high-quality test would (naturally!) <i>never fail</i> to write a passing test for that code; the semantics of the code would be perfectly encoded in the test. If the code had a defect, it follows logically that optimizing the quality of your AI for the metrics Qodo is aiming for will actually LOWER the probability of finding that defect! The generated test would have successfully managed to validate the code against itself, enshrining defective behavior as correct. It’s easy to say that higher code coverage is good, more maintainable, etc., but this outcome is actually the exact opposite of maintainable and actively undermines confidence in the code under test and the ability to refactor.<p>There are better ways to do this, and you’ve got competitors who are already well on the way to doing them using a diverse range of inputs besides code. It boils down to answering two questions:<p>1. Can a technique be applied so that a LLM, with or without explicit specifications and understanding of developer intentions, will reliably reconstruct the intended behavior of code?<p>2. Can a technique be applied so that tests generated by a LLM truly verify the specific behaviors the LLM was prompted to test, as opposed to writing a valid test but not the one that was asked for?