Notice the result is a bit more nuanced if reading the paper, i.e.<p>> Our results suggest that, for large Java
programs, the correlation between coverage and effectiveness
drops when suite size is controlled for.<p>and<p>> While coverage
measures are useful for identifying under-tested parts of a
program, and low coverage may indicate that a test suite is
inadequate, high coverage does not indicate that a test suite
is effective.<p>They also propose an alternative as a "quality goal" for test suites:<p>> Of course, developers still want to measure the quality
of their test suites, meaning they need a metric that does
correlate with fault detection ability. While this is still an
open problem, we currently feel that mutation score may be
a good substitute for coverage in this context.
I am not surprised by the findings of the study - not at all.<p>In discussions about software metrics, I'm always trying to make the point that you can <i>only</i> use most metrics from within a team to gain insights, not as an external measure about how good/bad the code (or the team) is. In other words, if a team thinks they have a problem, they can use metrics to gain insights and explore possible solutions. But as soon as someone says "[foo metric] needs to be at least [value]", you have already lost - the cheating and gaming begins. Even if the agreement on [value] comes from within the team.<p>Back to the topic :) I am not surprised by the findings - Higher test coverage does not mean that everything is fine. But very low test coverage indicates that there might be hidden problems here or there. This is how I like to use test coverage and how I try to teach it.<p>But it is great that we have now empirical data here: From now on, I can point others to this study when we are discussing whether the build server should reject commits with less than [value] coverage.
Another case of conflating a phenomenon and its measure.<p>Rather than write good tests, people have gotten side-tracked by chasing magical number that may or may not reflect the phenomenon of interest.<p>Heuristics are sometimes wrong, but always fast and not fundamentally bad. Writing unit tests for high-risk code is a good heuristic, god dammit. Chasing after 100% test-coverage is pedantic, and -- I honestly think -- evidence of a development team that favors form over function.<p>I'll continue writing unit tests that actually test critical boundary conditions, and I will continue not to care if only 10% of my code is covered. If that 10% represents 80% of my bugs, I've won.
No kidding. This has been known for a long time but it's good to see empirical evidence. The empirical evidence of 70's-90's showed the best ways to reduce defects were the following:<p>1. Unambiguous specifications of what the system will do.<p>2. Implementation in safe languages or subsets with plenty of interface checks.<p>3. <i>Human review</i> of design, implementation, and configuration looking for errors. Best results were here.<p>4. Usage-based testing to eliminate most errors user will experience in practice.<p>5. Thorough functional and unit testing.<p>(Later on...)<p>6. Fuzz testing.<p>7. Static analysis.<p>Practices 1-4 were used in Cleanroom, some major commercial projects, and high assurance security work. Every time, they produced results in preventing serious defects from entering the system that often might have slipped by testing. So, those are where the majority of our verification work should go in. Supporting evidence is from those that do a subset of that from OpenBSD team to Microsoft's SDL along with substantial defect reduction that followed.<p>Note: 6 and 7 showed their value later on. No 7 should definitely be done where possible with 6 done to a degree. I haven't researched 6 enough to give advice on most efficient approach, though.<p>So, internal testing and coverage are the weakest forms of verification. They have many benefits for sure but people put way too much effort and expectations into them where it pays off more elsewhere. Do enough testing for its key benefits then just cut it off from there. And don't even do most of that until you've done code reviews, etc which pay off more.
I find it strange that so much time is spent trying to convince people that some aspect of testing (coverage / tdd) is not a panacea and so little around improving how to test.<p>As a student / recent grad I remember thinking that testing was maybe some thing you maybe had to do for super hardcore projects. Now I see it as one of the first things to think about on a serious project and something that is going to take ~30 - 50% of the cost / effort.
Coverage is pretty awesome feedback for reminding you that "oh yeah, I should probably test that case too." And if you start out with this kind of feedback early it can help influence your design to increase its testability.<p>If X% coverage is a goal measured by non-technical team members, it likely loses much of its value.
Coverage does not guarantee effectiveness. Lack of coverage guarantees the behavior of uncovered code is not checked.<p>Use the tool for what it is good at. It's not a substitute for code review or writing convincing tests. Granted even convincing tests are going to miss important stuff, but they will miss less and after the initial bout of bugs will become quite good (because you write regression tests for EVERY bug right?).
I have a few objections to the evidence presented.<p>Keeping in mind that I don't disagree with the statement. Test coverage is an objective metric, and test Effectiveness is a... what is it again? How many bugs you'll find with it? Obviously the two are separate concepts.<p>This is my first objection: The paper seems to say that Mutation Testing is test effectiveness, but Mutation testing is merely another metric. They cite other papers, but papers that attempt to demonstrate that this metric is correlated with test "effectiveness".<p>Metric against metric, is that meaningful?<p>She presents graphs of results that seem to demonstrate linear correlation between test suite size/test suite coverage/and mutation testing (called "effectiveness") This is addressed later, "isn't this what we would expect?" yah! And the explanation for why it's unexpected sailed fully over my head. (I admit it! I'm dumb.)<p>finally, many test suites are generated on the presumption of achieving code coverage, is this test valid without also having test suites that were made without that goal in mind? Could such a suite exist?<p>so summary of my objections<p>* is mutation testing a meaningful measure of effectiveness?<p>* can you measure one metric against another another, get a linear relationship, and conclude any meaningful differences?<p>* does the presence of code coverage as a target spoil the conclusion?<p>I'd love to hear input on this.
I'm biased (it's my research field, in part) but I'd suggest that studies on coverage are all over the place, with this one showing lack of correlation, and other studies showing good correlation between coverage and... some kind of effectiveness (the ICSE paper mentioned above, also from 2014, and a TOSEM paper coming out this year, as well as a variety of publications over the years).<p><a href="http://www.cs.cmu.edu/~agroce/onwardessays14.pdf" rel="nofollow">http://www.cs.cmu.edu/~agroce/onwardessays14.pdf</a> covers the Inozemtseva et al. paper as well as some other recent work, and nothing in the time since we wrote that has modified my view that the jury is still out on coverage, depending on the situation in which you want to use it. Saying "coverage is not useful" is pretty clearly wrong, and saying "coverage is highly effective for measuring all suites in all situations" is also clearly wrong. Beyond that, it's hard to make solidly supported claims that don't depend greatly on details of what you are measuring and how.<p>I suspect Laura generally agrees, though probably our guesses on what eventual answers might be differ.
The SPLASH Onward! 2014 essay concludes with this advice to practitioners:<p>In some cases where coverage is currently used, there is little real substitute for it; test suite size alone is not a very helpful measure of testing effort, since it is even more easily abused or misunderstood than coverage. Other testing efforts already have ways of determining when to stop that don’t rely on coverage (ranging from “we’re out of time or money” to “we see clearly diminishing returns in terms of bugs found per dollar spent testing, and predict few residual defects based on past projects”). When coverage levels are required by company or government policy, conscientious testers should strive to produce good suites that, additionally, achieve the required level of coverage rather than aiming very directly at coverage itself [56]. “Testing to the test” by writing a suite that gets “enough” coverage and expecting this to guarantee good fault detection is very likely a bad idea — even in the best-case scenario where coverage is well correlated with fault detection. Stay tuned to the research community for news on whether coverage can be
used more aggressively, with confidence, in the future.
While I think it's well-agreed upon in this community that code coverage is not all that big a deal, I think we also should consider the methods used in this study before we all pat ourselves on the back for having 'proof' of what we know.<p>> we generated 31,000 test suites for five systems consisting of up to 724,000 lines of source code<p>You auto-generated unit tests suites, and you're surprised they weren't very good at finding bugs? Well, no kidding, they were auto-generated! Would you trust your unit tests to be generated by a computer? Of course not.<p>Do a study of real-world software, and compare the unit test coverage to the Test Suite effectiveness. Then I'll be interested.