First, you really should move away from frequentist statistical testing and use Bayesian statistics instead. It is perfect for such occasions where you want to adjust your beliefs in what UX is best based on empirical data to support your decision. With collecting data you are increasing confidence in your decision rather than trying to meet an arbitrary criterion of a specific p-value.<p>Second, the “run-in-parallel” approach has a well defined name in experimental design, called a factorial design. The diagram shown is an example of full factorial design in which each level of each factor is combined with each level of all other factors. The advantage of such design is that interactions between factors can be tested as well. If there are good reasons to believe that there are no interactions between the different factors then you could use a partial factorial design that, which has the advantage of having less total combinations of levels while still enabling estimation of effects of individual factors.
Building your own bayesian model with something like pymc3 is also a very reasonable approach to take with small data or data with too much variance to detect effects in a timely manner. This also forces you to think about the underlying distributions that generate your data which is an exercise in itself that can yield interesting insights.
> Gut Check: Especially if you’re off by quite a bit, this is a chance to take a step back and ask whether the company has reached growth scale or not. It could be that there are plenty of obvious 0-1 tactics left. Not everything has to be an experiment.<p>This is a key point, imo. I have a sneaking suspicion that a lot of companies are running "growth teams" that don't have the scale where it actually makes sense to do so.
There's an argument to be made that, so long as your testing fully encompasses all visitors to your site, you aren't sampling the population, you're fully observing it, and statistical significance is irrelevant.
“Using modern experiment frameworks, all 3 of ideas can be safely tested at once, using parallel A/B tests (see chart).”<p>Nooo! First, if one actually works, you’ve massively increased the “noise” for the other experiments, so your significance calculation is now off. Second, xkcd 882.