Another common issue he doesn't mention: using observed differences (or observed significance-test values) as the stopping criterion. The common statistical-significance tests <i>don't</i> work if the decision when to stop collecting data is dependent on the observed levels of significance. Instead, you must ahead of time decide how many trials to do, and stick to that decision, or use more complicated significance tests. (This is the "multiple testing" problem.)<p>For example, it works to flip two coins 50 times each, and then run a statistical-significance test. It does <i>not</i> work to flip two coins 50 times each, run a test; if no significance yet, continue to 100, then 150, etc. until you either find a significant difference or give up. That greatly increases the chance that you'll get a spurious significance, because your stopping is biased in favor of answering "yes": if you found a difference at 50, you don't go on to 100 (where maybe the difference would disappear again), but if you <i>didn't</i> find a difference at 50, you <i>do</i> go on to 100.<p>Put differently, it's using separate p-values for "what is the chance I could've gotten this result in [50|100|150|...] trials with unweighted coins?" to reject the null hypothesis each time, as if they were independent, but the null hypothesis for the entire series has to be the union, "what is the chance I could've seen this result at <i>any</i> of the 50, 100, 150, or 200, ... stopping points with unweighted coins?", which is higher. Yet that's exactly how many A/B tests are done: you start collecting data, and let the trials run until you find "significant" differences or give up.<p>(It's possible to set up a series of tests where you choose when to stop based on observed values, but you have to use different statistical machinery than the common significance-tests.)