I like this article a lot. But there is one thing that it gets a bit wrong.<p>The article is discussing the standard textbook Z-test. The article then talks a lot about Optimizely. However, Optimizely doesn't actually use the Z-test - they have a sequential testing method instead, and the details are a bit different.<p>The article also suggests "start by serving variant B to only 10% of the users to ensure there are no implementation problems". This is a good idea, but once you've ensured there are no integration problems you need to throw away the data and restart. Since conversion rates change during the week (i.e., sat != tues), keeping the data during the ramp-up period is a great way to get wrong results due to Simpson's Paradox.