On point 7 ((Testing an unclear hypothesis), while agreeing with the overall point, I strongly disagree with the examples.<p>> Bad Hypothesis: Changing the color of the "Proceed to checkout" button will increase purchases.<p>This is succinct, clear, and is very clear what the variable/measure will be.<p>> Good hypothesis: User research showed that users are unsure of how to proceed to the checkout page. Changing the button's color will lead to more users noticing it and thus more people will proceed to the checkout page. This will then lead to more purchases.<p>> User research showed that users are unsure of how to proceed to the checkout page.<p>Not a hypothesis, but a problem statement. Cut the fluff.<p>> Changing the button's color will lead to more users noticing it and thus more people will proceed to the checkout page.<p>This is now two hypotheses.<p>> This will then lead to more purchases.<p>Sorry I meant three hypotheses.
The biggest mistake engineers make is determining sample sizes. It is not trivial to determine the sample size for a trial without prior knowledge of effect sizes. Instead of waiting for a fixed sample size, I would recommend using a sequential testing framework: set a stopping condition and perform a test for each new batch of sample units.<p>This is called optional stopping and it is not possible using a classic t-test, since Type I and II errors are only valid at a determined sample size. However, other tests make it possible: see safe anytime-valid statistics [1, 2] or, simply, bayesian testing [3, 4].<p>[1] <a href="https://arxiv.org/abs/2210.01948" rel="nofollow noreferrer">https://arxiv.org/abs/2210.01948</a><p>[2] <a href="https://arxiv.org/abs/2011.03567" rel="nofollow noreferrer">https://arxiv.org/abs/2011.03567</a><p>[3]
<a href="https://pubmed.ncbi.nlm.nih.gov/24659049/" rel="nofollow noreferrer">https://pubmed.ncbi.nlm.nih.gov/24659049/</a><p>[4] <a href="http://doingbayesiandataanalysis.blogspot.com/2013/11/optional-stopping-in-data-collection-p.html?m=1" rel="nofollow noreferrer">http://doingbayesiandataanalysis.blogspot.com/2013/11/option...</a>
Surprised no one said this yet, so I'll bite the bullet.<p>I don't think A/B testing is a good idea at all for the long term.<p>Seems like a recipe for having your software slowly evolved into a giant heap of dark patterns. When a metric becomes a target, it ceases to be a good metric.
I built an internal a/b testing platform with a team of 3-5 over the years. It needed to handle extreme load (hundreds of millions of participants in some cases). Our team also had a sister team responsible for teaching/educating teams about how to do proper a/b testing -- they also reviewed implementations/results on-demand.<p>Most of the a/b tests they reviewed (note the survivorship bias here, they were reviewed because they were surprising results) were incorrectly implemented and had to be redone. Most companies I worked at before or since did NOT have a team like this, and blindly trusted the results without hunting for biases, incorrect implementations, bugs, or other issues.
The one mistake I assume happens too much is trying to measure "engagement".<p>Imagine a website is testing a redesign, and they want to decide if people like it by measuring how long they spend on the site to see if it's more "engaging". But the new site makes information harder to find, so they spend more time on the site browsing and trying to find what they're looking for.<p>Management goes, "Oh, users are delighted with the new site! Look how much time they spend on it!" not realizing how frustrated the users are.
That's not Simpson's Paradox. Simpson's Paradox is when the aggregate winner is different from the winner in each element of a partition, not just some of them
I want an A/B test framework that automatically optimizes the size of the groups to maximize revenue.<p>At first, it would pick say a 50/50 split. Then as data rolls in that shows group A is more likely to convert, shift more users over to group A. Keep a few users on B to keep gathering data. Eventually, when enough data has come in, it might turn out that flow A doesn't work at all for users in France - so the ideal would be for most users in France to end up in group B, whereas the rest of the world is in group A.<p>I want the framework to do all this behind the scenes - and preferably with statistical rigorousness. And then to tell me which groups have diminished to near zero (allowing me to remove the associated code).
Posthog is on developerdans "Ads & Tracking" blocklist[1], if you're wondering why this doesn't load.<p>[1]: <a href="https://github.com/lightswitch05/hosts/blob/master/docs/lists/ads-and-tracking-extended.txt">https://github.com/lightswitch05/hosts/blob/master/docs/list...</a>
Another challenge, related more to implementation than theory, is having too many experiments running in parallel.<p>As a company grows there will be multiple experiments running in parallel executed by different teams. The underlying assumption is that they are independent, but it is not necessarily true or at least not entirely correct. For example a graphics change on the main page together with a change in the login logic.<p>Obviously this can be solved by communication, for example documenting running experiments, but like many other aspects in AB testing there is a lot of guesswork and gut feeling involved.
The biggest mistake engineers make about A/B testing is not recognizing local maxima. Your test may be super successful, but there may be an even better solution that's significantly different than what you've arrived at.<p>It's important to not only A/B test minor changes, but occasionally throw in some major changes to see if it moves the same metric, possibly even more than your existing success.
If I read the first mistake correctly, then getFeatureFlag() has the side-effect to count how often it was called and uses this to calculate the outcome of the experiment? Wow. I don't know what to say....
Another one: don’t program your own AB testing framework! Every time I’ve seen engineers try to build this on their own, it fails an AA test (where both versions are the same so there should be no difference). Common reasons are overly complicated randomization schemes (keep it simple!) and differences in load times between test and control.
Enough traffic.<p>Isn’t the biggest problem with A/B testing that very few web sites even have enough traffic to properly measure statistical differences.<p>Essentially making A/B testing for 99.9% of websites useless.
Ad 7)<p>> Good hypothesis: User research showed that users are unsure of how to proceed to the checkout page. Changing the button's color will lead to more users noticing it (…)<p>Mind that you have to prove first that this preposition is actually true. Your user research is probably exploratory, qualitative data based on a small sample. At this point, it's rather an assumption. You have to transform and test this (by quantitative means) for validity and significance. Only then you can proceed to the button-hypothesis. Otherwise, you are still testing multiple things at once, based on an unclear hypothesis, while merely assuming that part of this hypothesis is actually valid.
> The solution is to use an A/B test running time calculator to determine if you have the required statistical power to run your experiment and for how long you should run your experiment.<p>Wouldn't it be better to have an A/B testing system that just counts how many users have been in each assignment group and end when you have the required statistical power?<p>Time just seems like a stand in for "that should be enough", when in reality you might have a change in how many users get exposed that differs from your expectations.
Point one seems to be an API naming issue. I would not anticipate getFeatureFlag to increment a hit counter. Seems like it should be called something like participateInFlagTest or whatever. Or maybe it should take a (key, arbitraryId) instead of just (key), use the hash of the id to determine if the flag is set, and idempotently register a hit for the id.
Thanks for posting this. It’s to the point and easy to understand. And much needed- most companies seem to do testing without teaching the intricacies involved.
> Relying too much on A/B tests for decision-making<p>Need I say more? Or just keep tweaking your website until it becomes a mindless, grey, sludge.
plus, mind the Honeymoon Effect<p>something new performs better cause its new<p>if you have a platform
with lots pf returning users this one will hit you again and again.<p>so even if you have a winner after the test and make the change permanent, revisit it 2 months later and see if you are now really better of.<p>all changes of a/b tests in sum has a high chance to just get an average platform in the sum of all changes.
<i>Every</i> engineer? Electrical engineers? Kernel developers? Defense workers?<p>I hesitate to write this (because I don't want to be negative) but I get a sense that most software "engineers" have a very narrow view of the industry at large. Or this forum leans a particular way.<p>I haven't A/B tested in my last three roles. Two of them were defense jobs, my current job deals with the Linux kernel.