TE
TechEcho
Home24h TopNewestBestAskShowJobs
GitHubTwitter
Home

TechEcho

A tech news platform built with Next.js, providing global tech news and discussions.

GitHubTwitter

Home

HomeNewestBestAskShowJobs

Resources

HackerNews APIOriginal HackerNewsNext.js

© 2025 TechEcho. All rights reserved.

Annoying A/B testing mistakes

292 pointsby Twixesalmost 2 years ago

25 comments

alsiolaalmost 2 years ago
On point 7 ((Testing an unclear hypothesis), while agreeing with the overall point, I strongly disagree with the examples.<p>&gt; Bad Hypothesis: Changing the color of the &quot;Proceed to checkout&quot; button will increase purchases.<p>This is succinct, clear, and is very clear what the variable&#x2F;measure will be.<p>&gt; Good hypothesis: User research showed that users are unsure of how to proceed to the checkout page. Changing the button&#x27;s color will lead to more users noticing it and thus more people will proceed to the checkout page. This will then lead to more purchases.<p>&gt; User research showed that users are unsure of how to proceed to the checkout page.<p>Not a hypothesis, but a problem statement. Cut the fluff.<p>&gt; Changing the button&#x27;s color will lead to more users noticing it and thus more people will proceed to the checkout page.<p>This is now two hypotheses.<p>&gt; This will then lead to more purchases.<p>Sorry I meant three hypotheses.
评论 #36367095 未加载
评论 #36357151 未加载
评论 #36356840 未加载
评论 #36357209 未加载
评论 #36361192 未加载
kimukasetsualmost 2 years ago
The biggest mistake engineers make is determining sample sizes. It is not trivial to determine the sample size for a trial without prior knowledge of effect sizes. Instead of waiting for a fixed sample size, I would recommend using a sequential testing framework: set a stopping condition and perform a test for each new batch of sample units.<p>This is called optional stopping and it is not possible using a classic t-test, since Type I and II errors are only valid at a determined sample size. However, other tests make it possible: see safe anytime-valid statistics [1, 2] or, simply, bayesian testing [3, 4].<p>[1] <a href="https:&#x2F;&#x2F;arxiv.org&#x2F;abs&#x2F;2210.01948" rel="nofollow noreferrer">https:&#x2F;&#x2F;arxiv.org&#x2F;abs&#x2F;2210.01948</a><p>[2] <a href="https:&#x2F;&#x2F;arxiv.org&#x2F;abs&#x2F;2011.03567" rel="nofollow noreferrer">https:&#x2F;&#x2F;arxiv.org&#x2F;abs&#x2F;2011.03567</a><p>[3] <a href="https:&#x2F;&#x2F;pubmed.ncbi.nlm.nih.gov&#x2F;24659049&#x2F;" rel="nofollow noreferrer">https:&#x2F;&#x2F;pubmed.ncbi.nlm.nih.gov&#x2F;24659049&#x2F;</a><p>[4] <a href="http:&#x2F;&#x2F;doingbayesiandataanalysis.blogspot.com&#x2F;2013&#x2F;11&#x2F;optional-stopping-in-data-collection-p.html?m=1" rel="nofollow noreferrer">http:&#x2F;&#x2F;doingbayesiandataanalysis.blogspot.com&#x2F;2013&#x2F;11&#x2F;option...</a>
评论 #36357634 未加载
评论 #36358793 未加载
mtlmtlmtlmtlalmost 2 years ago
Surprised no one said this yet, so I&#x27;ll bite the bullet.<p>I don&#x27;t think A&#x2F;B testing is a good idea at all for the long term.<p>Seems like a recipe for having your software slowly evolved into a giant heap of dark patterns. When a metric becomes a target, it ceases to be a good metric.
评论 #36358287 未加载
评论 #36357776 未加载
评论 #36357898 未加载
评论 #36358757 未加载
评论 #36362891 未加载
withinboredomalmost 2 years ago
I built an internal a&#x2F;b testing platform with a team of 3-5 over the years. It needed to handle extreme load (hundreds of millions of participants in some cases). Our team also had a sister team responsible for teaching&#x2F;educating teams about how to do proper a&#x2F;b testing -- they also reviewed implementations&#x2F;results on-demand.<p>Most of the a&#x2F;b tests they reviewed (note the survivorship bias here, they were reviewed because they were surprising results) were incorrectly implemented and had to be redone. Most companies I worked at before or since did NOT have a team like this, and blindly trusted the results without hunting for biases, incorrect implementations, bugs, or other issues.
评论 #36356663 未加载
评论 #36356355 未加载
评论 #36356738 未加载
Sohcahtoa82almost 2 years ago
The one mistake I assume happens too much is trying to measure &quot;engagement&quot;.<p>Imagine a website is testing a redesign, and they want to decide if people like it by measuring how long they spend on the site to see if it&#x27;s more &quot;engaging&quot;. But the new site makes information harder to find, so they spend more time on the site browsing and trying to find what they&#x27;re looking for.<p>Management goes, &quot;Oh, users are delighted with the new site! Look how much time they spend on it!&quot; not realizing how frustrated the users are.
评论 #36357874 未加载
评论 #36362975 未加载
评论 #36365458 未加载
throwaway084t95almost 2 years ago
That&#x27;s not Simpson&#x27;s Paradox. Simpson&#x27;s Paradox is when the aggregate winner is different from the winner in each element of a partition, not just some of them
评论 #36354922 未加载
评论 #36356065 未加载
评论 #36354626 未加载
评论 #36355501 未加载
评论 #36356114 未加载
londons_explorealmost 2 years ago
I want an A&#x2F;B test framework that automatically optimizes the size of the groups to maximize revenue.<p>At first, it would pick say a 50&#x2F;50 split. Then as data rolls in that shows group A is more likely to convert, shift more users over to group A. Keep a few users on B to keep gathering data. Eventually, when enough data has come in, it might turn out that flow A doesn&#x27;t work at all for users in France - so the ideal would be for most users in France to end up in group B, whereas the rest of the world is in group A.<p>I want the framework to do all this behind the scenes - and preferably with statistical rigorousness. And then to tell me which groups have diminished to near zero (allowing me to remove the associated code).
评论 #36356085 未加载
评论 #36356511 未加载
评论 #36354719 未加载
评论 #36355607 未加载
评论 #36356221 未加载
评论 #36354554 未加载
wasmitnetzenalmost 2 years ago
Posthog is on developerdans &quot;Ads &amp; Tracking&quot; blocklist[1], if you&#x27;re wondering why this doesn&#x27;t load.<p>[1]: <a href="https:&#x2F;&#x2F;github.com&#x2F;lightswitch05&#x2F;hosts&#x2F;blob&#x2F;master&#x2F;docs&#x2F;lists&#x2F;ads-and-tracking-extended.txt">https:&#x2F;&#x2F;github.com&#x2F;lightswitch05&#x2F;hosts&#x2F;blob&#x2F;master&#x2F;docs&#x2F;list...</a>
评论 #36360409 未加载
2rsfalmost 2 years ago
Another challenge, related more to implementation than theory, is having too many experiments running in parallel.<p>As a company grows there will be multiple experiments running in parallel executed by different teams. The underlying assumption is that they are independent, but it is not necessarily true or at least not entirely correct. For example a graphics change on the main page together with a change in the login logic.<p>Obviously this can be solved by communication, for example documenting running experiments, but like many other aspects in AB testing there is a lot of guesswork and gut feeling involved.
评论 #36358305 未加载
jedbergalmost 2 years ago
The biggest mistake engineers make about A&#x2F;B testing is not recognizing local maxima. Your test may be super successful, but there may be an even better solution that&#x27;s significantly different than what you&#x27;ve arrived at.<p>It&#x27;s important to not only A&#x2F;B test minor changes, but occasionally throw in some major changes to see if it moves the same metric, possibly even more than your existing success.
rmetzleralmost 2 years ago
If I read the first mistake correctly, then getFeatureFlag() has the side-effect to count how often it was called and uses this to calculate the outcome of the experiment? Wow. I don&#x27;t know what to say....
评论 #36354764 未加载
评论 #36356716 未加载
评论 #36354987 未加载
评论 #36357886 未加载
评论 #36356342 未加载
评论 #36356170 未加载
dbroockmanalmost 2 years ago
Another one: don’t program your own AB testing framework! Every time I’ve seen engineers try to build this on their own, it fails an AA test (where both versions are the same so there should be no difference). Common reasons are overly complicated randomization schemes (keep it simple!) and differences in load times between test and control.
评论 #36357145 未加载
评论 #36359534 未加载
alberthalmost 2 years ago
Enough traffic.<p>Isn’t the biggest problem with A&#x2F;B testing that very few web sites even have enough traffic to properly measure statistical differences.<p>Essentially making A&#x2F;B testing for 99.9% of websites useless.
评论 #36356562 未加载
评论 #36357655 未加载
评论 #36356081 未加载
masswerkalmost 2 years ago
Ad 7)<p>&gt; Good hypothesis: User research showed that users are unsure of how to proceed to the checkout page. Changing the button&#x27;s color will lead to more users noticing it (…)<p>Mind that you have to prove first that this preposition is actually true. Your user research is probably exploratory, qualitative data based on a small sample. At this point, it&#x27;s rather an assumption. You have to transform and test this (by quantitative means) for validity and significance. Only then you can proceed to the button-hypothesis. Otherwise, you are still testing multiple things at once, based on an unclear hypothesis, while merely assuming that part of this hypothesis is actually valid.
评论 #36355586 未加载
mabboalmost 2 years ago
&gt; The solution is to use an A&#x2F;B test running time calculator to determine if you have the required statistical power to run your experiment and for how long you should run your experiment.<p>Wouldn&#x27;t it be better to have an A&#x2F;B testing system that just counts how many users have been in each assignment group and end when you have the required statistical power?<p>Time just seems like a stand in for &quot;that should be enough&quot;, when in reality you might have a change in how many users get exposed that differs from your expectations.
评论 #36354993 未加载
iudqnolqalmost 2 years ago
Point one seems to be an API naming issue. I would not anticipate getFeatureFlag to increment a hit counter. Seems like it should be called something like participateInFlagTest or whatever. Or maybe it should take a (key, arbitraryId) instead of just (key), use the hash of the id to determine if the flag is set, and idempotently register a hit for the id.
realjohngalmost 2 years ago
Thanks for posting this. It’s to the point and easy to understand. And much needed- most companies seem to do testing without teaching the intricacies involved.
drpixiealmost 2 years ago
&gt; Relying too much on A&#x2F;B tests for decision-making<p>Need I say more? Or just keep tweaking your website until it becomes a mindless, grey, sludge.
franzealmost 2 years ago
plus, mind the Honeymoon Effect<p>something new performs better cause its new<p>if you have a platform with lots pf returning users this one will hit you again and again.<p>so even if you have a winner after the test and make the change permanent, revisit it 2 months later and see if you are now really better of.<p>all changes of a&#x2F;b tests in sum has a high chance to just get an average platform in the sum of all changes.
donretagalmost 2 years ago
If anyone from posthog is reading this, please fix your RSS feed. The link actually points back to the blog homepage.
评论 #36360704 未加载
2OEH8eoCRo0almost 2 years ago
<i>Every</i> engineer? Electrical engineers? Kernel developers? Defense workers?<p>I hesitate to write this (because I don&#x27;t want to be negative) but I get a sense that most software &quot;engineers&quot; have a very narrow view of the industry at large. Or this forum leans a particular way.<p>I haven&#x27;t A&#x2F;B tested in my last three roles. Two of them were defense jobs, my current job deals with the Linux kernel.
评论 #36359102 未加载
评论 #36357015 未加载
评论 #36356177 未加载
评论 #36357163 未加载
jlduggeralmost 2 years ago
&gt; 6. Not accounting for seasonality<p>Doesn&#x27;t the online nature of an A&#x2F;B automatically account for this?
pil0ualmost 2 years ago
In the second table, shouldn&#x27;t the mobile control conversion rate be 3.33%?
time4teaalmost 2 years ago
Annoying illegal cookie consent banner?
methoualmost 2 years ago
Probably off-topic, but how do opt out from most of A&#x2F;B testings?