TE
TechEcho
Home24h TopNewestBestAskShowJobs
GitHubTwitter
Home

TechEcho

A tech news platform built with Next.js, providing global tech news and discussions.

GitHubTwitter

Home

HomeNewestBestAskShowJobs

Resources

HackerNews APIOriginal HackerNewsNext.js

© 2025 TechEcho. All rights reserved.

Statistical significance & other A/B test pitfalls

19 pointsby japetheapealmost 15 years ago

4 comments

shalmanesealmost 15 years ago
It's disturbing to me how p &#60; 0.05 is used somewhat unthinkingly as the test for statistical significance simply because it's ubiquitous in science.<p>It seems to me if you have even a somewhat popular app, you're gathering enough data that you can afford to use p &#60; 0.001 and avoid a lot of the complexities of statistical analysis that comes from p &#60; 0.05. If you don't have enough data to reach p &#60; 0.001, it's probably better to work more on increasing traffic than getting the piddling gains from A/B testing so early.
评论 #1561689 未加载
_deliriumalmost 15 years ago
Another common issue he doesn't mention: using observed differences (or observed significance-test values) as the stopping criterion. The common statistical-significance tests <i>don't</i> work if the decision when to stop collecting data is dependent on the observed levels of significance. Instead, you must ahead of time decide how many trials to do, and stick to that decision, or use more complicated significance tests. (This is the "multiple testing" problem.)<p>For example, it works to flip two coins 50 times each, and then run a statistical-significance test. It does <i>not</i> work to flip two coins 50 times each, run a test; if no significance yet, continue to 100, then 150, etc. until you either find a significant difference or give up. That greatly increases the chance that you'll get a spurious significance, because your stopping is biased in favor of answering "yes": if you found a difference at 50, you don't go on to 100 (where maybe the difference would disappear again), but if you <i>didn't</i> find a difference at 50, you <i>do</i> go on to 100.<p>Put differently, it's using separate p-values for "what is the chance I could've gotten this result in [50|100|150|...] trials with unweighted coins?" to reject the null hypothesis each time, as if they were independent, but the null hypothesis for the entire series has to be the union, "what is the chance I could've seen this result at <i>any</i> of the 50, 100, 150, or 200, ... stopping points with unweighted coins?", which is higher. Yet that's exactly how many A/B tests are done: you start collecting data, and let the trials run until you find "significant" differences or give up.<p>(It's possible to set up a series of tests where you choose when to stop based on observed values, but you have to use different statistical machinery than the common significance-tests.)
ughalmost 15 years ago
Wait, so people who do A/B tests didn’t already do that? It drives me absolutely crazy when I don‘t have any measure to assess how likely or unlikely it is for some difference to be random.
评论 #1560653 未加载
评论 #1560863 未加载
评论 #1560652 未加载
seis6almost 15 years ago
I have a test to make.<p>Many people think they will get will become millionaires if they follow the style of person X.<p>Person X is like a trial in which a coin was tossed 10000 times and got 6000 heads.<p>Since there is no information about the others persons, the others trials, many choose to follow the illogical thinking that they will succeed in the same way.