TE
TechEcho
Home24h TopNewestBestAskShowJobs
GitHubTwitter
Home

TechEcho

A tech news platform built with Next.js, providing global tech news and discussions.

GitHubTwitter

Home

HomeNewestBestAskShowJobs

Resources

HackerNews APIOriginal HackerNewsNext.js

© 2025 TechEcho. All rights reserved.

Run fewer, better A/B tests

94 pointsby econtialmost 4 years ago

9 comments

btillyalmost 4 years ago
The notifications examples make me wonder what fundamental mistakes they are making.<p>People respond to change. If you A&#x2F;B test, say, a new email headline, the change usually wins. Even if it isn&#x27;t better. Just because it is different. Then you roll it out in production, look at it a few months later, and it is probably worse.<p>If you don&#x27;t understand downsides like this, then A&#x2F;B testing is going to have a lot of pitfalls that you won&#x27;t even know that you fell into.
评论 #27645456 未加载
tootiealmost 4 years ago
I&#x27;ve seen a lot of really sophisticated data pipelines and testing frameworks at a lot of shops. I&#x27;ve seen precious few who were able to make well-considered product decisions based on the data.
jonathankorenalmost 4 years ago
I’m pretty skeptical of this. I’ve run a lot of ML based A&#x2F;B tests over my career. I’ve talked to a lot of people that have also run ML A&#x2F;B tests over their careers. And the one constant everyone has discovered is that offline evaluation metrics are only somewhat directionally correlated with online metrics.<p>Seriously. A&#x2F;B tests are kind of a crap shoot. The systems are constantly changing. The online inference data drifts from the historical training data. User behavior changes.<p>I’ve seen positive offline models perform flat. I’ve seen <i>negative</i> offline metrics perform positively. There’s just a lot of variance between offline and online performance.<p>Just run the test. Lower the friction for running the tests, and just run them. It’s the only way to be sure.
评论 #27646454 未加载
bruce343434almost 4 years ago
Between the emojis in the headings and the 2009 era memes, this was a bit of a cringy read. Also, the author seems to avoid at all costs going in depth about the actual implementation of OPE and I still don&#x27;t quite understand how I would go about implementing it. Machine learning based on past A&#x2F;B tests that finds similarities between the UI changes???
评论 #27645234 未加载
评论 #27643549 未加载
dr_dshivalmost 4 years ago
The challenge I&#x27;ve seen is to have a combination of good, small-scale Human-Centered Design research (watching people work,for instance) and good, large-scale testing. It can be really hard to learn the &quot;why&quot; from a&#x2F;b tests otherwise.
austincheneyalmost 4 years ago
When I was the A&#x2F;B test guy for Travelocity I was fortunate to have an excellent team. The largest bias we discovered is that our tests were executed with amazing precision and durability. My dedicated QA was the whining star that made that happen. Unfortunately when the resulting feature entered the site in production as a released feature there was always some defect, or some conflict, or some oversight. The actual business results would then under perform compared to the team’s analyzed prediction.
评论 #27644013 未加载
eximiusalmost 4 years ago
I&#x27;m going to stick with multiarm bandit testing.
评论 #27643717 未加载
评论 #27645488 未加载
sbierwagenalmost 4 years ago
&gt;Now you might be thinking OPE is only useful if you have Facebook-level quantities of data. Luckily that’s not true. If you have enough data to A&#x2F;B test policies with statistical significance, you probably have more than enough data to evaluate them offline.<p>Isn&#x27;t there a multiple comparisons problem here? If you have enough data to do single A&#x2F;B test, how can you do a hundred historical comparisons and still have the same p value?
varsketizalmost 4 years ago
Recently I hear that booking.com is given as an example of a company that runs a lot of a&#x2F;b tests. Anyone from booking reading this? How does it look from the inside, is it worth it to run hundreds at a time?