TechEcho

9 comments

btillyalmost 4 years ago

The notifications examples make me wonder what fundamental mistakes they are making.People respond to change. If you A/B test, say, a new email headline, the change usually wins. Even if it isn't better. Just because it is different. Then you roll it out in production, look at it a few months later, and it is probably worse.If you don't understand downsides like this, then A/B testing is going to have a lot of pitfalls that you won't even know that you fell into.

评论 #27645456 未加载

tootiealmost 4 years ago

I've seen a lot of really sophisticated data pipelines and testing frameworks at a lot of shops. I've seen precious few who were able to make well-considered product decisions based on the data.

jonathankorenalmost 4 years ago

I’m pretty skeptical of this. I’ve run a lot of ML based A/B tests over my career. I’ve talked to a lot of people that have also run ML A/B tests over their careers. And the one constant everyone has discovered is that offline evaluation metrics are only somewhat directionally correlated with online metrics.Seriously. A/B tests are kind of a crap shoot. The systems are constantly changing. The online inference data drifts from the historical training data. User behavior changes.I’ve seen positive offline models perform flat. I’ve seen negative offline metrics perform positively. There’s just a lot of variance between offline and online performance.Just run the test. Lower the friction for running the tests, and just run them. It’s the only way to be sure.

评论 #27646454 未加载

bruce343434almost 4 years ago

Between the emojis in the headings and the 2009 era memes, this was a bit of a cringy read. Also, the author seems to avoid at all costs going in depth about the actual implementation of OPE and I still don't quite understand how I would go about implementing it. Machine learning based on past A/B tests that finds similarities between the UI changes???

评论 #27645234 未加载

评论 #27643549 未加载

dr_dshivalmost 4 years ago

The challenge I've seen is to have a combination of good, small-scale Human-Centered Design research (watching people work,for instance) and good, large-scale testing. It can be really hard to learn the "why" from a/b tests otherwise.

austincheneyalmost 4 years ago

When I was the A/B test guy for Travelocity I was fortunate to have an excellent team. The largest bias we discovered is that our tests were executed with amazing precision and durability. My dedicated QA was the whining star that made that happen. Unfortunately when the resulting feature entered the site in production as a released feature there was always some defect, or some conflict, or some oversight. The actual business results would then under perform compared to the team’s analyzed prediction.

评论 #27644013 未加载

eximiusalmost 4 years ago

I'm going to stick with multiarm bandit testing.

评论 #27643717 未加载

评论 #27645488 未加载

sbierwagenalmost 4 years ago

>Now you might be thinking OPE is only useful if you have Facebook-level quantities of data. Luckily that’s not true. If you have enough data to A/B test policies with statistical significance, you probably have more than enough data to evaluate them offline.Isn't there a multiple comparisons problem here? If you have enough data to do single A/B test, how can you do a hundred historical comparisons and still have the same p value?

varsketizalmost 4 years ago

Recently I hear that booking.com is given as an example of a company that runs a lot of a/b tests. Anyone from booking reading this? How does it look from the inside, is it worth it to run hundreds at a time?

9 comments

btillyalmost 4 years ago

评论 #27645456 未加载

tootiealmost 4 years ago

I've seen a lot of really sophisticated data pipelines and testing frameworks at a lot of shops. I've seen precious few who were able to make well-considered product decisions based on the data.

Run fewer, better A/B tests

9 comments

Run fewer, better A/B tests

9 comments