Airbnb could likely get a lot more bang for their buck by letting hosts run experiments on pricing than by testing button colors and whatnot.<p>I ran an online marketplace at a previous gig. Our service providers always complained that they didn't know what to charge to maximize their business. They couldn't see the forest as a tree. Because we had the data for all providers, we started letting them know if they were under- or over-priced, and we saw more conversions and revenue.<p>Dynamic pricing (like Uber does on holidays) alone could be hugely valuable.
A simple hack is to run an A-A-B-B test instead of an A-B test. Rather than splitting 50-50, use 25-25-25-25 splits. When A1==A2 and B1==B2, then you know that you have statistically relevant data and you can compare A to B. Depending on the dataset, this could happen in minutes or weeks.
Statisticians have spent time thinking about the right way to deal with these sorts of problems for a long time: <a href="https://en.wikipedia.org/wiki/Sequential_analysis" rel="nofollow">https://en.wikipedia.org/wiki/Sequential_analysis</a>.<p>Funnily enough, the page they reference for calculating the right sample size actually talks about sequential analysis, but AirBnB doesn't mention this in describing their solution...
HN user btilly has a really helpful essay on the math behind stopping tests earlier than your predetermined sample size. It calls for setting a maximum duration, and provides stopping points along the way. Works similar to the method AirBnB describes.<p><a href="http://elem.com/~btilly/ab-testing-multiple-looks/part2-limited-data.html" rel="nofollow">http://elem.com/~btilly/ab-testing-multiple-looks/part2-limi...</a>
As a developer working for a major competitor to airbnb on a shopping page, and having implemented hundreds of experiments on my page, I can say that these guys are way too obsessed with statistical certainty.<p>Rate of deployment of experiments is a better focus; since all your opponents are bound to copy your winners anyways, you have to rely on the few months edge you've earned before they do so, and constantly maintain that lead.
This article contains some serious p-value abuse. The p-value should be adjusted to account for multiple testing. You do this to minimise the effect that a hypothesis would be accepted purely due to random chance.<p>Try setting your p-value to your Type 1 error rate <i>divided by the number of tests you perform</i>. It will be <i>much</i> smaller, and this is a good thing. Significance should really test for significance, not random chance.
I wish AirBnB would make the cost scale logarithmic, to match the fact that this is roughly how the prices will be distributed too. I'm usually only using the left-most 5% of that slider.
The cult of statistical significance is alive and well. A 0.05 p-value implies a 1:20 chance of "alternative" performing worse upon final installation. That's rather risk adverse. It also implies that "alternative" is worse from the get-go. When is that the case? Type 1 and Type 2 errors are much more balanced in web apps. Anyone care to show me why that's a bad mentality?
Ok, I'll be "that" guy who heckles every AirBnB post, even if this one did have some nice graphs (and ideas).<p>When is AirBnB going to experiment with helping their hosts follow the law? I bet I can predict that graph. Why, look at all those illegal rentals in SF right there in the sample screenshots--oh the irony.<p>Remember, DON'T FUCK UP THE CULTURE! But it's OK to fuck up your host city for a buck or 2 billion.