One of the assumptions of vanilla multi-armed bandits is that the underlying reward rates are fixed. It's not valid to assume that in a lot of cases, including e-commerce.<p>To see how things could go wrong, imagine that you are running this on an website with a control/treatment variant. After a bit you end up sampling the treatment a little more (say 60:40). You now start running a sale - and the conversion rate for BOTH variants goes up equally (say). But since you are sampling from the treatment variant more, its overall conversion rate goes up faster than the control - meaning you start weighting even more towards that variant. This could be happening purely because of the sale and random noise at the start - you could even end up optimising towards the wrong variant. There are more sophisticated MAB approaches that try to remove the identical reward-rate assumption - they have to model a lot more uncertainty, and so optimise more conservatively.