I spent two years[0] designing, building and maintaining a system which used contextual multi-armed bandits at large scale. A couple pieces of advice relating to this post and this subject:<p>1. Thompson sampling is great. It's intuitive and computationally tractable. The literature is full of other strategies, specifically semi-uniform strategies, but I strongly recommend using Thompson sampling if it works for your problem.<p>2. This is broadly true about ML, but for contextual bandits, most of the engineering work will probably be the feature engineering, not algorithm implementation. Plan accordingly. Choosing the right inputs in the first place makes a big difference. The hashing trick (a la sklearn's dictvectorizer) can make a huge difference.<p>3. It can be difficult to obtain organizational alignment on the intention of using reinforcement learning. Tell stakeholders early and often that you're using bandit algos to produce some kind of outcome — say, clicks or conversions — and not to do science which will uncover deep truths.<p>[0] along with an excellent data scientist and a team of excellent engineers, of course :)