UCB bandit algorithms are quite simple to understand, and the generalisation of them by the UCT algorithm to tree search spaces seems like a pretty clean idea.<p>I had a go at using UCT to explore a search space earlier this year (not games, but selecting a good sequence of decisions inside a simulation).<p>The problem I ran into with this approach was that the algorithm has no way to generalise or lift experience from one explored state to other similar / unexplored states, and my search space was large. So having to evaluate each state at least once was prohibitive.<p>This isn't really a criticism of how well UCT can tackle the problem as framed, but more a criticism of how much better one might be able to do with a bit more structure -- e.g. perhaps with the addition of some kind of similarity metric between states, along with some bounds or assumptions on how the values of "similar" states should relate to each other.<p>It seems like combining some machine learning / statistical / approximation techniques to approximate the true state->value function with a UCT-like approach to guide exploration could be very effective for the right problem, but getting it to work well might be more of an art than a science.<p>If anyone has any pointers to reading about this kind of thing, I'd be most grateful! I have previously read about "approximate dynamic programming" techniques, e.g. [1]<p>[1] <a href="http://castlelab.princeton.edu/Papers/Powell-NRLWhat%20you%20should%20know%20about%20approximate%20dynamic%20programming.pdf" rel="nofollow">http://castlelab.princeton.edu/Papers/Powell-NRLWhat%20you%2...</a>