<a href="https://arxiv.org/abs/1705.08439" rel="nofollow">https://arxiv.org/abs/1705.08439</a><p>The original paper.<p>The references in the paper paint a much clearer picture of where exactly the idea behind reinforcement learning with optimal, suboptimal, random oracles comes from. There are also mathematical proofs that these setups work.<p>I was quite shocked to not see [6, 16] references in any of the recent MCTS papers.<p>These references prove why the stuff works and show how well it works. But the whole field of imitation learning seems invisible to the deep RL papers. Don't have the faintest idea why.<p>The algorithm described is the ultimate generalized algorithm. If you have the expert policy the algorithm is learning completely supervised, if expert policy is suboptimal but the score (loss) is fully calculable the learned policy will outperform the reference policy, if expert policy is completely random the algorithm behaves as reinforcement learning.<p>What the paper at the top adds is the ability to improve the expert policy with the learned one simultaneously in unison and the math covered previously guarantees improvement.