科技回声

3 条评论

nlperguiy超过 7 年前

<a href="https://arxiv.org/abs/1705.08439" rel="nofollow">https://arxiv.org/abs/1705.08439</a>The original paper.The references in the paper paint a much clearer picture of where exactly the idea behind reinforcement learning with optimal, suboptimal, random oracles comes from. There are also mathematical proofs that these setups work.I was quite shocked to not see [6, 16] references in any of the recent MCTS papers.These references prove why the stuff works and show how well it works. But the whole field of imitation learning seems invisible to the deep RL papers. Don't have the faintest idea why.The algorithm described is the ultimate generalized algorithm. If you have the expert policy the algorithm is learning completely supervised, if expert policy is suboptimal but the score (loss) is fully calculable the learned policy will outperform the reference policy, if expert policy is completely random the algorithm behaves as reinforcement learning.What the paper at the top adds is the ability to improve the expert policy with the learned one simultaneously in unison and the math covered previously guarantees improvement.

评论 #16020707 未加载

评论 #16021146 未加载

jph00超过 7 年前

FYI this is the paper that lays the key foundation for AlphaZero, which recently got a lot of attention for easily beating the earlier Go-winning algorithms without looking at human games, and then beat the best chess algorithm with 6 hours training.

评论 #16020537 未加载

sitkack超过 7 年前

> Repeated deep study gradually improves intuitions.Cognition and metacognition. The highest form of knowing is knowing why. There is an easy to solution to most of this, ruthless application of the scientific method. Ruthless. Zero Ego. Blank Slate every time.

Learning from Scratch by Thinking Fast and Slow

3 条评论

Learning from Scratch by Thinking Fast and Slow

3 条评论