If you are not familiar with RL, I recommend first reading the two articles that the author links to:<p>- <a href="https://www.alexirpan.com/2018/02/14/rl-hard.html" rel="nofollow">https://www.alexirpan.com/2018/02/14/rl-hard.html</a><p>- <a href="https://himanshusahni.github.io/2018/02/23/reinforcement-learning-never-worked.html" rel="nofollow">https://himanshusahni.github.io/2018/02/23/reinforcement-lea...</a><p>They are no so recent anymore, but still capture the problem well.<p>Long story short: RL doesn't work yet. We're not sure it'll ever work. Some big companies are betting that it will.<p>> My own hypothesis is that the reward function for learning organisms is really driven from maintaining homeostasis and minimizing surprise.<p>Both directions are actively researched: maximizing surprise (to improve exploration), and minimizing surprise (to improve exploitation).<p>See eg "Exploration by Random Network Distillation" for the first, "SURPRISE MINIMIZING RL IN DYNAMIC ENVIRONMENTS" for the second.