TE
科技回声
首页24小时热榜最新最佳问答展示工作
GitHubTwitter
首页

科技回声

基于 Next.js 构建的科技新闻平台,提供全球科技新闻和讨论内容。

GitHubTwitter

首页

首页最新最佳问答展示工作

资源链接

HackerNews API原版 HackerNewsNext.js

© 2025 科技回声. 版权所有。

Deep Reinforcement Learning is a waste of time

21 点作者 snaky超过 5 年前

3 条评论

beisner超过 5 年前
In a lot of ways, the field has already come to this conclusion. At NeurIPS this year some of the biggest topics in Deep RL were model-based RL and meta-learning for RL, both of which aim to learn a generalized representation of an environment that can be used in a variety of downstream tasks.
MasterScrat超过 5 年前
If you are not familiar with RL, I recommend first reading the two articles that the author links to:<p>- <a href="https:&#x2F;&#x2F;www.alexirpan.com&#x2F;2018&#x2F;02&#x2F;14&#x2F;rl-hard.html" rel="nofollow">https:&#x2F;&#x2F;www.alexirpan.com&#x2F;2018&#x2F;02&#x2F;14&#x2F;rl-hard.html</a><p>- <a href="https:&#x2F;&#x2F;himanshusahni.github.io&#x2F;2018&#x2F;02&#x2F;23&#x2F;reinforcement-learning-never-worked.html" rel="nofollow">https:&#x2F;&#x2F;himanshusahni.github.io&#x2F;2018&#x2F;02&#x2F;23&#x2F;reinforcement-lea...</a><p>They are no so recent anymore, but still capture the problem well.<p>Long story short: RL doesn&#x27;t work yet. We&#x27;re not sure it&#x27;ll ever work. Some big companies are betting that it will.<p>&gt; My own hypothesis is that the reward function for learning organisms is really driven from maintaining homeostasis and minimizing surprise.<p>Both directions are actively researched: maximizing surprise (to improve exploration), and minimizing surprise (to improve exploitation).<p>See eg &quot;Exploration by Random Network Distillation&quot; for the first, &quot;SURPRISE MINIMIZING RL IN DYNAMIC ENVIRONMENTS&quot; for the second.
w1nst0nsm1th超过 5 年前
Sometimes, send a letter is the best way to do.<p>Some systems fail to even implement the concept of reward (and punishment) and the agent is not even &#x27;aware&#x27; of what is a reward (or a &#x27;punishment&#x27;), and so the agent don&#x27;t even know he is being rewarded (or &#x27;punished&#x27;) is in the first place. Then the system has to be redesigned to optimize the code.<p>Sometimes AI is the least straight forward solution, the most expensive and the less efficient in matter of result.