TE
科技回声
首页24小时热榜最新最佳问答展示工作
GitHubTwitter
首页

科技回声

基于 Next.js 构建的科技新闻平台,提供全球科技新闻和讨论内容。

GitHubTwitter

首页

首页最新最佳问答展示工作

资源链接

HackerNews API原版 HackerNewsNext.js

© 2025 科技回声. 版权所有。

Writing an LLM from scratch, part 10 – dropout

90 点作者 gpjt大约 2 个月前

3 条评论

tony-allan大约 2 个月前
<a href="https:&#x2F;&#x2F;www.manning.com&#x2F;books&#x2F;build-a-large-language-model-from-scratch" rel="nofollow">https:&#x2F;&#x2F;www.manning.com&#x2F;books&#x2F;build-a-large-language-model-f...</a>
评论 #43419652 未加载
Scene_Cast2大约 2 个月前
I never did as much thinking or testing of dropout on transformers as the author, but it didn&#x27;t seem to help with my &quot;baby&quot; (~10 million param) transformer models. IIRC the latest Llama models don&#x27;t use dropout either.
评论 #43419552 未加载
xg15大约 2 个月前
&gt; <i>So you&#x27;d call the dropout function on the activations from each layer, zeroing out some at random so that they don&#x27;t contribute to the &quot;downstream&quot; calculations. (As I understand it, this means that they are also not adjusted during back-propagation -- if nothing else, it would be terribly unfair to the poor ignored neurons to have their weights changed when they didn&#x27;t contribute to the error.)</i><p>If the weights are effectively set to zero by the dropout, shouldn&#x27;t the propagated error in the backward pass be zero too, automatically?<p>(I.e., as I understand it, OP&#x27;s intuitive notion of &quot;fairness&quot; is literally how the error propagation works: Neurons are adjusted by the degree by which they contributed to the output)
评论 #43428593 未加载