科技回声

3 条评论

tony-allan大约 2 个月前

<a href="https://www.manning.com/books/build-a-large-language-model-from-scratch" rel="nofollow">https://www.manning.com/books/build-a-large-language-model-f...</a>

评论 #43419652 未加载

Scene_Cast2大约 2 个月前

I never did as much thinking or testing of dropout on transformers as the author, but it didn't seem to help with my "baby" (~10 million param) transformer models. IIRC the latest Llama models don't use dropout either.

评论 #43419552 未加载

xg15大约 2 个月前

> <i>So you'd call the dropout function on the activations from each layer, zeroing out some at random so that they don't contribute to the "downstream" calculations. (As I understand it, this means that they are also not adjusted during back-propagation -- if nothing else, it would be terribly unfair to the poor ignored neurons to have their weights changed when they didn't contribute to the error.)</i><p>If the weights are effectively set to zero by the dropout, shouldn't the propagated error in the backward pass be zero too, automatically?<p>(I.e., as I understand it, OP's intuitive notion of "fairness" is literally how the error propagation works: Neurons are adjusted by the degree by which they contributed to the output)

评论 #43428593 未加载

Writing an LLM from scratch, part 10 – dropout

3 条评论

Writing an LLM from scratch, part 10 – dropout

3 条评论