This style of documentation is called <i>literate programming</i>, and you should go and google about this term and the various implementations for various widespread programming languages if you never heard of this before. It's an eye-opener how clear, transparent and well-intertwined good code and comments can be.<p>I've used such a literate programming style with scientific python once in university classes and it was a breeze to prepare and hand in exercise sheets (rendered with Latex to PDF). My feeling was that today people use Jupyter/IPython notebook to archieve something similar (especially with embedding results), but a jupyter notebook is much more complex than a traditional, clean and terminal-readable literate programming source code.
In their Transformer section they have implementations of:<p><pre><code> - kNN-LM: Generalization through Memorization
- Feedback Transformer
- Switch Transformer
</code></pre>
Which are all from recent, highly interesting papers
Something like this could be incredibly helpful with arxiv articles: being able to pin-point a fragment of text or a formula to the actual implementation. This could save so much time and ping-ponging between the article and the code.
I thought of a change to gradient accumulation, which I call Adam accumulation:<p><a href="https://twitter.com/theshawwn/status/1355343951033602057" rel="nofollow">https://twitter.com/theshawwn/status/1355343951033602057</a><p><a href="https://news.ycombinator.com/item?id=25964420" rel="nofollow">https://news.ycombinator.com/item?id=25964420</a><p>Unfortunately, no one seems to understand it, which isn't a great sign. I'm either not explaining it very well, or the idea doesn't make sense.<p>In short:<p><pre><code> for example in batch:
accum += adam(gradients(example))
param += accum
accum = 0
</code></pre>
That way, adam statistics are updated for every training example.<p>Traditional gradient accumulation looks like this:<p><pre><code> for example in batch:
accum += gradients(example)
param += adam(accum)
accum = 0
</code></pre>
... which only updates Adam once.<p>(It's equivalent to a bigger batch size.)<p>Probably best to just implement Adam accumulation and see if it works, I suppose.<p>(Sorry for rambling about this here. I was just hoping to find some prior work along these lines, if anyone knew of something.)