科技回声

13 条评论

The writing style is amusing. :)Some notes from a first glance:* In the experiments, I see that he uses the Single Headed Attention model actually also with 4 heads, which is kind of a contradiction to the name, isn't it?* The main motivation is performance (training speed mostly). So some absolute number of e.g. training time would be nice to have in the comparisons. He e.g. mentions that the Adaptive Transformer can also be trained on a single GPU within hours, and in the comparison, the Adaptive Transformer gets much better BPC (enwik8), and uses even slightly less parameters. So, isn't the Adaptive Transformer thus better in every aspect (speed and BPC)? Or how does it compare in speed? As far as I remember, also the Sparse Transformer is more efficient (as it has sparsity), so again the speed comparison would be interesting here. Or is the argumentation for inference speed? But then the inference speed should be compared, or not?

评论 #21649478 未加载

citilife超过 5 年前

Honestly, I wish all research papers were written this way. Easy to understand, kept me entertained, and presented meaningful results with a way to reproduce (on a single GPU).I grant all research papers on deep learning can't be reproducible with a single GPU in a reasonable time, but it should happen more often IMO. It seems lazy to just toss out a paper saying "we hit new benchmarks, by increasing the parameters and throwing more compute". I'd like to see "we hit new benchmarks with a new design, the old ones had this issue, etc.Anyway, great read, recommend. Also, happy for the author haha"The author has also moved to a one bedroom apartment in San Francisco, removing themselves from proximity to the alley of questionable odors and unsavory noises."

评论 #21648873 未加载

评论 #21653246 未加载

czr超过 5 年前

for those who aren't familiar with the author, he previously worked at metamind / salesforce research doing nlp and has published many successful nlp papers [0]. he opted to write an informal paper for this project (similar to yolov3 [1]), but the work itself should still be taken seriously.[0] <a href="https://scholar.google.com/citations?user=AolIi4QAAAAJ" rel="nofollow">https://scholar.google.com/citations?user=AolIi4QAAAAJ</a>[1] <a href="https://pjreddie.com/media/files/papers/YOLOv3.pdf" rel="nofollow">https://pjreddie.com/media/files/papers/YOLOv3.pdf</a>

1maginary超过 5 年前

You just have to love Stephen Merity.His work on QRNN's saved me quite a bit of time and money when I was doing my undergrad dissertation on language models.This SHA-RNN seems to have surfaced from a similar line of thinking that spawned the QRNN.

评论 #21649223 未加载

lopuhin超过 5 年前

The paper rises a great point on tokenization affecting perplexity, that we can't compare perplexities of different tokenizers even re-normalizing taking token counts into account, say BPE vs word tokenization. This example nails it: <a href="https://twitter.com/Smerity/status/1192252147598909441" rel="nofollow">https://twitter.com/Smerity/status/1192252147598909441</a>

评论 #21649734 未加载

MiroF超过 5 年前

Perhaps I am missing the point of this article. The RNN approach seems to get similar performance, but uses more parameters and misses the parallelization benefits that Transformers have and recurrent networks do not.What is the benefit of the RNN here?

评论 #21650654 未加载

lucidrains超过 5 年前

Another work in the opposite direction, introducing gating in Transformer-xl <a href="https://arxiv.org/abs/1910.06764" rel="nofollow">https://arxiv.org/abs/1910.06764</a>

octocop超过 5 年前

Hilarious papers, i'm about to drop a SHA-RNN on my GPu to make it sweat

sbpayne超过 5 年前

Did anyone one else read "SHA-RNN" as "SHHHAAAAARRRROOOOONNN" in Ozzy's voice?

reubens超过 5 年前

Now that was some refreshing reading.

Dasemu超过 5 年前

madenine超过 5 年前

Now I really want pop music made by Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin

toxik超过 5 年前

A dissenting voice from the positive reception here on HN, I thought that this paper was a joke. Single author, no affiliation, snarky language. Why not be civil instead?

评论 #21648583 未加载

评论 #21649533 未加载

评论 #21648528 未加载

评论 #21648908 未加载

评论 #21653565 未加载

评论 #21648761 未加载

评论 #21649512 未加载

13 条评论

albertzeyer超过 5 年前

评论 #21649478 未加载

citilife超过 5 年前

评论 #21648873 未加载

评论 #21653246 未加载

czr超过 5 年前

1maginary超过 5 年前

评论 #21649223 未加载

lopuhin超过 5 年前

评论 #21649734 未加载

MiroF超过 5 年前

评论 #21650654 未加载

lucidrains超过 5 年前

Another work in the opposite direction, introducing gating in Transformer-xl <a href="https://arxiv.org/abs/1910.06764" rel="nofollow">https://arxiv.org/abs/1910.06764</a>

octocop超过 5 年前

Hilarious papers, i'm about to drop a SHA-RNN on my GPu to make it sweat

sbpayne超过 5 年前

Did anyone one else read "SHA-RNN" as "SHHHAAAAARRRROOOOONNN" in Ozzy's voice?

reubens超过 5 年前

Now that was some refreshing reading.

Dasemu超过 5 年前

madenine超过 5 年前

Now I really want pop music made by Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin

toxik超过 5 年前

A dissenting voice from the positive reception here on HN, I thought that this paper was a joke. Single author, no affiliation, snarky language. Why not be civil instead?

Single Headed Attention RNN

13 条评论

Single Headed Attention RNN

13 条评论