TechEcho

11 comments

gwernover 5 years ago

Discussion & links to various implementations: <a href="https://www.reddit.com/r/MachineLearning/comments/eg1wr3/reformer_the_efficient_transformer_anonymous_et/" rel="nofollow">https://www.reddit.com/r/MachineLearning/comments/eg1wr3/ref...</a>

occamrazorover 5 years ago

The demonstration on images is underwhelming at best. It is only marginally better than extending vertically the bottom row of pixels.

评论 #22098845 未加载

评论 #22094221 未加载

lapinkover 5 years ago

There is no argument for why the LSH would work well, especially at the beginning of training. As the weights are initially random, bucket assignment would be random as well. If predicting at position A requires info from position B, but they are not in the same bucket, there will be no gradient to get the query embedding of A closer to the key embedding of B. The reversible layer trick is neat though.

评论 #22094600 未加载

评论 #22095054 未加载

评论 #22094377 未加载

sillysaurusxover 5 years ago

One neat trick is that you can extend GPT-2 117M's context window from 1024 up to 30k on a TPU, since TPUs can allocate up to 300GB of memory for backprop. <a href="https://twitter.com/gwern/status/1218001309435072513" rel="nofollow">https://twitter.com/gwern/status/1218001309435072513</a>It's not quite 1M words, but a 30k context window is big enough for e.g. most midi songs.

darawkover 5 years ago

This seems like a big deal. An asymptotic reduction in the resource explosion created by larger attention windows should allow the development of substantially more complex models here.

overlordsover 5 years ago

Vowpal Wabbit has been doing this 'hashing trick' since the 200s.It also the feature interaction, which are the same thing as a layer in transformers (all against all matrix).So it seems like they are still catching up to where John Langford and crew were over a decade ago.And, the vowpal wabbit approach is extremely fast to train because it's only doing stochastic gradient descent on a linear function - linear regression. Transformers are much slower to train.EDIT: Downvoters, please see my last leaf to see why they're effectively the same. The guy responding here seems unfamiliar with all the functionality of vowpal wabbit.

评论 #22094970 未加载

评论 #22095986 未加载

footaover 5 years ago

I wonder if this could be used for the Wikipedia compression challenge?

评论 #22096034 未加载

The_rationalistover 5 years ago

How does accuracy compare in Nlp tasks vs XLnet? If we can have XLnet accuracy and fast inference on a single gpu, that would be revolutionary!

the8472over 5 years ago

This looks like building blocks from cryptography inspiring ML

4gotunameagainover 5 years ago

Would it be reasonable to add something to the tittle so it's clear it has nothing to do with electronics? Maybe it's just me.

评论 #22095061 未加载

评论 #22094682 未加载

评论 #22097351 未加载

silvestrovover 5 years ago

so many smart people and still using fuzzy PNG instead of SVG

评论 #22094488 未加载

评论 #22094436 未加载

11 comments

gwernover 5 years ago

occamrazorover 5 years ago

The demonstration on images is underwhelming at best. It is only marginally better than extending vertically the bottom row of pixels.

评论 #22098845 未加载

评论 #22094221 未加载

lapinkover 5 years ago

评论 #22094600 未加载

评论 #22095054 未加载

评论 #22094377 未加载

sillysaurusxover 5 years ago

darawkover 5 years ago

This seems like a big deal. An asymptotic reduction in the resource explosion created by larger attention windows should allow the development of substantially more complex models here.

overlordsover 5 years ago

评论 #22094970 未加载

评论 #22095986 未加载

footaover 5 years ago

I wonder if this could be used for the Wikipedia compression challenge?

评论 #22096034 未加载

The_rationalistover 5 years ago

How does accuracy compare in Nlp tasks vs XLnet? If we can have XLnet accuracy and fast inference on a single gpu, that would be revolutionary!

the8472over 5 years ago

This looks like building blocks from cryptography inspiring ML

4gotunameagainover 5 years ago

Would it be reasonable to add something to the tittle so it's clear it has nothing to do with electronics? Maybe it's just me.

Reformer, the Efficient Transformer

11 comments

Reformer, the Efficient Transformer

11 comments