科技回声

10 条评论

p1esk大约 1 年前

This method has only been tested on tiny models (<1B) and tiny dataset (17B tokens). It’s not clear if it scales.

评论 #39795135 未加载

评论 #39795951 未加载

评论 #39795383 未加载

valine大约 1 年前

The architecture changes are very straight forward. Model merging has shown that pre-trained transformer layers are very robust. I’ll bet it’s possible to fine tune a pre-trained model like mistral to use this architecture. That would enable someone to test it with more parameters without training a whole new base model.

评论 #39796285 未加载

评论 #39796597 未加载

tbalsam大约 1 年前

This is a very interesting idea, with DenseNets there are oftentimes some terrible memory gotchas that have gotten me over the past 7-8 years or so, so a part of me is sorta leaning back waiting for some memory usage shoe to drop not specified in the paper (even with the activation patterns!)However, maybe this is not the case. I have a bit of a history of messing with residuals in neural networks, seeing more work on it is good. Fast training networks of course are a very slightly mild obsession of mine as well, and very useful to the field. Here's hoping it pans out as a motif, curious to see where it goes.

sp332大约 1 年前

Even better is the result on page 7 that perplexity drops faster by wall-clock time. Even if you're getting fewer iterations per hour of rented GPU time, you're still coming out ahead in model performance.

ml_basics大约 1 年前

Cool paper. Really interesting to see how even quite straightforward architectural modifications haven't yet all been exhausted yet, despite all the resources being poured into LLMs

评论 #39795576 未加载

danieldk大约 1 年前

Nice finding and makes a lot of sense! It is somewhat related to classification heads using their own weighted representation of all transformer layer outputs.I only glanced the paper, but they don't seem to softmax ⍺_i for normalization?

zwaps大约 1 年前

1. They compare with an older sort of standard implementation of a transformer Unsure whether the results would be equally significant compared to models with gated units or multiquery etc.2. The difference seems to diminish with scale. Real life transformers obviously are much larger and train on many more tokens.3. A very significant part of training transformer models are the throughoutput and memory optimizations. I wonder how their model would work with such fused kernels or specialized paged KV cache schemes. Or activation checkpointing, if run locally.4. Indeed they claim no memory impact, but their code shows that their experiments are conducted with a special optimized version which requires all activations to reside in a single tensor at all times. Not sure this would work with 3d parallelism on multiple nodes etc.

matteopagli大约 1 年前

I'm one of the authors, happy to answer questions.

评论 #39807322 未加载

efrank3大约 1 年前

Can't believe nobody thought of this yet

aoeusnth1大约 1 年前

> Impact statement:> This paper presents work whose goal is to advance the field of Machine Learning. There are many potential societal consequences of our work, none which we feel must be specifically highlighted here.I found this particularly charming.

评论 #39795552 未加载

评论 #39796040 未加载

10 条评论

p1esk大约 1 年前

This method has only been tested on tiny models (<1B) and tiny dataset (17B tokens). It’s not clear if it scales.

评论 #39795135 未加载

评论 #39795951 未加载

评论 #39795383 未加载

valine大约 1 年前

评论 #39796285 未加载

评论 #39796597 未加载

tbalsam大约 1 年前

sp332大约 1 年前

ml_basics大约 1 年前

Cool paper. Really interesting to see how even quite straightforward architectural modifications haven't yet all been exhausted yet, despite all the resources being poured into LLMs

评论 #39795576 未加载

danieldk大约 1 年前

zwaps大约 1 年前

matteopagli大约 1 年前

I'm one of the authors, happy to answer questions.

评论 #39807322 未加载

efrank3大约 1 年前

Can't believe nobody thought of this yet

aoeusnth1大约 1 年前

评论 #39795552 未加载

评论 #39796040 未加载

DenseFormer: Enhancing Information Flow in Transformers

10 条评论

DenseFormer: Enhancing Information Flow in Transformers

10 条评论