科技回声

11 条评论

“This complexity leads to brittle architectures, where seemingly minor changes can significantly reduce training speed, or render models untrainable.”I am by no means an expert and I can’t verify the authors’ claims about reduced speed and untrainability, but this reflects an impression I’ve been having on the papers I read and review. The field of ML research is moving so fast that people don’t even take time anymore to explain the design decisions behind their architectures. It’s basically “we got nice results, and here is the architecture of the model” (proceeds to show a figure with a hundred coloured blocks connected together in some seemingly random complex way).It used to be that such a thing would get backlash from reviewers, and they would require you to actually justify the design. I don’t see that anymore. The problem with this for me is that we fail to build a nice, crisp understanding of the effects of each design decision in the final outcomes, which hurts the actual “science” of it. It also opens up the field for bogus and unreproducible claims.But at least other people are picking up on the thread and doing that in follow-up papers, which is good.

评论 #38443671 未加载

评论 #38443724 未加载

评论 #38445793 未加载

评论 #38443583 未加载

评论 #38447484 未加载

评论 #38444698 未加载

评论 #38444091 未加载

评论 #38449438 未加载

Bayes7超过 1 年前

"[...] modern neural network (NN) architectures have complex designs with many components [...]"I find the Transformer architecture actually very simple compared to previous models like LSTMs or other recurrent models. You could argue that their vision counterparts like ViT are conceptually maybe even simpler than ConvNets?Also, can someone explain why they are so keen to remove the skip connections? At least when it comes to coding, nothing is simpler than adding a skip connection and computationally the effect should be marginal?

评论 #38444125 未加载

评论 #38444352 未加载

imjonse超过 1 年前

"While we have demonstrated the efficacy of our simplifications across architectures, datasets, and tasks, the models we have considered (100-300M parameters) are small relative to the largest transformers."University researchers without a big lab's backing cannot try out such experiments on really large models.

评论 #38444354 未加载

chessgecko超过 1 年前

Not sure if I read it correctly, but it seems like the skip connections are kinda still present in the skipless block because they added I after the softmax and v= the previous hidden state. Still a cool paper if they really managed to get rid of those projections without degrading the quality.

karmasimida超过 1 年前

There could be a CISC vs RISC argument for Transformer, which I can see, but my bet is the traditional transformer has established enough fundamentals, it is going to take a long time for alternative architectures to prove they are indeed alternatives on a wide range of tasks.Pretty much at this point, to succeed transformer, one need to showcase, the alternative model could achieve chatgpt level performance with either significant reduction of compute or data requirements.

WithinReason超过 1 年前

This is a nice start, but what would really help is for someone who understands the GPU programming model in-depth to give this a shot, with the goal of reducing DRAM bandwidth and fitting the layers exactly onto a GPU's memory hierarchy (cache levels, local memory and registers). Basically, sizes of a target HW platform's memories should be hyper-parameters of the architecture.

评论 #38443801 未加载

Buttons840超过 1 年前

What is "Shaped Attention"? They simplified everything except for changing "Attention" to "Shaped Attention".

评论 #38445997 未加载

samus超过 1 年前

I love reading about papers like these. They raise hopes that novel model architectures might reduce the need for computational resources to train powerful models, which would help lower the barrier of entry.

评论 #38443536 未加载

bilsbie超过 1 年前

I wonder why only 15% faster training? Seems like simplying the main element would make a huge difference.

ofou超过 1 年前

Are anyone aware of similar simplifications?

m3kw9超过 1 年前

Do they have a demo of this theory?

11 条评论

low_tech_love超过 1 年前

评论 #38443671 未加载

评论 #38443724 未加载

评论 #38445793 未加载

评论 #38443583 未加载

评论 #38447484 未加载

评论 #38444698 未加载

评论 #38444091 未加载

评论 #38449438 未加载

Bayes7超过 1 年前

评论 #38444125 未加载

评论 #38444352 未加载

imjonse超过 1 年前

评论 #38444354 未加载

chessgecko超过 1 年前

karmasimida超过 1 年前

WithinReason超过 1 年前

评论 #38443801 未加载

Buttons840超过 1 年前

What is "Shaped Attention"? They simplified everything except for changing "Attention" to "Shaped Attention".

评论 #38445997 未加载

samus超过 1 年前

评论 #38443536 未加载

bilsbie超过 1 年前

I wonder why only 15% faster training? Seems like simplying the main element would make a huge difference.

ofou超过 1 年前

Are anyone aware of similar simplifications?

m3kw9超过 1 年前

Do they have a demo of this theory?

Simplifying Transformer Blocks

11 条评论

Simplifying Transformer Blocks

11 条评论