TechEcho

Like RWKV and Mamba, this is mixing some RNN properties to avoid the issues transformers have.However I'm curious about their scaling claims. They have a plot that shows how the model scales in training with the FLOPs you throw at it.But the issue we should rather be concerned with is the wall time of training for a set amount of hardware.Back in 2018, we could train medium sized RNNs, the issue was with wall time of training and training stability.

I didn't get one detail: they selected 6B transformer as baseline and compared it to 7B GriffinWhy wouldn't select equal size models?..

For anyone interested in a C++ implementation, our github.com/google/gemma.cpp now supports this model.

im not smart enough to know the significance of this...is Griffin like MAMBA?

I didn't get one detail: they selected 6B transformer as baseline and compared it to 7B GriffinWhy wouldn't select equal size models?..

For anyone interested in a C++ implementation, our github.com/google/gemma.cpp now supports this model.

im not smart enough to know the significance of this...is Griffin like MAMBA?

Implementation of Google's Griffin Architecture – RNN LLM

4 comments

Implementation of Google's Griffin Architecture – RNN LLM

4 comments