Like RWKV and Mamba, this is mixing some RNN properties to avoid the issues transformers have.<p>However I'm curious about their scaling claims. They have a plot that shows how the model scales in training with the FLOPs you throw at it.<p>But the issue we should rather be concerned with is the wall time of training for a set amount of hardware.<p>Back in 2018, we could train medium sized RNNs, the issue was with wall time of training and training stability.