This is exciting because it is an architecture that had so much promise, but we could never solve the gradient/parallelization problems better than transformers.<p>This code will allow people yo experiment and see if it is a viable architecture at foundation/frontier model scale.