If this one is from first principles, I wonder what the others are - since they're all out of the transformers paper and from first principles too. It would be impossible to make a model using a layer of abstraction, without understanding it from first princples