3 pointsby jasondaviesover 1 year ago

1 comment

gbickfordover 1 year ago

This paper is well written. The results are pretty wild. They observed some amazing reduction in training resources required to achieve similar benchmarks to models trained on conventional data:<p>> We observe that even at the first checkpoint (10B tokens) of WRAP training, the average perplexity of the LLM on the Pile is lower than that achieved by pre-training on C4 for 15 checkpoints. This suggests a 15x pre-training speed-up.

Rephrasing the Web: A Recipe for Compute and Data-Efficient Language Modeling

1 comment

Rephrasing the Web: A Recipe for Compute and Data-Efficient Language Modeling

1 comment