It's also worth mentioning that the original implementation by Meta is only 300 lines of very readable code [1].<p>[1]: <a href="https://github.com/meta-llama/llama3/blob/main/llama/model.py">https://github.com/meta-llama/llama3/blob/main/llama/model.p...</a>
`import jax.numpy as np`, then we also get a jax implemention after certain modifications: e.g. remove in-place index assignment, replace unsupported functions, etc
From the TinyStories dataset card [1] the dataset is generated by GPT-3.5 and GPT-4. Reading the discussions in the community tab [2] it looks like there are a lot of incomplete or misspelled words, incorrect grammar, and even Chinese characters in the dataset.<p>As such, I'd be weary of using that dataset to train or evaluate models.<p>[1] <a href="https://huggingface.co/datasets/roneneldan/TinyStories" rel="nofollow">https://huggingface.co/datasets/roneneldan/TinyStories</a><p>[2] <a href="https://huggingface.co/datasets/roneneldan/TinyStories/discussions" rel="nofollow">https://huggingface.co/datasets/roneneldan/TinyStories/discu...</a>
We changed the URL from <a href="https://github.com/likejazz/llama3.np">https://github.com/likejazz/llama3.np</a> to the article it points to, which gives more background.
What is the difference to the llama.np repository credited in the README? <a href="https://github.com/hscspring/llama.np">https://github.com/hscspring/llama.np</a>
The rotary embeddings bit is neat. I wonder if a complex representation would simplify vs complexify things (readability, performance, expressive power).