From the TinyStories dataset card [1] the dataset is generated by GPT-3.5 and GPT-4. Reading the discussions in the community tab [2] it looks like there are a lot of incomplete or misspelled words, incorrect grammar, and even Chinese characters in the dataset.<p>As such, I'd be weary of using that dataset to train or evaluate models.<p>[1] <a href="https://huggingface.co/datasets/roneneldan/TinyStories" rel="nofollow">https://huggingface.co/datasets/roneneldan/TinyStories</a><p>[2] <a href="https://huggingface.co/datasets/roneneldan/TinyStories/discussions" rel="nofollow">https://huggingface.co/datasets/roneneldan/TinyStories/discu...</a>