In various fields, we often encounter the principle of Sturgeon's Law, which suggests that "90% of everything is crud."<p>When it comes to datasets, how much of this holds true?<p>With the proliferation of LLM are we seeing an overwhelming amount of low-quality, irrelevant datasets?<p>Curious about HN thoughts on that.