Let me add a few:<p>- organic data exhaustion - we need to step up synthetic data and its validation<p>- imbalanced datasets - catalog, assess and fill in missing data<p>- backtracking - make LLMs better at combinatorial or search problems<p>- deduction - we need to augment the training set for revealing implicit knowledge, in other words to study the text before learning it<p>- defragmentation - information comes in small chunks, sits in separate siloes, and context size is short, we need to use retrieval to bring it together for analysis<p>tl;dr We need quantity, diversity and depth in our training sets