Why everybody (including this a16z dude) underestimates/not mentions:<p>1. quality of input data - for language models that are currently setup to be force-fed with any incoming data instead of real training (see 2.) this is the greatest gain you can get for your money - models can't distinguish between truth and nonsense, they're forced to follow training data auto-completion regardless of how stupid or sane it is<p>2. evaluation of input data by the model itself - self evaluating what is nonsense during training and what makes sense/is worthy of learning - based on so far gathered knowledge, dealing with biases in this area etc.<p>Current training methods equate things like first order logic with any kind of nonsense - having on its defense only quantity, not quality.<p>But there are many widely repeated things that are plainly wrong. Simplifying this thought - if there weren't, there would be no further progress in human kind. We constantly reexamine assumptions and come up with new theories leaving solid axioms untouched - why not teach this approach/hardcode it into LLMs?<p>Those two aspects seem to be problems with large gains, yet nobody seems to be discussing them.<p>Align training towards common/self sense, good/own judgement, not unconditional alignment towards input data.<p>If fine-tuning works, why not start training with first principles - dictionary, logic, base theories like sets, categories, encyclopedia of facts (omitting historic facts which are irrelevant at this stage) etc. - taking snapshots at each stage so others can fork their own training trees. Maybe even stop calling fine-tuning fine-tuning, just learning stages. Let researchers play with paths on those trees and evaluate them to find something more optimal, find optimal network sizes for each step, allow models to gradually grow in size etc.<p>To rephrase it a bit - we're saying that base models learned on large data work well when fine tuned - why not base models trained on first principles can continue to be trained on concepts that depend on previously learned first principles recursively are efficient - did anybody try?<p>As some concrete example - you want LLM to be good at math? Tokenize digits, teach it to do base-10 math, teach it to do addition, subtraction, multiplication, division, exponentiation, all known math basic operations/functions, then grow from that.<p>You want it to do good code completion? Teach it bnf, parsing, ast, interpreting, then code examples with simple output, then more complex code (github stuff).<p>Learning LLMs should start with teaching tiny model ASCII, numbers, basic ops on them, then slowly introducing words instead of symbols (is instead of =) etc., then forming basic phrases, then basic sentences, basic language grammar, etc. - everything in software 2.0 way - just throw in examples that have expected output and do back-propagation/gradient descent on it.<p>Training has to have a way of gradually growing model size in (ideally) optimal way.