Is it just me or does anyone else cringe when they read "is/is not all you need" in the title of an AI-related paper?<p>Also, what is "SOTA" for review? There isn't exactly a benchmark to compare...<p>In terms of comprehensiveness they don't mention PaLM + variants, which probably should be mentioned as it is currently the largest LLM with SOTA on several benchmarks (e.g. MedQA-USMLE).<p>In terms of correctness, I admittedly skipped to the sections I'm familiar with (LLMs) but I don't understand why they are distinguishing 'text-science' from 'text-text', they're both text-text and there is no reason why you can't, for example, adapt GPT3.5 to a scientific domain domain (some people even argue this is a better approach). A lot of powerful language models in the biomedical domain were initialized from general language models and use out-of-domain tokenizers/vocabularies (e.g. BioBERT).<p>The authors also make this statement regarding Galactica:<p>"The main advantage of [Galactica] is the ability to train on it for multiple epochs without overfitting"<p>This is not a unique feature of Galactica and has been done before. You're allowed to train LLMs for more than 1 epoch and in fact it can be very beneficial (see BioBERT as an example of increasing training length).<p>People GENERALLY don't do this because the corpus used during self-supervised training is filled with garbage/noise, so the model starts to fit to that instead of what you desire. There is nothing special about Galactica's architecture that specifically allows/encourages longer training cycles but rather they curated the dataset to minimize garbage. As another example, my research involves radiology NLP and when doing domain adaptive pretraining on a highly curated dataset we have been going up to 8 epochs without overfitting.