A really important nuance here is that they are building on top of Llama-2, the pretrained model, and not Llama-2-chat.<p>I really think the entire field is doing a degree of damage with the chat fine tuning beyond what might be expected, because regularly part of that chat instruction is an emphasis on identification as a LLM.<p>The problem with this is that nearly all of the training data it's performing next token prediction on is text generated by humans.<p>So there's an inherent narrowing of the model scope with most of the fine tuning I've seen such that while pretrained models are harder to use, I regularly prefer them over chat models when both are available as even at similar temperatures the quality and variety of language is much improved in the pretrained over chat model.<p>This fine tuning was only introducing bias towards logical step by step analysis and problem solving techniques, and the results are great. But I'm willing to bet that an identical fine tuning on top of the chat model would have been much worse on the evaluations - not just the compounding of a typical fine tuning loss of a few percent, but more like a double digit relative difference.<p>It's quite frustrating that the anxiety over model safety is likely throwing out tens of millions of dollars worth of data in the pretrained model when only chat models are available for the SotA, and I hope in the future a lighter touch is taken on fine tuning the pretrained model and instead of focusing on safety inherent to the model it is just set behind a safety oriented discriminator or 'editor' which filters or modifies responses accordingly.<p>I'd happily take a 2-3x increased API cost for a much more broadly capable and performant model with similar safety characteristics but without the handicaps that come with it.<p>So while a lot of the gains here might be due to the fine tuning, I expect at least part is shrugging off the baggage of the chat/safety fine tuning as well. Even in the first detailed example, we can see that while Llama-2 goes off rambling later on, its statement of the relative knowledge of John vs Llama-2-chat is much more clear and connected between initial conditions and result particularly regarding theory of mind (i.e. "he assumed" vs the latter's "it must be in").