Orca 2: Teaching Small Language Models How to Reason

310 点作者 fgfm超过 1 年前

13 条评论

alecco超过 1 年前

> Progressive Learning: We start with LLaMA-2-7B or LLaMA-2-13B checkpoint and finetune it on the train split of FLAN-v2 dataset for one epoch. Note that FLAN-v2 dataset contains both zero-shot and few-shot problems. We then train on 5 million ChatGPT data from Orca 1 for 3 epochs. Then we train on the combination of 1 million GPT-4 data from Orca 1 and Orca 2’s 817K data for 4 epochs.I think people are missing why they are comparing against Llama-2 13B/70B. They improved Llama-2 7B/13B and reach the level of a 5-10x larger model of the same base.This is huge. Models on HF.<a href="https://huggingface.co/papers/2311.11045" rel="nofollow noreferrer">https://huggingface.co/papers/2311.11045</a>

评论 #38364458 未加载

评论 #38364388 未加载

评论 #38366846 未加载

davidkunz超过 1 年前

For smaller models, I'm impressed by Mistral-7b or fine-tuned variants like Zephyr. I use it regularly in Neovim[1] for mundane tasks (grammar correction, summaries, ...). I'm curious how Orca 2 performs, downloading it right now.[1]: with <a href="https://github.com/David-Kunz/gen.nvim">https://github.com/David-Kunz/gen.nvim</a>

评论 #38363039 未加载

评论 #38362721 未加载

评论 #38365821 未加载

kromem超过 1 年前

A really important nuance here is that they are building on top of Llama-2, the pretrained model, and not Llama-2-chat.I really think the entire field is doing a degree of damage with the chat fine tuning beyond what might be expected, because regularly part of that chat instruction is an emphasis on identification as a LLM.The problem with this is that nearly all of the training data it's performing next token prediction on is text generated by humans.So there's an inherent narrowing of the model scope with most of the fine tuning I've seen such that while pretrained models are harder to use, I regularly prefer them over chat models when both are available as even at similar temperatures the quality and variety of language is much improved in the pretrained over chat model.This fine tuning was only introducing bias towards logical step by step analysis and problem solving techniques, and the results are great. But I'm willing to bet that an identical fine tuning on top of the chat model would have been much worse on the evaluations - not just the compounding of a typical fine tuning loss of a few percent, but more like a double digit relative difference.It's quite frustrating that the anxiety over model safety is likely throwing out tens of millions of dollars worth of data in the pretrained model when only chat models are available for the SotA, and I hope in the future a lighter touch is taken on fine tuning the pretrained model and instead of focusing on safety inherent to the model it is just set behind a safety oriented discriminator or 'editor' which filters or modifies responses accordingly.I'd happily take a 2-3x increased API cost for a much more broadly capable and performant model with similar safety characteristics but without the handicaps that come with it.So while a lot of the gains here might be due to the fine tuning, I expect at least part is shrugging off the baggage of the chat/safety fine tuning as well. Even in the first detailed example, we can see that while Llama-2 goes off rambling later on, its statement of the relative knowledge of John vs Llama-2-chat is much more clear and connected between initial conditions and result particularly regarding theory of mind (i.e. "he assumed" vs the latter's "it must be in").

评论 #38363675 未加载

intended超过 1 年前

I really really want this to work.However at this point - benchmark success is about as effective as results from someone who has been “taught the test”If say… Merck wanted to use this same model to reason out a logistics issue, or apply it to some business problem at scale - you’d have to deal with hallucinations all over the place.The best analogy I have right now is that improved results on benchmarks are like better acting from Hugh Laurie as House.If you want to watch a show - great (generative work)If you want to get a prescription - then not so much.

评论 #38363089 未加载

评论 #38365340 未加载

评论 #38364017 未加载

fgfm超过 1 年前

Orca 2-13B consistently beat Llama 2-70B on most benchmarks in 0-shot. Hopefully, research papers will start to include Mistral/Zephyr 7B & Openchat 3.5. Even though they're smaller, they're getting competitive against much larger models and they're much cheaper to orchestrate.

ple13超过 1 年前

It fails other benchmarks vs Mistral-7b. <a href="https://twitter.com/Teknium1/status/1726846755344634020" rel="nofollow noreferrer">https://twitter.com/Teknium1/status/1726846755344634020</a>(There is some doubts about the validity of the comparaison in the comments)

评论 #38363223 未加载

btbuildem超过 1 年前

Are we beginning to see "specialized SLMs"? We've already seen some pretend-agent based solutions (where the same model is given several different roles and made to act as eg. ceo / architect / dev / sales in a startup).I wonder if the way forward is to train smaller models with different sets of "skills" or "neural affinities". One for reasoning, one for summarization, one for math, one for code, etc - then combining them into full-fledged solutions. Perhaps smaller models can be "better" at their specific domains/tasks than the giant generalist models can be at any of them.

评论 #38363876 未加载

评论 #38367480 未加载

评论 #38369504 未加载

评论 #38363262 未加载

amelius超过 1 年前

This is why imho Microsoft is way cooler than Apple. They have tons of published research. In Apple, even speaking about your research with a friend may result in severe punishment.

评论 #38365972 未加载

yujian超过 1 年前

I'm not sure if I'm missing something from the paper, but are multi-billion parameter models getting called "small" language models now? And when did this paradigm shift happen?

评论 #38364131 未加载

评论 #38364022 未加载

评论 #38364237 未加载

评论 #38364028 未加载

Philpax超过 1 年前

<a href="https://huggingface.co/microsoft/Orca-2-13b" rel="nofollow noreferrer">https://huggingface.co/microsoft/Orca-2-13b</a><a href="https://huggingface.co/microsoft/Orca-2-7b" rel="nofollow noreferrer">https://huggingface.co/microsoft/Orca-2-7b</a>

iandanforth超过 1 年前

Released under the MS Research License, so not OSI and non-commercial, for the curious.<a href="https://huggingface.co/microsoft/Orca-2-13b/blob/main/LICENSE" rel="nofollow noreferrer">https://huggingface.co/microsoft/Orca-2-13b/blob/main/LICENS...</a>

jug超过 1 年前

This sounds quite exciting! Like Mistral all over again, only more transparent, open, and major backing probably as Microsoft are looking to significantly reduce costs now that they're expanding AI wide across their platforms? The approach truly feels like a next step in LLM design.

Yuvrajs超过 1 年前

Official Orca-2 demo is available on huggingface Spaces now - <a href="https://huggingface.co/spaces/ari9dam/Orca-2-13B" rel="nofollow noreferrer">https://huggingface.co/spaces/ari9dam/Orca-2-13B</a>