Like other comments, I was also initially surprised. But I think the gains are both real and easy to understand where the improvements are coming from.<p>Under the hood Reflection 70B seems to be a Llama-3.1 finetune that encourages the model to add <think>, <reflection> and <output> tokens and corresponding phases. This is an evolution of Chain-of-Thought's "think step by step" -- but instead of being a prompting technique, this fine-tune bakes examples of these phases more directly into the model. So the model starts with an initial draft and 'reflects' on it before issuing a final output.<p>The extra effort spent on tokens, which effectively let the model 'think more' appears to let it defeat prompts which other strong models (4o, 3.5 Sonnet) appear to fumble. So for example, when asked "which is greater 9.11 or 9.9" the Reflection 70b model initially gets the wrong answer, then <reflects> on it, then spits the right output.<p>Personally, the comparison to Claude and 4o doesn't quite seem apples-to-apples. If you were to have 4o/Claude take multiple rounds to review and reflect on their initial drafts, would we see similar gains? I suspect they would improve massively as well.
Interesting idea!<p>You can somewhat recreate the essence of this using a system prompt with any sufficiently sized model. Here's the prompt I tried for anybody who's interested:<p><pre><code> You are an AI assistant designed to provide detailed, step-by-step responses. Your outputs should follow this structure:
1. Begin with a <thinking> section. Everything in this section is invisible to the user.
2. Inside the thinking section:
a. Briefly analyze the question and outline your approach.
b. Present a clear plan of steps to solve the problem.
c. Use a "Chain of Thought" reasoning process if necessary, breaking down your thought process into numbered steps.
3. Include a <reflection> section for each idea where you:
a. Review your reasoning.
b. Check for potential errors or oversights.
c. Confirm or adjust your conclusion if necessary.
4. Be sure to close all reflection sections.
5. Close the thinking section with </thinking>.
6. Provide your final answer in an <output> section.
Always use these tags in your responses. Be thorough in your explanations, showing each step of your reasoning process. Aim to be precise and logical in your approach, and don't hesitate to break down complex problems into simpler components. Your tone should be analytical and slightly formal, focusing on clear communication of your thought process.
Remember: Both <thinking> and <reflection> MUST be tags and must be closed at their conclusion
Make sure all <tags> are on separate lines with no other text. Do not include other text on a line containing a tag.</code></pre>
If this does indeed beat all the closed source models, then I'm flabbergasted. The amount of time and resources Google, OpenAI, and Anthropic have put into improving the models to only be beaten in a couple weeks by two people (who as far as I know do not have PhDs and years of research experience) would be a pretty crazy feat.<p>That said, I'm withholding judgment on how likely the claims are. A friend who developed NoCha [1] is running the model on that benchmark, which will really stress test its ability to reason over full novels. I'll reserve judgement until then.<p>[1]: <a href="https://novelchallenge.github.io/" rel="nofollow">https://novelchallenge.github.io/</a>
We need results from these harder/different benchmarks which give pretty bad scores to current top LLMs.<p><a href="https://www.wolfram.com/llm-benchmarking-project/" rel="nofollow">https://www.wolfram.com/llm-benchmarking-project/</a><p><a href="https://help.kagi.com/kagi/ai/llm-benchmark.html" rel="nofollow">https://help.kagi.com/kagi/ai/llm-benchmark.html</a><p>Edit : There are few other benchmarks that give pretty low scores (<20%) to top LLMs. Can't find them atm. There was a benchmark with common sense easy looking questions.<p>Edit: found two more papers<p><a href="https://arxiv.org/html/2405.19616" rel="nofollow">https://arxiv.org/html/2405.19616</a><p><a href="https://arxiv.org/html/2406.02061v1" rel="nofollow">https://arxiv.org/html/2406.02061v1</a><p>Edit: How about Wordle?<p><a href="https://www.strangeloopcanon.com/p/what-can-llms-never-do" rel="nofollow">https://www.strangeloopcanon.com/p/what-can-llms-never-do</a><p><a href="https://news.ycombinator.com/item?id=40179232">https://news.ycombinator.com/item?id=40179232</a>
To anyone coming into this thread late, this LLM announcement was most likely a scam. See this more recent thread: <a href="https://news.ycombinator.com/item?id=41484981">https://news.ycombinator.com/item?id=41484981</a>
I'm surprised this does so well in benchmarks, given the intuition I'm getting about its behavior from quick testing.<p>I gave it a medium-complexity design problem: Design the typescript interface for the state of a react app that manages a tree of chat turns/responses and displays the current path through the tree. (In other words, the kind of state that sits logically behind the ChatGPT or Claude Web UI, where previous conversation turns can be edited and used as a branching off point for new turns.)<p>Reflection-70B suffered from a bad initial idea, just as Llama 70B generally does (proposing to duplicate state between the "tree of all messages" and the "path to currently displayed message"), which is a very common error. The automated reflection process identified a whole bunch of nitpicks but missed the glaring logical bug. Furthermore the final output was missing many of the details included in the initial reflection / chain-of-thought scratchpad, even though the UI hides the scratchpad as though it's unimportant for the user to read.
Worth mentioning that LlaMa 70b already had pretty high benchmark scores to begin with
<a href="https://ai.meta.com/blog/meta-llama-3-1/" rel="nofollow">https://ai.meta.com/blog/meta-llama-3-1/</a><p>Still impressive that it can beat top models with fine-tuning, but now I’m mostly impressed by the fact that the 70b model was so good to begin with.
Just tried this out for coding. I asked it to download weather data for Dublin into a Pandas Dataframe and write it to Hopsworks. Worked as good as GPT-4o - code ran correctly. The playground is fast. Impressed!
At the risk of sounding like a stuck LLM, it's under the Llama licence, which isn't an open source licence because of the restrictions on fields of endeavour.
Crazy how simple the technique is if this holds up. Just <think> and <reflection> plus synthetic data, used to finetune Llama 3.1 70B.<p>Note that there's a threshold for how smart the model has to be to take advantage of this flow (<a href="https://x.com/mattshumer_/status/1831775436420083753" rel="nofollow">https://x.com/mattshumer_/status/1831775436420083753</a>) - 8B is too dumb.<p>In which case, what happens if you apply this to a GPT-4o finetune, or to Claude 3.5 Sonnet?<p>What happens if you combine it with variants of tree-based reasoning? With AlphaProof (<a href="https://www.nature.com/articles/s41586-023-06747-5#Sec3" rel="nofollow">https://www.nature.com/articles/s41586-023-06747-5#Sec3</a>)? With MCTSr (<a href="https://arxiv.org/abs/2406.07394" rel="nofollow">https://arxiv.org/abs/2406.07394</a>)?
Seems to really fall apart on subsequent prompts, and a few times I've had code end up in the "thinking" tokens.<p>I'm guessing most of the training data was single-turn, instead of multi-turn, but that should be relatively easy to iterate on.
Quick update here: the model in question is apparently an attempt at an attention grab, there are open questions as to whether it is a llama 3 fine-tune, a llama 3.1 fine-tune, or a series of api calls redirecting to claude 3.5 sonnet, with a find and replace of Claude for Llama
You can try this hugging face assistant that uses Llama 3.1 70b and system prompt engineering to simulate Reflection 70b's thinking and reflection process.<p><a href="https://hf.co/chat/assistant/66db391075ff4595ec2652b7" rel="nofollow">https://hf.co/chat/assistant/66db391075ff4595ec2652b7</a>
Wonder why no Llama-3.1-8B based variant if the new training method has such good results.
UPDATE: didn't work well <a href="https://x.com/mattshumer_/status/1831775436420083753?t=flm41D8Ru9Zld2bjsmvs0A" rel="nofollow">https://x.com/mattshumer_/status/1831775436420083753?t=flm41...</a>
Can we please stop allowing links to Twitter? Rationale: the artificial limitations on that site around post size mean that most announcements (such as this one) are multiple posts. This, combined with the questionable design decision of hiding all reply tweets when a user is not logged in, means that many posts are completely missing crucial context for those of us who don’t have Twitter accounts.<p>Alternatively, Twitter links could be rewritten to redirect to one of the few Nitter instances that are still functional.
This make me think we should be introducing 'tokens required to answer questions correctly' dimension to each metric. Since letting the model think more verbosely is essentially giving it more compute and memory to answer the question correctly.
(not that this is a bad thing, but I would be curious if other models get the answer correctly with the first couple of tokens, or after hundreds of reasoning)
Unfortunately the model is broken at present, It looks like they're working on a fix - <a href="https://huggingface.co/mattshumer/Reflection-70B/discussions/6" rel="nofollow">https://huggingface.co/mattshumer/Reflection-70B/discussions...</a>