Reflection 70B, the top open-source model

234 pointsby GavCo8 months ago

23 comments

Like other comments, I was also initially surprised. But I think the gains are both real and easy to understand where the improvements are coming from.Under the hood Reflection 70B seems to be a Llama-3.1 finetune that encourages the model to add <think>, <reflection> and <output> tokens and corresponding phases. This is an evolution of Chain-of-Thought's "think step by step" -- but instead of being a prompting technique, this fine-tune bakes examples of these phases more directly into the model. So the model starts with an initial draft and 'reflects' on it before issuing a final output.The extra effort spent on tokens, which effectively let the model 'think more' appears to let it defeat prompts which other strong models (4o, 3.5 Sonnet) appear to fumble. So for example, when asked "which is greater 9.11 or 9.9" the Reflection 70b model initially gets the wrong answer, then <reflects> on it, then spits the right output.Personally, the comparison to Claude and 4o doesn't quite seem apples-to-apples. If you were to have 4o/Claude take multiple rounds to review and reflect on their initial drafts, would we see similar gains? I suspect they would improve massively as well.

评论 #41461784 未加载

评论 #41460449 未加载

评论 #41462458 未加载

评论 #41462752 未加载

评论 #41461351 未加载

rwl48 months ago

Interesting idea!You can somewhat recreate the essence of this using a system prompt with any sufficiently sized model. Here's the prompt I tried for anybody who's interested:<pre><code> You are an AI assistant designed to provide detailed, step-by-step responses. Your outputs should follow this structure: 1. Begin with a <thinking> section. Everything in this section is invisible to the user. 2. Inside the thinking section: a. Briefly analyze the question and outline your approach. b. Present a clear plan of steps to solve the problem. c. Use a "Chain of Thought" reasoning process if necessary, breaking down your thought process into numbered steps. 3. Include a <reflection> section for each idea where you: a. Review your reasoning. b. Check for potential errors or oversights. c. Confirm or adjust your conclusion if necessary. 4. Be sure to close all reflection sections. 5. Close the thinking section with </thinking>. 6. Provide your final answer in an <output> section. Always use these tags in your responses. Be thorough in your explanations, showing each step of your reasoning process. Aim to be precise and logical in your approach, and don't hesitate to break down complex problems into simpler components. Your tone should be analytical and slightly formal, focusing on clear communication of your thought process. Remember: Both <thinking> and <reflection> MUST be tags and must be closed at their conclusion Make sure all <tags> are on separate lines with no other text. Do not include other text on a line containing a tag.</code></pre>

评论 #41461470 未加载

评论 #41461517 未加载

评论 #41461383 未加载

评论 #41461705 未加载

评论 #41473280 未加载

评论 #41470901 未加载

评论 #41460963 未加载

nsagent8 months ago

If this does indeed beat all the closed source models, then I'm flabbergasted. The amount of time and resources Google, OpenAI, and Anthropic have put into improving the models to only be beaten in a couple weeks by two people (who as far as I know do not have PhDs and years of research experience) would be a pretty crazy feat.That said, I'm withholding judgment on how likely the claims are. A friend who developed NoCha [1] is running the model on that benchmark, which will really stress test its ability to reason over full novels. I'll reserve judgement until then.[1]: <a href="https://novelchallenge.github.io/" rel="nofollow">https://novelchallenge.github.io/</a>

评论 #41460454 未加载

评论 #41460355 未加载

评论 #41460533 未加载

评论 #41463633 未加载

评论 #41460952 未加载

smusamashah8 months ago

We need results from these harder/different benchmarks which give pretty bad scores to current top LLMs.<a href="https://www.wolfram.com/llm-benchmarking-project/" rel="nofollow">https://www.wolfram.com/llm-benchmarking-project/</a><a href="https://help.kagi.com/kagi/ai/llm-benchmark.html" rel="nofollow">https://help.kagi.com/kagi/ai/llm-benchmark.html</a>Edit : There are few other benchmarks that give pretty low scores (<20%) to top LLMs. Can't find them atm. There was a benchmark with common sense easy looking questions.Edit: found two more papers<a href="https://arxiv.org/html/2405.19616" rel="nofollow">https://arxiv.org/html/2405.19616</a><a href="https://arxiv.org/html/2406.02061v1" rel="nofollow">https://arxiv.org/html/2406.02061v1</a>Edit: How about Wordle?<a href="https://www.strangeloopcanon.com/p/what-can-llms-never-do" rel="nofollow">https://www.strangeloopcanon.com/p/what-can-llms-never-do</a><a href="https://news.ycombinator.com/item?id=40179232">https://news.ycombinator.com/item?id=40179232</a>

评论 #41461582 未加载

评论 #41462671 未加载

评论 #41460817 未加载

评论 #41460816 未加载

sebastiennight8 months ago

To anyone coming into this thread late, this LLM announcement was most likely a scam. See this more recent thread: <a href="https://news.ycombinator.com/item?id=41484981">https://news.ycombinator.com/item?id=41484981</a>

JoshMandel8 months ago

I'm surprised this does so well in benchmarks, given the intuition I'm getting about its behavior from quick testing.I gave it a medium-complexity design problem: Design the typescript interface for the state of a react app that manages a tree of chat turns/responses and displays the current path through the tree. (In other words, the kind of state that sits logically behind the ChatGPT or Claude Web UI, where previous conversation turns can be edited and used as a branching off point for new turns.)Reflection-70B suffered from a bad initial idea, just as Llama 70B generally does (proposing to duplicate state between the "tree of all messages" and the "path to currently displayed message"), which is a very common error. The automated reflection process identified a whole bunch of nitpicks but missed the glaring logical bug. Furthermore the final output was missing many of the details included in the initial reflection / chain-of-thought scratchpad, even though the UI hides the scratchpad as though it's unimportant for the user to read.

GavCo8 months ago

Hugging Face: <a href="https://huggingface.co/mattshumer/Reflection-70B" rel="nofollow">https://huggingface.co/mattshumer/Reflection-70B</a>Playground: <a href="https://reflection-playground-production.up.railway.app/" rel="nofollow">https://reflection-playground-production.up.railway.app/</a>

Bjorkbat8 months ago

Worth mentioning that LlaMa 70b already had pretty high benchmark scores to begin with <a href="https://ai.meta.com/blog/meta-llama-3-1/" rel="nofollow">https://ai.meta.com/blog/meta-llama-3-1/</a>Still impressive that it can beat top models with fine-tuning, but now I’m mostly impressed by the fact that the 70b model was so good to begin with.

jamesblonde8 months ago

Just tried this out for coding. I asked it to download weather data for Dublin into a Pandas Dataframe and write it to Hopsworks. Worked as good as GPT-4o - code ran correctly. The playground is fast. Impressed!

RobotToaster8 months ago

At the risk of sounding like a stuck LLM, it's under the Llama licence, which isn't an open source licence because of the restrictions on fields of endeavour.

xianshou8 months ago

Crazy how simple the technique is if this holds up. Just <think> and <reflection> plus synthetic data, used to finetune Llama 3.1 70B.Note that there's a threshold for how smart the model has to be to take advantage of this flow (<a href="https://x.com/mattshumer_/status/1831775436420083753" rel="nofollow">https://x.com/mattshumer_/status/1831775436420083753</a>) - 8B is too dumb.In which case, what happens if you apply this to a GPT-4o finetune, or to Claude 3.5 Sonnet?What happens if you combine it with variants of tree-based reasoning? With AlphaProof (<a href="https://www.nature.com/articles/s41586-023-06747-5#Sec3" rel="nofollow">https://www.nature.com/articles/s41586-023-06747-5#Sec3</a>)? With MCTSr (<a href="https://arxiv.org/abs/2406.07394" rel="nofollow">https://arxiv.org/abs/2406.07394</a>)?

评论 #41460557 未加载

winddude8 months ago

Seems to really fall apart on subsequent prompts, and a few times I've had code end up in the "thinking" tokens.I'm guessing most of the training data was single-turn, instead of multi-turn, but that should be relatively easy to iterate on.

htrp8 months ago

Quick update here: the model in question is apparently an attempt at an attention grab, there are open questions as to whether it is a llama 3 fine-tune, a llama 3.1 fine-tune, or a series of api calls redirecting to claude 3.5 sonnet, with a find and replace of Claude for Llama

louay_tn8 months ago

You can try this hugging face assistant that uses Llama 3.1 70b and system prompt engineering to simulate Reflection 70b's thinking and reflection process.<a href="https://hf.co/chat/assistant/66db391075ff4595ec2652b7" rel="nofollow">https://hf.co/chat/assistant/66db391075ff4595ec2652b7</a>

imjonse8 months ago

Wonder why no Llama-3.1-8B based variant if the new training method has such good results. UPDATE: didn't work well <a href="https://x.com/mattshumer_/status/1831775436420083753?t=flm41D8Ru9Zld2bjsmvs0A" rel="nofollow">https://x.com/mattshumer_/status/1831775436420083753?t=flm41...</a>

评论 #41460353 未加载

评论 #41460347 未加载

评论 #41460596 未加载

angoragoats8 months ago

Can we please stop allowing links to Twitter? Rationale: the artificial limitations on that site around post size mean that most announcements (such as this one) are multiple posts. This, combined with the questionable design decision of hiding all reply tweets when a user is not logged in, means that many posts are completely missing crucial context for those of us who don’t have Twitter accounts.Alternatively, Twitter links could be rewritten to redirect to one of the few Nitter instances that are still functional.

评论 #41460716 未加载

评论 #41460710 未加载

评论 #41460615 未加载

nhmllms8 months ago

This make me think we should be introducing 'tokens required to answer questions correctly' dimension to each metric. Since letting the model think more verbosely is essentially giving it more compute and memory to answer the question correctly. (not that this is a bad thing, but I would be curious if other models get the answer correctly with the first couple of tokens, or after hundreds of reasoning)

smcleod8 months ago

Unfortunately the model is broken at present, It looks like they're working on a fix - <a href="https://huggingface.co/mattshumer/Reflection-70B/discussions/6" rel="nofollow">https://huggingface.co/mattshumer/Reflection-70B/discussions...</a>

anshumankmr8 months ago

So is reflection tuning a scam or something worth exploring?

d_sc8 months ago

Any way to have this work in LM Studio? Not showing up in search results.

评论 #41460977 未加载

rspoerri8 months ago

i hope the quantized version doesnt loose to much of it's quality.

spencerchubb8 months ago

I wonder how good it is with multi-turn conversations

jph008 months ago

(removed)

评论 #41461413 未加载

评论 #41461433 未加载

评论 #41461424 未加载