TE
科技回声
首页24小时热榜最新最佳问答展示工作
GitHubTwitter
首页

科技回声

基于 Next.js 构建的科技新闻平台,提供全球科技新闻和讨论内容。

GitHubTwitter

首页

首页最新最佳问答展示工作

资源链接

HackerNews API原版 HackerNewsNext.js

© 2025 科技回声. 版权所有。

Can the New Mathstral LLM Accurately Compare 9.11 and 9.9?

23 点作者 3Sophons10 个月前

13 条评论

vessenes10 个月前
Wait wait wait… the json output is incorrect, full stop. It claims the first decimal digit of 9.9 is ‘0’. Mathstral might be great; it might be terrible; either way this particular test should be done first at 0 temp and then like 50 or 100 times at 0.7 temp, but in any event the writer owes it to themselves (and us) to notice that the claimed ‘good’ output is totally incorrect.
评论 #41018836 未加载
评论 #41019139 未加载
评论 #41019155 未加载
xanderlewis10 个月前
&gt; As we have seen, leading edge LLMs, such as the GPT-4o, can solve very complex math problems.<p>No… they can’t. That’s like saying a search engine can solve math problems — which it can, in a sense.<p>I suspect that the people repeatedly saying this simply lack the knowledge to know what really constitutes a ‘complex math problem’.<p>And of course any half-decent new model can answer this particular question correctly; the designers aren’t stupid or unaware of what the expectations and common traps are. The model <i>itself</i> probably will be able to talk about why testing on such comparisons would be interesting (because it ‘knows’ about how this being a recent meme).
评论 #41018932 未加载
lucabetelci10 个月前
In the JSON response (after &quot;And the response is the following.&quot;) it says that &quot;(...) Since 1 (from 9.11) is greater than 0 (implicitly, as there&#x27;s no second digit in 9.9), we can conclude that:\n\n$$9.11 &gt; 9.9$$ (...)&quot;
评论 #41018759 未加载
CharlesW10 个月前
This was all over Threads last week, posted by anti-AI people who who don&#x27;t know how LLMs work. These are the same people who post screenshots of LLMs attempting to count the number of &#x27;r&#x27;s in &quot;strawberry&quot;.<p>&gt; <i>&quot;The 7B mathstral model answers the math common sense question perfectly with the correct reasoning.&quot;</i><p>Answers perfectly, sure. But the word &quot;reasoning&quot; is anthropomorphism and promises a level of cognitive ability that LLMs do not possess.
评论 #41019168 未加载
g-w110 个月前
I&#x27;m quite confused. In the article, the response from mathstral is also wrong???
评论 #41019003 未加载
TZubiri10 个月前
None of them is wrong, the answer depends on the type of the object, which the notation doesn&#x27;t specify:<p>Version 9.11 is greater than 9.9<p>Decimal 9.9 is greater than 9.11
评论 #41018907 未加载
评论 #41018939 未加载
评论 #41019529 未加载
arnaudsm10 个月前
Naive question: will scaling laws be sufficient for reliable reasoning, or are transformer architectures incapable of that ?
评论 #41018869 未加载
评论 #41018715 未加载
评论 #41018710 未加载
评论 #41018792 未加载
meisel10 个月前
&gt; The case in point is that most LLMs, including GPT-4o, cannot tell whether 9.11 or 9.8 is bigger!<p>Wrong. GPT-4o gives me the correct answer to this question, 9.8.
评论 #41018855 未加载
rahduro10 个月前
It might have something to do with quantization though, I have used the Q6_K version from <a href="https:&#x2F;&#x2F;huggingface.co&#x2F;bartowski&#x2F;mathstral-7B-v0.1-GGUF" rel="nofollow">https:&#x2F;&#x2F;huggingface.co&#x2F;bartowski&#x2F;mathstral-7B-v0.1-GGUF</a> with Llamafile. It always shows 9.11 is bigger than 9.9.
hdhshdhshdjd10 个月前
This is like 200x more complicated setup than just running Ollama.
einarfd10 个月前
I found this interesting and tried the question with the top models from Antrophic, Openai, Google and Mistral. Which all gave the wrong results. But if you preface the question with &quot;Of these two decimal numbers &quot;, the answers changed and the results where correct. I suspect what we are seeing is that the models handles the numbers as version numbers, and not decimal numbers. This is disappointing and confusing, but it also imo. underlines that giving them context on what you try to get them to do is worthwhile.
bee_rider10 个月前
Which is the correct answer?<p>(Note that the logic in the response from the LLM is blatantly nonsense).
3Sophons10 个月前
AI can not handle basic math like deciding which is greater between 9.11 and 9.9? A popular meme sparks debates about LLM&#x27;s grasp of elementary math. Introducing mathstral, Mistral AI&#x27;s latest opensource model, fine-tuned specifically for mathematical reasoning and scientific discovery. I just ran a series of tests to determine if mathstral can truly discern the larger of two decimal numbers in a way that makes sense to us humans. Using LlamaEdge&#x27;s Rust + Wasm tech stack, I set up mathstral on my local machine—no complex installations needed! The results? Absolutely fascinating and promising for the future of AI in education and beyond. Want to see how it performed and possibly set it up yourself? Check out this detailed easy-to-follow walkthrough