TE
TechEcho
Home24h TopNewestBestAskShowJobs
GitHubTwitter
Home

TechEcho

A tech news platform built with Next.js, providing global tech news and discussions.

GitHubTwitter

Home

HomeNewestBestAskShowJobs

Resources

HackerNews APIOriginal HackerNewsNext.js

© 2025 TechEcho. All rights reserved.

Can the New Mathstral LLM Accurately Compare 9.11 and 9.9?

23 pointsby 3Sophons10 months ago

13 comments

vessenes10 months ago
Wait wait wait… the json output is incorrect, full stop. It claims the first decimal digit of 9.9 is ‘0’. Mathstral might be great; it might be terrible; either way this particular test should be done first at 0 temp and then like 50 or 100 times at 0.7 temp, but in any event the writer owes it to themselves (and us) to notice that the claimed ‘good’ output is totally incorrect.
评论 #41018836 未加载
评论 #41019139 未加载
评论 #41019155 未加载
xanderlewis10 months ago
&gt; As we have seen, leading edge LLMs, such as the GPT-4o, can solve very complex math problems.<p>No… they can’t. That’s like saying a search engine can solve math problems — which it can, in a sense.<p>I suspect that the people repeatedly saying this simply lack the knowledge to know what really constitutes a ‘complex math problem’.<p>And of course any half-decent new model can answer this particular question correctly; the designers aren’t stupid or unaware of what the expectations and common traps are. The model <i>itself</i> probably will be able to talk about why testing on such comparisons would be interesting (because it ‘knows’ about how this being a recent meme).
评论 #41018932 未加载
lucabetelci10 months ago
In the JSON response (after &quot;And the response is the following.&quot;) it says that &quot;(...) Since 1 (from 9.11) is greater than 0 (implicitly, as there&#x27;s no second digit in 9.9), we can conclude that:\n\n$$9.11 &gt; 9.9$$ (...)&quot;
评论 #41018759 未加载
CharlesW10 months ago
This was all over Threads last week, posted by anti-AI people who who don&#x27;t know how LLMs work. These are the same people who post screenshots of LLMs attempting to count the number of &#x27;r&#x27;s in &quot;strawberry&quot;.<p>&gt; <i>&quot;The 7B mathstral model answers the math common sense question perfectly with the correct reasoning.&quot;</i><p>Answers perfectly, sure. But the word &quot;reasoning&quot; is anthropomorphism and promises a level of cognitive ability that LLMs do not possess.
评论 #41019168 未加载
g-w110 months ago
I&#x27;m quite confused. In the article, the response from mathstral is also wrong???
评论 #41019003 未加载
TZubiri10 months ago
None of them is wrong, the answer depends on the type of the object, which the notation doesn&#x27;t specify:<p>Version 9.11 is greater than 9.9<p>Decimal 9.9 is greater than 9.11
评论 #41018907 未加载
评论 #41018939 未加载
评论 #41019529 未加载
arnaudsm10 months ago
Naive question: will scaling laws be sufficient for reliable reasoning, or are transformer architectures incapable of that ?
评论 #41018869 未加载
评论 #41018715 未加载
评论 #41018710 未加载
评论 #41018792 未加载
meisel10 months ago
&gt; The case in point is that most LLMs, including GPT-4o, cannot tell whether 9.11 or 9.8 is bigger!<p>Wrong. GPT-4o gives me the correct answer to this question, 9.8.
评论 #41018855 未加载
rahduro10 months ago
It might have something to do with quantization though, I have used the Q6_K version from <a href="https:&#x2F;&#x2F;huggingface.co&#x2F;bartowski&#x2F;mathstral-7B-v0.1-GGUF" rel="nofollow">https:&#x2F;&#x2F;huggingface.co&#x2F;bartowski&#x2F;mathstral-7B-v0.1-GGUF</a> with Llamafile. It always shows 9.11 is bigger than 9.9.
hdhshdhshdjd10 months ago
This is like 200x more complicated setup than just running Ollama.
einarfd10 months ago
I found this interesting and tried the question with the top models from Antrophic, Openai, Google and Mistral. Which all gave the wrong results. But if you preface the question with &quot;Of these two decimal numbers &quot;, the answers changed and the results where correct. I suspect what we are seeing is that the models handles the numbers as version numbers, and not decimal numbers. This is disappointing and confusing, but it also imo. underlines that giving them context on what you try to get them to do is worthwhile.
bee_rider10 months ago
Which is the correct answer?<p>(Note that the logic in the response from the LLM is blatantly nonsense).
3Sophons10 months ago
AI can not handle basic math like deciding which is greater between 9.11 and 9.9? A popular meme sparks debates about LLM&#x27;s grasp of elementary math. Introducing mathstral, Mistral AI&#x27;s latest opensource model, fine-tuned specifically for mathematical reasoning and scientific discovery. I just ran a series of tests to determine if mathstral can truly discern the larger of two decimal numbers in a way that makes sense to us humans. Using LlamaEdge&#x27;s Rust + Wasm tech stack, I set up mathstral on my local machine—no complex installations needed! The results? Absolutely fascinating and promising for the future of AI in education and beyond. Want to see how it performed and possibly set it up yourself? Check out this detailed easy-to-follow walkthrough