Wait wait wait… the json output is incorrect, full stop. It claims the first decimal digit of 9.9 is ‘0’. Mathstral might be great; it might be terrible; either way this particular test should be done first at 0 temp and then like 50 or 100 times at 0.7 temp, but in any event the writer owes it to themselves (and us) to notice that the claimed ‘good’ output is totally incorrect.
> As we have seen, leading edge LLMs, such as the GPT-4o, can solve very complex math problems.<p>No… they can’t. That’s like saying a search engine can solve math problems — which it can, in a sense.<p>I suspect that the people repeatedly saying this simply lack the knowledge to know what really constitutes a ‘complex math problem’.<p>And of course any half-decent new model can answer this particular question correctly; the designers aren’t stupid or unaware of what the expectations and common traps are. The model <i>itself</i> probably will be able to talk about why testing on such comparisons would be interesting (because it ‘knows’ about how this being a recent meme).
In the JSON response (after "And the response is the following.") it says that "(...) Since 1 (from 9.11) is greater than 0 (implicitly, as there's no second digit in 9.9), we can conclude that:\n\n$$9.11 > 9.9$$ (...)"
This was all over Threads last week, posted by anti-AI people who who don't know how LLMs work. These are the same people who post screenshots of LLMs attempting to count the number of 'r's in "strawberry".<p>> <i>"The 7B mathstral model answers the math common sense question perfectly with the correct reasoning."</i><p>Answers perfectly, sure. But the word "reasoning" is anthropomorphism and promises a level of cognitive ability that LLMs do not possess.
None of them is wrong, the answer depends on the type of the object, which the notation doesn't specify:<p>Version 9.11 is greater than 9.9<p>Decimal 9.9 is greater than 9.11
> The case in point is that most LLMs, including GPT-4o, cannot tell whether 9.11 or 9.8 is bigger!<p>Wrong. GPT-4o gives me the correct answer to this question, 9.8.
It might have something to do with quantization though, I have used the Q6_K version from <a href="https://huggingface.co/bartowski/mathstral-7B-v0.1-GGUF" rel="nofollow">https://huggingface.co/bartowski/mathstral-7B-v0.1-GGUF</a> with Llamafile. It always shows 9.11 is bigger than 9.9.
I found this interesting and tried the question with the top models from Antrophic, Openai, Google and Mistral.
Which all gave the wrong results. But if you preface the question with "Of these two decimal numbers ", the answers changed and the results where correct.
I suspect what we are seeing is that the models handles the numbers as version numbers, and not decimal numbers.
This is disappointing and confusing, but it also imo. underlines that giving them context on what you try to get them to do is worthwhile.
AI can not handle basic math like deciding which is greater between 9.11 and 9.9? A popular meme sparks debates about LLM's grasp of elementary math.
Introducing mathstral, Mistral AI's latest opensource model, fine-tuned specifically for mathematical reasoning and scientific discovery. I just ran a series of tests to determine if mathstral can truly discern the larger of two decimal numbers in a way that makes sense to us humans.
Using LlamaEdge's Rust + Wasm tech stack, I set up mathstral on my local machine—no complex installations needed! The results? Absolutely fascinating and promising for the future of AI in education and beyond.
Want to see how it performed and possibly set it up yourself? Check out this detailed easy-to-follow walkthrough