Sample solution for Balkan MO 2023 seems .. questionable?<p>The problem involves players removing stones sequentially and asking which will win with perfect play: the listed answer definitely doesn’t list all possible types of strategies.<p>The answer it gives may be right; in fact I bet it is correct (the second player), but does the qwen team offer the solution as correct including the logic? And is the solution logic correct?
FYI the model is almost 150GB: <a href="https://huggingface.co/Qwen/Qwen2-Math-72B-Instruct/tree/main" rel="nofollow">https://huggingface.co/Qwen/Qwen2-Math-72B-Instruct/tree/mai...</a>
First solution (IMO 2002) is completely wrong. It shows that 1,2,3 cubes are not sufficient, and provide an obstacle that <i>doesn't rule out</i> 4 cubes, but does not prove that there actually are 4 cubes that sum to the given number. This is much harder (and I don't know the true answer)
These solutions aren't perfect, but imagine how many more people can become mathematicians now that the price of an elite IMO medal winning tutor can be quantified as Hugging Face hosting costs!
I see that they do some decontamination of the datasets, in the hope that the models won't just recite answers from the training data. But in the recent interview with Subbarao Kambhampati on MLST (<a href="https://www.youtube.com/watch?v=y1WnHpedi2A" rel="nofollow">https://www.youtube.com/watch?v=y1WnHpedi2A</a>) they explain that models fail as soon as one slightly rephrases the test problems (indicating that they are indeed mostly reciting). I expect this to be the case with this model too.
It is obious that all of these problems are still way too hard, although sometimes it has ideas. It flawlessly demonstrates how to simplify (2002^2002) mod 9. I recall that there was once a scandalous university exam for future math teachers in germany, which asked to do tasks like that, but all failed the test. With Qwen-2 at hand this might not have happened.
The solution for IMO 2022 is barely a 1/7 solution. It just says ‘ might not satisfy the inequality for all y’ without a proof. That was the point of the question.