"We don’t actually want them to do math for the sake of replacing calculators" - couldn't the article just end here? People aren't giving it multiplication problems or asking it to count letters because they want to know the answer. Given that you can "patch" the issue by having it invoke python for computation, the real value is in seeing whether current models can learn to follow a simple step-by-step procedure.<p>The linked tweet <a href="https://twitter.com/yuntiandeng/status/1836114401213989366" rel="nofollow">https://twitter.com/yuntiandeng/status/1836114401213989366</a> is far more interesting to me, gpt models clearly _can_ learn to multiply with intermediate tokens, but even o1 currently doesn't. And yet this would be a case where generating synthetic data is almost trivial. And moreover, being able to perform computations in this fashion would be valuable for many types of benchmarks (e.g. FrontierMath, since I'm sure at the end of the day you'll have to grind through some computation).<p>So why hasn't it been a priority? I remember some NeurIPS presentation claiming that heavily training on math in this fashion hurt language scores. But then the follow-up would be to have specialized models for each and route between them...