One of the benefits of using thinking tokens compared to “thinking in a latent” space is that you can directly observe the quality of the CoT. In R1 they saw it was mixing languages and fixed it with cold start data.<p>It would be hard to SFT this because you can only SFT the final result not the latent space.<p>I also notice the authors only had compute for a single full training run. It’s impressive they saw such good results from that, but I wonder if they could get better results by incorporating recent efficiency improvements.<p>I would personally not use this architecture because 1) it adds a lot of hyperparameters which don’t have a strong theoretical grounding and 2) it’s not clearly better than simpler methods.
One of the hoped for benefits of this approach that’s described later in the paper. It’s not fully fleshed out what this will mean but the prospect is tantalizing.<p>"On a more philosophical note, we hope that latent reasoning captures facets of human reasoning that defy verbalization, such as spatial thinking, physical intuition or (motor) planning. Over many iterations of the recurrent process, reasoning in a high-dimensional vector space would enable the deep exploration of multiple directions simultaneously, instead of linear thinking, leading to a system capable of exhibiting novel and complex reasoning behavior."
My opinion is that opaque reasoning is a prerequisite for many of the worst possible AI outcomes.<p>We should make reasoning fully visible in the output space.
Slightly off topic, I rarely see paper talks about their failed training runs, and why those runs failed. This paper is definitely a breath of fresh air. And their analyses of their failures, the changes they made to fix them, and the rational behind that, are all very insightful.
Latent / embedding-space reasoning seems a step in the right direction, but building recurrence into the model while still relying on gradient descent (i.e. BPTT) to train it seems to create more of a problem (training inefficiency) than it solves, especially since they still end up externally specifying the number of recurrent iterations (r=4, 8, etc) for a given inference. Ideally having recurrence internal to the model would allow the model itself to decide how long to iterate for before outputting anything.
Twitter thread about this by the author: <a href="https://x.com/jonasgeiping/status/1888985929727037514" rel="nofollow">https://x.com/jonasgeiping/status/1888985929727037514</a>
Interesting stuff. As the authors note, using latent reasoning seems to be a way to sink more compute into the model and get better performance without increasing the model size, good news for those on a steady diet of 'scale pills'