Nice idea. Essentially, adding differentiability to the best of n choice lets them encourage models to add some diversity “naturally”. The Gemma 2b results indicate it’s probably worth trying this on larger models.<p>That said, I’m unclear how much this helps in practice; we don’t usually parse through say 32 responses from our 2B parameter models. I guess if you instrumented parallel reasoning processes in batch this might be helpful. Perhaps that’s what o1-pro is doing in the background, actually.<p>Anyway, this one seems to me like it might make its way onto the “good idea” list when rl is available in the training pipeline.