"Prompting is not enough, test-time fine-tuning is needed<p>Clearly this competition has shown that LLMs need test-time fine-tuning to do new tasks. Few-shot prompting is not enough for the model to learn novel tasks."<p>pretty interesting. However, only a pretty small model was used Qwen2.5-0.5B-Instruct. Bigger models were not available because of competition runtime constraints.