Note that this still doesn't seem as good at solving proofs as some of the specialized prover models at formal theorem proving that aren't LLM-based. In particular, they show a 3% increase in proves solved over COPRA on the MiniF2F Lean dataset, but in COPRA's own paper they show that they prove about 18.5% fewer theorems than Proverbot9001 on the CompCert dataset (their pitch is that they are faster, but not better at proving, than other tools). Assuming COPRA isn't just way better at Lean than Coq (there should be some amount of difference, but given that it's the same algorithm run on a uniform interface over Coq and Lean it's unlikely to be large), Llemma should be proving 10-15% fewer proofs than Proverbot9001's algorithm. There's a bit of trickiness to trying to do this kind of cross-benchmark comparison, and COPRA picks a <i>bit</i> of a weird dataset to compare to Proverbot9001 on to save costs, but they also use a slightly older version which has worse results. So it probably still washes out to the best LLM-based provers being about 10-15% worse at proving theorems than the best specialized models.<p>EDIT: To be clear, that's 10-15% of the <i>total</i> theorems in a test set, not a relative 10-15% improvement. Given that solve rates for tools are in the 10-30% range, that's much more than a relative 10-15% improvement.