Note that this still doesn't seem as good at solving proofs as some of the specialized prover models at formal theorem proving that aren't LLM-based. In particular, they show a 3% increase in proves solved over COPRA on the MiniF2F Lean dataset, but in COPRA's own paper they show that they prove about 18.5% fewer theorems than Proverbot9001 on the CompCert dataset (their pitch is that they are faster, but not better at proving, than other tools). Assuming COPRA isn't just way better at Lean than Coq (there should be some amount of difference, but given that it's the same algorithm run on a uniform interface over Coq and Lean it's unlikely to be large), Llemma should be proving 10-15% fewer proofs than Proverbot9001's algorithm. There's a bit of trickiness to trying to do this kind of cross-benchmark comparison, and COPRA picks a <i>bit</i> of a weird dataset to compare to Proverbot9001 on to save costs, but they also use a slightly older version which has worse results. So it probably still washes out to the best LLM-based provers being about 10-15% worse at proving theorems than the best specialized models.<p>EDIT: To be clear, that's 10-15% of the <i>total</i> theorems in a test set, not a relative 10-15% improvement. Given that solve rates for tools are in the 10-30% range, that's much more than a relative 10-15% improvement.
WizardMath[1] clearly has better benchmark results and they conveniently ignored it. They knew about the model because they have used their dataset.<p>This has 51.5% in GSM8K compared to 81.6% for WizardMath.<p>[1]: <a href="https://huggingface.co/WizardLM/WizardMath-70B-V1.0" rel="nofollow noreferrer">https://huggingface.co/WizardLM/WizardMath-70B-V1.0</a>
I wonder if I could use this to explore mathematics. I have been attempting to use existing LLMs, but results vary widely. I could imagine getting intuitions and insights about an equation, where the responses don't have to be watertight algebraic manipulations, but rather just deeply informed descriptions from several perspectives.
>> Before training L LEMMA 7B, we contract the RoPE (Su et al., 2022) base period of the Code Llama 7B initialization from θ = 1, 000, 000 to θ = 10, 000. This is so that the long context finetuning procedure described in Peng et al. (2023)and Rozière et al. (2023) can be repeated on the trained LLEMMA 7B (we leave actually doing so to future work).<p>This does not explain what was the reason for RoPE contraction in the first place. Can anyone elaborate what could be the reason?
This seems to be as open as proprietary game engines as no mention of model license was present in the paper, code, model card. I wonder whether their incorrect and misleading use of "open" will be caught during peer review.<p>Maybe the authors also believe that Windows is open. Its source code is also publicly available after all.
Can we stop using catchy marketing names for things that are *supposed to be* scientific results?<p>Choosing "llemma" (similar to Meta's "Llama") is just clout seeking.