TechEcho

13 comments

Note that this still doesn't seem as good at solving proofs as some of the specialized prover models at formal theorem proving that aren't LLM-based. In particular, they show a 3% increase in proves solved over COPRA on the MiniF2F Lean dataset, but in COPRA's own paper they show that they prove about 18.5% fewer theorems than Proverbot9001 on the CompCert dataset (their pitch is that they are faster, but not better at proving, than other tools). Assuming COPRA isn't just way better at Lean than Coq (there should be some amount of difference, but given that it's the same algorithm run on a uniform interface over Coq and Lean it's unlikely to be large), Llemma should be proving 10-15% fewer proofs than Proverbot9001's algorithm. There's a bit of trickiness to trying to do this kind of cross-benchmark comparison, and COPRA picks a bit of a weird dataset to compare to Proverbot9001 on to save costs, but they also use a slightly older version which has worse results. So it probably still washes out to the best LLM-based provers being about 10-15% worse at proving theorems than the best specialized models.EDIT: To be clear, that's 10-15% of the total theorems in a test set, not a relative 10-15% improvement. Given that solve rates for tools are in the 10-30% range, that's much more than a relative 10-15% improvement.

评论 #37921733 未加载

YetAnotherNickover 1 year ago

WizardMath[1] clearly has better benchmark results and they conveniently ignored it. They knew about the model because they have used their dataset.This has 51.5% in GSM8K compared to 81.6% for WizardMath.[1]: <a href="https://huggingface.co/WizardLM/WizardMath-70B-V1.0" rel="nofollow noreferrer">https://huggingface.co/WizardLM/WizardMath-70B-V1.0</a>

评论 #37928034 未加载

alhirzelover 1 year ago

I love the name, it's intuitive today and does not have any overloaded meanings or other confusers.

评论 #37923115 未加载

Syntoniclesover 1 year ago

I wonder if I could use this to explore mathematics. I have been attempting to use existing LLMs, but results vary widely. I could imagine getting intuitions and insights about an equation, where the responses don't have to be watertight algebraic manipulations, but rather just deeply informed descriptions from several perspectives.

pk-protect-aiover 1 year ago

>> Before training L LEMMA 7B, we contract the RoPE (Su et al., 2022) base period of the Code Llama 7B initialization from θ = 1, 000, 000 to θ = 10, 000. This is so that the long context finetuning procedure described in Peng et al. (2023)and Rozière et al. (2023) can be repeated on the trained LLEMMA 7B (we leave actually doing so to future work).This does not explain what was the reason for RoPE contraction in the first place. Can anyone elaborate what could be the reason?

评论 #37926636 未加载

classifiedover 1 year ago

So we get hallucinations in math, too. This was inevitable.

评论 #37925599 未加载

leedrake5over 1 year ago

Downloading this now to test it out - if anyone has prompts or tests they’d like to see let me know. I plan to run it with llama.cpp

评论 #37920886 未加载

评论 #37923653 未加载

pumanoirover 1 year ago

I skimmed (ctrl-f) the paper and didn't see a comparison against gpt-4. Anybody knows how they'd compare?

评论 #37920493 未加载

Zuiiiover 1 year ago

This seems to be as open as proprietary game engines as no mention of model license was present in the paper, code, model card. I wonder whether their incorrect and misleading use of "open" will be caught during peer review.Maybe the authors also believe that Windows is open. Its source code is also publicly available after all.

评论 #37924453 未加载

akomtuover 1 year ago

A missed opportunity: TheLemma.

评论 #37926359 未加载

an_aparallelover 1 year ago

so - now you dont need to know maths to work on DSP heavy/math reliant programming? :)

lanstinover 1 year ago

Can we run this somewhere?

评论 #37918812 未加载

评论 #37919249 未加载

behnamohover 1 year ago

Can we stop using catchy marketing names for things that are *supposed to be* scientific results?Choosing "llemma" (similar to Meta's "Llama") is just clout seeking.

评论 #37919466 未加载

评论 #37919879 未加载

评论 #37919468 未加载

评论 #37919448 未加载

评论 #37921761 未加载

评论 #37919425 未加载

13 comments

aSanchezSternover 1 year ago

评论 #37921733 未加载

YetAnotherNickover 1 year ago

评论 #37928034 未加载

alhirzelover 1 year ago

I love the name, it's intuitive today and does not have any overloaded meanings or other confusers.

评论 #37923115 未加载

Syntoniclesover 1 year ago

pk-protect-aiover 1 year ago

评论 #37926636 未加载

classifiedover 1 year ago

So we get hallucinations in math, too. This was inevitable.

评论 #37925599 未加载

leedrake5over 1 year ago

Downloading this now to test it out - if anyone has prompts or tests they’d like to see let me know. I plan to run it with llama.cpp

评论 #37920886 未加载

评论 #37923653 未加载

pumanoirover 1 year ago

I skimmed (ctrl-f) the paper and didn't see a comparison against gpt-4. Anybody knows how they'd compare?

评论 #37920493 未加载

Zuiiiover 1 year ago

评论 #37924453 未加载

akomtuover 1 year ago

A missed opportunity: TheLemma.

评论 #37926359 未加载

an_aparallelover 1 year ago

so - now you dont need to know maths to work on DSP heavy/math reliant programming? :)

lanstinover 1 year ago

Can we run this somewhere?

评论 #37918812 未加载

评论 #37919249 未加载

behnamohover 1 year ago

Can we stop using catchy marketing names for things that are *supposed to be* scientific results?Choosing "llemma" (similar to Meta's "Llama") is just clout seeking.

Llemma: An Open Language Model for Mathematics

13 comments

Llemma: An Open Language Model for Mathematics

13 comments