TE
科技回声
首页24小时热榜最新最佳问答展示工作
GitHubTwitter
首页

科技回声

基于 Next.js 构建的科技新闻平台,提供全球科技新闻和讨论内容。

GitHubTwitter

首页

首页最新最佳问答展示工作

资源链接

HackerNews API原版 HackerNewsNext.js

© 2025 科技回声. 版权所有。

Llemma: An Open Language Model for Mathematics

267 点作者 AlphaWeaver超过 1 年前

13 条评论

aSanchezStern超过 1 年前
Note that this still doesn&#x27;t seem as good at solving proofs as some of the specialized prover models at formal theorem proving that aren&#x27;t LLM-based. In particular, they show a 3% increase in proves solved over COPRA on the MiniF2F Lean dataset, but in COPRA&#x27;s own paper they show that they prove about 18.5% fewer theorems than Proverbot9001 on the CompCert dataset (their pitch is that they are faster, but not better at proving, than other tools). Assuming COPRA isn&#x27;t just way better at Lean than Coq (there should be some amount of difference, but given that it&#x27;s the same algorithm run on a uniform interface over Coq and Lean it&#x27;s unlikely to be large), Llemma should be proving 10-15% fewer proofs than Proverbot9001&#x27;s algorithm. There&#x27;s a bit of trickiness to trying to do this kind of cross-benchmark comparison, and COPRA picks a <i>bit</i> of a weird dataset to compare to Proverbot9001 on to save costs, but they also use a slightly older version which has worse results. So it probably still washes out to the best LLM-based provers being about 10-15% worse at proving theorems than the best specialized models.<p>EDIT: To be clear, that&#x27;s 10-15% of the <i>total</i> theorems in a test set, not a relative 10-15% improvement. Given that solve rates for tools are in the 10-30% range, that&#x27;s much more than a relative 10-15% improvement.
评论 #37921733 未加载
YetAnotherNick超过 1 年前
WizardMath[1] clearly has better benchmark results and they conveniently ignored it. They knew about the model because they have used their dataset.<p>This has 51.5% in GSM8K compared to 81.6% for WizardMath.<p>[1]: <a href="https:&#x2F;&#x2F;huggingface.co&#x2F;WizardLM&#x2F;WizardMath-70B-V1.0" rel="nofollow noreferrer">https:&#x2F;&#x2F;huggingface.co&#x2F;WizardLM&#x2F;WizardMath-70B-V1.0</a>
评论 #37928034 未加载
alhirzel超过 1 年前
I love the name, it&#x27;s intuitive today and does not have any overloaded meanings or other confusers.
评论 #37923115 未加载
Syntonicles超过 1 年前
I wonder if I could use this to explore mathematics. I have been attempting to use existing LLMs, but results vary widely. I could imagine getting intuitions and insights about an equation, where the responses don&#x27;t have to be watertight algebraic manipulations, but rather just deeply informed descriptions from several perspectives.
pk-protect-ai超过 1 年前
&gt;&gt; Before training L LEMMA 7B, we contract the RoPE (Su et al., 2022) base period of the Code Llama 7B initialization from θ = 1, 000, 000 to θ = 10, 000. This is so that the long context finetuning procedure described in Peng et al. (2023)and Rozière et al. (2023) can be repeated on the trained LLEMMA 7B (we leave actually doing so to future work).<p>This does not explain what was the reason for RoPE contraction in the first place. Can anyone elaborate what could be the reason?
评论 #37926636 未加载
classified超过 1 年前
So we get hallucinations in math, too. This was inevitable.
评论 #37925599 未加载
leedrake5超过 1 年前
Downloading this now to test it out - if anyone has prompts or tests they’d like to see let me know. I plan to run it with llama.cpp
评论 #37920886 未加载
评论 #37923653 未加载
pumanoir超过 1 年前
I skimmed (ctrl-f) the paper and didn&#x27;t see a comparison against gpt-4. Anybody knows how they&#x27;d compare?
评论 #37920493 未加载
Zuiii超过 1 年前
This seems to be as open as proprietary game engines as no mention of model license was present in the paper, code, model card. I wonder whether their incorrect and misleading use of &quot;open&quot; will be caught during peer review.<p>Maybe the authors also believe that Windows is open. Its source code is also publicly available after all.
评论 #37924453 未加载
akomtu超过 1 年前
A missed opportunity: TheLemma.
评论 #37926359 未加载
an_aparallel超过 1 年前
so - now you dont need to know maths to work on DSP heavy&#x2F;math reliant programming? :)
lanstin超过 1 年前
Can we run this somewhere?
评论 #37918812 未加载
评论 #37919249 未加载
behnamoh超过 1 年前
Can we stop using catchy marketing names for things that are *supposed to be* scientific results?<p>Choosing &quot;llemma&quot; (similar to Meta&#x27;s &quot;Llama&quot;) is just clout seeking.
评论 #37919466 未加载
评论 #37919879 未加载
评论 #37919468 未加载
评论 #37919448 未加载
评论 #37921761 未加载
评论 #37919425 未加载