TE
TechEcho
Home24h TopNewestBestAskShowJobs
GitHubTwitter
Home

TechEcho

A tech news platform built with Next.js, providing global tech news and discussions.

GitHubTwitter

Home

HomeNewestBestAskShowJobs

Resources

HackerNews APIOriginal HackerNewsNext.js

© 2025 TechEcho. All rights reserved.

Llama 2 is about as factually accurate as GPT-4 for summaries and is 30X cheaper

143 pointsby pallas_athenaover 1 year ago

8 comments

Michelangelo11over 1 year ago
Nice idea and all, but I think their methodology was totally unsuited to the task. What they actually assessed was whether an LLM can recognize an accurate summary of a given article, not whether the LLM can produce an accurate summary itself. Here&#x27;s their description of it:<p>&gt; We used a 3-way verified hand-labeled set of 373 news report statements and presented one correct and one incorrect summary of each. Each LLM had to decide which statement was the factually correct summary.<p>The problem with this approach is its assumption that if an LLM can recognize an accurate summary, it&#x27;ll be able to reliably produce accurate summaries. We know very little about the inner workings of LLMs right now, and what we do know suggests that they work highly counterintuitively, so I think there&#x27;s no basis to make this assumption.
评论 #37311291 未加载
behindaiover 1 year ago
it costs 0.001$ per 1K which is slightly cheaper than GPT-3.5-turbo. I have just tested it and it shows extremely worse results on the tasks in my pipelines. Not a game change, unfortunately.
评论 #37309874 未加载
评论 #37315666 未加载
评论 #37311061 未加载
sorokodover 1 year ago
&quot;It is not too much of a stretch to conclude that a system that is better at telling factual from non-factual sentences is better at not making them up in the first place – or alternatively could decide through a two stage process if it was being inconsistent.&quot;<p>Stretching aside, how does one follow from the other?
评论 #37312313 未加载
评论 #37312035 未加载
born-jreover 1 year ago
if someone reputable could maintain a blind benchmark that is not public, that would be great.
评论 #37311025 未加载
评论 #37312147 未加载
yieldcrvover 1 year ago
also means that OpenAI can just swap in Llama 2 and increase their capacity by orders of magnitude<p>this is the age (or year) of token price arbitrage
评论 #37309867 未加载
评论 #37309852 未加载
cheema33over 1 year ago
Almost as good as GPT-4? I hear that claim quite often. And then when I test the claim, it falls far far short. I want real competition in this space. But currently, there is none. Except for maybe some very very corner case.
ldjkfkdsjnvover 1 year ago
Anyscale&#x27;s business model was completely disrupted by OpenAI. They are trying to shift to provide hosting&#x2F;fine tuning for open source LLMs, but the model performance will get crushed by Gpt4&#x2F;newer open AI models. In theory the alternative to openAI models is nice, in reality anyscale is now competing with Azure&#x2F;AWS&#x2F;etc to provide model hosting.<p>Their original compute platform for running arbitrary ml workloads will become obsolete as the industry consolidates around LLMs.
评论 #37312089 未加载
评论 #37310109 未加载
评论 #37311221 未加载
评论 #37310273 未加载
评论 #37310350 未加载
m3kw9over 1 year ago
OpenAI is probably laughing their asses off internally read all the “it’s equivalent to GPT4” results
评论 #37310451 未加载