TE
科技回声
首页24小时热榜最新最佳问答展示工作
GitHubTwitter
首页

科技回声

基于 Next.js 构建的科技新闻平台,提供全球科技新闻和讨论内容。

GitHubTwitter

首页

首页最新最佳问答展示工作

资源链接

HackerNews API原版 HackerNewsNext.js

© 2025 科技回声. 版权所有。

Blending Is All You Need: Cheaper, Better Alternative to Trillion-Parameters LLM

147 点作者 naturalauction超过 1 年前

16 条评论

abeppu超过 1 年前
Ok, this seems bunk basically because they never really provide evidence of &quot;better&quot;.<p>&gt; ... traditiontal gold-standard approaches use human evaluators that score the quality of generated responses, which can be costly. However, since chat AIs are by definition deployed in social environments with humans, one can leverage statistics of users interaction as a meaningful and aligned measure of chat AI engagingness and quality. To assess the ’quality’ of a chat AI, we consider two main proxy functions: the industry standard user retention and the main objective function, user engagement.<p>Maybe retention and engagement _are_ sufficiently well correlated to human evaluations, but you should probably do both and show that they&#x27;re strongly correlated before you decide to just drop the human evaluators in favor of your cheap proxy measurements.<p>And in this field, where there are some known issues with chat LLMs, perhaps it&#x27;s important to check stuff like:<p>- Does the model seem &quot;engaging&quot; just b&#x2F;c the user has to refine their prompt several times before they get a satisfying response?<p>- Do responses include a lot of hallucinations which might be engaging but not true?<p>- Do successive responses show decreased consistency or coherence between messages, in a way that might accidentally elicit continued engagement?<p>Overall, it seems sloppy to believe that it&#x27;s not a waste of humans time to talk to your chatbots, and it&#x27;s not a waste of time for readers to look at this paper about your chatbots, but it&#x27;s too expensive for you to actually measure the quality of responses from your chatbots.
评论 #38959014 未加载
评论 #38962243 未加载
Animats超过 1 年前
<i>&quot;Responses are selected randomly from a group of base chat AIs. ... The response generated by a specific chat AI is conditional on all previous responses generated by the previously selected chat AIs.&quot;</i><p>That&#x27;s all? That works? Useful.<p>Could that be extended? It doesn&#x27;t seem inherent in this that all the chat AIs have to be LLMs. Some might be special-purpose systems. Solvers or knowledge bases, such as Wolfram Alpha or a database front end, could play too. Systems at the Alexa&#x2F;Siri level that can do simple tasks. Domain-specific systems with natural language in and out have been around for decades.
评论 #38960083 未加载
评论 #38960807 未加载
goethes_kind超过 1 年前
I find it suspicious that they would use user engagement and retention and none of the normal benchmarks to test their model.
enoch2090超过 1 年前
I&#x27;ve said this a few times previously yet I certainly want to say it again - &quot;All You Need&quot; titles are definetely not what all we need.
评论 #38962972 未加载
评论 #38964528 未加载
评论 #38963755 未加载
m3kw9超过 1 年前
I really would like them to compare to Gpt4 instead of claiming victory when matching 3.5. To me GPT4 is the first usable one for a lot of professional uses. 3.5 is fun and gets some stuff right but it’s like a demo.
评论 #38958956 未加载
sp332超过 1 年前
Is it weird to refer to GPT-3.5 as &quot;state of the art&quot; when GPT-4 is right there? Actually the paper uses davinci interchangeably with GPT-3.5 (sometimes without a hyphen) and ChatGPT.
评论 #38960080 未加载
jeffrallen超过 1 年前
&quot;All you need&quot; is all you need, apparently, to get an AI paper in HN.
rfw300超过 1 年前
The paper refers to ChatGPT as a 175B parameter LLM. This is almost certainly incorrect; the original largest version of GPT-3 was 175B but analysis of the speed and cost of the current model as well as public statements by OpenAI indicate it’s as much as 5-10x smaller.
评论 #38959149 未加载
denimboy超过 1 年前
mergekit is the tool you need to do this<p><pre><code> https:&#x2F;&#x2F;github.com&#x2F;cg123&#x2F;mergekit </code></pre> you can slice off layers and blend models with different strategies.
评论 #38958931 未加载
block_dagger超过 1 年前
Reminds me of Numenta&#x27;s Thousand Brains Theory of Intelligence.
评论 #38956964 未加载
miven超过 1 年前
Now that I think about it, doesn&#x27;t this &quot;technique&quot; triple the amount of compute and memory per generated token since each model needs to also compute and store the KV values for the two previous tokens it didn&#x27;t generate and thus has never seen?<p>Edit: On second thought, depending on how it&#x27;s actually implemented the other two tokens are probably ran through the model in parallel so it shouldn&#x27;t be all that much slower.
评论 #38963435 未加载
评论 #38960094 未加载
patrickhogan1超过 1 年前
Foundational models are designed to be universally applicable, covering a wide range of use cases. While it&#x27;s relatively easy to tailor smaller models to specific scenarios through overfitting, when a model is overly specialized, it loses its broad applicability and ceases to be a foundational model.
Buttons840超过 1 年前
How does the blending work? I&#x27;m imagining installing a bunch of &quot;AIs&quot; and having them all work together intelligently.
评论 #38965282 未加载
huytersd超过 1 年前
I’ve said this before but every time someone uses 3.5 to make a point, there’s an agenda.
teddyh超过 1 年前
Three small LLMs in a trenchcoat.
评论 #38962488 未加载
matmulbro超过 1 年前
machine learning papers are astrology for men