TE
TechEcho
Home24h TopNewestBestAskShowJobs
GitHubTwitter
Home

TechEcho

A tech news platform built with Next.js, providing global tech news and discussions.

GitHubTwitter

Home

HomeNewestBestAskShowJobs

Resources

HackerNews APIOriginal HackerNewsNext.js

© 2025 TechEcho. All rights reserved.

GPT-4 details leaked?

661 pointsby bx376almost 2 years ago

36 comments

neonatealmost 2 years ago
<a href="https:&#x2F;&#x2F;archive.ph&#x2F;2RQ8X" rel="nofollow noreferrer">https:&#x2F;&#x2F;archive.ph&#x2F;2RQ8X</a>
CSMastermindalmost 2 years ago
Previously posted about here: <a href="https:&#x2F;&#x2F;news.ycombinator.com&#x2F;item?id=36671588">https:&#x2F;&#x2F;news.ycombinator.com&#x2F;item?id=36671588</a> and here: <a href="https:&#x2F;&#x2F;news.ycombinator.com&#x2F;item?id=36674905">https:&#x2F;&#x2F;news.ycombinator.com&#x2F;item?id=36674905</a><p>With the original source being: <a href="https:&#x2F;&#x2F;www.semianalysis.com&#x2F;p&#x2F;gpt-4-architecture-infrastructure" rel="nofollow noreferrer">https:&#x2F;&#x2F;www.semianalysis.com&#x2F;p&#x2F;gpt-4-architecture-infrastruc...</a><p>The twitter guy seems to just be paraphrasing the actual blog post? That&#x27;s presumably why the tweets are now deleted.<p>---<p>The fact that they&#x27;re using MoE was news to me and very interesting. I&#x27;d love to know more details about how they got that to work. Variations in that implementation would explain the fluctuations in the quality of output that people have observed.<p>I&#x27;m still waiting for the release of their vision model which is mentioned here but we still know little about, sans a few demos a few months ago.
评论 #36676362 未加载
评论 #36676370 未加载
评论 #36680452 未加载
评论 #36679835 未加载
评论 #36676359 未加载
评论 #36676438 未加载
xeckralmost 2 years ago
If this is true, then:<p>1. Training took 21 yottaflops. When was the last time you saw the yotta- prefix for anything?<p>2. The training cost of GPT-4 is now only 1&#x2F;3 of what it was about a year ago. It is absolutely staggering how quickly the price of training an LLM is dropping, which is great news for open source. The google memo was right about the lack of a moat.
评论 #36678460 未加载
评论 #36677683 未加载
评论 #36681505 未加载
评论 #36681680 未加载
TeMPOraLalmost 2 years ago
&gt; <i>The conspiracy theory that the new GPT-4 quality had been deteriorated might be simply because they are letting the oracle model accept lower probability sequences from the speculative decoding model.</i><p>In other words: the speculation was likely right, I&#x27;ll propose a specific mechanism explaining it, but then still insult the people bringing it up and keep gaslighting them.
评论 #36689760 未加载
shahulesalmost 2 years ago
This guy doesn&#x27;t have any idea what he is talking about. He consistently posts such bullshit on twitter. Mostly copy paste with added spice mix.
评论 #36676507 未加载
评论 #36680392 未加载
potatoman22almost 2 years ago
Google has been doing research into mixture of experts for scaling LLMs. Their GLaM model published in 2022 has 1.7 trillion parameters and 64 experts.<p><a href="https:&#x2F;&#x2F;icml.cc&#x2F;media&#x2F;icml-2022&#x2F;Slides&#x2F;17378.pdf" rel="nofollow noreferrer">https:&#x2F;&#x2F;icml.cc&#x2F;media&#x2F;icml-2022&#x2F;Slides&#x2F;17378.pdf</a>
评论 #36676499 未加载
qaqalmost 2 years ago
Hmm “Sam Altman won&#x27;t tell you that GPT-4 has 220B parameters and is 16-way mixture model with 8 sets of weights” George Hotz said this in his recent interview with Lex Fridman. It looked like Lex knew this to be true by the way he reacted.
npsomaratnaalmost 2 years ago
This is unsubstantiated. The only folks who know exactly how GPT-4 works are employed at OpenAI. The rest of us can only guess.
评论 #36676547 未加载
mmahemoffalmost 2 years ago
I&#x27;ve been wondering how freemium services like Thread Reader still operate now that Twitter is charging prohibitive prices for API access and taking measures to prevent scraping. The cheapest API plan with read access is $100&#x2F;month, which reads 10,000 tweets, so could only produce about 500 pages like this one on demand.
评论 #36677191 未加载
评论 #36677275 未加载
RC_ITRalmost 2 years ago
For all the &#x27;I know every number&#x27; certainty of this post, there&#x27;s some weird stuff:<p>&gt;(Today, the pre-training could be done with ~8,192 H100 in ~55 days for $21.5 million at $2 per H100 hour.)<p>Why flex both system size <i>and</i> training time to arbitrary numbers?<p>&gt;For example, MoE is incredibly difficult to deal with on inference because not every part of the model is utilized on every token generation. This means parts may sit dormant when other parts are being used. When serving users, this really hurts utilization rates.<p>Utilization of what? Memory? If you&#x27;re that worried about inference utilization, then why not just fire up a non-MOE model?<p>Here&#x27;s what the post said about MQA:<p>&gt;Because of that only 1 head is needed and memory capacity can be significantly reduced for the KV cache<p>This is close but wrong. You only need one <i>Key and Value (KV)</i> head, but you still have the same amount of query heads.<p>My guess is that this is all a relatively knowledgeable person, using formulas laid out by the 2020 scaling paper and making a fantasy system (with the correct math), based on that.<p>Put differently, I could probably fake my way through a similar post and be an equal level of close but definitely wrong because I&#x27;m way out of my league. That vibe makes me very suspicious.
评论 #36677639 未加载
评论 #36676568 未加载
PUSH_AXalmost 2 years ago
What is this hyper dramatic nonsense tweet about, “It’s over“? What’s over?
评论 #36677660 未加载
评论 #36677593 未加载
评论 #36677568 未加载
dmarchand90almost 2 years ago
Can anyone provide an alternative link to <a href="https:&#x2F;&#x2F;twitter.com&#x2F;i&#x2F;web&#x2F;status&#x2F;1678545170508267522" rel="nofollow noreferrer">https:&#x2F;&#x2F;twitter.com&#x2F;i&#x2F;web&#x2F;status&#x2F;1678545170508267522</a><p>I haven&#x27;t registered for Twitter since it started and I&#x27;d rather not now (though I probably will if it&#x27;s the only way to get leaked gpt4 training details)
评论 #36684241 未加载
Roark66almost 2 years ago
The tweet is gone. What was in it?<p>Also, I&#x27;m dubious about this unsubstantiated claim. The biggest past innovation (training with human feedback) actually shrunk the size of a model. Compare Bloom-366B with falcon-40B (much better). I would be mildly surprised if it turned out Gpt4 has 1.8T parameters. (even if it&#x27;s a composite model as they say)<p>The article says they use 16 experts 111B each. So the best thing to assume is probably that each of these experts is basically a fine tuned version of the same initial model for some problem domain.
评论 #36677919 未加载
评论 #36677267 未加载
getmeinrnalmost 2 years ago
&gt;If their cost in the cloud was about $1 per A100 hour, the training costs for this run alone would be about $63 million.<p>If someone legitimate put together a crowd funding effort, I would donate a non-insignificant amount to train an open model. Has it been tried before?
评论 #36678529 未加载
评论 #36676799 未加载
评论 #36677643 未加载
aussieguy1234almost 2 years ago
The fact they are using MoE is interesting. There are alot of specialised open source models on HuggingFace. You just need an LLM to act as the core &quot;brain&quot; and a few other components.<p>HuggingGPT works similar to this. It automatically chooses, downloads and runs the right &quot;expert&quot; model from HuggingFace <a href="https:&#x2F;&#x2F;arxiv.org&#x2F;abs&#x2F;2303.17580" rel="nofollow noreferrer">https:&#x2F;&#x2F;arxiv.org&#x2F;abs&#x2F;2303.17580</a>
potatoman22almost 2 years ago
I wonder what the legal implications of them using SciHub and Libgen would be if that&#x27;s true. I&#x27;d imagine OpenAI is big enough to make deals with publishers.
评论 #36677111 未加载
评论 #36677282 未加载
评论 #36678146 未加载
langsoul-comalmost 2 years ago
We should default to using the thread aggregators instead of using twitter links. My God Twitter threads are unreadable.
PostOncealmost 2 years ago
&quot;Open&quot; AI, a charity to benefit us all by pushing and publishing the frontier of scientific knowledge.<p>Nevermind, fuckers, actually it&#x27;s just to take your jobs and make a few VCs richer. We&#x27;ll keep the science a secret and try to pressure the government into making it illegal for you to compete with us.<p><a href="https:&#x2F;&#x2F;github.com&#x2F;ggerganov&#x2F;llama.cpp">https:&#x2F;&#x2F;github.com&#x2F;ggerganov&#x2F;llama.cpp</a><p><a href="https:&#x2F;&#x2F;github.com&#x2F;openlm-research&#x2F;open_llama">https:&#x2F;&#x2F;github.com&#x2F;openlm-research&#x2F;open_llama</a><p><a href="https:&#x2F;&#x2F;huggingface.co&#x2F;TheBloke&#x2F;open-llama-7b-open-instruct-GGML" rel="nofollow noreferrer">https:&#x2F;&#x2F;huggingface.co&#x2F;TheBloke&#x2F;open-llama-7b-open-instruct-...</a><p><a href="https:&#x2F;&#x2F;huggingface.co&#x2F;TheBloke&#x2F;open-llama-13b-open-instruct-GGML" rel="nofollow noreferrer">https:&#x2F;&#x2F;huggingface.co&#x2F;TheBloke&#x2F;open-llama-13b-open-instruct...</a><p>You can use the above without paying OpenAI. You don&#x27;t even need a GPU. There are no license issues like with the facebook llama.
评论 #36678320 未加载
评论 #36676751 未加载
评论 #36676851 未加载
评论 #36678015 未加载
评论 #36680368 未加载
评论 #36678482 未加载
评论 #36679351 未加载
评论 #36676907 未加载
评论 #36681166 未加载
评论 #36681515 未加载
评论 #36681655 未加载
评论 #36677177 未加载
评论 #36676716 未加载
评论 #36682931 未加载
评论 #36677393 未加载
评论 #36681170 未加载
评论 #36676691 未加载
评论 #36677987 未加载
评论 #36682367 未加载
评论 #36676554 未加载
TheRealPomaxalmost 2 years ago
&quot;*The post about GPT-4&#x27;s architecture had been removed due to a copyright claim.&quot;, <a href="https:&#x2F;&#x2F;twitter.com&#x2F;Yampeleg&#x2F;status&#x2F;1678582275561103360" rel="nofollow noreferrer">https:&#x2F;&#x2F;twitter.com&#x2F;Yampeleg&#x2F;status&#x2F;1678582275561103360</a>
qwertoxalmost 2 years ago
&gt; This, of course, is “only” a batch size of 7.5 million tokens per expert due to not every expert seeing all tokens.<p>&gt; Mixture of Expert Tradeoffs: There are multiple MoE tradeoffs taken: For example, MoE is incredibly difficult to deal with on inference because not every part of the model is utilized on every token generation.<p>Are these experts able to communicate among them in one query? How do they get selected? How do they know who to pass information to?<p>Would I be able to influence the selection of experts by how I create my questions? For example to ensure that a question about code gets passed directly to an expert in code? I feel silly asking this question, but I honestly have no idea how to interpret this.
评论 #36678404 未加载
gdubsalmost 2 years ago
Recently I was saying how much amazing stuff there is in retro computing. One thing that keeps coming to mind for me recently is just how visionary Thinking Machines Connection Machine supercomputer architecture was with its massive parallelism built in, with neural network applications being a key predicted use case at the time. That was so long ago!<p>Interesting to think about in comparison to the challenges today around parallelizing &#x27;commodity&#x27; GPUs. Scare quotes because he A100 and H100 are pretty impressive machines in and of themselves.
评论 #36692691 未加载
wokwokwokalmost 2 years ago
This a duplicate post of pure speculation.
rurpalmost 2 years ago
&gt; The conspiracy theory that the new GPT-4 quality had been deteriorated might be simply because they are letting the oracle model accept lower probability sequences from the speculative decoding model.<p>Whether or not this specific theory is true something along these lines seems like the most likely explanation for the quality degradation that many have noticed; where OpenAI&#x27;s claims about not changing the model are both technically true and conpletely misleading.
elzbardicoalmost 2 years ago
It is a bit problematic if it is being trained on copyrighted textbooks without compensation for the authors. Even for open-source science, I think it is a bit unethical if OpenAI is using public founded research without attribution or compensation. Tax Payers paid for those NIH grants, you know...
评论 #36682502 未加载
rjb7731almost 2 years ago
I&#x27;ve previously noticed when playing with GPT-4 it can sometimes &#x27;autocomplete&#x27; on different sections of the text its feeding back, sometimes what looks like 4 or more different sections. Might be unrelated but is this MoE in action or them streaming the response in some way?
评论 #36677260 未加载
nightsd01almost 2 years ago
“Leaked” seems like a strong clickbait claim from whoever wrote this, along with the “it’s over” part….
评论 #36677911 未加载
eminence32almost 2 years ago
&gt; It is over.<p>What does this mean?
评论 #36676475 未加载
评论 #36676607 未加载
评论 #36676483 未加载
评论 #36676459 未加载
dataangelalmost 2 years ago
&gt; Part of this extremely low utilization is due to an absurd number of failures requiring checkpoints that needed to be restarted from.<p>Hahahaha, the truth of anyone who has worked with quanty types running Python code at scale on a cluster
Ozzie_osmanalmost 2 years ago
There&#x27;s a section at the end where there is speculation on what the entire dataset entails. My guess is a chunk of it is probably from ChatGPT data (or GPT3 data from when training on your requests was opt-out rather than opt-in).
refulgentisalmost 2 years ago
No, this is fake, a light dusting of nothing on top of a meme post that was circulating in grifting communities as early as Q4 2022. It gains a little bit in every retelling, sort of impressive to see its almost blog scale.
评论 #36676347 未加载
评论 #36676327 未加载
kristianpalmost 2 years ago
I wonder if any open source MOE models are being worked on. Could I run an 8x13B model on my 16GB graphics card, only loading the expert that is needed per run?
henkdehenkeralmost 2 years ago
So George Hotz was right
DanAtCalmost 2 years ago
These words are nonsense to me. Can someone explain?
评论 #36676477 未加载
jonplackettalmost 2 years ago
Bard taking notes…
esaymalmost 2 years ago
Everyone hates on crypto because of all the electricity use for mining. But how much electricity is the training of all the giant LLMs costing us?
评论 #36676697 未加载
abrax3141almost 2 years ago
If it was trained on CS textbooks, they weren&#x27;t very good ones. I asked it (GPT4) to write a quantum computer algorithm to square a number. It very confidently told me that to simplify the problem it would use two bits. Okay, fine. But then the algorithm it (again confidently) implemented did a left shift (which it reminded me was multiplying by 2, so it definitely intended this!) and then add the number to itself. It then wrote that in terms of QC gates. Tada! It took me a half beat to realize that rather than this being some new version of squaring a number that I somehow wasn&#x27;t aware of, it&#x27;s completely wrong. It only works on 00! Confronted, of course it did the usual &quot;So sorry... I guess I don&#x27;t know how to do this.&quot; dance. I don&#x27;t get why anyone thinks that this thing is worth anything at all, except for cheating on creative writing tests.
评论 #36676560 未加载
评论 #36676591 未加载
评论 #36676618 未加载
评论 #36676735 未加载
评论 #36677749 未加载
评论 #36676580 未加载