Previously posted about here: <a href="https://news.ycombinator.com/item?id=36671588">https://news.ycombinator.com/item?id=36671588</a> and here: <a href="https://news.ycombinator.com/item?id=36674905">https://news.ycombinator.com/item?id=36674905</a><p>With the original source being: <a href="https://www.semianalysis.com/p/gpt-4-architecture-infrastructure" rel="nofollow noreferrer">https://www.semianalysis.com/p/gpt-4-architecture-infrastruc...</a><p>The twitter guy seems to just be paraphrasing the actual blog post? That's presumably why the tweets are now deleted.<p>---<p>The fact that they're using MoE was news to me and very interesting. I'd love to know more details about how they got that to work. Variations in that implementation would explain the fluctuations in the quality of output that people have observed.<p>I'm still waiting for the release of their vision model which is mentioned here but we still know little about, sans a few demos a few months ago.
If this is true, then:<p>1. Training took 21 yottaflops. When was the last time you saw the yotta- prefix for anything?<p>2. The training cost of GPT-4 is now only 1/3 of what it was about a year ago. It is absolutely staggering how quickly the price of training an LLM is dropping, which is great news for open source. The google memo was right about the lack of a moat.
> <i>The conspiracy theory that the new GPT-4 quality had been deteriorated might be simply because they are letting the oracle model accept lower probability sequences from the speculative decoding model.</i><p>In other words: the speculation was likely right, I'll propose a specific mechanism explaining it, but then still insult the people bringing it up and keep gaslighting them.
Google has been doing research into mixture of experts for scaling LLMs. Their GLaM model published in 2022 has 1.7 trillion parameters and 64 experts.<p><a href="https://icml.cc/media/icml-2022/Slides/17378.pdf" rel="nofollow noreferrer">https://icml.cc/media/icml-2022/Slides/17378.pdf</a>
Hmm “Sam Altman won't tell you that GPT-4 has 220B parameters and is 16-way mixture model with 8 sets of weights” George Hotz said this in his recent interview with Lex Fridman. It looked like Lex knew this to be true by the way he reacted.
I've been wondering how freemium services like Thread Reader still operate now that Twitter is charging prohibitive prices for API access and taking measures to prevent scraping. The cheapest API plan with read access is $100/month, which reads 10,000 tweets, so could only produce about 500 pages like this one on demand.
For all the 'I know every number' certainty of this post, there's some weird stuff:<p>>(Today, the pre-training could be done with ~8,192 H100 in ~55 days for $21.5 million at $2 per H100 hour.)<p>Why flex both system size <i>and</i> training time to arbitrary numbers?<p>>For example, MoE is incredibly difficult to deal with on inference because not every part of the model is utilized on every token generation.
This means parts may sit dormant when other parts are being used. When serving users, this really hurts utilization rates.<p>Utilization of what? Memory? If you're that worried about inference utilization, then why not just fire up a non-MOE model?<p>Here's what the post said about MQA:<p>>Because of that only 1 head is needed and memory capacity can be significantly reduced for the KV cache<p>This is close but wrong. You only need one <i>Key and Value (KV)</i> head, but you still have the same amount of query heads.<p>My guess is that this is all a relatively knowledgeable person, using formulas laid out by the 2020 scaling paper and making a fantasy system (with the correct math), based on that.<p>Put differently, I could probably fake my way through a similar post and be an equal level of close but definitely wrong because I'm way out of my league. That vibe makes me very suspicious.
Can anyone provide an alternative link to <a href="https://twitter.com/i/web/status/1678545170508267522" rel="nofollow noreferrer">https://twitter.com/i/web/status/1678545170508267522</a><p>I haven't registered for Twitter since it started and I'd rather not now (though I probably will if it's the only way to get leaked gpt4 training details)
The tweet is gone. What was in it?<p>Also, I'm dubious about this unsubstantiated claim. The biggest past innovation (training with human feedback) actually shrunk the size of a model. Compare Bloom-366B with falcon-40B (much better). I would be mildly surprised if it turned out Gpt4 has 1.8T parameters. (even if it's a composite model as they say)<p>The article says they use 16 experts 111B each. So the best thing to assume is probably that each of these experts is basically a fine tuned version of the same initial model for some problem domain.
>If their cost in the cloud was about $1 per A100 hour, the training costs for this run alone would be about $63 million.<p>If someone legitimate put together a crowd funding effort, I would donate a non-insignificant amount to train an open model. Has it been tried before?
The fact they are using MoE is interesting. There are alot of specialised open source models on HuggingFace. You just need an LLM to act as the core "brain" and a few other components.<p>HuggingGPT works similar to this. It automatically chooses, downloads and runs the right "expert" model from HuggingFace <a href="https://arxiv.org/abs/2303.17580" rel="nofollow noreferrer">https://arxiv.org/abs/2303.17580</a>
I wonder what the legal implications of them using SciHub and Libgen would be if that's true. I'd imagine OpenAI is big enough to make deals with publishers.
"Open" AI, a charity to benefit us all by pushing and publishing the frontier of scientific knowledge.<p>Nevermind, fuckers, actually it's just to take your jobs and make a few VCs richer. We'll keep the science a secret and try to pressure the government into making it illegal for you to compete with us.<p><a href="https://github.com/ggerganov/llama.cpp">https://github.com/ggerganov/llama.cpp</a><p><a href="https://github.com/openlm-research/open_llama">https://github.com/openlm-research/open_llama</a><p><a href="https://huggingface.co/TheBloke/open-llama-7b-open-instruct-GGML" rel="nofollow noreferrer">https://huggingface.co/TheBloke/open-llama-7b-open-instruct-...</a><p><a href="https://huggingface.co/TheBloke/open-llama-13b-open-instruct-GGML" rel="nofollow noreferrer">https://huggingface.co/TheBloke/open-llama-13b-open-instruct...</a><p>You can use the above without paying OpenAI. You don't even need a GPU. There are no license issues like with the facebook llama.
"*The post about GPT-4's architecture had been removed due to a copyright claim.", <a href="https://twitter.com/Yampeleg/status/1678582275561103360" rel="nofollow noreferrer">https://twitter.com/Yampeleg/status/1678582275561103360</a>
> This, of course, is “only” a batch size of 7.5 million tokens per expert due to not every expert seeing all tokens.<p>> Mixture of Expert Tradeoffs: There are multiple MoE tradeoffs taken: For example, MoE is incredibly difficult to deal with on inference because not every part of the model is utilized on every token generation.<p>Are these experts able to communicate among them in one query? How do they get selected? How do they know who to pass information to?<p>Would I be able to influence the selection of experts by how I create my questions? For example to ensure that a question about code gets passed directly to an expert in code? I feel silly asking this question, but I honestly have no idea how to interpret this.
Recently I was saying how much amazing stuff there is in retro computing. One thing that keeps coming to mind for me recently is just how visionary Thinking Machines Connection Machine supercomputer architecture was with its massive parallelism built in, with neural network applications being a key predicted use case at the time. That was so long ago!<p>Interesting to think about in comparison to the challenges today around parallelizing 'commodity' GPUs. Scare quotes because he A100 and H100 are pretty impressive machines in and of themselves.
> The conspiracy theory that the new GPT-4 quality had been deteriorated might be simply because they are letting the oracle model accept lower probability sequences from the speculative decoding model.<p>Whether or not this specific theory is true something along these lines seems like the most likely explanation for the quality degradation that many have noticed; where OpenAI's claims about not changing the model are both technically true and conpletely misleading.
It is a bit problematic if it is being trained on copyrighted textbooks without compensation for the authors.
Even for open-source science, I think it is a bit unethical if OpenAI is using public founded research without attribution or compensation. Tax Payers paid for those NIH grants, you know...
I've previously noticed when playing with GPT-4 it can sometimes 'autocomplete' on different sections of the text its feeding back, sometimes what looks like 4 or more different sections. Might be unrelated but is this MoE in action or them streaming the response in some way?
> Part of this extremely low utilization is due to an absurd number of failures requiring checkpoints that needed to be restarted from.<p>Hahahaha, the truth of anyone who has worked with quanty types running Python code at scale on a cluster
There's a section at the end where there is speculation on what the entire dataset entails. My guess is a chunk of it is probably from ChatGPT data (or GPT3 data from when training on your requests was opt-out rather than opt-in).
No, this is fake, a light dusting of nothing on top of a meme post that was circulating in grifting communities as early as Q4 2022. It gains a little bit in every retelling, sort of impressive to see its almost blog scale.
I wonder if any open source MOE models are being worked on. Could I run an 8x13B model on my 16GB graphics card, only loading the expert that is needed per run?
If it was trained on CS textbooks, they weren't very good ones. I asked it (GPT4) to write a quantum computer algorithm to square a number. It very confidently told me that to simplify the problem it would use two bits. Okay, fine. But then the algorithm it (again confidently) implemented did a left shift (which it reminded me was multiplying by 2, so it definitely intended this!) and then add the number to itself. It then wrote that in terms of QC gates. Tada! It took me a half beat to realize that rather than this being some new version of squaring a number that I somehow wasn't aware of, it's completely wrong. It only works on 00! Confronted, of course it did the usual "So sorry... I guess I don't know how to do this." dance. I don't get why anyone thinks that this thing is worth anything at all, except for cheating on creative writing tests.