GPT-4 details leaked?

661 pointsby bx376almost 2 years ago

36 comments

neonatealmost 2 years ago

<a href="https://archive.ph/2RQ8X" rel="nofollow noreferrer">https://archive.ph/2RQ8X</a>

CSMastermindalmost 2 years ago

Previously posted about here: <a href="https://news.ycombinator.com/item?id=36671588">https://news.ycombinator.com/item?id=36671588</a> and here: <a href="https://news.ycombinator.com/item?id=36674905">https://news.ycombinator.com/item?id=36674905</a>With the original source being: <a href="https://www.semianalysis.com/p/gpt-4-architecture-infrastructure" rel="nofollow noreferrer">https://www.semianalysis.com/p/gpt-4-architecture-infrastruc...</a>The twitter guy seems to just be paraphrasing the actual blog post? That's presumably why the tweets are now deleted.---The fact that they're using MoE was news to me and very interesting. I'd love to know more details about how they got that to work. Variations in that implementation would explain the fluctuations in the quality of output that people have observed.I'm still waiting for the release of their vision model which is mentioned here but we still know little about, sans a few demos a few months ago.

评论 #36676362 未加载

评论 #36676370 未加载

评论 #36680452 未加载

评论 #36679835 未加载

评论 #36676359 未加载

评论 #36676438 未加载

xeckralmost 2 years ago

If this is true, then:1. Training took 21 yottaflops. When was the last time you saw the yotta- prefix for anything?2. The training cost of GPT-4 is now only 1/3 of what it was about a year ago. It is absolutely staggering how quickly the price of training an LLM is dropping, which is great news for open source. The google memo was right about the lack of a moat.

评论 #36678460 未加载

评论 #36677683 未加载

评论 #36681505 未加载

评论 #36681680 未加载

TeMPOraLalmost 2 years ago

> The conspiracy theory that the new GPT-4 quality had been deteriorated might be simply because they are letting the oracle model accept lower probability sequences from the speculative decoding model.In other words: the speculation was likely right, I'll propose a specific mechanism explaining it, but then still insult the people bringing it up and keep gaslighting them.

评论 #36689760 未加载

shahulesalmost 2 years ago

This guy doesn't have any idea what he is talking about. He consistently posts such bullshit on twitter. Mostly copy paste with added spice mix.

评论 #36676507 未加载

评论 #36680392 未加载

potatoman22almost 2 years ago

Google has been doing research into mixture of experts for scaling LLMs. Their GLaM model published in 2022 has 1.7 trillion parameters and 64 experts.<a href="https://icml.cc/media/icml-2022/Slides/17378.pdf" rel="nofollow noreferrer">https://icml.cc/media/icml-2022/Slides/17378.pdf</a>

评论 #36676499 未加载

qaqalmost 2 years ago

Hmm “Sam Altman won't tell you that GPT-4 has 220B parameters and is 16-way mixture model with 8 sets of weights” George Hotz said this in his recent interview with Lex Fridman. It looked like Lex knew this to be true by the way he reacted.

npsomaratnaalmost 2 years ago

This is unsubstantiated. The only folks who know exactly how GPT-4 works are employed at OpenAI. The rest of us can only guess.

评论 #36676547 未加载

mmahemoffalmost 2 years ago

I've been wondering how freemium services like Thread Reader still operate now that Twitter is charging prohibitive prices for API access and taking measures to prevent scraping. The cheapest API plan with read access is $100/month, which reads 10,000 tweets, so could only produce about 500 pages like this one on demand.

评论 #36677191 未加载

评论 #36677275 未加载

RC_ITRalmost 2 years ago

For all the 'I know every number' certainty of this post, there's some weird stuff:>(Today, the pre-training could be done with ~8,192 H100 in ~55 days for $21.5 million at $2 per H100 hour.)Why flex both system size and training time to arbitrary numbers?>For example, MoE is incredibly difficult to deal with on inference because not every part of the model is utilized on every token generation. This means parts may sit dormant when other parts are being used. When serving users, this really hurts utilization rates.Utilization of what? Memory? If you're that worried about inference utilization, then why not just fire up a non-MOE model?Here's what the post said about MQA:>Because of that only 1 head is needed and memory capacity can be significantly reduced for the KV cacheThis is close but wrong. You only need one Key and Value (KV) head, but you still have the same amount of query heads.My guess is that this is all a relatively knowledgeable person, using formulas laid out by the 2020 scaling paper and making a fantasy system (with the correct math), based on that.Put differently, I could probably fake my way through a similar post and be an equal level of close but definitely wrong because I'm way out of my league. That vibe makes me very suspicious.

评论 #36677639 未加载

评论 #36676568 未加载

PUSH_AXalmost 2 years ago

What is this hyper dramatic nonsense tweet about, “It’s over“? What’s over?

评论 #36677660 未加载

评论 #36677593 未加载

评论 #36677568 未加载

dmarchand90almost 2 years ago

Can anyone provide an alternative link to <a href="https://twitter.com/i/web/status/1678545170508267522" rel="nofollow noreferrer">https://twitter.com/i/web/status/1678545170508267522</a>I haven't registered for Twitter since it started and I'd rather not now (though I probably will if it's the only way to get leaked gpt4 training details)

评论 #36684241 未加载

Roark66almost 2 years ago

The tweet is gone. What was in it?Also, I'm dubious about this unsubstantiated claim. The biggest past innovation (training with human feedback) actually shrunk the size of a model. Compare Bloom-366B with falcon-40B (much better). I would be mildly surprised if it turned out Gpt4 has 1.8T parameters. (even if it's a composite model as they say)The article says they use 16 experts 111B each. So the best thing to assume is probably that each of these experts is basically a fine tuned version of the same initial model for some problem domain.

评论 #36677919 未加载

评论 #36677267 未加载

getmeinrnalmost 2 years ago

>If their cost in the cloud was about $1 per A100 hour, the training costs for this run alone would be about $63 million.If someone legitimate put together a crowd funding effort, I would donate a non-insignificant amount to train an open model. Has it been tried before?

评论 #36678529 未加载

评论 #36676799 未加载

评论 #36677643 未加载

aussieguy1234almost 2 years ago

The fact they are using MoE is interesting. There are alot of specialised open source models on HuggingFace. You just need an LLM to act as the core "brain" and a few other components.HuggingGPT works similar to this. It automatically chooses, downloads and runs the right "expert" model from HuggingFace <a href="https://arxiv.org/abs/2303.17580" rel="nofollow noreferrer">https://arxiv.org/abs/2303.17580</a>

potatoman22almost 2 years ago

I wonder what the legal implications of them using SciHub and Libgen would be if that's true. I'd imagine OpenAI is big enough to make deals with publishers.

评论 #36677111 未加载

评论 #36677282 未加载

评论 #36678146 未加载

langsoul-comalmost 2 years ago

We should default to using the thread aggregators instead of using twitter links. My God Twitter threads are unreadable.

PostOncealmost 2 years ago

"Open" AI, a charity to benefit us all by pushing and publishing the frontier of scientific knowledge.Nevermind, fuckers, actually it's just to take your jobs and make a few VCs richer. We'll keep the science a secret and try to pressure the government into making it illegal for you to compete with us.<a href="https://github.com/ggerganov/llama.cpp">https://github.com/ggerganov/llama.cpp</a><a href="https://github.com/openlm-research/open_llama">https://github.com/openlm-research/open_llama</a><a href="https://huggingface.co/TheBloke/open-llama-7b-open-instruct-GGML" rel="nofollow noreferrer">https://huggingface.co/TheBloke/open-llama-7b-open-instruct-...</a><a href="https://huggingface.co/TheBloke/open-llama-13b-open-instruct-GGML" rel="nofollow noreferrer">https://huggingface.co/TheBloke/open-llama-13b-open-instruct...</a>You can use the above without paying OpenAI. You don't even need a GPU. There are no license issues like with the facebook llama.

评论 #36678320 未加载

评论 #36676751 未加载

评论 #36676851 未加载

评论 #36678015 未加载

评论 #36680368 未加载

评论 #36678482 未加载

评论 #36679351 未加载

评论 #36676907 未加载

评论 #36681166 未加载

评论 #36681515 未加载

评论 #36681655 未加载

评论 #36677177 未加载

评论 #36676716 未加载

评论 #36682931 未加载

评论 #36677393 未加载

评论 #36681170 未加载

评论 #36676691 未加载

评论 #36677987 未加载

评论 #36682367 未加载

评论 #36676554 未加载

TheRealPomaxalmost 2 years ago

"*The post about GPT-4's architecture had been removed due to a copyright claim.", <a href="https://twitter.com/Yampeleg/status/1678582275561103360" rel="nofollow noreferrer">https://twitter.com/Yampeleg/status/1678582275561103360</a>

qwertoxalmost 2 years ago

> This, of course, is “only” a batch size of 7.5 million tokens per expert due to not every expert seeing all tokens.> Mixture of Expert Tradeoffs: There are multiple MoE tradeoffs taken: For example, MoE is incredibly difficult to deal with on inference because not every part of the model is utilized on every token generation.Are these experts able to communicate among them in one query? How do they get selected? How do they know who to pass information to?Would I be able to influence the selection of experts by how I create my questions? For example to ensure that a question about code gets passed directly to an expert in code? I feel silly asking this question, but I honestly have no idea how to interpret this.

评论 #36678404 未加载

gdubsalmost 2 years ago

Recently I was saying how much amazing stuff there is in retro computing. One thing that keeps coming to mind for me recently is just how visionary Thinking Machines Connection Machine supercomputer architecture was with its massive parallelism built in, with neural network applications being a key predicted use case at the time. That was so long ago!Interesting to think about in comparison to the challenges today around parallelizing 'commodity' GPUs. Scare quotes because he A100 and H100 are pretty impressive machines in and of themselves.

评论 #36692691 未加载

wokwokwokalmost 2 years ago

This a duplicate post of pure speculation.

rurpalmost 2 years ago

> The conspiracy theory that the new GPT-4 quality had been deteriorated might be simply because they are letting the oracle model accept lower probability sequences from the speculative decoding model.Whether or not this specific theory is true something along these lines seems like the most likely explanation for the quality degradation that many have noticed; where OpenAI's claims about not changing the model are both technically true and conpletely misleading.

elzbardicoalmost 2 years ago

It is a bit problematic if it is being trained on copyrighted textbooks without compensation for the authors. Even for open-source science, I think it is a bit unethical if OpenAI is using public founded research without attribution or compensation. Tax Payers paid for those NIH grants, you know...

评论 #36682502 未加载

rjb7731almost 2 years ago

I've previously noticed when playing with GPT-4 it can sometimes 'autocomplete' on different sections of the text its feeding back, sometimes what looks like 4 or more different sections. Might be unrelated but is this MoE in action or them streaming the response in some way?

评论 #36677260 未加载

nightsd01almost 2 years ago

“Leaked” seems like a strong clickbait claim from whoever wrote this, along with the “it’s over” part….

评论 #36677911 未加载

eminence32almost 2 years ago

> It is over.What does this mean?

评论 #36676475 未加载

评论 #36676607 未加载

评论 #36676483 未加载

评论 #36676459 未加载

dataangelalmost 2 years ago

> Part of this extremely low utilization is due to an absurd number of failures requiring checkpoints that needed to be restarted from.Hahahaha, the truth of anyone who has worked with quanty types running Python code at scale on a cluster

Ozzie_osmanalmost 2 years ago

There's a section at the end where there is speculation on what the entire dataset entails. My guess is a chunk of it is probably from ChatGPT data (or GPT3 data from when training on your requests was opt-out rather than opt-in).

refulgentisalmost 2 years ago

No, this is fake, a light dusting of nothing on top of a meme post that was circulating in grifting communities as early as Q4 2022. It gains a little bit in every retelling, sort of impressive to see its almost blog scale.

评论 #36676347 未加载

评论 #36676327 未加载

kristianpalmost 2 years ago

I wonder if any open source MOE models are being worked on. Could I run an 8x13B model on my 16GB graphics card, only loading the expert that is needed per run?

henkdehenkeralmost 2 years ago

So George Hotz was right

DanAtCalmost 2 years ago

These words are nonsense to me. Can someone explain?

评论 #36676477 未加载

jonplackettalmost 2 years ago

Bard taking notes…

esaymalmost 2 years ago

Everyone hates on crypto because of all the electricity use for mining. But how much electricity is the training of all the giant LLMs costing us?

评论 #36676697 未加载

abrax3141almost 2 years ago

If it was trained on CS textbooks, they weren't very good ones. I asked it (GPT4) to write a quantum computer algorithm to square a number. It very confidently told me that to simplify the problem it would use two bits. Okay, fine. But then the algorithm it (again confidently) implemented did a left shift (which it reminded me was multiplying by 2, so it definitely intended this!) and then add the number to itself. It then wrote that in terms of QC gates. Tada! It took me a half beat to realize that rather than this being some new version of squaring a number that I somehow wasn't aware of, it's completely wrong. It only works on 00! Confronted, of course it did the usual "So sorry... I guess I don't know how to do this." dance. I don't get why anyone thinks that this thing is worth anything at all, except for cheating on creative writing tests.

评论 #36676560 未加载

评论 #36676591 未加载

评论 #36676618 未加载

评论 #36676735 未加载

评论 #36677749 未加载

评论 #36676580 未加载

36 comments

neonatealmost 2 years ago

<a href="https://archive.ph/2RQ8X" rel="nofollow noreferrer">https://archive.ph/2RQ8X</a>

CSMastermindalmost 2 years ago

评论 #36676362 未加载

评论 #36676370 未加载

评论 #36680452 未加载

评论 #36679835 未加载

评论 #36676359 未加载

评论 #36676438 未加载

xeckralmost 2 years ago

评论 #36678460 未加载

评论 #36677683 未加载

评论 #36681505 未加载

评论 #36681680 未加载

TeMPOraLalmost 2 years ago

评论 #36689760 未加载

shahulesalmost 2 years ago

This guy doesn't have any idea what he is talking about. He consistently posts such bullshit on twitter. Mostly copy paste with added spice mix.

评论 #36676507 未加载

评论 #36680392 未加载

potatoman22almost 2 years ago

评论 #36676499 未加载

qaqalmost 2 years ago

npsomaratnaalmost 2 years ago

This is unsubstantiated. The only folks who know exactly how GPT-4 works are employed at OpenAI. The rest of us can only guess.

评论 #36676547 未加载

mmahemoffalmost 2 years ago

评论 #36677191 未加载

评论 #36677275 未加载

RC_ITRalmost 2 years ago

评论 #36677639 未加载

评论 #36676568 未加载

PUSH_AXalmost 2 years ago

What is this hyper dramatic nonsense tweet about, “It’s over“? What’s over?

评论 #36677660 未加载

评论 #36677593 未加载

评论 #36677568 未加载

dmarchand90almost 2 years ago

评论 #36684241 未加载

Roark66almost 2 years ago

评论 #36677919 未加载

评论 #36677267 未加载

getmeinrnalmost 2 years ago

评论 #36678529 未加载

评论 #36676799 未加载

评论 #36677643 未加载

aussieguy1234almost 2 years ago

potatoman22almost 2 years ago

I wonder what the legal implications of them using SciHub and Libgen would be if that's true. I'd imagine OpenAI is big enough to make deals with publishers.

评论 #36677111 未加载

评论 #36677282 未加载

评论 #36678146 未加载

langsoul-comalmost 2 years ago

We should default to using the thread aggregators instead of using twitter links. My God Twitter threads are unreadable.

PostOncealmost 2 years ago

评论 #36678320 未加载

评论 #36676751 未加载

评论 #36676851 未加载

评论 #36678015 未加载

评论 #36680368 未加载

评论 #36678482 未加载

评论 #36679351 未加载

评论 #36676907 未加载

评论 #36681166 未加载

评论 #36681515 未加载

评论 #36681655 未加载

评论 #36677177 未加载

评论 #36676716 未加载

评论 #36682931 未加载

评论 #36677393 未加载

评论 #36681170 未加载

评论 #36676691 未加载

评论 #36677987 未加载

评论 #36682367 未加载

评论 #36676554 未加载

TheRealPomaxalmost 2 years ago

qwertoxalmost 2 years ago

评论 #36678404 未加载

gdubsalmost 2 years ago

评论 #36692691 未加载

wokwokwokalmost 2 years ago

This a duplicate post of pure speculation.

rurpalmost 2 years ago

elzbardicoalmost 2 years ago

评论 #36682502 未加载

rjb7731almost 2 years ago

评论 #36677260 未加载

nightsd01almost 2 years ago

“Leaked” seems like a strong clickbait claim from whoever wrote this, along with the “it’s over” part….

评论 #36677911 未加载

eminence32almost 2 years ago

> It is over.What does this mean?

评论 #36676475 未加载

评论 #36676607 未加载

评论 #36676483 未加载

评论 #36676459 未加载

dataangelalmost 2 years ago

Ozzie_osmanalmost 2 years ago

refulgentisalmost 2 years ago

评论 #36676347 未加载

评论 #36676327 未加载

kristianpalmost 2 years ago

I wonder if any open source MOE models are being worked on. Could I run an 8x13B model on my 16GB graphics card, only loading the expert that is needed per run?

henkdehenkeralmost 2 years ago

So George Hotz was right

DanAtCalmost 2 years ago

These words are nonsense to me. Can someone explain?

评论 #36676477 未加载

jonplackettalmost 2 years ago

Bard taking notes…

esaymalmost 2 years ago

Everyone hates on crypto because of all the electricity use for mining. But how much electricity is the training of all the giant LLMs costing us?