GPT4 is 8 x 220B params = 1.7T params

388 点作者 georgehill将近 2 年前

23 条评论

andreyk将近 2 年前

weird title, note that the tweet said "so yes, GPT4 is *technically* 10x the size of GPT3, and all the small circle big circle memes from January were actually... in the ballpark?"It's really 8 models that are 220B, which is not the same as one model that is 1.7T params. There have been 1T+ models via mixtures of experts for a while now.Note also the follow up tweet: "since MoE is So Hot Right Now, GLaM might be the paper to pay attention to. Google already has a 1.2T model with 64 experts, while Microsoft Bing’s modes are different mixes accordingly"There is also this linked tweet <a href="https://twitter.com/LiamFedus/status/1536791574612303872" rel="nofollow noreferrer">https://twitter.com/LiamFedus/status/1536791574612303872</a> - "They are all related to Switch-Transformers and MoE. Of the 3 people on Twitter, 2 joined OpenAI. Could be related, could be unrelated"Which links to this tweet: "Today we're releasing all Switch Transformer models in T5X/JAX, including the 1.6T param Switch-C and the 395B param Switch-XXL models. Pleased to have these open-sourced!"Anyway... remember not to just read the headlines, they can be misleading.

评论 #36415007 未加载

评论 #36415290 未加载

评论 #36417959 未加载

评论 #36417506 未加载

评论 #36418199 未加载

SheinhardtWigCo将近 2 年前

GPT-4 is 1.7T params in the same way that an AMD Ryzen 9 7950X is 72 GHz.

评论 #36414282 未加载

评论 #36418649 未加载

评论 #36415179 未加载

bluedevilzn将近 2 年前

I wouldn’t trust anything geohot says. He doesn’t have access to any inside information.

评论 #36413869 未加载

评论 #36415046 未加载

评论 #36413865 未加载

评论 #36415371 未加载

eightysixfour将近 2 年前

I find it interesting that geohot says it is what you do “when you are out of ideas,” I can’t help but think that having multiple blended models is what makes GPT-4 seem like it has more “emergent” behavior than earlier models.

评论 #36414139 未加载

andy_xor_andrew将近 2 年前

Are the models specifically trained to be experts in certain domains?Or the models are all trained on the same corpus, but just queried with different parameters?Is this functionally the same as beam search?Do they select the best output on a token-by-token basis, or do they let each model stream to completion and then pick the best final output?

评论 #36414745 未加载

评论 #36414679 未加载

esperent将近 2 年前

So if this is true - which is a big if since this looks like speculation rather than real information - could this work with even smaller models?For example, what about 20 x 65B = 1.3T params? Or 100 x 13B = 1.3T params?Hell, what about 5000 x 13B params? Thousands of small highly specialized models, with maybe one small "categorization" model as the first pass?

评论 #36414818 未加载

评论 #36415031 未加载

it_citizen将近 2 年前

« We can’t really make models bigger than 220B parameters »Can someone explains why?

评论 #36414246 未加载

评论 #36414166 未加载

MuffinFlavored将近 2 年前

> GPT-4: 8 x 220B experts trained with different data/task distributions and 16-iter inference.There was a post on HackerNews the other day about a 13B open source model.Any 220B open source models? Why or why not?I wonder what the 8 categories were. I wonder what goes into identifying tokens and then trying to guess which category/model you should look up. What if tokens go between two models, how do the models route between each other?

评论 #36416154 未加载

评论 #36420270 未加载

评论 #36414105 未加载

评论 #36414560 未加载

ramraj07将近 2 年前

If it’s trivial then why does every other competitor suck at replicating it? Is it possible this is just a case of sour grapes that this intellectual is annoyed they’re not at the driving wheel of the coolest thing anymore?

评论 #36414030 未加载

LarsDu88将近 2 年前

That's on the order of 25 4090 GPUs to run inference. Not a crazy number by any means. We will see consumer robots running that by the end of the decade, mark my words.

评论 #36417497 未加载

评论 #36416936 未加载

评论 #36420013 未加载

评论 #36419614 未加载

评论 #36417510 未加载

评论 #36415848 未加载

lyu07282将近 2 年前

full podcast here: <a href="https://www.latent.space/p/geohot" rel="nofollow noreferrer">https://www.latent.space/p/geohot</a>

评论 #36414691 未加载

refulgentis将近 2 年前

We're sourcing off Geohot? Yikes.

janalsncm将近 2 年前

At a minimum he glossed over the multimodal capabilities of GPT-4. If they use the same set of tokens, it’s unclear how this doesn’t pollute text training data. If they use separate tokens, the model size should be bigger.

评论 #36415123 未加载

评论 #36419799 未加载

generalizations将近 2 年前

Reminds me of this: <a href="https://en.wikipedia.org/wiki/Society_of_Mind" rel="nofollow noreferrer">https://en.wikipedia.org/wiki/Society_of_Mind</a>> In his book of the same name, Minsky constructs a model of human intelligence step by step, built up from the interactions of simple parts called agents, which are themselves mindless. He describes the postulated interactions as constituting a "society of mind", hence the title.

jiggywiggy将近 2 年前

I often hear the idea of digital is faster then biology. This seems mostly derived from small math computations.Yet it seems the current form of large language computations is much much slower then our biology. Making it even larger will be necessary to come closer to human levels but the speed?If this is the path to GI, the computational levels need to be very High and very centralized.Are there ways to improve this in its current implementation other the cache & more hardware?

评论 #36416956 未加载

评论 #36416681 未加载

评论 #36416477 未加载

评论 #36421035 未加载

chriskanan将近 2 年前

This seems pretty consistent with what Sam Altman has said in past interviews regarding the end of continuously increasing scale and having multiple smaller specialist models: <a href="https://finance.yahoo.com/news/openai-sam-altman-says-giant-164924270.html" rel="nofollow noreferrer">https://finance.yahoo.com/news/openai-sam-altman-says-giant-...</a>

donkeyboy将近 2 年前

I think this could be fake. He says it’s an MoE model, but then explains that it’s actually a blended ensemble. Anyone else have thoughts on that?

letitgo12345将近 2 年前

What's 16 iter inference?

评论 #36423866 未加载

sylware将近 2 年前

Is this still orders of magnitude smaller than a human brain?How many? Based on current human neurons/synapses knowledge?

评论 #36413786 未加载

评论 #36413772 未加载

评论 #36414780 未加载

jerpint将近 2 年前

I really wonder if it is the case that the image processing is simply more tokens appended to the sequence. Would make the most sense from an architecture perspective, training must be a whole other ballgame of alchemy though

评论 #36414547 未加载

bigyikes将近 2 年前

I like how geohot takes these concepts that seem impossibly complex to an outsider (mixture of models, multi modality) and discusses them so casually that they seem accessible to anyone. The flippancy is refreshing.

评论 #36417955 未加载

Havoc将近 2 年前

Interesting. And makes sense. Eg I could see one of the eight being very close focussed and trained on GitHub like data. Could help it stay on tasks too

评论 #36415303 未加载

reportgunner将近 2 年前

Sad that there is a grammar error lining the top of the video.

评论 #36415979 未加载