GigaGPT: GPT-3 sized models in 565 lines of code

223 点作者 georgehill超过 1 年前

16 条评论

icyfox超过 1 年前

For those hearing about Cerebras for the first time, they make a chipset that's similar to a GPU in matrix multiplication speed but way bigger (a whole wafer) so it can fit more transistors and memory onto one chip. They achieve this small LOC count because they don't need to shard across multiple devices / backprop consolidate on a central CPU / etc. These tricks are usually what blows up a project from a single architecture proof of concept to a robust training pipeline that can handle the billions of parameters on modern models. This is more akin to training a whole model on a single GPU because... it kind of is.Even with a wafer scale chipset this approach has limits. You eventually will still need to shard to fit more parameters / use different training modalities / etc. I'd look at this more as a proof of concept for the ergonomics of what LLM training can look like when you have access to a much larger compute primitive versus a new state of the art in feature-equivalent clean code.Disclaimer: I'm a small investor in Cerebras.

评论 #38603972 未加载

评论 #38605095 未加载

评论 #38605225 未加载

评论 #38606826 未加载

filterfiber超过 1 年前

I don't understand why they're comparing the parameter sizes to lines of code.AFAIK you can just increase the layer parameters of a 1B model to whatever you want? Like, the difference between a 1B and 175B model can be just changing a few numbers, and not adding any LOC at all?LOC has never been a limitation for large models, it's been the compute+training data required.Most of the LOC is spent on optimization, and they don't address MoE or anything fancy like that?

评论 #38604621 未加载

Voloskaya超过 1 年前

Distributed training infra/libs have made insane progress since the Megatron era. I have worked with Megatron codebase to train larger than 175B models a few years back, a lot of the boilerplate that you find in those 20k LoC you could remove today by just importing deepspeed or other distributed training libs.Cerebras' point still stands though, even if you can get the LoC count down significantly nowadays, it's still a major PITA to debug those systems, deal with node crashing, tweak the architecture and the data-loading pipeline to have high GPU utilization, optimize network bottlenecks etc. Scaling vertically first like Cerebras is doing surely makes that much easier.On a tangentially related note, this is imho where OpenAI has built it's moat: training and inference stack that they have refined over the last 6 years. They have good researchers, but so does MS, Google and Meta. But no one else has the ability to train such large models with such ease. Same for the inference stack, being able to run GPT-3.5/4 in prod at the scale at which they are doing it is no joke, and I'm 100% convinced this is why Gemini is still not widely available a year after 3.5 came out.

jwan584超过 1 年前

Everyone knows Cerebras by their wafer scale chips. The less understood part is the 12TB of external memory. That's the real reason why large models fit by default and you don't have to chop it up in software ala megatron/deepspeed.

评论 #38605110 未加载

realityloop超过 1 年前

According to <a href="https://www.anandtech.com/show/16626/cerebras-unveils-wafer-scale-engine-two-wse2-26-trillion-transistors-100-yield" rel="nofollow noreferrer">https://www.anandtech.com/show/16626/cerebras-unveils-wafer-...</a> the WSE-1 was $2Million, so I expect the WSE-2 is an arm and a leg too.

lopuhin超过 1 年前

Strange that they don't mention the performance, how long does it take to do one step, and how does it compare to a similarly priced GPU cluster? Sure simple code is good, but it needs to also be useful.

101008超过 1 年前

Would it be ever possible to run a GPT-{n}, n>3, similar model in a home computer wihtout GPU? I have a "good" laptop with 32GB, good processor, but no GPU (I was never interested in gaming, crypto or ML), but I found GPT very useful and I'd prefer to run a local version instead of keep feeding OpenAI.

评论 #38607375 未加载

评论 #38606756 未加载

评论 #38604399 未加载

评论 #38604334 未加载

评论 #38604375 未加载

kkzz99超过 1 年前

Looks like hardware vs. software abstraction. Considereing the perspective of an LLM startup: Would you rather write 20k LOC of complex code that would make you be able to more easily switch hardware platforms - or - write 600 LOC of less complex code and be pinned to a single provider?

评论 #38603951 未加载

评论 #38603750 未加载

评论 #38604147 未加载

natch超过 1 年前

Ignorant question: Why are we interested in training models much smaller than GPT-4? For academic reasons? I understand training in specific domains but isn’t that covered by fine tuning, with much less compute?

评论 #38605520 未加载

评论 #38605507 未加载

blobbers超过 1 年前

I'm curious if these low-code models matter. I understand that small codebases can be cached effectively speeding up computations, but isn't it the data load the bottleneck in training?Furthermore, how important is the breadth of data in the dataset to getting the desired results? I was under the impression that the main reason these LLM work is based on massive data sets.As such, is there data-breadth metrics to validate whether training on a given dataset is even worthwhile? (ie: avoid sunk cost on a dataset that will yield a poorly performing LLM)

nojvek超过 1 年前

nanoGPT & micrograd are master pieces of code. Truly god level code.Andrew Karpathy is truly a gem and super grateful he still publishes videos showing his art.Cerebras showing their distributed architecture on that same piece of code is impressive.All of AI is search for a god algorithm. An algorithm so simple it could be written on an A4 piece of paper in 12px font size - but with enough data and compute it can more intelligent than entire cities of humans combined.NanoGPT is a glimpse of that.

leobg超过 1 年前

I would’ve been interested to learn how much it costs to train these models using their platform. Like, a 70b model - are we talking millions of dollars here?

评论 #38605592 未加载

syntaxing超过 1 年前

Inverted Y axis?! I get marketing probably wanted them on the top right hand corner as “best” but it makes me cringe seeing this.

I_am_tiberius超过 1 年前

I assume the 565 lines of code refer to the # of lines of native code (not counting the # of lines referring to libraries used).

评论 #38604928 未加载

whimsicalism超过 1 年前

Yes, transformers are very simple - but typically the additional lines of code are doing useful work. The comparison with nvidia megatron is particularly ridiculous imo.I don't see the novelty/interesting bit in this article, personally.

voz_超过 1 年前

If I see "import torch" in your models, is it really 565 LOC?