Building Meta's GenAI infrastructure

664 点作者 mootpt大约 1 年前

35 条评论

float8 got a mention! x2 more FLOPs! Also xformers has 2:4 sparsity support now so another x2? Is Llama3 gonna use like float8 + 2:4 sparsity for the MLP, so 4x H100 float16 FLOPs? Pytorch has fp8 experimental support, whilst attention is still complex to do in float8 due to precision issues, so maybe attention is in float16, and RoPE / layernorms in float16 / float32, whilst everything else is float8?

评论 #39682314 未加载

评论 #39683668 未加载

评论 #39684040 未加载

评论 #39685853 未加载

评论 #39686982 未加载

dougdonohoe大约 1 年前

Having lived through the dot-com era, I find the AI-era slightly dispiriting because of the sheer capital cost of training models. At the start of the dot-com era, anyone could spin up an e-commerce site with relatively little infrastructure costs. Now, it seems, only the hyper-scale companies can build these AI models. Meta, Google, Microsoft, Open-AI, etc.

评论 #39687942 未加载

评论 #39686402 未加载

评论 #39686262 未加载

评论 #39691609 未加载

评论 #39688086 未加载

评论 #39688752 未加载

评论 #39686245 未加载

评论 #39688375 未加载

评论 #39689927 未加载

评论 #39687328 未加载

评论 #39690227 未加载

评论 #39686521 未加载

评论 #39690011 未加载

islewis大约 1 年前

I know we won't get it this from FB, but I'd be really interested to see how the relationship of compute power to engineering hours scales.They mention custom building as much as they can. If FB magically has the option to 10x the compute power, would they need to re-engineer the whole stack? What about 100x? Is each of these re-writes just a re-write, or is it a whole order of magnitude more complex?My technical understanding of what's under the hood of these clusters is pretty surface level- super curious if anyone with relevant experience has thoughts?

评论 #39681743 未加载

评论 #39686107 未加载

评论 #39683128 未加载

jvanderbot大约 1 年前

So, I'd love to work on optimizing pipelines like this. How does one "get into" it? It seems a ML scientist with some C/C++ and infra knowledge just dips down into the system when required? Or is it CUDA/SIMD experts who move "up" into ML?

评论 #39686279 未加载

评论 #39684404 未加载

评论 #39687218 未加载

评论 #39686438 未加载

评论 #39686766 未加载

fuddle大约 1 年前

How much are they paying for H100's? If they are paying $10k: 350,000 NVIDIA H100 x $10k = $3.5b

评论 #39682984 未加载

评论 #39682497 未加载

评论 #39684419 未加载

评论 #39683442 未加载

评论 #39682360 未加载

评论 #39684899 未加载

gingergoat大约 1 年前

The article doesn't mention MTIA, meta's custom ASIC for training & inference acceleration. <a href="https://ai.meta.com/blog/meta-training-inference-accelerator-AI-MTIA/" rel="nofollow">https://ai.meta.com/blog/meta-training-inference-accelerator...</a>I wonder if they will use it in RSC.

benreesman大约 1 年前

I think it’s always useful to pay attention to the history on stuff like this and it’s a rare pleasure to be able to give some pointers in the literature along with some color to those interested from first-hand experience.I’d point the interested at the DLRM paper [1]: that was just after I left and I’m sad I missed it. FB got into disagg racks and SDN and stuff fairly early, and we already had half-U dual-socket SKUs with the SSD and (increasingly) even DRAM elsewhere in the rack in 2018, but we were doing huge NNs for recommenders and rankers even for then. I don’t know if this is considered proprietary so I’ll play it safe and just say that a click-prediction model on IG Stories in 2018 was on the order of a modest but real LLM today (at FP32!).The crazy part is they were HOGWILD trained on Intel AVX-2, which is just wild to think about. When I was screwing around with CUDA kernels we were time sharing NVIDIA dev boxes, typically 2-4 people doing CUDA were splitting up a single card as late as maybe 2016. I was managing what was called “IGML Infra” when I left and was on a first-name basis with the next-gen hardware people and any NVIDIA deal was still so closely guarded I didn’t hear more than rumors about GPUs for training let alone inference.350k Hopper this year, Jesus. Say what you want about Meta but don’t say they can’t pour concrete and design SKUs on a dime: best damned infrastructure folks in the game pound-for-pound to this day.The talk by Thomas “tnb” Bredillet in particular I’d recommend: one of the finest hackers, mathematicians, and humans I’ve ever had the pleasure to know.[1] <a href="https://arxiv.org/pdf/1906.00091.pdf" rel="nofollow">https://arxiv.org/pdf/1906.00091.pdf</a>[2] <a href="https://arxiv.org/pdf/2108.09373.pdf" rel="nofollow">https://arxiv.org/pdf/2108.09373.pdf</a>[3] <a href="https://engineering.fb.com/2022/10/18/open-source/ocp-summit-2022-grand-teton/" rel="nofollow">https://engineering.fb.com/2022/10/18/open-source/ocp-summit...</a>[4] <a href="https://youtu.be/lQlIwWVlPGo?si=rRbRUAXX7aM0UcVO" rel="nofollow">https://youtu.be/lQlIwWVlPGo?si=rRbRUAXX7aM0UcVO</a>

DEDLINE大约 1 年前

I wonder if Meta would ever try to compete with AWS / MSFT / GOOG for AI workloads

评论 #39682648 未加载

评论 #39682690 未加载

评论 #39684675 未加载

评论 #39683430 未加载

mjburgess大约 1 年前

I'd be great if they could invest in an alternative to nvidia -- then, in one fell swoop, destroy the moats of everyone in the industry.

评论 #39681758 未加载

评论 #39681896 未加载

评论 #39682064 未加载

评论 #39681818 未加载

评论 #39681908 未加载

elwell大约 1 年前

> Meta’s long-term vision is to build artificial general intelligence (AGI)

评论 #39686546 未加载

hendersoon大约 1 年前

350k H100 cards, around ten billion dollars just for the GPUs. Less if Nvidia gives a volume discount, which I imagine they do not.

评论 #39682118 未加载

alexsereno大约 1 年前

Honestly Meta is consistently one of the better companies at releasing tech stack info or just open sourcing, these kinds of articles are super fun

评论 #39681588 未加载

评论 #39681507 未加载

wseqyrku大约 1 年前

> Commitment to open AI innovationI see what you did there, Meta.

评论 #39683736 未加载

zone411大约 1 年前

Meta is still playing catch-up. Might be hard to believe but according to Reuters they've been trying to run AI workloads mostly on CPUs until 2022 and they had to pull the plug on the first iteration of their AI chip.<a href="https://www.reuters.com/technology/inside-metas-scramble-catch-up-ai-2023-04-25/" rel="nofollow">https://www.reuters.com/technology/inside-metas-scramble-cat...</a>

评论 #39685829 未加载

latchkey大约 1 年前

> we have successfully used both RoCE and InfiniBand clusters for large, GenAI workloads (including our ongoing training of Llama 3 on our RoCE cluster) without any network bottlenecks.Interesting dig on IB. RoCE is the right solution since it is open standards and more importantly, available without a 52+ week lead time.

评论 #39684919 未加载

seydor大约 1 年前

This is great news for Nvidia and their stock, but are they sure the LLMs and image models will scale indefinitely? nature and biology has a preference for sigmoids. What if we find out that AGI requries different kinds of cpu capabilities

评论 #39685806 未加载

spencerchubb大约 1 年前

All this compute and my Instagram Reels feed still isn't as good as my TikTok feed

评论 #39687325 未加载

mrkramer大约 1 年前

"Share this: Hacker News" Noice

评论 #39683629 未加载

pinko大约 1 年前

The link mentions "our internal job scheduler" and how they had to optimize it for this work -- does anyone know what this job scheduler is called, or how it works?

评论 #39684431 未加载

zerop大约 1 年前

> At Meta, we handle hundreds of trillions of AI model executions per daySuch a large number, makes sense?

评论 #39682487 未加载

评论 #39682856 未加载

评论 #39682844 未加载

评论 #39682170 未加载

ilaksh大约 1 年前

"Everything You Wanted to Know About GenAI at Meta, Except the One Thing You Honestly Care About" (Llama 3).

dekhn大约 1 年前

it's really interesting just how similar these systems are to the designs adopted for HPC over the past few decades. I'm salty because it took a while for the ML community to converge on this (20+K GPUs connected by a real fabric with low latency and high bandwidth).

sashank_1509大约 1 年前

Metas backing itself into a corner with its admirable commitment to open source. Unfortunately, at some point when they decide to monetize their billions spent and try to release a closed source model, the level of vitriol they will deal with will be an order of magnitude above what even OpenAI is experiencing. I don’t think they realize that!

评论 #39687414 未加载

评论 #39687392 未加载

marmaduke大约 1 年前

Just for comparison, Swiss CSCS new Alps system will get 5k GH200 nodes (each with a H100).

dazhbog大约 1 年前

Searched H100 and an Amazon link popped up. Good reviews.<a href="https://www.amazon.com/Tesla-NVIDIA-Learning-Compute-Graphics/dp/B0C3XH4QSJ#customerReviews" rel="nofollow">https://www.amazon.com/Tesla-NVIDIA-Learning-Compute-Graphic...</a>

评论 #39683241 未加载

delanyoyoko大约 1 年前

You've got to read "open" roughly 3x in a paragraph.

评论 #39684243 未加载

lvl102大约 1 年前

This reads more like a flex for the investment community.

codingjaguar大约 1 年前

"By the end of 2024, we’re aiming to continue to grow our infrastructure build-out that will include 350,000 NVIDIA H100 GPUs as part of a portfolio that will feature compute power equivalent to nearly 600,000 H100s." This AI game is getting into a GPU war. Heard that Meta is pushing a lot of CPU wordloads to GPU to co-locate with model inference for infra simplicity.

delegate大约 1 年前

Subtitled 'Here's what you'll never be able to do'.

froonly大约 1 年前

lmfao at the Meta folks not giving any credit whatsoever to the company that actually came up with and implemented the infrastructure work.

评论 #39682925 未加载

pwb25大约 1 年前

so tired of this, not everyone need to work with AI stuff. work on facebook that is a disaster page instead

sidcool大约 1 年前

Those are some seriously great engineering numbers. Mera, with all the negative pressure it receives (rightfully so) is an engineering powerhouse.But I do wonder how they foresee monetising this.

pedrovhb大约 1 年前

Meta seems to actually be taking all the right steps in how they're contributing to open source AI research. Is this a "commodotize your complement" kind of situation?

CuriouslyC大约 1 年前

Yann wants to be open and Mark seems happy to salt the earth.

评论 #39682705 未加载

评论 #39681617 未加载

choppaface大约 1 年前

Total cluster they say will reach 350k H100, which at $30k street price is about $10b.In contrast, Microsoft is spending over $10b per quarter capex on cloud.That makes Zuck look conservative after his big loss on metaverse.<a href="https://www.datacenterdynamics.com/en/news/q3-2023-cloud-results-ai-investments-drive-up-results-and-capex/" rel="nofollow">https://www.datacenterdynamics.com/en/news/q3-2023-cloud-res...</a>

评论 #39681870 未加载

评论 #39684457 未加载

评论 #39681796 未加载