TE
TechEcho
Home24h TopNewestBestAskShowJobs
GitHubTwitter
Home

TechEcho

A tech news platform built with Next.js, providing global tech news and discussions.

GitHubTwitter

Home

HomeNewestBestAskShowJobs

Resources

HackerNews APIOriginal HackerNewsNext.js

© 2025 TechEcho. All rights reserved.

AI’s compute fragmentation: what matrix multiplication teaches us

122 pointsby tzhenghaoabout 2 years ago

14 comments

BenoitPabout 2 years ago
There&#x27;s hope in intermediate representations, in OpenXLA:<p><a href="https:&#x2F;&#x2F;opensource.googleblog.com&#x2F;2023&#x2F;03&#x2F;openxla-is-ready-to-accelerate-and-simplify-ml-development.html?m=1" rel="nofollow">https:&#x2F;&#x2F;opensource.googleblog.com&#x2F;2023&#x2F;03&#x2F;openxla-is-ready-t...</a><p>&gt; OpenXLA is an open source ML compiler ecosystem co-developed by AI&#x2F;ML industry leaders including Alibaba, Amazon Web Services, AMD, Apple, Arm, Cerebras, Google, Graphcore, Hugging Face, Intel, Meta, and NVIDIA. It enables developers to compile and optimize models from all leading ML frameworks for efficient training and serving on a wide variety of hardware
评论 #35281180 未加载
brrrrrmabout 2 years ago
&gt; Hand-written assembly kernels don’t scale!<p>I used to think this. And I think, in theory, it is true. But the fact of the matter is, modern ML just doesn&#x27;t use that many kernels. Every framework uses the same libraries (BLAS) and every library uses the same basic idea (maximally saturate FMA-like units).<p>Large language models are being run natively on commodity hardware with code written from scratch within days of their release (e.g. llama.cpp).<p>From a conceptual standpoint, it&#x27;s really easy to saturate hardware in this domain. It&#x27;s been pretty easy since 2014 when convolutions were interpreted as matrix multiplications. Sure, the actual implementations can be tricky, but a single engineer (trained in it) can get that done for a specific hardware in a couple months.<p>Of course, the interesting problem is how to generalize kernel generation. I spent years working with folks trying to do just that. But, in retrospect, the actual value add from a system that does all this for you is quite low. It&#x27;s a realization I&#x27;ve been struggling to accept :&#x27;(
评论 #35284252 未加载
评论 #35285148 未加载
nitwit005about 2 years ago
&gt; &quot;Think about it: how can a small number of specialized experts, who hand write and tune assembly code, possibly scale their work to all the different configurations while also incorporating their work into all the AI frameworks?! It’s simply an impossible task.&quot;<p>By committing it to a common library that a lot of people use? There are already multiple libraries with optimized matrix multiplication.<p>This is also exaggerating the expertise required. I&#x27;m not going to claim it&#x27;s trivial, but you can genuinely google &quot;intel avx-512 matrix multiplication&quot;, and find both papers and Intel samples.
photochemsynabout 2 years ago
&gt; &quot;Think about it: how can a small number of specialized experts, who hand write and tune assembly code, possibly scale their work to all the different configurations while also incorporating their work into all the AI frameworks?! It’s simply an impossible task.&quot;<p>Naively, I wonder if this is the kind of problem that AI itself can solve, which is a rather singularity-approaching concept. Maybe there&#x27;s too much logic involved and not enough training data on different configurations for that to work? A bit spooky however, the thought of self-bootstrapping AI.
评论 #35282230 未加载
评论 #35281223 未加载
junrushao1994about 2 years ago
My take: optimizing matrix multiplication is not hard on modern architecture if you have the right abstraction. The code itself could be fragmented across different programming models, which is true, but the underlying techniques are not hard for a 2nd&#x2F;3rd year undergrad to understand. There are only a few important ones on GPU: loop tiling, pipelining, shared memory swizzle, memory coalescing. A properly designed compiler can allow developers to optimize matmuls within 100 lines of code.
评论 #35281227 未加载
评论 #35282317 未加载
bee_riderabout 2 years ago
The article seems to be missing a conclusion.<p>Writing assembly doesn’t scale across lots of platforms? Sure… the solution for matrix multiplication is to use the vendor’s BLAS.<p>If the vendor can’t at least plop some kernels into BLIS they don’t want you to use their platform for matmuls… don’t fight them.
评论 #35283521 未加载
gleennabout 2 years ago
I really like the Neanderthal library because it does a pretty good job of abstracting over Nvidia, AMD, and Intel hardware to provide matrix operations in an extremely performant manner for each one with the same code. Dragan goes into a lot of detail about the hardware differences. His library provides some of the fastest implementations of using the given hardware too, it&#x27;s not a hand-wavy, half-baked performance abstraction, the code is really fast. <a href="https:&#x2F;&#x2F;github.com&#x2F;uncomplicate&#x2F;neanderthal">https:&#x2F;&#x2F;github.com&#x2F;uncomplicate&#x2F;neanderthal</a>
kickingvegasabout 2 years ago
Off topic, but related. <a href="https:&#x2F;&#x2F;mastodon.social&#x2F;@mcc&#x2F;110024854706734967" rel="nofollow">https:&#x2F;&#x2F;mastodon.social&#x2F;@mcc&#x2F;110024854706734967</a>
bigbillheckabout 2 years ago
Surely one solution is for the AI frameworks to each themselves understand the operating environment and choose the best implementation at run-time, much like the way they currently do.
brucethemoose2about 2 years ago
Yeah well tell all that to Nvidia, who very much likes the fragmentation and wants to keep things that way.
评论 #35281077 未加载
评论 #35280613 未加载
评论 #35282198 未加载
EntrePrescottabout 2 years ago
&gt; performance has become increasingly constrained by memory latency, which has grown much slower than processing speeds.<p>Sounds like they would oddly prefer memory latency to grow as least as fast as processing speeds, which would be terrible. Obviously, memory latency actually decreased, just not enough.<p>So it seems likely they made a mistake and actually meant that memory latency has decreased slower than processing speeds have increased, in other words, that it is not memory latency but memory random access throughput (which in rough approximation is about proportional to the inverse of memory latency) that has grown much slower than processing speeds.
b34rabout 2 years ago
Chad Jarvis is an AI-generated name if I’ve ever heard one
version_fiveabout 2 years ago
A cool mission
adamnemecekabout 2 years ago
No, they compute spectra.