TE
TechEcho
Home24h TopNewestBestAskShowJobs
GitHubTwitter
Home

TechEcho

A tech news platform built with Next.js, providing global tech news and discussions.

GitHubTwitter

Home

HomeNewestBestAskShowJobs

Resources

HackerNews APIOriginal HackerNewsNext.js

© 2025 TechEcho. All rights reserved.

AI Flame Graphs

316 pointsby JNRowe7 months ago

21 comments

wcunning7 months ago
I actually looked at this in detail about a year ago for some automated driving compute work at my previous job, and I found that the detailed info you&#x27;d want from Nvidia was just 100% unavailable. There&#x27;s pretty good proxies in some of the data you can get out of Nvidia tools, and there&#x27;s some extra info you can glean from some of the function call stack in the open source Nvidia driver shim layer (because the actual main components are still binary blob, even with the &quot;open source&quot; driver), but over all you still can&#x27;t get much useful info out.<p>Now that Brendan works for Intel, he can get a lot of this info from the much more open source Intel GPU driver, but that&#x27;s only so useful since everyone is either Nvidia or AMD still. The more hopeful sign is that a lot of the major customers of Nvidia are going to start demanding this sort of access and there&#x27;s a real chance that AMD&#x27;s more accessible driver starts documenting what to actually look at, which will create the market competition to fill this space. In the meantime, take a look at the flamegraph capabilities in PyTorch and similar frameworks, up an abstraction level and eek what performance you can.
评论 #42000587 未加载
评论 #42001865 未加载
zkry7 months ago
&gt; Imagine halving the resource costs of AI and what that could mean for the planet and the industry -- based on extreme estimates such savings could reduce the total US power usage by over 10% by 20301.<p>Why would it be the case that reducing the costs of AI reduces power consumption as opposed to increase AI usage (or another application using electricity)? I would think with cheaper AI their usage would be come more ubiquitous: LLMs in fridges, toasters, smart alarms, etc.
评论 #41994104 未加载
评论 #41994108 未加载
评论 #42007207 未加载
评论 #41994841 未加载
评论 #41994026 未加载
评论 #41995233 未加载
评论 #42002293 未加载
评论 #42001305 未加载
xnx7 months ago
&gt; Imagine halving the resource costs of AI and what that could mean for the planet and the industry<p>Google has done this: &quot;In eighteen months, we reduced costs by more than 90% for these queries through hardware, engineering, and technical breakthroughs, while doubling the size of our custom Gemini model.&quot; <a href="https:&#x2F;&#x2F;blog.google&#x2F;inside-google&#x2F;message-ceo&#x2F;alphabet-earnings-q3-2024&#x2F;" rel="nofollow">https:&#x2F;&#x2F;blog.google&#x2F;inside-google&#x2F;message-ceo&#x2F;alphabet-earni...</a>
评论 #42006955 未加载
评论 #41994521 未加载
dan-robertson7 months ago
Being able to ‘connect’ call stacks between python, c++, and the gpu&#x2F;accelerator seems useful.<p>I wonder if this pushes a bit much towards flamegraphs specifically. They were an innovation when they were first invented and the alternatives were things like perf report, but now I think they’re more one tool among many. In particular, I think many people who are serious about performance often reach for things like pprof for statistical profiles and various traceing and trace-visualisation tools for more fine-grained information (things like bpftrace, systemtap, or custom instrumentation on the recording side and perfetto or the many game-development oriented tools on the visualisation (and sometimes instrumentation) side).<p>I was particularly surprised by the statement about intel’s engineers not knowing what to do with the flamegraphs. I read it as them already having tools that are better suited to their particular needs, because I think the alternative has to be that they are incompetent or, at best, not thinking about performance at all.<p>Lots of performance measuring on Linux is done through the perf subsystem and Intel have made a lot of contributions to make it good. Similarly, Intel have added hardware features that are useful for measuring and improving performance – an area where their chips have features that, at least on chips I’ve used, easily beat AMD’s offerings. This kind of plumbing is important and useful, and I guess the flamegraphs demonstrate that the plumbing was done.
评论 #42001546 未加载
评论 #42001735 未加载
kevg1237 months ago
&gt; based on Intel EU stall profiling for hardware profiling<p>It wasn&#x27;t clearly defined but I think EU stall means Execution Unit stall which is when a GPU &quot;becomes stalled when all of its threads are waiting for results from fixed function units&quot; <a href="https:&#x2F;&#x2F;www.intel.com&#x2F;content&#x2F;www&#x2F;us&#x2F;en&#x2F;docs&#x2F;gpa&#x2F;user-guide&#x2F;2022-4&#x2F;gpu-metrics.html" rel="nofollow">https:&#x2F;&#x2F;www.intel.com&#x2F;content&#x2F;www&#x2F;us&#x2F;en&#x2F;docs&#x2F;gpa&#x2F;user-guide&#x2F;...</a>
simpledood7 months ago
I&#x27;ve tried using flame graphs, but in my view nothing beats the simplicity and succinctness of gprof output for quickly analyzing program bottlenecks.<p><a href="https:&#x2F;&#x2F;ftp.gnu.org&#x2F;old-gnu&#x2F;Manuals&#x2F;gprof-2.9.1&#x2F;html_chapter&#x2F;gprof_5.html#SEC12" rel="nofollow">https:&#x2F;&#x2F;ftp.gnu.org&#x2F;old-gnu&#x2F;Manuals&#x2F;gprof-2.9.1&#x2F;html_chapter...</a><p>For each function you know how much CPU is spent in the function itself, as opposed to child calls. All in a simple text file without the need for constantly scrolling, panning, and enlarging to get the information you need.
davidclark7 months ago
This is so cool! Flame graphs are super helpful for analyzing bottlenecks. The eflambe library for elixir has let us catch some tricky issues.<p><a href="https:&#x2F;&#x2F;github.com&#x2F;Stratus3D&#x2F;eflambe&#x2F;blob&#x2F;master&#x2F;README.adoc">https:&#x2F;&#x2F;github.com&#x2F;Stratus3D&#x2F;eflambe&#x2F;blob&#x2F;master&#x2F;README.adoc</a>
saagarjha7 months ago
I never really liked flamegraphs much but I am going to put that aside for a bit and try to be as objective as possible.<p>I don&#x27;t find the usecase presented here compelling. Cutting out the &quot;yo we will save you $x billion in compute&quot; costs the tools presented here seem to be…stacktraces for your kernels. Stacktraces that go from your Python code through the driver shim to the kernel and finally onto the GPU. Neat. I don&#x27;t actually know very much about what Intel has in this area so perhaps this is a step forward for them? If so, I will always applaud people figuring out how to piece together symbols and whatnot to make profiling work.<p>However, I am still not very impressed. Sure, there are some workloads where it is nice to know that 70% of your time is spent in some GEMM. But I think the real optimization doesn&#x27;t look like that all. For most &quot;real&quot; workloads, you already know the basics of how your kernels look and execute. Nobody is burning a million dollars an hour on a training run without knowing what each and every one of the important kernels are. Some of them were probably written by hand. Some might be written in higher-level PyTorch&#x2F;Triton&#x2F;JAX&#x2F;whatever. Still others might be built on some general library. But the people who do this are not stupid, and they aren&#x27;t going to be caught unawares that a random kernel has suddenly popped up on their flamegraph. They should already know what is there. And most of these tools have debugging facilities to dump intermediate state in forms that tools understand. Often this is incomplete and buggy, I know. But it&#x27;s there and people <i>do</i> use them.<p>What these people are optimizing are things that flamegraphs do not show. That&#x27;s things like latency in kernel launches, or synchronization overhead with the host. It&#x27;s global memory traffic and warp stalls. Sure, the tools to profile this are immature compared to what the hyperscalers have for CPUs. But they are still present and used heavily: I don&#x27;t buy the argument that knowing that your python code calls a kernel through __cuda12_ioctl_whatever is actually helpful. This seems like a solution searching for a problem, or maybe a basic diagnostic tool at best.
评论 #42009045 未加载
_heimdall7 months ago
&gt; Imagine halving the resource costs of AI and what that could mean for the planet and the industry -- based on extreme estimates such savings could reduce the total US power usage by over 10% by 2030<p>The way this is phrased threw me off. It sounded to me like the author was comparing the power use of a more efficient LLM industry to US usage <i>without</i> LLMs and expecting it to be 10% lower.<p>Looking into the source linked with the claim, it doesn&#x27;t even hold up when compared against how much power LLMs use today. The linked article raises an estimate that LLM power use could increase 15-23 times between 2023 and 2027, and that by 2030 LLMs could account for 20-25% of our total energy use.<p>Working that match backwards, the benefit the author is hailing as a success is that we would <i>only</i> increase energy use by say 7.5-11.5 times by 2027 and that in 2030 LLMs would <i>only</i> be 10% of the total energy use. That&#x27;s not a win in my book, and doesn&#x27;t account for the Jevan&#x27;s Paradox problem where we would almost certainly just use all that efficiency gain to further grow LLM use compared to the 2030 prediction without the efficiency gains.
have_faith7 months ago
&gt; Imagine halving the resource costs of AI ... based on extreme estimates such savings could reduce the total US power usage by over 10% by 2030<p>Is that implying that by 2030 they expect at least 20% of all US energy to be used by AI?
评论 #41993924 未加载
评论 #41993996 未加载
评论 #41993917 未加载
adrianco7 months ago
This is super interesting and useful. I tried reading the code to understand how GPU workloads worked last year and it was easy to get lost in all the options and pluggable layers.
Veserv7 months ago
I do not really understand the mentioned difficulties with instruction profiling.<p>Are they saying it is hard to sample the stacks across the boundary? Are they saying it is hard to do so coherently because the accelerator engine is actually asynchronous so you need to do some sort of cross-boundary correlation?<p>However, they then talk about file systems and &#x2F;proc representations which have nothing to do with the actual sampling process; only posing problems for the display of human-readable information. Many naive profiling, tracing, and logging implementations conflate these actions to their detriment; are they being conflated here or is it just a generic statement of the scope of problems?
yanniszark7 months ago
Trying to find out more about this EU stall thing Brendan talks about. Is it instruction sampling that gives you the reason for the stall? Sounds like a pretty advanced hw functionality.
shidoshi7 months ago
I can imagine Nelson and other Anthropic engineers jumping for joy at this release.
treefarmer7 months ago
Would love it if it was available and open source so people could use it in their own projects (or on their own hardware), instead of only being available on Intel&#x27;s AI Cloud. But cool idea and execution nevertheless!
评论 #42001803 未加载
r3tr07 months ago
i am actually working on a platform that makes this sort of stuff easy. We use BPF under the hood and let you remotely deploy them across a cluster and visualize them.<p>Check us out: <a href="https:&#x2F;&#x2F;yeet.cx" rel="nofollow">https:&#x2F;&#x2F;yeet.cx</a><p>Our current package index is a bit thin:<p><a href="https:&#x2F;&#x2F;yeet.cx&#x2F;discover" rel="nofollow">https:&#x2F;&#x2F;yeet.cx&#x2F;discover</a><p>We have a ton in the pipeline and are going to add more in the coming weeks and release an SDK.
impish92087 months ago
Dupe: <a href="https:&#x2F;&#x2F;news.ycombinator.com&#x2F;item?id=41983876">https:&#x2F;&#x2F;news.ycombinator.com&#x2F;item?id=41983876</a>
ryao7 months ago
Wow. Nice.
FeepingCreature7 months ago
Unrelated, but on the topic of reducing power consumption, I want to once again note that both AMD and NVidia max out a CPU core per blocking API call, preventing your CPU from entering low power states even when doing nothing but waiting on the GPU, for no reason other than to minimally rice benchmarks.<p>Basically, these APIs are set up to busyspin while waiting for a bus write from the GPU by default (!), rather than use interrupts like every other hardware device on your system.<p>You turn it off with<p>NVidia: `cudaSetDeviceFlags(cudaDeviceScheduleBlockingSync)`<p>AMD: `hipSetDeviceFlags(hipDeviceScheduleBlockingSync)`<p>On Pytorch<p>NVidia: `import ctypes \ ctypes.CDLL(&#x27;libcudart.so&#x27;).cudaSetDeviceFlags(4)`<p>AMD: `import ctypes \ ctypes.CDLL(&#x27;libamdhip64.so&#x27;).hipSetDeviceFlags(4)`<p>This saves me 20W whenever my GPU is busy in ComfyUI.<p>Every single device using the default settings for CUDA&#x2F;ROCM burns a CPU core per worker thread for no reason.
评论 #41996598 未加载
评论 #41996783 未加载
nonamepcbrand17 months ago
totally looks like self promotion article lol
评论 #41996378 未加载
Lerc7 months ago
There has been a bit of hyperbole of late about energy saving AI.<p>There isn&#x27;t a magic bullet here, it&#x27;s just people improving a relatively new technology. Even though the underlying neural nets are fairly old now, the newness of transformers and the newness of the massive scale means there&#x27;s quite a lot of low hanging fruit still. Some of the best minds are on this problem and are reaching for the hardest to get fruit.<p>A lot of these advancements work well together improving efficiency a few percent here, a few percent there.<p>This is a good thing, but people are doing crazy comparisons by extrapolating older tech into future use cases.<p>This is like estimating the impact of cars by correctly guessing that there are 1.4 Billion cars in the world and multiplying that by the impact of a single model-T Ford.