TE
科技回声
首页24小时热榜最新最佳问答展示工作
GitHubTwitter
首页

科技回声

基于 Next.js 构建的科技新闻平台,提供全球科技新闻和讨论内容。

GitHubTwitter

首页

首页最新最佳问答展示工作

资源链接

HackerNews API原版 HackerNewsNext.js

© 2025 科技回声. 版权所有。

Exploring GPU-to-GPU Communication: Insights into Supercomputer Interconnects

4 点作者 matt_d9 个月前

1 comment

matt_d9 个月前
Abstract:<p>&quot;Multi-GPU nodes are increasingly common in the rapidly evolving landscape of exascale supercomputers. On these systems, GPUs on the same node are connected through dedicated networks, with bandwidths up to a few terabits per second. However, gauging performance expectations and maximizing system efficiency is challenging due to different technologies, design options, and software layers. This paper comprehensively characterizes three supercomputers - Alps, Leonardo, and LUMI - each with a unique architecture and design. We focus on performance evaluation of intra-node and inter-node interconnects on up to 4096 GPUs, using a mix of intra-node and inter-node benchmarks. By analyzing its limitations and opportunities, we aim to offer practical guidance to researchers, system architects, and software developers dealing with multi-GPU supercomputing. Our results show that there is untapped bandwidth, and there are still many opportunities for optimization, ranging from network to software optimization.&quot;<p>Observations:<p>&gt; Observation 1: Achieving good performance on multi-GPU systems requires non-trivial tuning, which depends on the system, message size, communication library, and number of nodes. The default choices made by *CCL and GPU-Aware MPI are not always optimal, and manual tuning can improve performance up to an order of magnitude.<p>&gt; Observation 2: GPU-Aware MPI provides the highest goodput for intra-node point-to-point transfers on all the analyzed systems. For small transfers, the optimal solution changes across the systems, depending on architectural features and specific optimization implemented by MPI.<p>&gt; Observation 3: On LUMI, RCCL point-to-point communication primitives do not correctly determine the bandwidth available between GPUs on the same node, thus underutilizing the available bandwidth.<p>&gt; Observation 4: For single node collectives, *CCL outperforms GPU-Aware MPI in most cases, except for small collectives on LUMI. Indeed, unlike MPI, *CCL collectives are optimized for the specific GPU models. Nevertheless, there is still room for collective algorithms optimization.<p>&gt; Observation 5: On inter-node point-to-point communications, MPI outperforms *CCL by up to one order of magnitude on small transfers, and by up to 3x on larger transfers.<p>&gt; Observation 6: On Alps and LUMI, GPU’s network location has a marginal impact on average performance (below 30% for latency and 1% for goodput). On the other hand, on Leonardo, the average latency increases by up to 2x when the GPUs are in different groups rather than under the same switch. Similarly, the average goodput decreases by 17%. This is mainly due to network performance variability caused by network noise.<p>&gt; Observation 7: *CCL exploits the intra-node GPU-GPU interconnect more effectively than MPI, being specifically optimized for the target devices. Those advantages are more evident at smaller node counts and for larger transfers, for which the performance of intra-node communications has a higher weight on the overall performance. However, we experienced instability at large node counts for the alltoall on both NCCL and RCCL.<p>&gt; Observation 8: Network noise decreases the goodput of allreduce and alltoall up to 50%.<p>*CCL refers to NVIDIA Collective Communications Library (NCCL) and AMD ROCm Collective Communication Library (RCCL)
评论 #41374216 未加载