Abstract:<p>"Multi-GPU nodes are increasingly common in the rapidly evolving landscape of exascale supercomputers. On these systems, GPUs on the same node are connected through dedicated networks, with bandwidths up to a few terabits per second. However, gauging performance expectations and maximizing system efficiency is challenging due to different technologies, design options, and software layers. This paper comprehensively characterizes three supercomputers - Alps, Leonardo, and LUMI - each with a unique architecture and design. We focus on performance evaluation of intra-node and inter-node interconnects on up to 4096 GPUs, using a mix of intra-node and inter-node benchmarks. By analyzing its limitations and opportunities, we aim to offer practical guidance to researchers, system architects, and software developers dealing with multi-GPU supercomputing. Our results show that there is untapped bandwidth, and there are still many opportunities for optimization, ranging from network to software optimization."<p>Observations:<p>> Observation 1: Achieving good performance on multi-GPU systems requires non-trivial tuning, which depends on the system, message size, communication library, and number of nodes. The default choices made by *CCL and GPU-Aware MPI are not always optimal, and manual tuning can improve performance up to an order of magnitude.<p>> Observation 2: GPU-Aware MPI provides the highest goodput for intra-node point-to-point transfers on all the analyzed systems. For small transfers, the optimal solution changes across the systems, depending on architectural features and specific optimization implemented by MPI.<p>> Observation 3: On LUMI, RCCL point-to-point communication primitives do not correctly determine the bandwidth available between GPUs on the same node, thus underutilizing the available bandwidth.<p>> Observation 4: For single node collectives, *CCL outperforms GPU-Aware MPI in most cases, except for small collectives on LUMI. Indeed, unlike MPI, *CCL collectives are optimized for the specific GPU models. Nevertheless, there is still room for collective algorithms optimization.<p>> Observation 5: On inter-node point-to-point communications, MPI outperforms *CCL by up to one order of magnitude on small transfers, and by up to 3x on larger transfers.<p>> Observation 6: On Alps and LUMI, GPU’s network location has a marginal impact on average performance (below 30% for latency and 1% for goodput). On the other hand, on Leonardo, the average latency increases by up to 2x when the GPUs are in different groups rather than under the same switch. Similarly, the average goodput decreases by 17%. This is mainly due to network performance variability caused by network noise.<p>> Observation 7: *CCL exploits the intra-node GPU-GPU interconnect more effectively than MPI, being specifically optimized for the target devices. Those advantages are more evident at smaller node counts and for larger transfers, for which the performance of intra-node communications has a higher weight on the overall performance. However, we experienced instability at large node counts for the alltoall on both NCCL and RCCL.<p>> Observation 8: Network noise decreases the goodput of allreduce and alltoall up to 50%.<p>*CCL refers to NVIDIA Collective Communications Library (NCCL) and AMD ROCm Collective Communication Library (RCCL)