HPC-X, MPI, PMIx & NCCL: A Full Dissection of the GPU Cluster Communication Stack
HPC-X, UCX, PMIx, and NCCL each own a clearly-defined slice of the GPU cluster communication stack. This deep-dive traces every call from mpirun launch to GPU kernel RDMA Write, pinpoints the single 128-byte crossing point between HPC-X and NCCL, and explains why UCX never touches your tensor data.
Read More