A Comprehensive Guide to Multi-Node, Multi-GPU NVIDIA GPU Fabric Deployment

1. Introduction

NVIDIA GPU Fabric is a high-speed, low-latency interconnect technology that enables direct peer-to-peer communication between GPUs across multiple nodes. This technology is critical for scaling high-performance computing (HPC) and artificial intelligence (AI) workloads that require massive parallel processing capabilities. By creating a unified memory space across multiple GPUs and nodes, GPU Fabric significantly reduces the overhead of data transfers, allowing for near-linear performance scaling for distributed training and inference tasks.

This document provides a comprehensive, step-by-step guide for system administrators and DevOps engineers to configure a multi-node, multi-GPU environment using NVIDIA GPU Fabric. The instructions cover the necessary prerequisites, software stack installation, network configuration, and validation procedures to ensure a robust and high-performance setup.

2. Prerequisites

Before proceeding with the GPU Fabric setup, it is imperative to ensure that the system meets the following hardware and software requirements. A proper foundation is crucial for a successful deployment.

2.1. Hardware Requirements

  • NVIDIA GPUs: One or more NVIDIA GPUs per node, with support for GPUDirect RDMA.
  • NVIDIA ConnectX-7 VPI Adapters: These adapters are used for the GPU Fabric and must be installed in each node.
  • Network Switches: High-speed Ethernet switches capable of handling the bandwidth and latency requirements of the GPU Fabric.

2.2. Software Requirements

The following software components must be installed and configured on each node in the cluster.

2.2.1. Mellanox OFED Driver

The Mellanox OpenFabrics Enterprise Distribution (OFED) driver is essential for enabling RDMA over Converged Ethernet (RoCE) and other high-performance networking features. Install the driver using the following command:

mlnxofedinstall --without-dkms --add-kernel-support --kernel `uname -r`  --without-fw-update --force --basic

Note: The --without-dkms flag is used to prevent the installation of the driver through the Dynamic Kernel Module Support (DKMS) framework. The --add-kernel-support flag ensures that the driver is compiled for the currently running kernel. The --without-fw-update flag skips the firmware update process. The --force flag is used to bypass any non-critical errors, and --basic installs the basic driver package.

2.2.2. NVIDIA Software Stack

The NVIDIA software stack includes the necessary drivers and tools to manage and utilize the GPUs. The following components are required:

  • NVIDIA Driver: The core driver for the NVIDIA GPUs.
  • NVIDIA Fabric Manager: A service that manages the GPU Fabric and ensures its proper operation.
  • CUDA Driver: The driver for the CUDA toolkit, which is required for running GPU-accelerated applications.

2.2.3. NVIDIA Peer Memory Kernel Module

The nvidia-peermem kernel module enables direct memory access between GPUs, which is a key component of GPUDirect RDMA. This module is included in the NVIDIA driver but has a dependency on the OFED driver. It is crucial to verify that this module is loaded after the installation of both the NVIDIA and OFED drivers.

lsmod | grep nvidia_peermem

If the module is not loaded, you can load it manually:

modprobe nvidia-peermem

2.2.4. NVIDIA Fabric Manager Service

The nvidia-fabricmanager service must be running to manage the GPU Fabric. You can check the status of the service and enable it to start on boot with the following commands:

systemctl status nvidia-fabricmanager
systemctl enable nvidia-fabricmanager
systemctl start nvidia-fabricmanager

3. GPU Fabric Setup

With the prerequisites in place, the next step is to configure the GPU Fabric itself. This involves setting the network adapters to the correct mode, configuring IP addresses, and enabling link discovery.

3.1. VPI Adapter Configuration

The NVIDIA ConnectX-7 VPI (Virtual Protocol Interconnect) adapters support both InfiniBand and Ethernet modes. For GPU Fabric, the adapters must be in Ethernet mode. This can be configured either through the BIOS during system provisioning (the preferred method) or via the command line.

To switch the mode via the command line, first identify the PCI device name of the adapter:

mlx_pci=$(sudo ibdev2netdev -v | awk '{print $1}' | head -n1)

Then, use the mlxconfig utility to set the link type to Ethernet for both ports:

sudo mlxconfig -d $mlx_pci set LINK_TYPE_P1=ETH LINK_TYPE_P2=ETH

Important: A system reboot is required for the changes to take effect.

3.2. Network Interface Configuration

Once the adapters are in Ethernet mode, the network interfaces must be configured with IP addresses and brought up. The following commands demonstrate how to configure an interface with an IP address and set the Maximum Transmission Unit (MTU). The MTU value of 4200 is used as an initial example and can be increased to a larger value, such as 9000, after verifying network stability.

ip link set dev ${i} up
ip link set dev ${i} mtu 4200
ADDR="192.168.${subnet}.${octet}/24"
ip addr add dev ${i} ${ADDR}

3.3. Link Layer Discovery Protocol (LLDP)

LLDP is used for link discovery and to verify the physical topology of the network. The lldpd package needs to be installed and configured on each node.

3.3.1. Install lldpd

Install the lldpd package using the following command:

apt-get install lldpd

3.3.2. Configure lldpd

Configure lldpd to transmit and receive LLDP frames on the network interfaces:

echo "enabling lldp for interface: $i"
lldptool set-lldp -i $i adminStatus=rxtx
lldptool -T -i $i -V sysName enableTx=yes
lldptool -T -i $i -V portDesc enableTx=yes
lldptool -T -i $i -V sysDesc enableTx=yes
lldptool -T -i $i -V sysCap enableTx=yes
lldptool -T -i $i -V mngAddr enableTx=yes
lldptool -i $i -T -V portID subtype=PORT_ID_INTERFACE_NAME

3.4. Automation Script

To simplify the configuration process, the following bash script can be used to automate the configuration of all network interfaces (excluding eth0 and eth1):

#!/bin/bash

IFACES=$(ip -br addr | grep eth | grep -v -E 'eth0|eth1' | awk '{print $1}')

subnet=50
octet=1

for i in ${IFACES}; do

        ip link set dev ${i} up
        ip link set dev ${i} mtu 4200
        ADDR="192.168.${subnet}.${octet}/24"
        ip addr add dev ${i} ${ADDR}
        subnet=$((subnet + 1))

        echo "enabling lldp for interface: $i"
        lldptool set-lldp -i $i adminStatus=rxtx
        lldptool -T -i $i -V sysName enableTx=yes
        lldptool -T -i $i -V portDesc enableTx=yes
        lldptool -T -i $i -V sysDesc enableTx=yes
        lldptool -T -i $i -V sysCap enableTx=yes
        lldptool -T -i $i -V mngAddr enableTx=yes
        lldptool -i $i -T -V portID subtype=PORT_ID_INTERFACE_NAME
done

ip -br addr
lldpctl neigh show

4. GPU Fabric Validation

After configuring the GPU Fabric, it is crucial to validate the setup to ensure that it is functioning correctly and performing as expected. This section outlines the steps to validate the GPU Fabric using the perftest tool.

4.1. RDMA Performance Testing with perftest

The perftest package provides a set of tools for testing the performance of RDMA devices. It can be installed either from the default package manager or compiled from the source.

4.1.1. Installation

From APT:

apt install perftest

From Source:

For the latest version and more control over the compilation, it is recommended to build perftest from the source.

git clone https://github.com/linux-rdma/perftest
cd perftest
./autogen.sh
./configure CUDA_H_PATH=/usr/local/cuda/include/cuda.h
make -j$(nproc)

4.1.2. Peer-to-Peer (P2P) Bandwidth Test

To test the bandwidth between two nodes, run the ib_read_bw tool in server mode on one node and in client mode on the other.

Server:

./ib_read_bw -a -q 20 --report_gbits -d mlx5_0

Client:

./ib_read_bw -a -q 20 --report_gbits -d mlx5_0 192.168.50.1

Note: The -a flag runs the test for all message sizes. The -q 20 flag sets the queue pair depth to 20. The --report_gbits flag reports the bandwidth in gigabits per second. The -d mlx5_0 flag specifies the RDMA device to use. Replace 192.168.50.1 with the IP address of the server node.

4.2. Expected Output

The output of the ib_read_bw test should show a high bandwidth, close to the theoretical maximum of the network adapters. This confirms that the GPU Fabric is configured correctly and that GPUDirect RDMA is functioning as expected.

5. References

  1. NVIDIA Enterprise Support. (n.d.). How-to: Change Port Type in Mellanox ConnectX-3 Adapter. Retrieved from https://enterprise-support.nvidia.com/s/article/howto-change-port-type-in-mellanox-connectx-3-adapter
  2. NVIDIA. (n.d.). Set Up GPUDirect RDMA. Holoscan SDK User Guide. Retrieved from https://docs.nvidia.com/holoscan/sdk-user-guide/set_up_gpudirect_rdma.html#enabling-rdma-on-the-connectx-smartnic
  3. NVIDIA Enterprise Support. (n.d.). How-to: Enable LLDP on Linux Servers for Link Discovery. Retrieved from https://enterprise-support.nvidia.com/s/article/howto-enable-lldp-on-linux-servers-for-link-discovery
  4. Linux RDMA. (n.d.). perftest. GitHub. Retrieved from https://github.com/linux-rdma/perftest/tree/master
← Previous Post
Next Post →

Leave a Comment