UFM vs OpenSM: Understanding InfiniBand Fabric Management

Quick Answer

Both UFM and OpenSM are Subnet Managers for InfiniBand fabrics, but they serve very different purposes. UFM is NVIDIA's comprehensive enterprise fabric management platform, while OpenSM is a lightweight, open-source Subnet Manager implementation.


What Is a Subnet Manager (SM)?

Before comparing UFM and OpenSM, it helps to understand what a Subnet Manager actually does. The Subnet Manager is the "brain" of an InfiniBand fabric. It is responsible for:

  1. Fabric Discovery — Discovers all switches, adapters, and connections in the fabric
  2. LID Assignment — Assigns Local IDs (LIDs) to every port
  3. Routing Table Configuration — Programs forwarding tables into switches
  4. Port State Management — Drives ports through: Down → Initialize → Armed → Active
  5. Partition Management — Creates VLANs (partitions) for network isolation
Important: Without a functioning Subnet Manager, InfiniBand ports cannot reach the Active state and cannot pass any traffic — regardless of physical link status.

OpenSM (Open Subnet Manager)

What Is OpenSM?

OpenSM is an open-source, lightweight Subnet Manager that implements the InfiniBand specification for fabric management. It runs as a daemon on any Linux node, embedded switch, or dedicated management server.

Key Characteristics

Aspect Details
Type Command-line tool / daemon
License Open source (GPL/BSD)
Complexity Simple, minimal configuration
Interface Command-line only, no GUI
Resource Usage Very lightweight (low CPU/memory)
Cost Free
Deployment Any Linux node, switch, or dedicated server

Where OpenSM Runs

  • On a compute node — Any server with an InfiniBand adapter
  • Embedded in a switch — Many IB switches ship with OpenSM built-in
  • On a dedicated management server — A separate server solely for SM duties

Feature Set

What OpenSM does:

  •  Fabric discovery and initialization
  •  LID assignment and routing
  •  Partition (VLAN) management
  •  Basic high availability (master/standby)
  •  Multiple routing algorithms (MinHop, LASH, Fat-Tree, etc.)

What OpenSM does NOT do:

  •  No graphical user interface
  •  No performance monitoring or analytics
  •  No centralized logging or alerting
  •  No topology visualization
  •  No advanced diagnostics
  •  No job/workload scheduler integration

Configuration Files

/etc/rdma/opensm.conf       # Main configuration file
/etc/sysconfig/opensm       # Startup parameters (GUID, priority)
/etc/rdma/partitions.conf   # Partition definitions

Best Suited For

  • Small to medium InfiniBand clusters (< 100 nodes)
  • Development and test environments
  • Simple point-to-point or small switched fabrics
  • Environments where minimal overhead and full control are priorities
  • Budget-conscious deployments

UFM (Unified Fabric Manager)

What Is UFM?

UFM (Unified Fabric Manager) is NVIDIA's enterprise-grade fabric management platform. It is a comprehensive software suite that includes a full Subnet Manager plus extensive monitoring, analytics, diagnostics, and management capabilities — all accessible through a web interface and REST API.

Key Characteristics

Aspect Details
Type Enterprise management platform with web UI
License Commercial (requires NVIDIA license)
Complexity Feature-rich, extensive configuration options
Interface Web GUI + REST API + CLI
Resource Usage Medium to high (requires dedicated server)
Cost Commercial license required
Deployment Dedicated server (physical or VM)

UFM Architecture Components

  1. UFM Server — Core management engine with embedded SM
  2. UFM Web UI — Browser-based management interface
  3. UFM Agents (optional) — Installed on compute nodes for detailed telemetry
  4. Database — Stores historical data, events, and topology snapshots

Additional Capabilities (Beyond OpenSM)

Monitoring & Analytics

  •  Real-time performance monitoring (bandwidth, latency, errors)
  •  Historical data collection and trend analysis
  •  Per-port counters and statistics
  •  Cable health and temperature monitoring

Visualization & Topology

  •  Interactive fabric topology maps
  •  Color-coded port status visualization
  •  Rack and chassis views

Advanced Management

  •  Firmware update management
  •  Job-aware routing (Slurm, PBS, and other scheduler integration)
  •  Quality of Service (QoS) configuration
  •  Multi-fabric management from a single console
  •  Prometheus / Grafana exporters

Best Suited For

  • Large-scale HPC clusters (hundreds to thousands of nodes)
  • Production data centers and AI/ML training clusters
  • Environments requiring detailed monitoring and analytics
  • Organizations with compliance or audit requirements (historical data retention)

Side-by-Side Feature Comparison

Feature OpenSM UFM
Core SM Functionality ✅ Yes ✅ Yes
Cost Free Commercial
User Interface CLI only Web GUI + CLI + API
Topology Visualization ❌ No ✅ Yes
Performance Monitoring ❌ No ✅ Yes
Historical Data ❌ No ✅ Yes
Alerting ❌ No ✅ Yes
Cable Diagnostics ❌ No ✅ Yes
Job Scheduler Integration ❌ No ✅ Yes
Multi-Fabric Management ❌ No ✅ Yes
Resource Requirements Very low Medium–High
Setup Complexity Simple Complex
Suitable for Small Clusters ✅ Yes ⚠️ Overkill
Suitable for Large Clusters ⚠️ Limited ✅ Yes
High Availability Basic Advanced
Support Community NVIDIA Commercial

How to Identify Which One You Have

Check if UFM is installed locally

rpm -qa | grep -i ufm
systemctl list-units | grep -i ufm
ps aux | grep ufm

Check if OpenSM is installed / running locally

rpm -qa | grep -i opensm
systemctl status opensm
ps aux | grep opensm

Identify the active SM on the fabric

# Query the SM (requires at least one Active port)
sminfo

# Discover the fabric and look for SM entries
ibnetdiscover | grep -i "sm"

Common Deployment Patterns

Pattern Description Typical Use Case
Switch-Embedded OpenSM OpenSM runs inside the IB switch Small–medium clusters; most common
Dedicated UFM Server UFM runs on a standalone management server Large HPC / AI clusters
Compute Node OpenSM OpenSM runs on one of the compute nodes Small clusters or dev environments
Multi-SM (HA) Primary SM + standby SM with priority config High-availability production environments

Recommendations for Your Current Situation

Root Cause: You started a local OpenSM on your B300 node, which conflicted with the remote SM running on your switch or UFM server. This caused all ports to get stuck in State: Down even though Physical state: LinkUp — a classic SM conflict symptom.

If you have a switch-embedded SM

  1. Log in to the switch management interface
  2. Check SM status: show sm
  3. Restart the SM service (NVIDIA/Mellanox switches):
sm stop
sm start

If you have a UFM server

  1. Contact the UFM administrator, or log in to the UFM server directly
  2. Restart the UFM service:
sudo systemctl restart ufmd

If you are unsure which SM you have

  • Ask your cluster administrator or network team
  • They should know what SM infrastructure is in place
  • Run rpm -qa | grep -E "ufm|opensm" to check what is installed locally

Summary

  OpenSM UFM
In one sentence Lightweight, free, basic SM functionality Enterprise platform: SM + monitoring + management + analytics
Best for Small clusters, dev/test environments Large production clusters, AI/ML training
Cost Free Requires NVIDIA commercial license

Both tools can manage your InfiniBand fabric, but UFM provides far greater visibility and control at scale. Your current problem is that the primary SM — wherever it runs — has stopped working correctly. The path to recovery is to either fix the remote SM (switch or UFM) or bring up a new SM instance once the root cause of the conflict has been resolved.

← Previous Post

Leave a Comment