UFM vs OpenSM: Understanding InfiniBand Fabric Management

lyan 2026-03-08 02:54

Quick Answer

Both UFM and OpenSM are Subnet Managers for InfiniBand fabrics, but they serve very different purposes. UFM is NVIDIA's comprehensive enterprise fabric management platform, while OpenSM is a lightweight, open-source Subnet Manager implementation.

What Is a Subnet Manager (SM)?

Before comparing UFM and OpenSM, it helps to understand what a Subnet Manager actually does. The Subnet Manager is the "brain" of an InfiniBand fabric. It is responsible for:

Fabric Discovery — Discovers all switches, adapters, and connections in the fabric
LID Assignment — Assigns Local IDs (LIDs) to every port
Routing Table Configuration — Programs forwarding tables into switches
Port State Management — Drives ports through: Down → Initialize → Armed → Active
Partition Management — Creates VLANs (partitions) for network isolation

Important: Without a functioning Subnet Manager, InfiniBand ports cannot reach the Active state and cannot pass any traffic — regardless of physical link status.

OpenSM (Open Subnet Manager)

What Is OpenSM?

OpenSM is an open-source, lightweight Subnet Manager that implements the InfiniBand specification for fabric management. It runs as a daemon on any Linux node, embedded switch, or dedicated management server.

Key Characteristics

Aspect	Details
Type	Command-line tool / daemon
License	Open source (GPL/BSD)
Complexity	Simple, minimal configuration
Interface	Command-line only, no GUI
Resource Usage	Very lightweight (low CPU/memory)
Cost	Free
Deployment	Any Linux node, switch, or dedicated server

Where OpenSM Runs

On a compute node — Any server with an InfiniBand adapter
Embedded in a switch — Many IB switches ship with OpenSM built-in
On a dedicated management server — A separate server solely for SM duties

Feature Set

What OpenSM does:

✅ Fabric discovery and initialization
✅ LID assignment and routing
✅ Partition (VLAN) management
✅ Basic high availability (master/standby)
✅ Multiple routing algorithms (MinHop, LASH, Fat-Tree, etc.)

What OpenSM does NOT do:

❌ No graphical user interface
❌ No performance monitoring or analytics
❌ No centralized logging or alerting
❌ No topology visualization
❌ No advanced diagnostics
❌ No job/workload scheduler integration

Configuration Files

/etc/rdma/opensm.conf       # Main configuration file
/etc/sysconfig/opensm       # Startup parameters (GUID, priority)
/etc/rdma/partitions.conf   # Partition definitions

Best Suited For

Small to medium InfiniBand clusters (< 100 nodes)
Development and test environments
Simple point-to-point or small switched fabrics
Environments where minimal overhead and full control are priorities
Budget-conscious deployments

UFM (Unified Fabric Manager)

What Is UFM?

UFM (Unified Fabric Manager) is NVIDIA's enterprise-grade fabric management platform. It is a comprehensive software suite that includes a full Subnet Manager plus extensive monitoring, analytics, diagnostics, and management capabilities — all accessible through a web interface and REST API.

Key Characteristics

Aspect	Details
Type	Enterprise management platform with web UI
License	Commercial (requires NVIDIA license)
Complexity	Feature-rich, extensive configuration options
Interface	Web GUI + REST API + CLI
Resource Usage	Medium to high (requires dedicated server)
Cost	Commercial license required
Deployment	Dedicated server (physical or VM)

UFM Architecture Components

UFM Server — Core management engine with embedded SM
UFM Web UI — Browser-based management interface
UFM Agents (optional) — Installed on compute nodes for detailed telemetry
Database — Stores historical data, events, and topology snapshots

Additional Capabilities (Beyond OpenSM)

Monitoring & Analytics

✅ Real-time performance monitoring (bandwidth, latency, errors)
✅ Historical data collection and trend analysis
✅ Per-port counters and statistics
✅ Cable health and temperature monitoring

Visualization & Topology

✅ Interactive fabric topology maps
✅ Color-coded port status visualization
✅ Rack and chassis views

Advanced Management

✅ Firmware update management
✅ Job-aware routing (Slurm, PBS, and other scheduler integration)
✅ Quality of Service (QoS) configuration
✅ Multi-fabric management from a single console
✅ Prometheus / Grafana exporters

Best Suited For

Large-scale HPC clusters (hundreds to thousands of nodes)
Production data centers and AI/ML training clusters
Environments requiring detailed monitoring and analytics
Organizations with compliance or audit requirements (historical data retention)

Side-by-Side Feature Comparison

Feature	OpenSM	UFM
Core SM Functionality	✅ Yes	✅ Yes
Cost	Free	Commercial
User Interface	CLI only	Web GUI + CLI + API
Topology Visualization	❌ No	✅ Yes
Performance Monitoring	❌ No	✅ Yes
Historical Data	❌ No	✅ Yes
Alerting	❌ No	✅ Yes
Cable Diagnostics	❌ No	✅ Yes
Job Scheduler Integration	❌ No	✅ Yes
Multi-Fabric Management	❌ No	✅ Yes
Resource Requirements	Very low	Medium–High
Setup Complexity	Simple	Complex
Suitable for Small Clusters	✅ Yes	⚠️ Overkill
Suitable for Large Clusters	⚠️ Limited	✅ Yes
High Availability	Basic	Advanced
Support	Community	NVIDIA Commercial

How to Identify Which One You Have

Check if UFM is installed locally

rpm -qa | grep -i ufm
systemctl list-units | grep -i ufm
ps aux | grep ufm

Check if OpenSM is installed / running locally

rpm -qa | grep -i opensm
systemctl status opensm
ps aux | grep opensm

Identify the active SM on the fabric

# Query the SM (requires at least one Active port)
sminfo

# Discover the fabric and look for SM entries
ibnetdiscover | grep -i "sm"

Common Deployment Patterns

Pattern	Description	Typical Use Case
Switch-Embedded OpenSM	OpenSM runs inside the IB switch	Small–medium clusters; most common
Dedicated UFM Server	UFM runs on a standalone management server	Large HPC / AI clusters
Compute Node OpenSM	OpenSM runs on one of the compute nodes	Small clusters or dev environments
Multi-SM (HA)	Primary SM + standby SM with priority config	High-availability production environments

Recommendations for Your Current Situation

Root Cause: You started a local OpenSM on your B300 node, which conflicted with the remote SM running on your switch or UFM server. This caused all ports to get stuck in State: Down even though Physical state: LinkUp — a classic SM conflict symptom.

If you have a switch-embedded SM

Log in to the switch management interface
Check SM status: show sm
Restart the SM service (NVIDIA/Mellanox switches):

sm stop
sm start

If you have a UFM server

Contact the UFM administrator, or log in to the UFM server directly
Restart the UFM service:

sudo systemctl restart ufmd

If you are unsure which SM you have

Ask your cluster administrator or network team
They should know what SM infrastructure is in place
Run rpm -qa | grep -E "ufm|opensm" to check what is installed locally

Summary

	OpenSM	UFM
In one sentence	Lightweight, free, basic SM functionality	Enterprise platform: SM + monitoring + management + analytics
Best for	Small clusters, dev/test environments	Large production clusters, AI/ML training
Cost	Free	Requires NVIDIA commercial license

Both tools can manage your InfiniBand fabric, but UFM provides far greater visibility and control at scale. Your current problem is that the primary SM — wherever it runs — has stopped working correctly. The path to recovery is to either fix the remote SM (switch or UFM) or bring up a new SM instance once the root cause of the conflict has been resolved.

UFM vs OpenSM: Understanding InfiniBand Fabric Management

Quick Answer

What Is a Subnet Manager (SM)?

OpenSM (Open Subnet Manager)

What Is OpenSM?

Key Characteristics

Where OpenSM Runs

Feature Set

Configuration Files

Best Suited For

UFM (Unified Fabric Manager)

What Is UFM?

Key Characteristics

UFM Architecture Components

Additional Capabilities (Beyond OpenSM)

Best Suited For

Side-by-Side Feature Comparison

How to Identify Which One You Have

Check if UFM is installed locally

Check if OpenSM is installed / running locally

Identify the active SM on the fabric

Common Deployment Patterns

Recommendations for Your Current Situation

If you have a switch-embedded SM

If you have a UFM server

If you are unsure which SM you have

Summary

Leave a Comment

Top Posts

Hot Posts

Recent Posts

Tag Cloud