UFM vs OpenSM: Understanding InfiniBand Fabric Management
Quick Answer
Both UFM and OpenSM are Subnet Managers for InfiniBand fabrics, but they serve very different purposes. UFM is NVIDIA's comprehensive enterprise fabric management platform, while OpenSM is a lightweight, open-source Subnet Manager implementation.
What Is a Subnet Manager (SM)?
Before comparing UFM and OpenSM, it helps to understand what a Subnet Manager actually does. The Subnet Manager is the "brain" of an InfiniBand fabric. It is responsible for:
- Fabric Discovery — Discovers all switches, adapters, and connections in the fabric
- LID Assignment — Assigns Local IDs (LIDs) to every port
- Routing Table Configuration — Programs forwarding tables into switches
- Port State Management — Drives ports through: Down → Initialize → Armed → Active
- Partition Management — Creates VLANs (partitions) for network isolation
Active state and cannot pass any traffic — regardless of physical link status.OpenSM (Open Subnet Manager)
What Is OpenSM?
OpenSM is an open-source, lightweight Subnet Manager that implements the InfiniBand specification for fabric management. It runs as a daemon on any Linux node, embedded switch, or dedicated management server.
Key Characteristics
| Aspect | Details |
|---|---|
| Type | Command-line tool / daemon |
| License | Open source (GPL/BSD) |
| Complexity | Simple, minimal configuration |
| Interface | Command-line only, no GUI |
| Resource Usage | Very lightweight (low CPU/memory) |
| Cost | Free |
| Deployment | Any Linux node, switch, or dedicated server |
Where OpenSM Runs
- On a compute node — Any server with an InfiniBand adapter
- Embedded in a switch — Many IB switches ship with OpenSM built-in
- On a dedicated management server — A separate server solely for SM duties
Feature Set
What OpenSM does:
- ✅ Fabric discovery and initialization
- ✅ LID assignment and routing
- ✅ Partition (VLAN) management
- ✅ Basic high availability (master/standby)
- ✅ Multiple routing algorithms (MinHop, LASH, Fat-Tree, etc.)
What OpenSM does NOT do:
- ❌ No graphical user interface
- ❌ No performance monitoring or analytics
- ❌ No centralized logging or alerting
- ❌ No topology visualization
- ❌ No advanced diagnostics
- ❌ No job/workload scheduler integration
Configuration Files
/etc/rdma/opensm.conf # Main configuration file
/etc/sysconfig/opensm # Startup parameters (GUID, priority)
/etc/rdma/partitions.conf # Partition definitions
Best Suited For
- Small to medium InfiniBand clusters (< 100 nodes)
- Development and test environments
- Simple point-to-point or small switched fabrics
- Environments where minimal overhead and full control are priorities
- Budget-conscious deployments
UFM (Unified Fabric Manager)
What Is UFM?
UFM (Unified Fabric Manager) is NVIDIA's enterprise-grade fabric management platform. It is a comprehensive software suite that includes a full Subnet Manager plus extensive monitoring, analytics, diagnostics, and management capabilities — all accessible through a web interface and REST API.
Key Characteristics
| Aspect | Details |
|---|---|
| Type | Enterprise management platform with web UI |
| License | Commercial (requires NVIDIA license) |
| Complexity | Feature-rich, extensive configuration options |
| Interface | Web GUI + REST API + CLI |
| Resource Usage | Medium to high (requires dedicated server) |
| Cost | Commercial license required |
| Deployment | Dedicated server (physical or VM) |
UFM Architecture Components
- UFM Server — Core management engine with embedded SM
- UFM Web UI — Browser-based management interface
- UFM Agents (optional) — Installed on compute nodes for detailed telemetry
- Database — Stores historical data, events, and topology snapshots
Additional Capabilities (Beyond OpenSM)
Monitoring & Analytics
- ✅ Real-time performance monitoring (bandwidth, latency, errors)
- ✅ Historical data collection and trend analysis
- ✅ Per-port counters and statistics
- ✅ Cable health and temperature monitoring
Visualization & Topology
- ✅ Interactive fabric topology maps
- ✅ Color-coded port status visualization
- ✅ Rack and chassis views
Advanced Management
- ✅ Firmware update management
- ✅ Job-aware routing (Slurm, PBS, and other scheduler integration)
- ✅ Quality of Service (QoS) configuration
- ✅ Multi-fabric management from a single console
- ✅ Prometheus / Grafana exporters
Best Suited For
- Large-scale HPC clusters (hundreds to thousands of nodes)
- Production data centers and AI/ML training clusters
- Environments requiring detailed monitoring and analytics
- Organizations with compliance or audit requirements (historical data retention)
Side-by-Side Feature Comparison
| Feature | OpenSM | UFM |
|---|---|---|
| Core SM Functionality | ✅ Yes | ✅ Yes |
| Cost | Free | Commercial |
| User Interface | CLI only | Web GUI + CLI + API |
| Topology Visualization | ❌ No | ✅ Yes |
| Performance Monitoring | ❌ No | ✅ Yes |
| Historical Data | ❌ No | ✅ Yes |
| Alerting | ❌ No | ✅ Yes |
| Cable Diagnostics | ❌ No | ✅ Yes |
| Job Scheduler Integration | ❌ No | ✅ Yes |
| Multi-Fabric Management | ❌ No | ✅ Yes |
| Resource Requirements | Very low | Medium–High |
| Setup Complexity | Simple | Complex |
| Suitable for Small Clusters | ✅ Yes | ⚠️ Overkill |
| Suitable for Large Clusters | ⚠️ Limited | ✅ Yes |
| High Availability | Basic | Advanced |
| Support | Community | NVIDIA Commercial |
How to Identify Which One You Have
Check if UFM is installed locally
rpm -qa | grep -i ufm
systemctl list-units | grep -i ufm
ps aux | grep ufm
Check if OpenSM is installed / running locally
rpm -qa | grep -i opensm
systemctl status opensm
ps aux | grep opensm
Identify the active SM on the fabric
# Query the SM (requires at least one Active port)
sminfo
# Discover the fabric and look for SM entries
ibnetdiscover | grep -i "sm"
Common Deployment Patterns
| Pattern | Description | Typical Use Case |
|---|---|---|
| Switch-Embedded OpenSM | OpenSM runs inside the IB switch | Small–medium clusters; most common |
| Dedicated UFM Server | UFM runs on a standalone management server | Large HPC / AI clusters |
| Compute Node OpenSM | OpenSM runs on one of the compute nodes | Small clusters or dev environments |
| Multi-SM (HA) | Primary SM + standby SM with priority config | High-availability production environments |
Recommendations for Your Current Situation
State: Down even though Physical state: LinkUp — a classic SM conflict symptom.If you have a switch-embedded SM
- Log in to the switch management interface
- Check SM status:
show sm - Restart the SM service (NVIDIA/Mellanox switches):
sm stop
sm start
If you have a UFM server
- Contact the UFM administrator, or log in to the UFM server directly
- Restart the UFM service:
sudo systemctl restart ufmd
If you are unsure which SM you have
- Ask your cluster administrator or network team
- They should know what SM infrastructure is in place
- Run
rpm -qa | grep -E "ufm|opensm"to check what is installed locally
Summary
| OpenSM | UFM | |
|---|---|---|
| In one sentence | Lightweight, free, basic SM functionality | Enterprise platform: SM + monitoring + management + analytics |
| Best for | Small clusters, dev/test environments | Large production clusters, AI/ML training |
| Cost | Free | Requires NVIDIA commercial license |
Both tools can manage your InfiniBand fabric, but UFM provides far greater visibility and control at scale. Your current problem is that the primary SM — wherever it runs — has stopped working correctly. The path to recovery is to either fix the remote SM (switch or UFM) or bring up a new SM instance once the root cause of the conflict has been resolved.