Deep Dive into Server Memory Architecture: From DRAM Granules to NUMA Modes

lyan 2026-03-14 13:23

Introduction

In modern data centers, server performance is no longer solely determined by the number of CPU cores and their frequency; the architecture and configuration of the memory subsystem play an equally crucial role. From the familiar memory stick (DIMM) to the black granules soldered onto it (DRAM Chip), and further to the storage matrix composed of billions of transistors inside the chip, this is a precise and complex multi-level system. Understanding this system requires not only a high-level perspective (from the OS and BIOS) to distinguish between system-level memory modes like UMA, NUMA, and SNC, but also a low-level hardware perspective to explore the fundamental working principles of DRAM itself, such as how concepts like Bank, Rank, and Channel ultimately affect performance.

This article, based on your test summary and integrated with public technical documents and industry practices, will build a complete knowledge framework from "low-level physics" to "top-level architecture." We will first delve into the internals of DRAM granules to understand their basic working principles, then combine this foundational knowledge with system-level memory access modes like UMA, NUMA, SNC, and Hemisphere/Quadrant. Using the real-world CPU architectures of Intel Emerald Rapids (EMR) and Granite Rapids (GNR) as examples, we aim to provide system architects and operations engineers with a truly comprehensive and actionable tuning guide that spans both software and hardware.

1. The Physical World of Memory: From Memory Stick to Storage Cell

Before we discuss complex system architectures, we must first establish a solid physical foundation. What we commonly refer to as "memory" has a physical entity and organizational form far more complex than just "a large warehouse" [6].

1.1 Physical Hierarchy: Channel, DIMM, Rank, Chip, Bank

Memory Stick (DIMM): Short for Dual In-line Memory Module, this is the hardware entity we can see directly. It is a circuit board that carries multiple DRAM chips and connects to the memory slots on the motherboard via golden fingers.
Channel: A data pathway between the memory controller and a group of DIMMs. The CPU, through the memory controller, simultaneously directs all memory within a channel to work in concert. Modern server CPUs support multi-channel technology (e.g., dual-channel, quad-channel, eight-channel), which is equivalent to expanding the data transmission road from a single lane to multiple lanes, directly multiplying the total memory bandwidth [6].
Chip/Device: The black square granules soldered onto the memory stick, which are the basic units of DRAM. Each chip has its own bit width (e.g., x4, x8, x16), representing how many bits of data it can provide at one time.
Rank: To satisfy the CPU's data width (usually 64 bits), multiple chips need to be combined to work in parallel. All chips within a Rank share the same command/address bus and act together in the same clock cycle to provide 64 bits of data. For example, a 64-bit Rank can be formed using 8 x8 chips or 16 x4 chips [7]. A DIMM can have one or more Ranks.
Logical Bank (L-Bank): Inside a single DRAM chip, the storage matrix is further divided into multiple independent "small warehouses," known as logical Banks. Their core purpose is to achieve operational parallelism. While the memory controller is accessing L-Bank 0, it can simultaneously precharge L-Bank 1 and activate a row in L-Bank 2, thereby hiding operational latency and improving efficiency [6].

1.2 Core Working Principles: Cell, Burst, and Prefetch

Storage Cell: The smallest unit of data storage in DRAM, its structure consists of one transistor and one capacitor (1T1C). The presence or absence of charge in the capacitor represents a logical "1" or "0". This structure determines two fundamental characteristics of DRAM: volatility (data is lost upon power-off) and the need for refresh (the capacitor leaks charge, so data must be periodically rewritten to maintain its state) [7].
Burst Transfer: The CPU does not read data from memory bit by bit, but rather reads a "data block" at a time. This process is called a burst transfer. The Burst Length (BL) defines how many consecutive data cycles are transferred after a single read/write command. For example, the BL for DDR4 is typically 8.
Prefetch: To match the growing speed gap between the CPU and the memory core, DRAM introduced prefetch technology. It refers to the amount of data that DRAM reads from the storage array and prepares in a single internal operation, which is Prefetch Size × Internal Bus Width. For example, DDR4 has an 8n prefetch, meaning it reads 8 data units in one internal operation. This data is latched in the I/O buffer and then transferred out at high speed via a burst. A common misconception is to confuse Prefetch with Burst Length; Prefetch is the amount of data in one internal DRAM operation, while Burst is the number of consecutive cycles for external transfer [6].

2. UMA (Uniform Memory Access): The Simplified, Monolithic Model

UMA, or Uniform Memory Access, is a model where all physical memory is treated as a single, contiguous address space. In theory, the latency for any CPU core to access any memory address is the same. On modern multi-socket servers, forcing the system into UMA mode by explicitly setting NUMA=off in the BIOS causes the memory controllers of the two CPU sockets to interleave their addresses, presenting a single large memory pool to the operating system [1].

The core challenge lies in the limited bandwidth of the UPI (Ultra Path Interconnect) bus. When many cores concurrently perform cross-socket memory accesses, the UPI link quickly becomes saturated, creating a system-wide performance bottleneck. Your test data eloquently demonstrates this: in UMA mode, the global memory bandwidth dropped by 26% compared to NUMA mode, a direct consequence of UPI saturation. For modern multi-socket servers, disabling NUMA for UMA mode is generally a performance trap [2].

3. NUMA (Non-Uniform Memory Access): The Cornerstone of Performance Optimization

The NUMA architecture acknowledges and leverages the physical reality that memory access costs differ based on the physical location of the memory. It is the standard configuration for modern multi-socket servers. In NUMA mode, each CPU socket and the memory sticks directly connected to it jointly form a NUMA Node. A dual-socket server has two NUMA nodes, which are fully visible to the operating system [1].

Local Access: A core accesses memory within its own node. The request is completed only through the local IMC, resulting in the shortest data path, lowest latency, and highest bandwidth.
Remote Access: A core accesses memory in another node. The request must travel through the local CHA (Caching and Home Agent) → UPI link → remote CHA → remote IMC → remote DRAM, and then return data along the same path in reverse.

Your test data clearly quantifies this difference: the cross-node latency (450 ns) is 3.5 times the local latency (130 ns), while the cross-node bandwidth (250 GB/s) is only 45% of the local bandwidth (550 GB/s). This enormous performance gap is the core basis for NUMA tuning. Keeping NUMA enabled is the foremost principle for maximizing memory performance on multi-socket servers.

4. Die Topology and Cluster Mode Practical Analysis (SNC, Hemisphere, Quadrant)

SNC (Sub-NUMA Clustering), Hemisphere, and Quadrant are further subdivisions built on top of NUMA. Their core idea is to divide cores, LLC, and IMC into smaller clusters based on physical layout within a single CPU socket, in pursuit of lower "hyper-local" memory access latency. The availability of these modes is closely tied to the physical die topology of the CPU.

4.1 Core Concepts: SNC vs. UMA-Based Clustering (Hemisphere/Quadrant)

SNC (Sub-NUMA Clustering): This is an explicit cluster mode. After enabling SNC2 or SNC4, the clusters inside the CPU are exposed to the operating system as independent NUMA nodes. For example, on a dual-socket server with SNC2 enabled, the OS will see 4 NUMA nodes. This mode has the highest performance potential but strongly depends on applications performing precise core and memory binding via numactl; otherwise, performance may actually decrease rather than improve.
Hemisphere / Quadrant (UMA-Based Clustering): This is an implicit, software-transparent cluster mode. When enabled, the OS still sees each socket as a single NUMA node (the appearance of UMA), but the internal memory access mechanism borrows the concept of SNC (the core of SNC). Its access process is divided into two steps:
1. CHA Selection (UMA-like): On an L2 cache miss, the request is distributed to one of all CHAs within the socket via a hash function.
2. IMC Routing (SNC-like): The selected CHA intelligently routes the request to the physically nearest IMC, rather than accessing memory controllers randomly [4] [5].

This mode is a "plug-and-play" optimization designed for non-NUMA-aware legacy applications, providing better performance than pure UMA without any code modification.

4.2 Architecture Example 1: Intel EMR MCC (Emerald Rapids, Monolithic Die)

The MCC (Medium Core Count) SKU in the 5th Gen Xeon family, such as the Xeon Gold 6542Y, adopts a single monolithic die design with up to 32 cores and integrates 4 IMCs internally (corresponding to 8 DDR5-5600 channels) [8] [10]. Since all resources are on a single die, its cluster mode support is the most flexible.

Quadrant / SNC4: This is the finest-grained partitioning on the MCC die. The die is divided into 4 clusters, with each cluster allocated approximately 8 cores, 1/4 of the LLC, and 1 IMC (2 memory channels). Enabling SNC4 causes the OS to see 4 NUMA nodes; enabling Quadrant mode causes the OS to see only 1 NUMA node.
Hemisphere / SNC2: The die is divided into 2 clusters, with each cluster allocated approximately 16 cores, 1/2 of the LLC, and 2 IMCs (4 memory channels). Enabling SNC2 causes the OS to see 2 NUMA nodes; enabling Hemisphere mode causes the OS to see only 1 NUMA node.

4.3 Architecture Example 2: Intel EMR XCC (Emerald Rapids, Dual Tile)

The XCC (Extreme Core Count) SKU in the 5th Gen Xeon family, such as the Xeon Platinum 8592+, adopts a dual-tile package connected via an EMIB bridge. Each tile has up to 32 cores, its own 2 IMCs (4 memory channels), and 160 MB of LLC. The entire CPU has 64 cores, 4 IMCs (8 channels), and 320 MB of LLC [8] [10].

Because there are physically only two equal tiles, EMR XCC no longer supports the 4-partition Quadrant and SNC4 modes. This is a significant architectural change from the 4th Gen SPR XCC (4-tile design) to the 5th Gen EMR XCC (2-tile design) [10].

Hemisphere / SNC2: This is the primary cluster mode supported by EMR XCC. Each tile forms one cluster. In Hemisphere mode, the OS sees 1 NUMA node; in SNC2 mode, each tile becomes an independent NUMA node, and the OS sees 2 NUMA nodes (4 total in a dual-socket system).

4.4 Architecture Example 3: Intel GNR XCC (Granite Rapids, Dual Compute Die)

Xeon 6 P-core (Granite Rapids) adopts a more thorough chiplet design, dividing the CPU into Compute Dies (Intel 3 process) and IO Dies (Intel 7 process). The GNR XCC SKU consists of 2 Compute Dies and 2 IO Dies [9] [11].

Each Compute Die has up to approximately 60 cores and 4 IMCs (corresponding to 4 DDR5-6400 channels). The entire GNR XCC CPU has up to approximately 120 cores and 8 IMCs (8 channels) [11] [12]. Similar to EMR XCC, because there are physically two equal Compute Dies, GNR XCC also only supports Hemisphere and SNC2 modes, and does not support Quadrant and SNC4.

Hemisphere / SNC2: Each Compute Die forms one cluster. In Hemisphere mode, the OS sees 1 NUMA node; in SNC2 mode, each Compute Die becomes an independent NUMA node, and the OS sees 2 NUMA nodes (4 total in a dual-socket system).

Architectural Evolution Insight: The transition from SPR's 4-tile design to EMR/GNR's 2-tile/2-die design simplified packaging but also limited the maximum number of cluster mode partitions. This indicates that while Intel pursues higher core density, it also makes trade-offs at the physical topology level, reserving the finest-grained partitioning capability (SNC4/Quadrant) for the monolithic die MCC SKU.

5. Comprehensive Comparison and Final Tuning Recommendations

The following table summarizes the memory modes across different architectures:

Feature	UMA (NUMA=off)	NUMA (SNC=off)	SNC (SNC=on)	Hemisphere/Quadrant
OS-Visible Nodes (per socket)	1	1	2 (SNC2) / 4 (SNC4)	1
CHA Selection	Global Interleave	Local-First	Intra-Cluster Interleave	Global Interleave
IMC Routing	Global Interleave	Local-First	Intra-Cluster Priority	Localized Routing
Best Scenario	Legacy single-socket apps	General-purpose multi-socket servers	Highly localized memory access (HPC)	Non-NUMA-aware applications
Core Trade-off	Simple but poor performance	Requires software cooperation	Ultimate local latency vs. global bandwidth	Transparent optimization vs. performance ceiling

Based on the above analysis, we update and finalize the tuning recommendations:

CPU Architecture	Application Type	Recommended BIOS Setting	Additional Action	Core Rationale
General	Large Database (OLTP/OLAP)	NUMA=on, SNC=disable	Bind instances to nodes	In random access patterns, stable global bandwidth is more important than ultimate local latency.
General	Virtualization / Container Cloud	NUMA=on, SNC=disable	Use NUMA-aware scheduler	Provides a stable, predictable global memory performance baseline for upper-layer tenants.
General	AI Training (Single-machine, Multi-GPU)	NUMA=on, SNC=disable	Ensure GPU-to-NUMA-node affinity	Data transfers between GPU and CPU are frequent and cross-domain; SNC partitioning may actually increase overhead.
EMR MCC	Legacy / Non-NUMA-aware Apps	Quadrant	No special action required	Leverages the 4-partition hardware advantage on a monolithic die to transparently reduce average latency for software.
EMR/GNR XCC	Legacy / Non-NUMA-aware Apps	Hemisphere	No special action required	The best software-transparent choice under the XCC architecture, leveraging the 2-tile/die physical partitioning.
EMR MCC	HPC (Requires Ultimate Latency)	NUMA=on, SNC4	Strictly bind processes per NUMA domain	Monolithic die supports 4-partition, enabling the finest-grained core binding to maximize locality benefits.
EMR/GNR XCC	HPC (Requires Ultimate Latency)	NUMA=on, SNC2	Strictly bind processes per NUMA domain	The finest-grained NUMA partition under the 2-tile/die architecture, with each tile/die becoming one node.

Final Principle: Any BIOS-level performance tuning must serve the upper-layer application workload. Before changing configurations, deeply understand the memory access patterns of the target application and validate quantitatively through rigorous benchmark testing (e.g., Stream, MLC, Sysbench). There is no "one-size-fits-all best configuration," only the solution best suited to a specific workload.

References

酷毙的我啊. (2024). Linux NUMA Mechanism. CSDN.
酷毙的我啊. (2024). Huawei Server BIOS Performance Tuning Practical Guide. CSDN.
Intel Corporation. (2022). Technical Overview Of The 4th Gen Intel® Xeon® Scalable processor family. Intel Developer Zone.
Intel Corporation. (2022). Intel® Xeon® Scalable processor Max Series. Intel Developer Zone.
Bloom, M. (2023). What's the difference between "Sub-NUMA Clustering" and "Hemisphere and Quadrant Modes" in Intel CPU? Stack Overflow.
dapp9builder. (2024). From "Memory Stick" to "DRAM Granule": Do You Really Understand DRAM?. CSDN.
OurPlay. (2024). DRAM Memory Architecture Full Analysis: A Hardware Engineer's Guide from Cell to Channel. CSDN.
Paul Alcorn. (2023). Intel 'Emerald Rapids' 5th-Gen Xeon Platinum 8592+ Review. Tom's Hardware.
Timothy Prickett Morgan. (2024). Intel Shoots "Granite Rapids" Xeon 6 Into The Datacenter. The Next Platform.
Brad Bourque. (2024). Intel 5th Gen Xeon Processors Debut: Emerald Rapids Benchmarks. HotHardware.
Patrick Kennedy. (2023). 5th Gen Intel Xeon Processors Emerald Rapids Resets Servers by Intel. ServeTheHome.
Chips and Cheese. (2025). A Look into Intel Xeon 6's Memory Subsystem. Chips and Cheese.