Optimize System Bandwidth For HPC Ai Micron CXL Intel Xeon Whitepaper
Optimize System Bandwidth For HPC Ai Micron CXL Intel Xeon Whitepaper
1
Optimizing System Memory Bandwidth with
Micron CXLTM Memory Expansion Modules on
Intel® Xeon® 6 Processors
Rohit Sehgal, Vishal Tanna Vinicius Petrucci Anil Godbole
Micron Technology Micron Technology Intel Corporation
San Jose, CA Austin, TX Santa Clara, CA
Abstract— High-Performance Computing (HPC) and DRAM modules. The memory bandwidth expansion enabled
Artificial Intelligence (AI) workloads typically demand by CXL is essential for enhancing the performance of HPC
substantial memory bandwidth and, to a degree, memory and AI workloads.
capacity. CXLTM memory expansion modules, also known as
CXL “type-3” devices, enable enhancements in both memory
capacity and bandwidth for server systems by utilizing the CXL
While CXL has primarily aimed at expanding memory
protocol which runs over the PCIe interfaces of the processor. capacity, its advantages for bandwidth-intensive workloads
This paper discusses experimental findings on achieving still need to be thoroughly explored and quantified in real
increased memory bandwidth for HPC and AI workloads using CXL-capable systems, utilizing as many supported PCIe
Micron’s CXL modules. This is the first study that presents real lanes as possible. In particular, the unique bandwidth
data experiments utilizing eight CXL E3.S (x8) Micron CZ122 characteristics of local DRAM and CXL memory can differ
devices on the Intel® Xeon® 6 processor 6900P (previously depending on the read/write ratio of workloads, creating
codenamed Granite Rapids AP) featuring 128 cores, alongside challenges in optimizing the capabilities of each memory tier
Micron DDR-5 memory operating at 6400 MT/s on each of the in terms of memory bandwidth. For this purpose, a software-
CPU’s 12 DRAM channels. The eight CXL memories were set
up as a unified NUMA configuration, employing software-based
based weighted interleaving method, available in mainstream
page level interleaving mechanism, available in Linux kernel Linux kernel distribution, is employed for optimization.
v6.9+, between DDR5 and CXL memory nodes to improve
overall system bandwidth. Memory expansion via CXL boosts
II. PLATFORM CONFIGURATION
read-only bandwidth by 24% and mixed read/write bandwidth A. Intel Xeon 6 CPU System (Avenue City platform)
by up to 39%. Across HPC and AI workloads, the geometric
mean of performance speedups is 24%. The 6900P CPU supports 6 x16 (96) PCIe 5.0 lanes. The lanes
support CXL 2.0 Type-3 devices, allowing for memory
Keywords—DDR5, CXL, HPC, software-interleaving, expansion. The CPU supports any four x16 lanes to be used as
bandwidth, LLM inferencing, AI vector search CXL links.
I. INTRODUCTION
High-performance and AI workloads encompass important
computational tasks that demand substantial processing and
memory resources. These workloads are frequently utilized
in scientific research, simulations, and data-intensive
applications, including computational fluid dynamics,
weather forecasting, and DNA sequencing.
2
As the focus of this paper is on demonstrating the OS Red Hat Enterprise Linux 9.4
effectiveness of increasing bandwidth rather than capacity, Kernel 6.11.6 (With support for weighted
smaller memory modules were intentionally chosen for both memory interleaving)
native DRAM (64 GB) and CXL (128 GB) modules.
B. Memory Expansion with Micron CZ122 CXL modules
The system configuration employed (Figure 1) facilitates the Micron's CZ122 CXL modules are currently in production
management of various memory tiers by efficiently and have demonstrated reliable performance across various
organizing and distinguishing between the locally attached workloads, effectively showcasing memory expansion over
DRAM and the CXL memory modules. CXL interface. The addition of these CXL modules enhance
both the memory bandwidth and the capacity of the server,
Traditionally, the Linux kernel has managed memory building on what is already provided by the RDIMM slots;
allocation across multiple NUMA (Non-Uniform Memory that is, delivering memory bandwidth expansion.
Access) nodes. Each of the memory types (either DRAM or
CXL) is represented as a single NUMA node, allowing the Optimally placing newly allocated pages is a complex issue.
system to use existing abstractions to manage and allocate NUMA interleaving, a traditional approach under Linux,
memory across these two different pools. evenly distributes pages across memory nodes for consistent
performance. However, it lacks the ability to consider
Recently, NUMA nodes have been used to categorize memory tier performance differences.
memory into performance tiers, while existing allocation
policies can place memory on specific NUMA nodes. For A recent series of patches has added weighted NUMA
example, when brought up as system memory, CXL memory interleaving capabilities to the Linux kernel, allowing for
is treated as a separate NUMA node. more strategic memory allocation based on performance
characteristics of different memory nodes in system. This
To showcase the advantages of using CXL memories, the strategy optimizes system memory bandwidth by effectively
system configuration is designed so that the local RDIMM utilizing bandwidth both local DRAM and CXL memory
slots are filled with the fastest available Micron RDIMMs, nodes. The weighted-interleaving feature, introduced in
delivering a bandwidth of 6400 MT/s per slot. All 12 Linux kernel version 6.9+ and influenced significantly by
available slots are populated – totaling 768GB memory Micron’s contributions, enables the adjustment of weights
capacity. As shown in Figure 2, eight Micron CZ122 128GB assigned to individual pages across various memory types,
CXL devices are utilized, occupying 64 PCIe lanes and thereby enhancing overall memory bandwidth (as illustrated
providing a total additional memory capacity of 1TB. in Figure 2).
3
Figure 4. Bandwidth vs Latency curves using DRAM only vs DRAM + CXL. The interleaving weights are represented as pairs
(DRAM, CXL). It’s important to note that at low bandwidth, a greater number of pages (9) are allocated to DRAM compared to
CXL (1), as indicated by the weights (9,1). Conversely, under high load conditions, the optimal interleaving weights shift to (3,1).
III. NATIVE DRAM VS. CXL ATTACHED MEMORY workload. Therefore, it's crucial to analyze the read-to-write
PERFORMANCE CHARACTERISTICS ratio of a workload to identify the optimal interleaving
Before the performance analysis of the actual workloads is strategy for utilizing DRAM and CXL memory tiers
introduced, the performance characteristics of local DRAM effectively.
and CXL memory regarding bandwidth at various read-to-
write ratios of memory traffic will be presented and As shown in Figure 4, it's also important to note that memory
discussed1. latency is reduced when using CXL. This is because
workloads that rely solely on local DRAM can be bandwidth-
CXL over limited, leading to significantly higher memory access
Memory Bandwidth Bandwidth DRAM
Workload
Tier (in GB/s) (Normalized) (Theoretical latency (loaded latency) under heavy loads. In contrast,
gains with CXL) combining DRAM with CXL memory through optimized
Read only DRAM 556 1.00 - weighted interleaving results in lower latency, despite CXL
3R,1W DRAM 486 0.87 - memory having a higher unloaded latency.
2R,1W DRAM 474 0.85 -
2R,1W
(non-
At each data point on the “DRAM + CXL” curve, the
DRAM 466 0.84 - interleave ratio of DRAM and CXL is displayed. Under low
temporal
W) bandwidth conditions, it's advantageous to utilize more
1R,1W DRAM 446 0.80 - DRAM due to its lower latency compared to CXL memory
(9:1 ratio). However, as the load increases, the reliance on
Read only CXL 205 1.00 37%
DRAM decreases while the emphasis shifts towards CXL
3R,1W CXL 214 1.04 44%
memory. Ultimately, a 3:1 ratio was identified as optimal
2R,1W CXL 208 1.01 44%
2R,1W
under maximum load conditions for a read-only workload
(non- traffic.
CXL 189 0.92 41%
temporal
W)
When comparing the use of CXL memory alongside local
1R,1W CXL 214 1.04 48%
DRAM, various performance improvements can be observed.
For instance, in a read-only scenario (where DRAM excels)
The performance data from table above indicates that DRAM
the addition of CXL memory bandwidth results in a 24%
performs optimally in read-only workloads, but its
performance boost. The upcoming experiments will
performance diminishes when the number of writes is equal demonstrate that for mixed read/write workloads, the
to or exceeds the number of reads. For instance, in a workload performance improvements with CXL, attributed to balanced
with a 1:1 read to write ratio, DRAM's performance drops by
memory interleaving, can reach as high as 38%. The
20% compared to a read-only scenario.
following sections will demonstrate that for different
workload mixes, we may need to adjust the interleaving
Conversely, CXL memory demonstrates the opposite trend
weights based on the read-to-write ratio of the workload.
due to the bidirectional nature of the PCIe interface, resulting
in better performance for mixed read-write workloads.
Another noteworthy observation is that CXL memory shows
an 8% decrease in bandwidth during a non-temporal write
1
Performance results are derived from testing in the specified
configuration (Section II.A). Results may vary, so it is
recommended to reconfirm them in your setting.
4
IV. WORKLOAD ANALYSIS Workload: W5 (1W, 1W)
5
(Intel® AVX-512) Vector Neural Network Instructions search time. These values were optimized to achieve a high
(VNNI) and Intel® Advanced Matrix Extensions (Intel® recall rate with minimal search time. The configuration
AMX) on Intel CPUs. resulted in a recall rate of 77% @ 10, meaning 77% of the
true nearest neighbors are included in the top 10 results
With 128 physical cores, the CPU architecture provides returned by the search algorithm.
specialized acceleration for AI operations, improving
throughput and reducing latency in LLM inferencing and Weight Weight Time Speedup
vector search workloads. The architecture supports matrix (DRAM) (CXL) (ms / query)
multiplication and efficiently handles models with billions of 1 0 0.545 1.00
parameters. 2 1 0.442 1.23
5 2 0.454 1.20
LLM Inference - To run LLM inferencing on the Intel
hardware, the open-source Intel Extensions for PyTorch The FAISS workload demonstrated a 23% improvement with
(IPEX) was used. IPEX has up to date optimizations for an a DRAM to CXL ratio of 2:1.
extra performance boost on Intel hardware. The LLM model C. HPC Workloads
used was Meta-Llama3-8B-Instruct. The data type employed
HPC workloads stand for High performance workloads –
for the weights is ‘bfloat16’. Batch size of one was used. With
those include OpenFOAM, HPCG, Xcompact3d, POT3D.
using the intel pytorch extensions for inferencing, the
These workloads typically require higher memory
LLAMA3-8B-Instruct gave a speed up of 17% with 3:1
bandwidths in addition to increased capacity.
DRAM to CXL ratio versus using DRAM only memory.
OpenFOAM - OpenFOAM workload benchmarks are
Weight Weight Output Token Speedup
(DRAM) (CXL) Latency (ms) standardized test cases designed to evaluate the performance
1 0 42.91 1.00 and scalability of hardware and software configurations when
2 1 40.43 1.06 running OpenFOAM, an open-source computational fluid
5 2 37.54 1.14 dynamics (CFD) software. These benchmarks simulate
3 1 36.83 1.17 various fluid dynamics scenarios to assess how efficiently
different systems handle complex CFD computations. The
FAISS (Vector Search) - FAISS [7] is a library developed OpenFOAM drivaerFastback case was used with an input of
by Facebook AI for efficient similarity search and clustering approximately 200 million cells. The results from the
of dense vectors. The dataset used was the Microsoft Turing- benchmark for different DRAM/CXL ratios are shown
ANNS consisting of a raw vector space of one billion points below:
with 100 dimensions, using L2 distance and k-NN method.
As recommended by Meta [8], the index used was: Weight Weight Execution Speedup
OPQ128_256-IVF65536_HNSW32-PQ128x4fsr. This is an (DRAM) (CXL) time (s)
1 0 254 1.00
optimized FAISS index configuration that specifies a series
2 1 212 1.20
of transformations and indexing methods for efficient 5 2 209 1.22
similarity search. Here is a breakdown of what each part 3 1 210 1.21
means:
• OPQ128_256: Optimized Product Quantization The OpenFOAM workload has exhibited a 22%
rotates vectors for efficient encoding, with 128 and improvement with a DRAM to CXL ratio of 5:2.
256 dimensions involved.
• IVF65536: Inverted File Index with 65,536 clusters HPCG - The High-Performance Conjugate Gradients
speeds up the search by dividing the vector space (HPCG) benchmark is a workload designed to assess
into clusters. supercomputing systems by solving a large, sparse linear
• HNSW32: Hierarchical Navigable Small World system using a multigrid preconditioned conjugate gradient
graph with 32 neighbors, a graph-based method for algorithm. Unlike the High Performance Linpack (HPL)
approximate nearest neighbor search. benchmark, which focuses on dense matrix computations,
• PQ128x4fsr: Product Quantization with 128 HPCG emphasizes memory access patterns and data
dimensions and 4 subquantizers for further movement, reflecting the behavior of real-world scientific
optimizations. and engineering applications. By doing so, HPCG provides a
more comprehensive measure of a system’s capability to
The configuration combines several advanced techniques to handle complex, memory-intensive workloads. The input
create an efficient and scalable index for similarity search in used was the following: x=192, y=192, z=192. Results are
large datasets. shown in the table below.
To report the final performance data, these parameters were Weight Weight Performance Speedup
configured: nprobe=4096 and efSearch=512. Both are crucial (DRAM) (CXL) (GFlops/s)
for balancing speed and accuracy in FAISS searches. A 1 0 92 1.00
higher nprobe (number of clusters probed) increases accuracy 2 1 111 1.20
but also search time. Similarly, efSearch (number of 5 2 113 1.23
candidate nodes explored) enhances accuracy at the cost of 3 1 117 1.27
6
Figure 5. Summary of performance gains for the HPC and AI workloads running on DDR5-6400 (baseline) vs. DDR5-6400 + CXL.
7
Key takeaways from this study include:
• Significant improvements in system performance
with the combination of CXL based memory
expansion and native DDR5-6400 memory due to
bandwidth improvements.
• The optimization of the DRAM:CXL ratios as a
critical factor in achieving these performance gains.
• The potential for CXL technology to drastically
elevate the capabilities of high-performance
computing and artificial intelligence applications.
micron.com
©2024 Micron Technology, Inc. All rights reserved. All information herein is provided on as “AS IS” basis without warranties of any kind, including any implied
warranties, warranties of merchantability or warranties of fitness for a particular purpose. Micron, the Micron logo, and all other Micron trademarks are the property of
Micron Technology, Inc. Intel and Xeon are trademarks of Intel Corporation. All other trademarks are the property of their respective owners. No hardware, software or
system can provide absolute security and protection of data under all conditions. Micron assumes no liability for lost, stolen or corrupted data arising from the use of any
Micron product, including those products that incorporate any of the mentioned security features. Products are warranted only to meet Micron’s production data sheet
specifications. Products, programs and specifications are subject to change without notice. Rev. A 12/2024 CCM004-676576390-11778