Low-overhead General-purpose Near-Data Processing in CXL Memory Expanders

Ham, Hyungkyu; Hong, Jeongmin; Park, Geonwoo; Shin, Yunseon; Woo, Okkyun; Yang, Wonhyuk; Bae, Jinhoon; Park, Eunhyeok; Sung, Hyojin; Lim, Euicheol; Kim, Gwangsun

Computer Science > Hardware Architecture

arXiv:2404.19381 (cs)

[Submitted on 30 Apr 2024 (v1), last revised 23 Sep 2024 (this version, v3)]

Title:Low-overhead General-purpose Near-Data Processing in CXL Memory Expanders

Authors:Hyungkyu Ham, Jeongmin Hong, Geonwoo Park, Yunseon Shin, Okkyun Woo, Wonhyuk Yang, Jinhoon Bae, Eunhyeok Park, Hyojin Sung, Euicheol Lim, Gwangsun Kim

View PDF HTML (experimental)

Abstract:Emerging Compute Express Link (CXL) enables cost-efficient memory expansion beyond the local DRAM of processors. While its CXL$.$mem protocol provides minimal latency overhead through an optimized protocol stack, frequent CXL memory accesses can result in significant slowdowns for memory-bound applications whether they are latency-sensitive or bandwidth-intensive. The near-data processing (NDP) in the CXL controller promises to overcome such limitations of passive CXL memory. However, prior work on NDP in CXL memory proposes application-specific units that are not suitable for practical CXL memory-based systems that should support various applications. On the other hand, existing CPU or GPU cores are not cost-effective for NDP because they are not optimized for memory-bound applications. In addition, the communication between the host processor and CXL controller for NDP offloading should achieve low latency, but existing CXL$.$io/PCIe-based mechanisms incur $\mu$s-scale latency and are not suitable for fine-grained NDP.
To achieve high-performance NDP end-to-end, we propose a low-overhead general-purpose NDP architecture for CXL memory referred to as Memory-Mapped NDP (M$^2$NDP), which comprises memory-mapped functions (M$^2$func) and memory-mapped $\mu$threading (M$^2\mu$thread). M$^2$func is a CXL$.$mem-compatible low-overhead communication mechanism between the host processor and NDP controller in CXL memory. M$^2\mu$thread enables low-cost, general-purpose NDP unit design by introducing lightweight $\mu$threads that support highly concurrent execution of kernels with minimal resource wastage. Combining them, M$^2$NDP achieves significant speedups for various workloads by up to 128x (14.5x overall) and reduces energy by up to 87.9% (80.3% overall) compared to baseline CPU/GPU hosts with passive CXL memory.

Comments:	Accepted at the 57th IEEE/ACM International Symposium on Microarchitecture (MICRO), 2024
Subjects:	Hardware Architecture (cs.AR)
Cite as:	arXiv:2404.19381 [cs.AR]
	(or arXiv:2404.19381v3 [cs.AR] for this version)
	https://doi.org/10.48550/arXiv.2404.19381

Submission history

From: Gwangsun Kim [view email]
[v1] Tue, 30 Apr 2024 09:14:12 UTC (990 KB)
[v2] Fri, 19 Jul 2024 08:12:24 UTC (1,401 KB)
[v3] Mon, 23 Sep 2024 08:38:27 UTC (1,479 KB)

Computer Science > Hardware Architecture

Title:Low-overhead General-purpose Near-Data Processing in CXL Memory Expanders

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Hardware Architecture

Title:Low-overhead General-purpose Near-Data Processing in CXL Memory Expanders

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators