Patent
Patent
( 12 ) Malladi
Unitedet alStates
.
Patent ( 10) Patent No.: US 11,461,263 B2
Oct. 4 , 2022
(45 ) Date of Patent :
( 54 ) DISAGGREGATED MEMORY SERVER ( 58 ) Field of Classification Search
??? GO6F 13/4027 ; G06F 13/1663 ; GOOF
( 71 ) Applicant: SAMSUNG ELECTRONICS CO . , 12/0802 ; G06F 12/0808 ; G06F 12/1045 ;
LTD ., Suwon - si (KR) ( Continued )
( 72 ) Inventors : Krishna Teja Malladi , San Jose , CA ( 56 ) References Cited
(US ) ; Byung Hee Choi , Fremont, CA
(US ) ; Andrew Chang , Los Altos , CA U.S. PATENT DOCUMENTS
(US ) ; Ehsan M. Najafabadi, San Jose , 8,971,423 B1 3/2015 Fu et al .
CA (US ) 9,235,519 B2 1/2016 Lih et al .
( 73 ) Assignee : Samsung Electronics Co. , Ltd. , (Continued )
Suwon - si (KR ) FOREIGN PATENT DOCUMENTS
( * ) Notice : Subject to any disclaimer, the term of this EP 1235154 A2 8/2002
patent is extended or adjusted under 35 WO 2005/116839 A1 12/2005
U.S.C. 154 (b ) by 0 days.
OTHER PUBLICATIONS
( 21 ) Appl. No .: 17/026,087
U.S. Office Action ated Jun . 21 , 2021 , issued U.S. Appl. No.
(22 ) Filed : Sep. 18, 2020 17 /026,071 ( 12 pages ).
(Continued )
( 65 ) Prior Publication Data
US 2021/0311646 A1 Oct. 7 , 2021 Primary Examiner — Hiep T Nguyen
(74 ) Attorney, Agent, or Firm - Lewis Roca Rothgerber
Related U.S. Application Data Christie LLP
( 60 ) Provisional application No. 63 /031,508 , filed on May ( 57 ) ABSTRACT
28 , 2020 , provisional application No. 63 /031,509 , A system and method for managing memory resources . In
(Continued ) some embodiments, the system includes a first memory
(51 ) Int. Ci. server, a second memory server, and a server - linking switch
G06F 13/40 ( 2006.01 ) connected to the first memory server and to the second
G06F 15/173 ( 2006.01) memory server . The first server may include a cache
( Continued ) coherent switch and a first memory module . In some
(52) U.S. CI.
embodiments, the first memory module is connected to the
CPC
cache - coherent switch, and the cache - coherent switch is
GO6F 13/4027 (2013.01 ) ; G06F 3/0604 connected to the server - linking switch .
(2013.01 ) ; G06F 3/067 ( 2013.01 ) ;
(Continued ) 20 Claims , 11 Drawing Sheets
Datacenter Spine
Ethernet
130
137
BU
135 . SIT MUS SUS a
<
ST SITE
US 11,461,263 B2
Page 2
( 56 ) References Cited
OTHER PUBLICATIONS
U.S. Advisory Action dated Dec. 9 , 2021 , issued in U.S. Appl. No.
17 / 026,071 (4 pages ) .
European Search Report for EP Application No. 21162578.5 dated
Sep. 15 , 2021 , 13 pages .
Notice of Allowance for U.S. Appl. No. 17/ 026,074 dated Dec. 29 ,
2021 , 10, pages .
Notice of Allowance for U.S. Appl. No. 17 /026,074 dated Mar. 9 ,
2021 , 10 pages .
Notice of Allowance for U.S. Appl. No. 17/ 026,082 dated Mar. 30 ,
2022 , 7 pages.
Office Action for U.S. Appl. No. 17/ 026,071 dated Mar. 17 , 2022 ,
14 pages.
Office Action for U.S. Appl. No. 17/ 026,082 dated Nov. 26 , 2021 ,
13 pages.
Office Action for U.S. Appl. No. 17 /246,448 dated May 13 , 2022 ,
12 pages.
U.S. Notice of Allowance dated Jun . 28 , 2022 , issued in U.S. Appl.
No. 17 / 026,074 ( 10 pages ) .
U.S. Office Action dated Aug. 3 , 2022 , issued in U.S. Appl. No.
177,026,082 ( 13 pages ) .
* cited by examiner
U.S. Patent Oct. 4. 2022 Sheet 1 of 11 US 11,461,263 B2
DDR4 120
135
115
125 X6 137
CPU PCIe
5
140
en
-120 1 1 11 M
OLI 3 $ 3 3
1B
.
FIG
portport 10 GbE
How OIN
ww CPU WuXXMX.XWwuwWu
?MA SZI
115
145
$ 5 4
DDR
W 135 137
WOWAMW
5 3 3
SANMAR
120
Servers
105
U.S. Patent Oct. 4 , 2022 Sheet 3 of 11 US 11,461,263 B2
Som
110
MUI
*
115
125
CPU PCIe
3 130 137
DSatpcienter Ethernet TOREstwhiertnceht 10 GbE
port NIC SIS
FIG
.
1C
# 10 GbE
port NIC
1
1
PO
X ALCK 125
CPU PCIe
5 x48
115
??????????
MU IM
DDR4 120
Sms
SAKREKLIK ***
Servers 135
105
U.S. Patent Oct. 4 , 2022 Sheet 4 of 11 US 11,461,263 B2
120 SS
TT
140
110 3 3 3 3 3 3
CPU
X ALCK
3
125
32 E ?
115 DDR
WiUWXaXW
**
MU I M
140 ST
3
120 **
Servers 135
105
U.S. Patent Oct. 4 , 2022 Sheet 5 of 11 US 11,461,263 B2
SS
DDR4 120 LL
ARXAR*** ARRES
L
112 LO MIUM
............
115
125
1
w
SNS
ST MIU 1E
.
FIG
CPU PCIe
5 x48 wwwwwwwww
XX 125 115
DDR4 120
Sans
HOROROKORREKORDERICIRE
WATT
Servers 135
105
U.S. Patent Oct. 4 , 2022 Sheet 6 of 11 US 11,461,263 B2
120 TT
SS
140
112 3 3 3 3 3 3
port PCle5
conector CPU
XX
WRT7*
3 3
125
3 ?
115 ODR
WiUWXaXW
**
MIU M
STS
120 **
Servers 135
105
U.S. Patent Oct. 4 , 2022 Sheet 7 of 11 US 11,461,263 B2
13
1
, .........
112 ***
PCle5 conector
130 137
...
R.
IMU
! WWWWWWWWWWWWWWWWWW
IM UM
16
.
FIG
KU AKU 125
*****
????????????????????
WEWED
Servers 135
OST
U.S. Patent Oct. 4 , 2022 Sheet 8 of 11 US 11,461,263 B2
stwoehiurmtvoncehsrt by poentcmdreousltrienyrgto
rmcof
or scncwphaibetnlcsiethdyrCPU
aCXL
RDMA
via b,maygeprmaesogatirneygd ErsTOR
pdata
andtewohcieurtsnvcehstdlocal
to
CXLcoremofCaonephtdmabrniuolcretydraforRDMAangsedwniqmetuorteancsethlmemory
aor
)CXL
,
e.g.
(
g regated coremofCCXLaonphtdmabriounlcretydrsendsdirectsviaRDMArNICsandTORewqiutecshtEswith
RDMA
i,
bCPUs
tnywhpiearstfncaienhtg ESTORtorfor CXL
to
or
module
memory
bypas inglocal
,
switch RDMA
via
CPUs mCor
of
eCXL
coanphtamdrbounilcrietysrdoflocalrdROMAweicrsetpviocenhrste FIG
2A
.
U.S. Patent Oct. 4 , 2022 Sheet 9 of 11 US 11,461,263 B2
server
Within
ToR
request
the
,
received
is
port
over
NIC
of
)
s
(
tdCPUrEoaevhqlnwsuoermknilstroead rsEandtotewohcqiuetrenvcshtof
spcolerursvapolenidrtnsyg )
NIC
enabled
-
GbE
100
, (
)x86
,
e.g.
(
CPUs
request
receive
)
s
NIC
from ogetherrmato
and
DDR
te.g.
using
,
)request
process
(
CPUs pdesoimtuiorcnoeayls
2.0
CXL
via
share
the
e.g.
(
a)
,memory
g regated 2B
FIG
.
e.g.
U.S. Patent Oct. 4 , 2022 Sheet 10 of 11 US 11,461,263 B2
ontentsvia
c(
memory
said
for
request
sends
CPU
or
DDR
,
e.g. dfrom
a
cmaieogfnremtvrogenarteysd
) second
DDR
and
using
,sharepvia
to
rmaedositmuoircronealys
2.0
CXL
Second
CPUs
rp(
,
e.g.
memory
of
etoqrcuievsatl ersipqegucitenisaveltthe
tcohwnrieoctucogrhs
sand
CXL
PCle
TOR
CXLprotocol cPClelthetprTheohenrqpceaougaetgoslerhttotheTORPCIe/CXLswitch,whichthentransmits ofasecondserverontoPClecroenquecetsotrtherack Second
CPUs
x86
,
.
0.9
(
request
receives
)
from
the cPsoenCceoltneodr theamgermgoatreyd of PCle
result
the
transmit
)
x86
,
e.g.
(
CPUs
Second rCPUs
oback
via
the
to
tc)ognet hnetrs
2C
.
FIG
ausing
and
DDR
,
d itional
wpof
rack
ascoleirurtesvaphloenirdtnsyg
tCPU
wrdata
/
Eor
over
PCle
otearhqnkesulrmeoinsated /
PCle
TOR
switch
rCXL
request
and
to
routes
eceives
PCle
of
)
s
(
port
over
received
is
request
the
,
server
Within
conector PCle
the
from
request
receive
)
X86
,
e.g.
(
CPUs conector the
share
p2.0
to
CXL
via
rmemory
eostoucroels amgermgoatredy dthat
CPU
needs
it
memory
ce(
e.g.
DDR
,
otenrtmeinetsserver
d)
cmaor
a
from
iogefnrmteorgnaerteysd
trpCPUsoegqeuctehstr
Cof
memory
the
sends
rofor
said agonrtegantesd dsvia
netqroulesrt)
,
.
eg
c(
memory
a
from CXLpirefortevoecnortl cPCle
tthelprTheohenrqpcaoeugatgsolehtrtotheTORPCleswitch,whichthentreaqnusmeists PClectoasofontherackoencreovcnteodr SPCle
cpavia
CXL
to
oercnovteoicndtseodrslCPU
the
of
allow
memory
to
ashare
g regated
dtormceotvnmrioecnrvetys
FIG
2D
.
overrdata/wtcPoeiarqcnkeuslmeoisatndgEorPClethernet rack
within
sof
apserver
colerurevsapelonirdtsnyg PCle
of
)
s
(
port
over
received
is
request
the
,
server
Within
srand
PCle
TOR
to
ewocqiuetievcsht conector PCle
from
rmgiven
the
A
eocqdmeuioevlsretys conector Cof
moprusing
the ,neotmrdqcoeulsrlesyrtlmoecmaolry needs
dmof
it
that
Ceotnemroilrneysr cm(
the a)
,
e.g.
goenrmteoganrteysdfromadsiefrevrenrt
!
ent switch is configured to virtualize the first memory from different memory servers 150 , and it may group
module and the second memory module. In some embodi- packets, reducing control overhead . The enhanced capability
ments, the first memory module includes flash memory, and CXL switch 130 may include composable hardware building
the cache -coherent switch is configured to provide a flash blocks to (i ) route data to different memory types based on
translation layer for the flash memory . In some embodi- 30 workload , and ( ii ) virtualize processor - side addresses ( trans
ments, the first server includes an expansion socket adapter, lating such addresses to memory - side addresses ). The sys
connected to an expansion socket of the first server, the tem illustrated in FIG . 16 may be CXL 2.0 based , it may
expansion socket adapter including: the cache - coherent include composable and disaggregated shared memory
switch ; and a memory module socket, the first memory within a rack , and it may use the TOR server - linking switch
module being connected to the cache - coherent switch 35 112 to provide pooled (i.e. , aggregated ) memory to remote
through the memory module socket. In some embodiments , devices.
the memory module socket includes an M.2 socket. In some The ToR server - linking switch 112 may have an addi
embodiments: the cache - coherent switch is connected to the tional network connection (e.g. , an Ethernet connection , as
server - linking switch through a connector, and the connector illustrated, or another kind of connection , e.g. , a wireless
is on the expansion socket adapter. According to an embodi- 40 connection such as a WiFi connection or a 5G connection )
ment of the present invention, there is provided a method for for making connections to other servers or to clients. The
performing remote direct memory access in a computing server - linking switch 112 and the enhanced capability CXL
system , the computing system including : a first server, a switch 130 may each include a controller, which may be or
second server, a third server, and a server - linking switch include a processing circuit such as an ARM processor. The
connected to the first server, to the second server, and to the 45 PCIe interfaces may comply with the PCIe 5.0 standard or
third server , the first server including : a stored - program with an earlier version, or with aa future version of the PCIe
processing circuit , a cache - coherent switch , and a first standard , or a different standard (e.g. , NVDIMM - P , CCIX ,
memory module, the method including : receiving, by the or OpenCAPI) may be employed instead of PCIe . The
server - linking switch , a first packet , from the second server, memory modules 135 may include various memory types
receiving , by the server - linking switch , a second packet , 50 including DDR4 DRAM , HBM , LDPPR , NAND flash , and
from the third server, and transmitting the first packet and solid state drives ( SSDs ) . The memory modules 135 may be
the second packet to the first server. In some embodiments , partitioned or contain cache controllers to handle multiple
the method further includes : receiving , by the cache -coher- memory types, and they may be in different form factors,
ent switch, a straight remote direct memory access (RDMA ) such as HHHL , FHHL , M.2 , U.2 , mezzanine card, daughter
2
request, and sending, by the cache - coherent switch, a 55 card, E1.5 , E1.L , E3.L , or E3.S.
straight RDMA response . In some embodiments, the receiv- In the embodiment of FIG . 16 , the enhanced capability
ing of the straight RDMA request includes receiving the CXL switch 130 may enable one - to -many and many - to -one
straight RDMA request through the server - linking switch . In switching , and it may enable a fine grain load -store interface
some embodiments, the method further includes: receiving, at the flit ( 64 - byte) level . Each memory server 150 may have
by the cache - coherent switch, a read command, from the 60 aggregated memory devices , each device being partitioned
stored - program processing circuit, for a first memory into multiple logical devices each with a respective LD - ID .
address, translating , by the cache - coherent switch , the first The enhanced capability CXL switch 130 may include a
memory address to a second memory address, and retriev- controller 137 (e.g. , an ASIC or an FPGA ), and a circuit
ing , by the cache - coherent switch, data from the first (which may be separate from , or part of, such an ASIC or
memory module at the second memory address. According 65 FPGA ) for device discovery, enumeration , partitioning, and
to an embodiment of the present invention , there is provided presenting physical address ranges . Each of the memory
a system , including : a first server, including : a stored- modules 135 may have one physical function (PF ) and as
US 11,461,263 B2
21 22
many as 16 isolated logical devices . In some embodiments provide reliability , replication, consistency, system coher
the number of logical devices ( e.g. , the number of partitions) ence, hashing, caching, and persistence, in a manner analo
may be limited (e.g. to 16 ) , and one control partition (which gous to that described herein for the embodiment of FIG . 1E ,
may be a physical function used for controlling the device ) with, e.g. , coherence being provided with multiple remote
may also be present. Each of the memory modules 135 may 5 servers reading from or writing to the same memory address,
be a Type 2 device with cxl.cache , cxl.mem and cxl.io and and with each remote server being configured with different
address translation service (ATS ) implementation to deal consistency levels . In some embodiments, the server -linking
with cache line copies that the processing circuits 115 may switch maintains eventual consistency between data stored
hold . on a first memory server, and data stored on a second
The enhanced capability CXL switch 130 and a fabric 10 memory server . The server - linking switch 112 may maintain
manager may control discovery of the memory modules 135 different consistency levels for different pairs of servers ; for
and (i ) perform device discovery, and virtual CXL software example, the server - linking switch may also maintain ,
creation , and ( ii ) bind virtual to physical ports . As in the between data stored on the first memory server , and data
embodiments of FIGS . 1A - 1D , the fabric manager may stored on a third memory server , a consistency level that is
operate through connections over an SMBus sideband. An 15 strict consistency, sequential consistency, causal consis
interface to the memory modules 135 , which may be Intel- tency, or processor consistency. The system may employ
ligent Platform Management Interface (IPMI ) or an interface communications in “ local-band ” ( the server- linking switch
that complies with the Redfish standard ( and that may also 112 ) and “ global-band ” (disaggregated server ) domains .
provide additional features not required by the standard ), Writes may be flushed to the " global band ” to be visible to
may enable configurability. 20 new reads from other servers . The controller 137 of the
Building blocks , for the embodiment of FIG . 16 , may enhanced capability CXL switch 130 may manage persistent
include ( as mentioned above ) a CXL controller 137 imple- domains and flushes separately for each remote server. For
mented on an FPGA or on an ASIC , switching to enable example , the cache - coherent switch may monitor aa fullness
aggregating of memory devices ( e.g. , of the memory mod- of a first region of memory (volatile memory, operating as
ules 135 ) , SSDs , accelerators (GPUs , NICs ) , CXL and 25 a cache) , and, when the fullness level exceeds a threshold ,
PCIe5 connectors, and firmware to expose device details to the cache - coherent switch may move data from the first
the advanced configuration and power interface (ACPI ) region of memory to a second region of memory, the second
tables of the operating system , such as the heterogeneous region of memory being in persistent memory . Flow control
memory attribute table (HMAT) or the static resource affin- may be handled in that priorities may be established, by the
ity table SRAT. 30 controller 137 of the enhanced capability CXL switch 130 ,
In some embodiments, the system provides composabil- among remote servers , to present different perceived laten
ity. The system may provide an ability to online and offline cies and bandwidths.
CXL devices and other accelerators based on the software According to an embodiment of the present invention ,
configuration , and it may be capable of grouping accelerator, there is provided a system , including: a first memory server,
memory , storage device resources and rationing them to 35 including: a cache - coherent switch , and a first memory
each memory server 150 in the rack . The system may hide module ; and a second memory server ; and a server - linking
the physical address space and provide transparent cache switch connected to the first memory server and to the
using faster devices like HBM and SRAM . second memory server, wherein : the first memory module is
In the embodiment of FIG . 16 , the controller 137 of the connected to the cache - coherent switch , and the cache
enhanced capability CXL switch 130 may (i ) manage the 40 coherent switch is connected to the server -linking switch . In
memory modules 135 , ( ii ) integrate and control heteroge- some embodiments, the server -linking switch is configured
neous devices such as NICS , SSDs , GPUs, DRAM , and ( iii ) to disable power to the first memory module . In some
effect dynamic reconfiguration of storage to memory devices embodiments: the server - linking switch is configured to
by power -gating. For example, the TOR server -linking disable power to the first memory module by instructing the
switch 112 may disable power ( i.e. , shut off power , or reduce 45 cache - coherent switch to disable power to the first memory
power ) to one of the memory modules 135 (by instructing module, and the cache - coherent switch is configured to
the enhanced capability CXL switch 130 to disable power to disable power to the first memory module, upon being
the memory module 135 ) . The enhanced capability CXL instructed , by the server - linking switch , to disable power to
switch 130 may then disable power to the memory module the first memory module . In some embodiments , the cache
135 , upon being instructed , by the server -linking switch 112 , 50 coherent switch is configured to perform deduplication
to disable power to the memory module . Such disabling may within the first memory module. In some embodiments, the
conserve power, and it may improve the performance ( e.g., cache -coherent switch is configured to compress data and to
the throughput and latency ) of other memory modules 135 store compressed data in the first memory module . In some
in the memory server 150. Each remote server 105 may see embodiments, the server -linking switch is configured to
a different logical view of memory modules 135 and their 55 query a status of the first memory server. In some embodi
connections based on negotiation . The controller 137 of the ments, the server- linking switch is configured to query a
enhanced capability CXL switch 130 may maintain state so status of the first memory server through an Intelligent
that each remote server maintains allotted resources and Platform Management Interface (IPMI ) . In some embodi
connections , and it may perform compression or deduplica- ments, the querying of a status includes querying a status
tion of memory to save memory capacity (using a config- 60 selected from the group consisting of a power status, a
urable chunk size ) . The disaggregated rack of FIG . 16 may network status, and an error check status. In some embodi
have its own BMC . It also may expose an IPMI network ments, the server -linking switch is configured to batch cache
interface and a system event log ( SEL ) to remote devices, requests directed to the first memory server. In some
enabling the master (e.g. , a remote server using storage embodiments, the system further includes a third memory
provided by the memory servers 150 ) to measure perfor- 65 server connected to the server -linking switch, wherein the
mance and reliability on the fly , and to reconfigure the server - linking switch is configured to maintain , between
disaggregated rack . The disaggregated rack of FIG . 16 may data stored on the first memory server and data stored on the
US 11,461,263 B2
23 24
third memory server, a consistency level selected from the network interface circuits 125 transmit the request to the
group consisting of strict consistency, sequential consis- TOR Ethernet switch 110 (which may have an RDMA
tency, causal consistency, and processor consistency. In interface ), bypassing processing circuits ; at 215 , the TOR
some embodiments , the cache - coherent switch is configured Ethernet switch 110 routes the RDMA request to the remote
to : monitor a fullness of aa first region of memory, and move 5 the server 105 for processing by the controller 137 of a
data from the first region of memory to a second region of memory module 135 , or by a remote enhanced capability
memory, wherein : the first region of memory is in volatile CXL switch 130 , via RDMA access to remote aggregated
memory, and the second region of memory is in persistent memory, bypassing the remote processing circuit 115 ; at
memory. In some embodiments, the server -linking switch 220 , the TOR Ethernet switch 110 receives the processed
includes a Peripheral Component Interconnect Express 10 data and routes the data to the local memory module 135 , or
( PCIe) switch . In some embodiments , the server -linking to the local enhanced capability CXL switch 130 , bypassing
switch includes a Compute Express Link (CXL ) switch . In the local processing circuits 115 via RDMA ; and, at 222 , the
some embodiments, the server - linking switch includes a top controller 137 of a memory module 135 of the embodiment
of rack ( ToR ) CXL switch . In some embodiments, the of FIGS . 1A and 1B , or the enhanced capability CXL switch
server - linking switch is configured to transmit data from the 15 130 receives the RDMA response straightly ( e.g. , without it
second memory server to the first memory server, and to being forwarded by the processing circuits 115 ) .
perform flow control on the data. In some embodiments, the In such an embodiment, the controller 137 of the remote
system further includes aa third memory server connected to memory module 135 , or the enhanced capability CXL
the server - linking switch, wherein : the server - linking switch switch 130 of the remote the server 105 , is configured to
is configured to : receive a first packet, from the second 20 receive straight remote direct memory access ( RDMA)
memory server, receive a second packet, from the third requests and to send straight RDMA responses . As used
memory server, and transmit the first packet and the second herein , the controller 137 of the remote memory module 135
packet to the first memory server. According to an embodi- receiving, or the enhanced capability CXL switch 130
ment of the present invention, there is provided a method for receiving, “ straight RDMA requests ” (or receiving such
performing remote direct memory access in a computing 25 requests “ straightly ” ) means receiving, by the controller 137
system , the computing system including: a first memory of the remote memory module 135 , or by the enhanced
server; a first server; a second server; and a server - linking capability CXL switch 130 , such requests without their
switch connected to the first memory server, to the first being forwarded or otherwise processed by a processing
server, and to the second server , the first memory server circuit 115 of the remote server, and sending, by the con
including : a cache - coherent switch , and a first memory 30 troller 137 of the remote memory module 135 , or by the
module ; the first server including : a stored -program pro- enhanced capability CXL switch 130 , “ straight RDMA
cessing circuit ; the second server including: a stored -pro- responses ” ( or sending such requests “ straightly " ) means
gram processing circuit; the method including: receiving, by sending such responses without their being forwarded or
the server -linking switch , a first packet, from the first server; otherwise processed by a processing circuit 115 of the
receiving , by the server - linking switch , a second packet , 35 remote server .
from the second server; and transmitting the first packet and Referring to FIG . 2B , in another embodiment, RDMA
the second packet to the first memory server . In some may be performed with the processing circuit of the remote
embodiments, the method further includes : compressing server being involved in the handling of the data . For
data , by the cache - coherent switch , and storing the data in example, at 225 , a processing circuit 115 may transmit data
the first memory module . In some embodiments, the method 40 or a workload request over Ethernet; at 230 , the ToR
further includes : querying, by the server - linking switch , a Ethernet switch 110 may receive the request and route it to
status of the first memory server. According to an embodi- the corresponding server 105 of the plurality of servers 105 ;
ment of the present invention , there is provided a system , at 235 , the request may be received , within the server, over
including : a first memory server, including : a cache -coherent port ( s) of the network interface circuits 125 (e.g. , 100
switch , and a first memory module; and a second memory 45 GbE - enabled NIC ) ; at 240 , the processing circuits 115 ( e.g. ,
server; and server- linking switching means connected to the x86 processing circuits) may receive the request from the
first memory server and to the second memory server, network interface circuits 125 ; and, at 245 , the processing
wherein : the first memory module is connected to the circuits 115 may process the request ( e.g. , together ), using
cache - coherent switch , and the cache - coherent switch is DDR and additional memory resources via the CXL 2.0
connected to the server - linking switching means . 50 protocol to share the memory (which, in the embodiment of
FIGS . 2A - 2D are flow charts for various embodiments . In FIGS . 1A and 1B , may be aggregated memory ).
the embodiments of these flow charts, the processing circuits Referring to FIG . 2C , in the embodiment of FIG . 1E ,
115 are CPUs; in other embodiments they may be other RDMA may be performed with the processing circuit of the
processing circuits (e.g. , GPUs ) . Referring to FIG . 2A , the remote server being involved in the handling of the data . For
controller 137 of a memory module 135 of the embodiment 55 example , at 225 , a processing circuit 115 may transmit data
of FIGS . 1A and 1B , or the enhanced capability CXL switch or a workload request over Ethernet or PCie at 230 , the TOR
130 of any of the embodiments of FIGS . 1C - 1G may Ethernet switch 110 may receive the request and route it to
virtualize across the processing circuit 115 and initiate an the corresponding server 105 of the plurality of servers 105 ;
RDMA request on an enhanced capability CXL switch 130 at 235 , the request may be received , within the server, over
in another server 105 , to move data back and forth between 60 port ( s) of the PCIe connector; at 240 , the processing circuits
servers 105 , without involving a processing circuit 115 in 115 (e.g. , x86 processing circuits) may receive the request
either server ( with the virtualization being handled by the from the network interface circuits 125 ; and, at 245 , the
controller 137 of the enhanced capability CXL switches processing circuits 115 may process the request ( e.g. ,
130 ) . For example, at 205 , the controller 137 of the memory together ), using DDR and additional memory resources via
module 135 , or the enhanced capability CXL switch 130 , 65 the CXL 2.0 protocol to share the memory ( which , in the
generates an RDMA request for additional remote memory embodiment of FIGS . 1A and 1B , may be aggregated
( e.g. , CXL memory or aggregated memory ); at 210 , the memory ). At 250 , the processing circuit 115 may identify a
US 11,461,263 B2
25 26
requirement to access memory contents (e.g. , DDR or aggre- Processing circuit hardware may include , for example,
gated memory contents ) from a different server ; at 252 the application specific integrated circuits ( ASICs ) , general pur
processing circuit 115 may send the request for said memory pose or special purpose central processing units (CPUs ) ,
contents (e.g. , DDR or aggregated memory contents) from a digital signal processors (DSPs ) , graphics processing units
different server , via a CXL protocol ( e.g. , CXL 1.1 or CXL 5 (GPUs ) , and programmable logic devices such as field
2.0 ) ; at 254 , the request propagates through the local PCIe programmable gate arrays ( FPGAs). In a processing circuit ,
connector to the server-linking switch 112 , which then as used herein , each function is performed either by hard
transmits the request to a second PCIe connector of a second ware configured, i.e. , hard -wired , to perform that function ,
server on the rack ; at 256 , the second processing circuits 115 or by more general purpose hardware , such as a CPU ,
( e.g. , x86 processing circuits) receive the request from the 10 configured to execute instructions stored in a non -transitory
second PCIe connector; at 258 , the second processing cir- storage medium . A processing circuit may be fabricated on
cuits 115 may process the request (e.g. , retrieval of memory a single printed circuit board (PCB ) or distributed over
contents ) together, using second DDR and second additional several interconnected PCBs . A processing circuit may
memory resources via the CXL 2.0 protocol to share the contain other processing circuits; for example a processing
aggregated memory; and, at 260 , the second processing 15 circuit may include two processing circuits, an FPGA and a
circuits (e.g. , x86 processing circuits ) transmit the result of CPU , interconnected on a PCB .
the request back to the original processing circuits via As used herein , a “ controller " includes a circuit , and a
respective PCIe connectors and through the server - linking controller may also be referred to as a “ control circuit ” or a
switch 112 . “ controller circuit ” . Similarly, a " memory module ” may also
Referring to FIG . 2D , in the embodiment of FIG . 16 , 20 be referred to as a “ memory module circuit ” or as a
RDMA may be performed with the processing circuit of the “ memory circuit ” . As used herein , the term “ array ” refers to
remote server being involved in the handling of the data . For an ordered set of numbers regardless of how stored ( e.g. ,
example ; at 225 , a processing circuit 115 may transmit data whether stored in consecutive memory locations , or in a
or a workload request over Ethernet; at 230 , the ToR linked list ) . As used herein , when a second number is
Ethernet switch 110 may receive the request and route it to 25 " within Y % ” of aa first number, it means that the second
the corresponding server 105 of the plurality of servers 105 ; number is at least ( 1 - Y / 100 ) times the first number and the
at 235 , the request may be received , within the server , over second number is at most ( 1 + Y / 100 ) times the first number.
port( s ) of the network interface circuits 125 ( e.g. , 100 As used herein , the term “ or” should be interpreted as
GbE - enabled NICs ) . At 262 , a memory module 135 receives " and /or ” , such that, for example, " A or B ” means any one of
the request from the PCIe connector; at 264 , the controller 30 “ A ” or “ B ” or “ A and B ” .
of the memory module 135 processes the request, using local As used herein , when a method (e.g. , an adjustment) or a
memory ; at 250 , the controller of the memory module 135 first quantity ( e.g. , a first variable ) is referred to as being
identifies a requirement to access memory contents ( e.g. , “ based on ” a second quantity (e.g. , a second variable ) it
aggregated memory contents) from aa different server ; at 252 , means that the second quantity is an input to the method or
the controller of the memory module 135 sends request for 35 influences the first quantity, e.g. , the second quantity may be
said memory contents ( e.g. , aggregated memory contents) an input ( e.g. , the only input, or one of several inputs) to a
from a different server via the CXL protocol ; at 254 the function that calculates the first quantity, or the first quantity
request propagates through the local PCIe connector to the may be equal to the second quantity, or the first quantity may
server - linking switch 112 , which then transmits the request be the same as ( e.g. , stored at the same location or locations
to a second PCIe connector of aa second server on the rack ; 40 in memory ) as the second quantity .
and at 266 , the second PCIe connector provides access via It will be understood that, although the terms “ first” ,
the CXL protocol to share the aggregated memory to allow “ second ” , “ third ” , etc., may be used herein to describe
the controller of the memory module 135 to retrieve memory various elements, components, regions , layers and / or sec
contents . tions , these elements , components, regions, layers and /or
As used herein , a “ server ” is a computing system includ- 45 sections should not be limited by these terms. These terms
ing at least one stored -program processing circuit (e.g. , a are only used to distinguish one element, component, region ,
processing circuit 115 ) , at least one memory resource ( e.g. , layer or section from another element, component, region ,
a system memory 120 ) , and at least one circuit for providing layer or section . Thus, a first element, component, region ,
network connectivity (e.g. , a network interface circuit 125 ) . layer or section discussed herein could be termed a second
As used herein , “ a portion of something means “ at least 50 element, component, region, layer or section , without
some of ” the thing, and as such may mean less than all of, departing from the spirit and scope of the inventive concept.
or all of, the thing. As such, “ a portion of a thing includes Spatially relative terms, such as “ beneath ” , “ below ” ,
the entire thing as a special case , i.e. , the entire thing is an " lower ” , “ under ” , “ above” , “ upper ” and the like , may be
example of a portion of the thing . used herein for ease of description to describe one element
The background provided in the Background section of 55 or feature's relationship to another element (s ) or feature ( s)
the present disclosure section is included only to set context, as illustrated in the figures. It will be understood that such
and the content of this section is not admitted to be prior art . spatially relative terms are intended to encompass different
Any of the components or any combination of the compo- orientations of the device in use or in operation, in addition
nents described ( e.g. , in any system diagrams included to the orientation depicted in the figures. For example, if the
herein ) may be used to perform one or more of the opera- 60 device in the figures is turned over, elements described as
tions of any flow chart included herein . Further, (i ) the “ below ” or “ beneath ” or “ under ” other elements or features
operations are example operations, and may involve various would then be oriented " above ” the other elements or
additional steps not explicitly covered , and ( ii ) the temporal features . Thus, the example terms “ below ” and “ under ” can
order of the operations may be varied . encompass both an orientation of above and below . The
The term “ processing circuit ” or “ controller means ” is 65 device may be otherwise oriented ( e.g. , rotated 90 degrees or
used herein to mean any combination of hardware , firmware , at other orientations) and the spatially relative descriptors
and software , employed to process data or digital signals. used herein should be interpreted accordingly. In addition, it
US 11,461,263 B2
27 28
will also be understood that when a layer is referred to as a first memory module ; and
being “ between ” two layers, it can be the only layer between a second memory server ; and
the two layers, or one or more intervening layers may also a server - linking switch connected to the first memory
be present. server and to the second memory server,
The terminology used herein is for the purpose of describ- 5 wherein :
ing particular embodiments only and is not intended to be the first memory module is connected to the cache
limiting of the inventive concept. As used herein , the terms coherent switch via a first interface, and
“ substantially ,” “ about, ” and similar terms are used as terms the cache - coherent switch is connected to the server
of approximation and not as terms of degree, and are linking switch via a second interface different from
intended to account for the inherent deviations in measured 10 the first interface .
or calculated values that would be recognized by those of 2. The system of claim 1 , wherein the server - linking
ordinary skill in the art. As used herein , the singular forms switch
“ a” and “ an ” are intended to include the plural forms as well , moduleis. configured to disable power to the first memory
unless the context clearly indicates otherwise. It will be
further understood that the terms " comprises ” and / or " com- 15 3.theThe system of claim 2 , wherein :
server -linking switch is configured to disable power to
prising ", when used in this specification, specify the pres
ence of stated features, integers , steps , operations, elements, the first memory module by instructing the cache
and / or components, but do not preclude the presence or coherent switch to disable power to the first memory
addition of one or more other features, integers, steps , module, and
operations, elements, components, and / or groups thereof. As 20 the cache -coherent switch is configured to disable power
used herein , the term “ and /or” includes any and all combi to the first memory module, upon being instructed, by
nations of one or more of the associated listed items. the server -linking switch , to disable power to the first
Expressions such as “ at least one of,” when preceding a list memory module.
of elements, modify the entire list of elements and do not 4. The system of claim 1 , wherein the cache - coherent
modify the individual elements of the list . Further, the use of 25 switch is configured to perform deduplication within the first
“ may ” when describing embodiments of the inventive con memory module.
cept refers to “ one or more embodiments of the present 5. The system of claim 1 , wherein the cache - coherent
disclosure ” . Also , the term " exemplary ” is intended to refer switch is configured to compress data and to store com
to an example or illustration . As used herein , the terms pressed data in the first memory module.
“ use," “ using , ” and “ used ” may be considered synonymous 30 6. The system of claim 1 , wherein the server - linking
with the terms " utilize , " " utilizing, ” and “ utilized , ” respec switch is configured to query a status of the first memory
tively. server .
It will be understood that when an element or layer is 7. The system of claim 6 , wherein the server - linking
“referred
adjacenttotoas” another
being “ on ” , “ connected to ” , “ coupled to " , or
element or layer, it may be directly on, 35 switch is configured to query a status of the first memory
connected to , coupled to , or adjacent to the other element or server through an Intelligent Platform Management Inter
layer, or one or more intervening elements or layers may be face ( IPMI).
present. In contrast, when an element or layer is referred to 8. The system of claim 7 , wherein the querying of a status
as being " directly on ” , “ directly connected to ” , “ directly comprises querying a status selected from the group con
coupled to " , or " immediately adjacent to ” another element 40 sisting of a power status, a network status, and an error
or layer, there are no intervening elements or layers present. check status.
Any numerical range recited herein is intended to include 9. The system of claim 1 , wherein the server - linking
all sub - ranges of the same numerical precision subsumed switch is configured to batch cache requests directed to the
within the recited range. For example, a range of “ 1.0 to first memory server.
10.0 " or " between 1.0 and 10.0 " is intended to include all 45 10. The system of claim 1 , further comprising a third
subranges between ( and including) the recited minimum memory server connected to the server - linking switch ,
value of 1.0 and the recited maximum value of 10.0 , that is , wherein the server -linking switch is configured to maintain ,
having a minimum value equal to or greater than 1.0 and a between data stored on the first memory server and data
maximum value equal to or less than 10.0 , such as , for stored on the third memory server , a consistency level
example, 2.4 to 7.6 . Any maximum numerical limitation 50 selected from the group consisting of strict consistency,
recited herein is intended to include all lower numerical sequential consistency, causal consistency, and processor
limitations subsumed therein and any minimum numerical consistency.
limitation recited in this specification is intended to include
11. The system of claim 1 , wherein the cache - coherent
all higher numerical limitations subsumed therein . switch is configured to :
Although exemplary embodiments of system and method 55 monitor aa fullness of a first region of memory, and
for managing memory resources have been specifically move data from the first region of memory to a second
described and illustrated herein , many modifications and region of memory,
variations will be apparent to those skilled in the art. wherein :
Accordingly, it is to be understood that system and method the first region of memory is in volatile memory , and
for managing memory resources constructed according to 60 the second region of memory is in persistent memory .
principles of this disclosure may be embodied other than as 12. The system of claim 1 , wherein the server - linking
9
specifically described herein . The invention is also defined switch comprises a Peripheral Component Interconnect
in the following claims , and equivalents thereof. Express ( PCIe ) switch.
What is claimed is : 13. The system of claim 1 , wherein the server - linking
1. A system , comprising : 65 switch comprises a Compute Express Link (CXL ) switch .
a first memory server, comprising : 14. The system of claim 13 , wherein the server - linking
a cache -coherent switch , and switch comprises a top of rack ( TOR ) CXL switch .
US 11,461,263 B2
29 30
15. The system of claim 1 , wherein the server - linking the second server comprising:
switch is configured to transmit data from the second a stored -program processing circuit ;
memory server to the first memory server, and to perform the method comprising :
flow control on the data . receiving, by the server - linking switch , a first packet,
16. The system of claim 1 , further comprising a third 5 from the first server ;
memory server connected to the server -linking switch , receiving, by the server - linking switch , a second
wherein : packet, from the second server; and
the server - linking switch is configured to : transmitting the first packet and the second packet to
receive a first packet , from the second memory server, the first memory server.
receive a second packet , from the third memory server, 10 18. The method of claim 17 , further comprising:
and
transmit the first packet and the second packet to the compressing data, by the cache - coherent switch , and
first memory server . storing the data in the first memory module .
17. A method for performing remote direct memory 19. The method of claim 17 , further comprising :
access in a computing system , the computing system com- 15 querying, by the server - linking switch, a status of the first
prising : memory server .
a first memory server ; 20. A system , comprising:
a first server ; a first memory server, comprising:
a second server; and a cache -coherent switch, and
a server -linking switch connected to the first memory 20 a first memory module ; and
server, to the first server , and to the second server, a second memory server; and
the first memory server comprising: server -linking switching means connected to the first
a cache -coherent switch , and memory server and to the second memory server,
a first memory module , wherein the first memory wherein :
module is connected to the cache - coherent switch via 25 the first memory module is connected to the cache
a first interface, and the cache - coherent switch is coherent switch via a first interface, and
connected to the server - linking switch via a second the cache -coherent switch is connected to the server
interface different from the first interface; linking switching means via a second interface dif
the first server comprising: ferent from the first interface .
a stored - program processing circuit;