0% found this document useful (0 votes)
20 views29 pages

Patent

Uploaded by

bbroy britain
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
20 views29 pages

Patent

Uploaded by

bbroy britain
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 29

US011461263B2

( 12 ) Malladi
Unitedet alStates
.
Patent ( 10) Patent No.: US 11,461,263 B2
Oct. 4 , 2022
(45 ) Date of Patent :
( 54 ) DISAGGREGATED MEMORY SERVER ( 58 ) Field of Classification Search
??? GO6F 13/4027 ; G06F 13/1663 ; GOOF
( 71 ) Applicant: SAMSUNG ELECTRONICS CO . , 12/0802 ; G06F 12/0808 ; G06F 12/1045 ;
LTD ., Suwon - si (KR) ( Continued )
( 72 ) Inventors : Krishna Teja Malladi , San Jose , CA ( 56 ) References Cited
(US ) ; Byung Hee Choi , Fremont, CA
(US ) ; Andrew Chang , Los Altos , CA U.S. PATENT DOCUMENTS
(US ) ; Ehsan M. Najafabadi, San Jose , 8,971,423 B1 3/2015 Fu et al .
CA (US ) 9,235,519 B2 1/2016 Lih et al .
( 73 ) Assignee : Samsung Electronics Co. , Ltd. , (Continued )
Suwon - si (KR ) FOREIGN PATENT DOCUMENTS
( * ) Notice : Subject to any disclaimer, the term of this EP 1235154 A2 8/2002
patent is extended or adjusted under 35 WO 2005/116839 A1 12/2005
U.S.C. 154 (b ) by 0 days.
OTHER PUBLICATIONS
( 21 ) Appl. No .: 17/026,087
U.S. Office Action ated Jun . 21 , 2021 , issued U.S. Appl. No.
(22 ) Filed : Sep. 18, 2020 17 /026,071 ( 12 pages ).
(Continued )
( 65 ) Prior Publication Data
US 2021/0311646 A1 Oct. 7 , 2021 Primary Examiner — Hiep T Nguyen
(74 ) Attorney, Agent, or Firm - Lewis Roca Rothgerber
Related U.S. Application Data Christie LLP
( 60 ) Provisional application No. 63 /031,508 , filed on May ( 57 ) ABSTRACT
28 , 2020 , provisional application No. 63 /031,509 , A system and method for managing memory resources . In
(Continued ) some embodiments, the system includes a first memory
(51 ) Int. Ci. server, a second memory server, and a server - linking switch
G06F 13/40 ( 2006.01 ) connected to the first memory server and to the second
G06F 15/173 ( 2006.01) memory server . The first server may include a cache
( Continued ) coherent switch and a first memory module . In some
(52) U.S. CI.
embodiments, the first memory module is connected to the
CPC
cache - coherent switch, and the cache - coherent switch is
GO6F 13/4027 (2013.01 ) ; G06F 3/0604 connected to the server - linking switch .
(2013.01 ) ; G06F 3/067 ( 2013.01 ) ;
(Continued ) 20 Claims , 11 Drawing Sheets
Datacenter Spine

Ethernet

TOR Ethernet switch 110


Servers
A
poriport pori port
140 JOOGLE TOOGDE - 140
105 125 NIC NIC 125
115 115
145
120 CPU DDR -120

130
137
BU
135 . SIT MUS SUS a
<
ST SITE
US 11,461,263 B2
Page 2

Related U.S. Application Data 2015/0258437 Al 9/2015 Kruglick


2015/0263985 A1 9/2015 Schmitter et al .
filed on May 28 , 2020 , provisional application No. 2016/0182154 Al 6/2016 Fang et al.
63 / 068,054 , filed on Aug. 20 , 2020 , provisional ap 2016/0267209 A1 9/2016 Ikram et al .
2016/0299767 Al 10/2016 Mukadam
plication No. 63 / 057,746 , filed on Jul. 28 , 2020 . 2016/0328273 A1 11/2016 Molka et al .
2016/0344629 A1 11/2016 Gray
(51 ) Int. Ci . 2017/0075576 A1 3/2017 Cho
G06F 9/4401 ( 2018.01) 2017/0187846 A1 6/2017 Shalev et al.
G06F 3/06 ( 2006.01 ) 2017/0228317 Al 8/2017 Drapala et al .
G06F 12/0808 ( 2016.01 ) 2017/0300298 Al 10/2017 Ishii
2017/0308483 A1 10/2017 Ishii
GO6F 12/1045 ( 2016.01 ) 2017/0346915 A1 * 11/2017 Gay H04L 67/32
GOOF 13/16 ( 2006.01 ) 2018/0024935 A1 1/2018 Meswani et al .
GOOF 13/42 ( 2006.01 ) 2018/0048711 A1 2/2018 Aslam et al .
G06F 12/0802 ( 2016.01 ) 2018/0089115 A1 3/2018 Schmisseur et al .
G06F 13/28 2018/0089881 Al 3/2018 Johnson
( 2006.01 ) 2018/0191523 Al 7/2018 Shah et al.
H04L 49/45 ( 2022.01 ) 2018/0293489 Al 10/2018 Eyster et al.
H04L 49/351 ( 2022.01 ) 2019/0042388 Al 2/2019 Wang et al .
(52) U.S. CI . 2019/0073265 A1 3/2019 Brennan et al .
CPC G06F 3/0619 (2013.01 ) ; G06F 3/0625 2019/0102346 Al 4/2019 Wang et al .
2019/0171373 A1 6/2019 Frank et al.
( 2013.01 ) ; G06F 3/0629 ( 2013.01 ) ; G06F 2019/0179805 Al 6/2019 Prahlad et al.
370647 ( 2013.01 ) ; G06F 3/0653 ( 2013.01 ) ; 2019/0213130
2019/0220319
A1
A1
7/2019 Madugula et al.
7/2019 Parees et al .
G06F 3/0659 (2013.01 ) ; G06F 3/0679 2019/0235777 Al 8/2019 Wang et al .
( 2013.01 ) ; G06F 9/4401 ( 2013.01 ) ; G06F 2019/0243579 A1 * 8/2019 Li GO6F 3/0659
12/0802 (2013.01 ) ; G06F 12/0808 ( 2013.01 ) ; 2019/0297015 A1 9/2019 Marolia et al .
GO6F 12/1045 (2013.01 ) ; G06F 13/1663 2019/0303345 A1 * 10/2019 Zhu G06F 13/28
( 2013.01 ) ; G06F 13/28 ( 2013.01 ) ; G06F 2019/0384733
2019/0385057
A1 12/2019 Jen et al .
Al 12/2019 Litichever et al .
13/409 (2013.01 ) ; G06F 13/4022 ( 2013.01 ) ; 2019/0391936 A1 12/2019 Stalley
G06F 13/4068 (2013.01 ) ; G06F 13/4221 2020/0012604 A1 1/2020 Agarwal
( 2013.01 ) ; G06F 15/17331 ( 2013.01 ) ; H04L 2020/0021540 A1 1/2020 Marolia et al .
49/45 ( 2013.01 ) ; G06F 2212/621 ( 2013.01 ) ; 2020/0026656
2020/0050403
A1 * 1/2020 Liao
Al
G06F 13/4282
GOOF 2213/0026 ( 2013.01 ) ; G06F 2213/28 2020/0050570 A1
2/2020 Suri et al.
2/2020 Agarwal et al.
(2013.01 ) ; H04L 49/351 (2013.01 ) 2020/0104275 Al 4/2020 Sen et al .
( 58 ) Field of Classification Search 2020/0125503 A1 4/2020 Graniello et al.
CPC .... GO6F 3/0604 ; G06F 3/0619 ; G06F 3/0625 ; 2020/0125529 Al 4/2020 Byers et al .
GO6F 3/0629 ; G06F 3/0647 ; G06F 2020/0136943 A1 4/2020 Banyai et al .
3/0653 ; G06F 3/0659 ; G06F 3/067 ; G06F 2020/0137896 A1 4/2020 Elenitoba - Johnson et al .
2020/0159449 Al 5/2020 Davis et al .
3/0679 ; G06F 9/4401 2020/0167098 A1 5/2020 Shah et al .
See application file for complete search history. 2020/0167258 Al 5/2020 Chattopadhyay et al .
2020/0192715 Al 6/2020 Wang et al .
( 56 ) References Cited 2020/0241926 Al 7/2020 Guim Bernat
2020/0257517 A1 8/2020 Seater et al .
U.S. PATENT DOCUMENTS 2020/0412798 Al 12/2020 Devireddy et al .
2021/0011864 A1 1/2021 Guim Bernat et al .
9,432,298 B1 8/2016 Smith 2021/0058388 A1 * 2/2021 Knotwell G06F 21/30
9,483,318 B2 11/2016 Vajapeyam 2021/0064530
2021/0084787
A1 * 3/2021 Palfer - Sollier
A1 3/2021 Weldon et al .
G06F 12/0833
9,619,389 B1 4/2017 Roug
9,916,241 B2 3/2018 McKean et al . 2021/0117360 A1 4/2021 Kutch et al .
10,331,588 B2 6/2019 Frandzel et al. 2021/0120039 A1 * 4/2021 Bett G06F 11/1451
10,389,800 B2 8/2019 Blainey et al. 2021/0200667 A1 * 7/2021 Bernstein G06F 3/0658
10,523,748 B2 12/2019 Yang et al .
11,126,564 B2 9/2021 Schlansker et al .
OTHER PUBLICATIONS
2002/0018484 A1 * 2/2002 Kim HO4L 12/1868
370/432
2003/0140274 Al 7/2003 Neumiller et al . AWS Summit, Seoul , Korea, 2017 , 36 pages , https ://www.slideshare.
2004/0133409 A1 7/2004 Mukherjee et al . net /awskorea/aws - cloud -game - architecture ? from_action = save ), Ama
2005/0160430 A1 7/2005 Steely et al . zon Web Services, Inc.
2006/0063501 Al 3/2006 Adkisson et al . Unpublished U.S. Appl. No. 17 /026,082 , filed Sep. 18 , 2020 .
2006/0230119 A1 * 10/2006 Hausauer HO4L 47/6265 Unpublished U.S. Appl. No. 17 /026,071 , filed Sep. 18 , 2020 .
709/216 Unpublished U.S. Appl. No. 17 /026,074 , filed Sep. 18 , 2020 .
2008/0025289 A1 1/2008 Kapur et al . Jack Tigar Humphries, et al . , “ Mind the Gap: A Case for Informed
2011/0320690 A1 * 12/2011 Petersen G06F 3/0685 Request Scheduling at the NIC ” , HotNets ’19 : Proceedings of the
711/103 18th ACM Workshop on Hot Topics in Networks, Nov. 2019 , pp .
2012/0069029 Al 3/2012 Bourd et al . 60-68 , https://doi.org/10.1145/3365609.3365856 .
2012/0151141 A1 * 6/2012 Bell , Jr G06F 11/3485
711/118 U.S. Office Action dated Aug. 19 , 2021 , issued in U.S. Appl. No.
2013/0318308 A1 11/2013 Jayasimha et al . 17 /026,074 ( 16 pages ).
2014/0195672 A1 * 7/2014 Raghavan HO4L 67/025 EPO Extended European Search Report dated Sep. 13 , 2021 , issued
709/224 in corresponding European Patent Application No. 21158607.8 ( 14
2015/0058642 A1 2/2015 Okamoto et al . pages )
2015/0106560 A1 4/2015 Perego et al . U.S. Final Office Action dated Sep. 28 , 2021 , issued in U.S. Appl.
2015/0143037 Al 5/2015 Smith No. 17 /026,071 ( 13 pages ).
US 11,461,263 B2
Page 3

( 56 ) References Cited
OTHER PUBLICATIONS
U.S. Advisory Action dated Dec. 9 , 2021 , issued in U.S. Appl. No.
17 / 026,071 (4 pages ) .
European Search Report for EP Application No. 21162578.5 dated
Sep. 15 , 2021 , 13 pages .
Notice of Allowance for U.S. Appl. No. 17/ 026,074 dated Dec. 29 ,
2021 , 10, pages .
Notice of Allowance for U.S. Appl. No. 17 /026,074 dated Mar. 9 ,
2021 , 10 pages .
Notice of Allowance for U.S. Appl. No. 17/ 026,082 dated Mar. 30 ,
2022 , 7 pages.
Office Action for U.S. Appl. No. 17/ 026,071 dated Mar. 17 , 2022 ,
14 pages.
Office Action for U.S. Appl. No. 17/ 026,082 dated Nov. 26 , 2021 ,
13 pages.
Office Action for U.S. Appl. No. 17 /246,448 dated May 13 , 2022 ,
12 pages.
U.S. Notice of Allowance dated Jun . 28 , 2022 , issued in U.S. Appl.
No. 17 / 026,074 ( 10 pages ) .
U.S. Office Action dated Aug. 3 , 2022 , issued in U.S. Appl. No.
177,026,082 ( 13 pages ) .
* cited by examiner
U.S. Patent Oct. 4. 2022 Sheet 1 of 11 US 11,461,263 B2

DDR4 120
135
115
125 X6 137
CPU PCIe
5

DSatpcienter Ethernet TOREstwhiertnceht 10 GbE


port NIC x48
1A
.
FIG

port 10 GbE NIC /


x16 137
CPU Pte
5
48
125 115
135
DDR4 120
Servers
105
U.S. Patent Oct. 4 , 2022 Sheet 2 of 11 US 11,461,263 B2

140
en
-120 1 1 11 M

OLI 3 $ 3 3

115 DOR AWith 135 137 W*w *w


125 SVI

DSatpcienter Ethernet TOREstwhiertnceht


w 10GbE
ERAport
IKMN NIC
CPU AMwW
YNn

1B
.
FIG

portport 10 GbE
How OIN
ww CPU WuXXMX.XWwuwWu

?MA SZI
115
145
$ 5 4

DDR
W 135 137
WOWAMW

5 3 3
SANMAR

120
Servers
105
U.S. Patent Oct. 4 , 2022 Sheet 3 of 11 US 11,461,263 B2

Som

DDR4 120 AAAAAAAAAA


*** .......
TV
..MMMMM

110

MUI
*

115
125
CPU PCIe
3 130 137
DSatpcienter Ethernet TOREstwhiertnceht 10 GbE
port NIC SIS
FIG
.
1C
# 10 GbE
port NIC
1

1
PO

X ALCK 125
CPU PCIe
5 x48

115
??????????

MU IM
DDR4 120
Sms
SAKREKLIK ***

Servers 135
105
U.S. Patent Oct. 4 , 2022 Sheet 4 of 11 US 11,461,263 B2

120 SS
TT

140
110 3 3 3 3 3 3

115 DDR WMAINHKAwW*iUMutKUCWAmh


125 doouCo4 o

CPU 130 137


NIC
DSatpcienter Ethernet TOREstwhiertnceht MI UM FIG.1D
A *WWWWMWT7W.1PIC#FMnem
port 10 GbE
NIC
145? .
T

CPU
X ALCK
3
125
32 E ?
115 DDR
WiUWXaXW

**
MU I M
140 ST
3

120 **
Servers 135
105
U.S. Patent Oct. 4 , 2022 Sheet 5 of 11 US 11,461,263 B2

SS

DDR4 120 LL
ARXAR*** ARRES
L

112 LO MIUM
............

115
125

DSatpcienter Ethernet sTORPwCiltec5h


À
portport PCle5 conector
port PCle5
conector
CPU PCIe
3 130 137

1
w

SNS
ST MIU 1E
.
FIG

CPU PCIe
5 x48 wwwwwwwww

XX 125 115
DDR4 120
Sans
HOROROKORREKORDERICIRE
WATT

Servers 135
105
U.S. Patent Oct. 4 , 2022 Sheet 6 of 11 US 11,461,263 B2

120 TT

SS

140
112 3 3 3 3 3 3

115 DDR WMAINHKAwW*iUMutKUCWAmh


125 doouCo4 o

??RANA+?MH+74?0 CPU 130 137


DSatpcienter Ethernet sTORPwCiltec5h
port PCle5
conector 145? .
1F
.
FIG

port PCle5
conector CPU
XX
WRT7*

3 3
125
3 ?
115 ODR
WiUWXaXW

**
MIU M
STS

120 **
Servers 135
105
U.S. Patent Oct. 4 , 2022 Sheet 7 of 11 US 11,461,263 B2

13
1
, .........

112 ***

DSatpcienter Ethernet switch


CXL
PCle5
TOR
À port
|
125con ector?
port PCle5

PCle5 conector
130 137
...

R.
IMU
! WWWWWWWWWWWWWWWWWW
IM UM
16
.
FIG

KU AKU 125
*****
????????????????????
WEWED

Servers 135
OST
U.S. Patent Oct. 4 , 2022 Sheet 8 of 11 US 11,461,263 B2

205 210 -215 220 222

stwoehiurmtvoncehsrt by poentcmdreousltrienyrgto
rmcof
or scncwphaibetnlcsiethdyrCPU
aCXL
RDMA
via b,maygeprmaesogatirneygd ErsTOR
pdata
andtewohcieurtsnvcehstdlocal
to
CXLcoremofCaonephtdmabrniuolcretydraforRDMAangsedwniqmetuorteancsethlmemory
aor
)CXL
,
e.g.
(
g regated coremofCCXLaonphtdmabriounlcretydrsendsdirectsviaRDMArNICsandTORewqiutecshtEswith
RDMA
i,
bCPUs
tnywhpiearstfncaienhtg ESTORtorfor CXL
to
or
module
memory
bypas inglocal
,
switch RDMA
via
CPUs mCor
of
eCXL
coanphtamdrbounilcrietysrdoflocalrdROMAweicrsetpviocenhrste FIG
2A
.
U.S. Patent Oct. 4 , 2022 Sheet 9 of 11 US 11,461,263 B2

-2 5 -230 -235 240


- 245
---

server
Within
ToR
request
the
,
received
is
port
over
NIC
of
)
s
(

tdCPUrEoaevhqlnwsuoermknilstroead rsEandtotewohcqiuetrenvcshtof
spcolerursvapolenidrtnsyg )
NIC
enabled
-
GbE
100
, (
)x86
,
e.g.
(
CPUs
request
receive
)
s
NIC
from ogetherrmato
and
DDR
te.g.
using
,
)request
process
(
CPUs pdesoimtuiorcnoeayls
2.0
CXL
via
share
the
e.g.
(
a)
,memory
g regated 2B
FIG
.
e.g.
U.S. Patent Oct. 4 , 2022 Sheet 10 of 11 US 11,461,263 B2

252 .254 -256 258 260

ontentsvia
c(
memory
said
for
request
sends
CPU
or
DDR
,
e.g. dfrom
a
cmaieogfnremtvrogenarteysd
) second
DDR
and
using
,sharepvia
to
rmaedositmuoircronealys
2.0
CXL
Second
CPUs
rp(
,
e.g.
memory
of
etoqrcuievsatl ersipqegucitenisaveltthe
tcohwnrieoctucogrhs
sand
CXL
PCle
TOR
CXLprotocol cPClelthetprTheohenrqpceaougaetgoslerhttotheTORPCIe/CXLswitch,whichthentransmits ofasecondserverontoPClecroenquecetsotrtherack Second
CPUs
x86
,
.
0.9
(
request
receives
)
from
the cPsoenCceoltneodr theamgermgoatreyd of PCle
result
the
transmit
)
x86
,
e.g.
(
CPUs
Second rCPUs
oback
via
the
to

tc)ognet hnetrs
2C
.
FIG

ausing
and
DDR
,
d itional
wpof
rack
ascoleirurtesvaphloenirdtnsyg
tCPU
wrdata
/
Eor
over
PCle
otearhqnkesulrmeoinsated /
PCle
TOR
switch
rCXL
request
and
to
routes
eceives
PCle
of
)
s
(
port
over
received
is
request
the
,
server
Within
conector PCle
the
from
request
receive
)
X86
,
e.g.
(
CPUs conector the
share
p2.0
to
CXL
via
rmemory
eostoucroels amgermgoatredy dthat
CPU
needs
it
memory
ce(
e.g.
DDR
,
otenrtmeinetsserver
d)
cmaor
a
from
iogefnrmteorgnaerteysd
trpCPUsoegqeuctehstr

225 230 235 240 245 250


U.S. Patent Oct. 4 , 2022 Sheet 11 of 11 US 11,461,263 B2

252 254 266

Cof
memory
the
sends
rofor
said agonrtegantesd dsvia
netqroulesrt)
,
.
eg
c(
memory
a
from CXLpirefortevoecnortl cPCle
tthelprTheohenrqpcaoeugatgsolehtrtotheTORPCleswitch,whichthentreaqnusmeists PClectoasofontherackoencreovcnteodr SPCle
cpavia
CXL
to
oercnovteoicndtseodrslCPU
the
of
allow
memory
to
ashare
g regated
dtormceotvnmrioecnrvetys

FIG
2D
.

overrdata/wtcPoeiarqcnkeuslmeoisatndgEorPClethernet rack
within
sof
apserver
colerurevsapelonirdtsnyg PCle
of
)
s
(
port
over
received
is
request
the
,
server
Within
srand
PCle
TOR
to
ewocqiuetievcsht conector PCle
from
rmgiven
the
A
eocqdmeuioevlsretys conector Cof
moprusing
the ,neotmrdqcoeulsrlesyrtlmoecmaolry needs
dmof
it
that
Ceotnemroilrneysr cm(
the a)
,
e.g.
goenrmteoganrteysdfromadsiefrevrenrt
!

225 230 235 262 264 250


US 11,461,263 B2
1 2
DISAGGREGATED MEMORY SERVER having to access a processor such as a central processing unit
(CPU) ( e.g. , by performing remote direct memory access
CROSS - REFERENCE TO RELATED ( RDMA ) ) .
APPLICATION ( S ) According to an embodiment of the present invention ,
5 there is provided a system , including: a first memory server,
The present application claims priority to and the benefit including: a cache - coherent switch , and a first memory
of U.S. Provisional Application No. 63 / 031,508 , filed May module ; and aa second memory server, and a server - linking
28 , 2020 , entitled “ EXTENDING MEMORY ACCESSES switch connected to the first memory server and to the
WITH NOVEL CACHE COHERENCE CONNECTS ” , and 10
second memory server , wherein : the first memory module is
priority to and the benefit of U.S. Provisional Application connected to the cache - coherent switch , and the cache
No. 63 /031,509 , filed May 28 , 2020 , entitled “ POOLING coherent switch is connected to the server -linking switch .
SERVER MEMORY RESOURCES FOR COMPUTE In some embodiments, the server - linking switch is con
EFFICIENCY ” , and priority to and the benefit of U.S.
2 figured to disable power to the first memory module .
Provisional Application No. 63 /068,054 , filed Aug. 20 , 15 In some embodiments: the server -linking switch is con
2020 , entitled " SYSTEM WITH CACHE -COHERENT figured to disable power to the first memory module by
MEMORY AND SERVER -LINKING SWITCH FIELD ” , instructing the cache - coherent switch to disable power to the
and priority to and the benefit of U.S. Provisional Applica- first memory module, and the cache -coherent switch is
tion No. 63 /057,746 , filed Jul. 28 , 2020 , entitled “ DISAG- configured to disable power to the first memory module ,
GREGATED MEMORY ARCHITECTURE WITH 20 upon being instructed, by the server - linking switch , to
NOVEL INTERCONNECTS ” , the entire contents of all disable power to the first memory module .
which are incorporated herein by reference ; the present In some embodiments, the cache - coherent switch is con
application also claims priority to and the benefit of U.S. figured to perform deduplication within the first memory
Provisional Application No. 63 /006,073 , filed Apr. 6 , 2020 , module .
entitled “ SYSTEMS, METHODS , AND APPARATUSES 25 In some embodiments, the cache - coherent switch is con
FOR MEMORY ACCESS USING CACHE COHERENT figured to compress data and to store compressed data in the
INTERCONNECTS ” . first memory module .
In some embodiments, the server- linking switch is con
FIELD figured to query a status of the first memory server.
30 In some embodiments, the server - linking switch is con
One or more aspects of embodiments according to the figured to query a status of the first memory server through
present disclosure relate to computing systems , and more anInIntelligent Platform Management Interface (IPMI ) .
some embodir its , the querying of a status includes
particularly to a system and method for managing memory querying a status selected from the group consisting of a
resources in a system including one or more servers . 35 power status, a network status, and an error check status.
BACKGROUND In some embodiments, the server - linking switch is con
figured to batch cache requests directed to the first memory
The present background section is intended to provide server .
In some embodiments, the system further includes a third
context only, and the disclosure of any embodiment or
concept in this section does not constitute an admission that 40 memory
wherein
server connected to the server -linking switch ,
the server - linking switch is configured to maintain ,
said embodiment or concept is prior art. between data stored on the first memory server and data
Some server systems may include collections of servers stored on the third memory server , a consistency level
connected by a network protocol . Each of the servers in such selected from the group consisting of strict consistency,
a system may include processing resources (e.g. , processors) 45 sequential consistency, causal consistency, and processor
and memory resources (e.g. , system memory ). It may be consistency.
advantageous, in some circumstances , for a processing In some embodiments , the cache - coherent switch is con
resource of one server to access a memory resource of figured to : monitor a fullness of a first region of memory , and
another server , and it may be advantageous for this access to
move data from the first region of memory to a second
occur while minimizing the processing resources of either 50 region of memory , wherein : the first region of memory is in
server. volatile memory, and the second region of memory is in
Thus, there is a need for an improved system and method persistent memory .
for managing memory resources in a system including one In some embodiments , the server - linking switch includes
or more servers . a Peripheral Component Interconnect Express ( PCIe )
55 switch .
SUMMARY In some embodiments, the server - linking switch includes
a Compute Express Link (CXL ) switch .
In some embodiments, a data storage and processing In some embodiments, the server - linking switch includes
system includes one or more servers and one or more a top of rack ( TOR ) CXL switch .
memory servers connected by a server - linking switch . Each 60 In some embodiments, the server -linking switch is con
memory server may include one or more memory modules figured to transmit data from the second memory server to
connected to the server -linking switch through a cache- the first memory server, and to perform flow control on the
coherent switch . Each memory module may include a con- data .
troller (e.g. , a field -programmable gate array (FPGA ) or an In some embodiments, the system further includes aa third
application specific integrated circuit ( ASIC )) providing it 65 memory server connected to the server -linking switch ,
with enhanced capabilities . These capabilities may include wherein : the server - linking switch is configured to : receive
enabling a server to interact with a memory module without a first packet, from the second memory server , receive a
US 11,461,263 B2
3 4
second packet, from the third memory server, and transmit FIG . 2A is a flow chart for an example method of
the first packet and the second packet to the first memory performing a remote direct memory access (RDMA) trans
server . fer, bypassing processing circuits, for embodiments illus
According to an embodiment of the present invention, trated in FIGS . 1A - 1G , according to an embodiment of the
there is provided a method for performing remote direct 5 present disclosure ;
memory access in a computing system , the computing FIG . 2B is a flow chart for an example method of
system including : a first memory server , a first server , a performing an RDMA transfer, with the participation of
second server, and aa server - linking switch connected to the processing circuits , for embodiments illustrated in FIGS .
first memory server, to the first server, and to the second 1A - 1D , according to an embodiment of the present disclo
server, the first memory server including : a cache - coherent 10 sure ;
switch , and a first memory module ; the first server including : FIG . 2C is a flow chart for an example method of
a stored -program processing circuit ; the second server performing an RDMA transfer, through a Compute Express
including : a stored - program processing circuit ; the method Link (CXL ) switch, for embodiments illustrated in FIGS . 1E
including : receiving, by the server -linking switch , a first and 1F, according to an embodiment of the present disclo
packet, from the first server; receiving, by the server -linking 15 sure ; and
switch , a second packet, from the second server, and trans- FIG . 2D is a flow chart for an example method of
mitting the first packet and the second packet to the first performing an RDMA transfer, through a CXL switch for the
memory server . embodiment illustrated in FIG . 16 , according to an embodi
In some embodiments, the method further includes : com- ment of the present disclosure .
pressing data , by the cache - coherent switch , and storing the 20
data in the first memory module . DETAILED DESCRIPTION
In some embodiments, the method further includes : que
rying, by the server - linking switch , a atus of the first The detailed description set forth below in connection
memory server. with the appended drawings is intended as a description of
According to an embodiment of the present invention , 25 exemplary embodiments of a system and method for man
there is provided a system , including : a first memory server, aging memory resources provided in accordance with the
including : a cache - coherent switch , and a first memory present disclosure and is not intended to represent the only
module ; and a second memory server ; and server - linking forms in which the present disclosure may be constructed or
switching means connected to the first memory server and to utilized . The description sets forth the features of the present
the second memory server, wherein : the first memory mod- 30 disclosure in connection with the illustrated embodiments. It
ule is connected to the cache - coherent switch, and the is to be understood , however, that the same or equivalent
cache -coherent switch is connected to the server - linking functions and structures may be accomplished by different
switching means. embodiments that are also intended to be encompassed
within the scope of the disclosure. As denoted elsewhere
BRIEF DESCRIPTION OF THE DRAWINGS 35 herein , like element numbers are intended to indicate like
elements or features.
The drawings provided herein are for purpose of illus- Peripheral Component Interconnect Express ( PCIe ) can
trating certain embodiments only ; other embodiments , refer to a computer interface which may have a relatively
which may not be explicitly illustrated, are not excluded high and variable latency that can limit its usefulness in
from the scope of this disclosure . 40 making connections to memory. CXL is an open industry
These and other features and advantages of the present standard for communications over PCIe 5.0 , which can
disclosure will be appreciated and understood with reference provide fixed, relatively short packet sizes , and , as a result,
to the specification, claims , and appended drawings wherein : may be able to provide relatively high bandwidth and
FIG . 1A is a block diagram of a system for attaching relatively low, fixed latency. As such, CXL may be capable
memory resources to computing resources using a cache- 45 of supporting cache coherence and CXL may be well suited
coherent connection, according to an embodiment of the for making connections to memory. CXL may further be
present disclosure; used to provide connectivity between a host and accelera
FIG . 1B is a block diagram of a system , employing tors , memory devices, and network interface circuits (or
expansion socket adapters, for attaching memory resources " network interface controllers ” or network interface cards ”
to computing resources using a cache - coherent connection, 50 (NICs )) in a server.
according to an embodiment of the present disclosure ; Cache coherent protocols such as CXL may also be
FIG . 1C is a block diagram of a system for aggregating employed for heterogeneous processing, e.g. , in scalar,
memory employing an Ethernet ToR switch , according to an vector, and buffered memory systems. CXL may be used to
embodiment of the present disclosure; leverage the channel, the retimers, the PHY layer of a
FIG . 1D is a block diagram of a system for aggregating 55 system , the logical aspects of the interface, and the protocols
memory employing an Ethernet TOR switch and an expan- from PCIe 5.0 to provide a cache - coherent interface. The
sion socket adapter, according to an embodiment of the CXL transaction layer may include three multiplexed sub
present disclosure; protocols that run simultaneously on a single link and can be
FIG . 1E is aa block diagram of a system for aggregating referred to as CXL.io , CXL.cache , and CXL.memory.
memory , according to an embodiment of the present disclo- 60 CXL.io may include I /O semantics, which may be similar to
sure ; PCIe . CXL.cache may include caching semantics, and
FIG . 1F is a block diagram of a system for aggregating CXL.memory may include memory semantics; both the
memory, employing an expansion socket adapter , according caching semantics and the memory semantics may be
to an embodiment of the present disclosure ; optional . Like PCIe , CXL may support ( i ) native widths of
FIG . 16 is a block diagram of a system for disaggregating 65 x16 , x8 , and x4 , which may be partitionable, ( ii ) a data rate
servers , according to an embodiment of the present disclo- of 32 GT/ s , degradable to 8 GT/ s and 16 GT/ s , 128b/ 130b ,
sure ; (iii ) 300 W ( 75 W in a x16 connector ), and ( iv ) plug and
US 11,461,263 B2
5 6
play. To support plug and play, either a PCIe or a CXL system memory 120 ( e.g. , Double Data Rate (version 4 )
device link may start training in PCIe in Gen1, negotiate (DDR4 ) memory or any other suitable memory ), ( ii ) one or
CXL , complete Gen 1-5 training and then start CXL trans- more network interface circuits 125 , and (iii ) one or more
actions . CXL memory modules 135. Each of the processing circuits
In some embodiments , the use of CXL connections to an 5 115 may be a stored -program processing circuit , e.g. , a
aggregation , or “ pool”, of memory ( e.g. , a quantity of central processing unit ( CPU ( e.g. , an x86 CPU ), a graphics
memory, including a plurality of memory cells connected processing unit (GPU ), or an ARM processor. In some
together ) may provide various advantages, in a system that embodiments a network interface circuit 125 may be embed
includes a plurality of servers connected together by a ded in (e.g. , on the same semiconductor chip as , or in the
network , as discussed in further detail below . For example, 10 same module as ) one of the memory modules 135 , or a
a CXL switch having further capabilities in addition to network interface circuit 125 may be separately packaged
providing packet - switching functionality for CXL packets from the memory modules 135 .
( referred to herein as an " enhanced capability CXL switch ” ) As used herein , a “ memory module ” is a package ( e.g. , a
may be used to connect the aggregation of memory to one package including a printed circuit board and components
or more central processing units ( CPUs) ( or “ central pro- 15 connected to it , or an enclosure including a printed circuit
cessing circuits ” ) and to one or more network interface board ) including one or more memory dies , each memory
circuits (which may have enhanced capability ). Such a die including a plurality of memory cells . Each memory die ,
configuration may make it possible ( i ) for the aggregation of or each of a set of groups of memory dies , may be in a
memory to include various types of memory , having differ- package ( e.g. , an epoxy mold compound ( EMC ) package )
ent characteristics , ( ii ) for the enhanced capability CXL 20 soldered to the printed circuit board of the memory module
switch to virtualize the aggregation of memory, and to store (or connected to the printed circuit board of the memory
data of different characteristics ( e.g. , frequency of access ) in module through a connector ). Each of the memory modules
appropriate types of memory, ( iii ) for the enhanced capa- 135 may have aa CXL interface and may include a controller
bility CXL switch to support remote direct memory access 137 ( e.g. , an FPGA, an ASIC , a processor, and / or the like )
( RDMA) so that RDMA may be performed with little or no 25 for translating between CXL packets and the memory inter
involvement from the server's processing circuits. As used face of the memory dies , e.g. , the signals suitable for the
herein , to “ virtualize” memory means to perform memory memory technology of the memory in the memory module
address translation between the processing circuit and the 135. As used herein , the “ memory interface” of the memory
memory . dies is the interface that is native to the technology of the
A CXL switch may (i ) support memory and accelerator 30 memory dies , e.g. , in the case of DRAM e.g. , the memory
dis -aggregation through single level switching, ( ii ) enable interface may be word lines and bit lines . A memory module
resources to be off - lined and on - lined between domains , may also include a controller 137 which may provide
which may enable time-multiplexing across don ins , based enhanced capabilities, as described in further detail below .
on demand, and (iii ) support virtualization of downstream The controller 137 of each memory modules 135 may be
ports. CXL may be employed to implement aggregated 35 connected to a processing circuit 115 through a cache
memory, which may enable one- to -many and many - to -one coherent interface , e.g. , through the CXL interface . The
switching ( e.g. , it may be capable of (i ) connecting multiple controller 137 may also facilitate data transmissions ( e.g. ,
root ports to one end point, ( ii ) connecting one root port to RDMA requests) between different servers 105 , bypassing
multiple end points, or ( iii) connecting multiple root ports to the processing circuits 115. The TOR Ethernet switch 110
multiple end points ), with aggregated devices being, in some 40 and the network interface circuits 125 may include an
embodiments , partitioned into multiple logical devices each RDMA interface to facilitate RDMA requests between CXL
with a respective LD - ID ( logical device identifier ). In such memory devices on different servers (e.g. , the TOR Ethernet
an embodiment a physical device may be partitioned into a switch 110 and the network interface circuits 125 may
plurality of logical devices, each visible to a respective provide hardware offload or hardware acceleration of
initiator. A device may have one physical function ( PF ) and 45 RDMA over Converged Ethernet (ROCE ) , Infiniband , and
a plurality ( e.g. , 16 ) isolated logical devices . In some iWARP packets).
embodiments the number of logical devices (e.g. , the num- The CXL interconnects in the system may comply with a
ber of partitions ) may be limited ( e.g. to 16 ) , and one control cache coherent protocol such as the CXL 1.1 standard, or, in
partition (which may be a physical function used for con- some embodiments, with the CXL 2.0 standard , with a
trolling the device ) may also be present . 50 future version of CXL , or any other suitable protocol ( e.g. ,
In some embodiments, a fabric manager may be cache coherent protocol ). The memory modules 135 may be
employed to (i ) perform device discovery and virtual CXL directly attached to the processing circuits 115 as shown, and
software creation , and to ( ii ) bind virtual ports to physical the top of rack Ethernet switch 110 may be used for scaling
ports. Such a fabric manager may operate through connec- the system to larger sizes ( e.g. , with larger numbers of
tions over an SMBus sideband . The fabric manager may be 55 servers 105 ) .
implemented in hardware, or software, or firmware, or in a In some embodiments , each server can be populated with
combination thereof, and it may reside , for example, in the multiple direct- attached CXL attached memory modules
host , in one of the memory modules 135 , or in the enhanced 135 , as shown in FIG . 1A . Each memory module 135 may
capability CXL switch 130 , or elsewhere in the network . The expose a set of base address registers (BARs ) to the host's
fabric manager may issue commands including commands 60 Basic Input/Output System ( BIOS ) as a memory range. One
issued through a sideband bus or through the PCIe tree . or more of the memory modules 135 may include firmware
Referring to FIG . 1A , in some embodiments, a server
2 to transparently manage its memory space behind the host
system includes a plurality of servers 105 , connected OS map . Each of the memory modules 135 may include one
together by a top of rack ( TOR ) Ethernet switch 110. While of, or a combination of, memory technologies including , for
this switch is described as using Ethernet protocol, any other 65 example (but not limited to ) Dynamic Random Access
suitable network protocol may be used . Each server includes Memory (DRAM ), not - AND (NAND ) flash , High Band
one or more processing circuits 115 , each connected to (i ) width Memory ( HBM) , and Low - Power Double Data Rate
US 11,461,263 B2
7 8
Synchronous Dynamic Random Access Memory ( LPDDR that may be present in other modes of RDMA transfer. For
SDRAM ) technologies, and may also include a cache con- example, in such an embodiment, the use of a bounce buffer
troller or separate respective split controllers for different (e.g. , a buffer in the remote server, when the eventual
technology memory devices ( for memory modules 135 that destination in memory is in an address range not supported
combine several memory devices of different technologies ). 5 by the RDMA protocol ) may be avoided . In some embodi
Each memory module 135 may include different interface ments, RDMA uses another physical medium option , other
widths ( x4 -x16 ), and may be constructed according to any of than Ethernet ( e.g. , for use with aa switch that is configured
various pertinent form factors, e.g. , U.2 , M.2 , half height, to handle other network protocols ) . Examples of inter - server
half length (HHHL ) , full height, half length ( FHHL ) , E1.S , connections that may enable RDMA include (but are not
El.L , E3.S , and E3.H. 10 limited to ) Infiniband, RDMA over Converged Ethernet
In some embodiments, as mentioned above , the enhanced (ROCE ) (which uses Ethernet User Datagram Protocol
capability CXL switch 130 includes an FPGA (or ASIC ) (UDP ) ), and iWARP (which uses transmission control pro
controller 137 and provides additional features beyond tocol/ Internet protocol ( TCP/ IP ) ) .
switching of CXL packets . The controller 137 of the FIG . 1B shows a system similar to that of FIG . 1A , in
enhanced capability CXL switch 130 may also act as a 15 which the processing circuits 115 are connected to the
management device for the memory modules 135 and help network interface circuits 125 through the memory modules
with host control plane processing , and it may enable rich 135. The memory modules 135 and the network interface
control semantics and statistics. The controller 137 may circuits 125 are on expansion socket adapters 140. Each
include an additional “ backdoor ” ( e.g. , 100 gigabit Ethernet expansion socket adapter 140 may plug into an expansion
(GbE )) network interface circuit 125. In some embodiments , 20 socket 145 , e.g. , a M.2 connector, on the motherboard of the
the controller 137 presents as a CXL Type 2 device to the server 105. As such , the server may be any suitable ( e.g. ,
processing circuits 115 , which enables the issuing of cache industry standard ) server, modified by the installation of the
invalidate instructions to the processing circuits 115 upon expansion socket adapters 140 in expansion sockets 145. In
receiving remote write requests. In some embodiments, such an embodiment, (i ) each network interface circuit 125
DDIO technology is enabled, and remote data is first pulled 25 may be integrated into a respective one of the memory
to last level cache ( LLC) of the processing circuit and later modules 135 , or (ii ) each network interface circuit 125 may
written to the memory modules 135 ( from cache ). As used have aa PCIe interface ( the network interface circuit 125 may
herein , a “ Type 2” CXL Device is one that can initiate be a PCIe endpoint (i.e. , a PCIe slave device ) ), so that the
transactions and that implements an optional coherent cache processing circuit 115 to which it is connected (which may
and host-managed device memory and for which applicable 30 operate as the PCIe master device, or “ root port ” ) may
transaction types include all CXL.cache and all CXL.mem communicate with it through a root port to endpoint PCIe
transactions. connection , and the controller 137 of the memory module
As mentioned above , one or more of the memory modules 135 may communicate with it through a peer-to - peer PCIe
135 may include persistent memory, or “ persistent storage ” connection .
( i.e. , storage within which data is not lost when external 35 According to an embodiment of the present invention ,
power is disconnected ). If a memory module 135 is pre- there is provided a system , including: aa first server, includ
sented as a persistent device, the controller 137 of the ing : a stored -program processing circuit, a first network
memory module 135 may manage the persistent domain , interface circuit, and a first memory module, wherein : the
e.g. , it may store , in the persistent storage data identified first memory module includes: a first memory die , and a
( e.g. , as a result of an application making a call to a 40 controller, the controller being connected : to the first
corresponding operating system function ) by a processing memory die through a memory interface, to the stored
circuit 115 as requiring persistent storage . In such an program processing circuit through a cache - coherent inter
embodiment, a software API may flush caches and data to face, and to the first network interface circuit . In some
the persistent storage. embodiments : the first memory module further includes a
In some embodiments, direct memory transfer to the 45 second memory die , the first memory die includes volatile
memory modules 135 from the network interface circuits memory, and the second memory die includes persistent
125 is enabled . Such transfers may be a one -way transfers to
memory . In some embodiments, the persistent memory
remote memory for fast communication in a distributed includes NAND flash . In some embodiments, the controller
system . In such an embodiment, the memory modules 135 is configured to provide a flash translation layer for the
may expose hardware details to the network interface cir- 50 persistent memory. In some embodiments, the cache -coher
cuits 125 in the system to enable faster RDMA transfers. In ent interface includes a Compute Express Link ( CXL )
such a system , two scenarios may occur, depending on interface . In some embodiments, the first server includes an
whether the Data Direct I/ O (DDIO ) of the processing expansion socket adapter, connected to an expansion socket
circuit 115 is enabled or disabled . DDIO may enable direct of the first server, the expansion socket adapter including:
communication between an Ethernet controller or an Ether- 55 the first memory module; and the first network interface
net adapter and a cache of a processing circuit 115. If the circuit . In some embodiments, the controller of the first
DDIO of the processing circuit 115 is enabled , the transfer's memory module is connected to the stored -program pro
target may be the last level cache of the processing circuit , cessing circuit through the expansion socket . In some
from which the data may subsequently be automatically embodiments, the expansion socket includes an M.2 socket.
flushed to the memory modules 135. If the DDIO of the 60 In some embodiments, the controller of the first memory
processing circuit 115 is disabled , the memory modules 135 module is connected to the first network interface circuit by
may operate in device- bias mode to force accesses to be a peer to peer Peripheral Component Interconnect Express
directly received by the destination memory module 135 (PCIe) connection . In some embodiments, the system further
(without DDIO ) . An RDMA -capable network interface cir- includes: a second server , and aa network switch connected to
cuit 125 with host channel adapter (HCA) , buffers, and other 65 the first server and to the second server . In some embodi
processing, may be employed to enable such an RDMA ments, the network switch includes a top of rack ( TOR )
transfer, which may bypass the target memory buffer transfer Ethernet switch . In some embodiments , the controller of the
US 11,461,263 B2
9 10
first memory module is configured to receive straight remote circuit 125 , and the stored -program processing circuit 115 is
direct memory access (RDMA ) requests, and to send connected to the cache - coherent switch 130 .
straight RDMA responses. In some embodiments, the con The memory modules 135 may be grouped by type , form
troller of the first memory module is configured to receive factor, or technology type ( e.g. , DDR4 , DRAM , LDPPR ,
straight remote direct memory access (RDMA ) requests 5 high bandwidth memory ( HBM) , or NAND flash , or other
through the network switch and through the first network persistent storage (e.g. , solid state drives incorporating
interface circuit , and to send straight RDMA responses NAND flash )). Each memory module may have a CXL
through the network switch and through the first network interface and include an interface circuit for translating
interface circuit . In some embodiments, the controller of the between CXL packets and signals suitable for the memory
first memory module is configured to : receive data , from the 10 in the memory module 135. In some embodiments, these
second server ; store the data in the first memory module; and interface circuits are instead in the enhanced capability CXL
send, to the stored -program processing circuit, a command switch 130 , and each of the memory modules 135 has an
for invalidating a cache line . In some embodiments, the interface that is the native interface of the memory in the
controller of the first memory module includes a field memory module 135. In some embodiments, the enhanced
programmable gate array (FPGA) or an application - specific 15 capability CXL switch 130 is integrated into (e.g. ,in an M.2
integrated circuit (ASIC ) . According to an embodiment of form factor package with , or integrated into a single inte
the present invention , there is provided a method for per- grated circuit with other components of) a memory module
forming remote direct memory access in a computing sys- 135 .
tem , the computing system including : a first server and a The ToR Ethernet switch 110 may include interface
second server, the first server including : a stored -program 20 hardware to facilitate RDMA requests between aggregated
processing circuit , a network interface circuit , and a first memory devices on different servers. The enhanced capa
memory module including a controller, the method includ- bility CXL switch 130 may include one or more circuits
ing : receiving , by the controller of the first memory module, (e.g. , it may include an FPGA or an ASIC ) to (i ) route data
a straight remote direct memory access (RDMA) request; to different memory types based on workload ( ii ) virtualize
and sending, by the controller of the first memory module, 25 host addresses to device addresses and /or ( iii ) facilitate
a straight RDMA response. In some embodiments : the RDMA requests between different servers , bypassing the
computing system further includes an Ethernet switch con- processing circuits 115 .
nected to the first server and to the second server, and the The memory modules 135 may be in an expansion box
receiving of the straight RDMA request includes receiving (e.g. , in the same rack as the enclosure housing the moth
the straight RDMA request through the Ethernet switch . In 30 erboard of the enclosure ), which may include a predeter
some embodiments , the method further includes : receiving, mined number ( e.g. , more than 20 or more than 100 )
by the controller of the first memory module, a read com- memory modules 135 , each plugged into a suitable connec
mand, from the stored -program processing circuit , for a first tor. The modules may be in an M.2 form and the
memory address, translating, by the controller of the first connectors may be M.2 connectors . In some embodiments,
memory module, the first memory address to a second 35 the connections between servers are over a different net
memory address, and retrieving, by the controller of the first work , other than Ethernet, e.g. , they may be wireless con
memory module , data from the first memory module at the nections such as WiFi or 5G connections. Each processing
second memory address . In some embodiments, the method circuit may be an x86 processor or another processor, e.g. ,
further includes : receiving data , by the controller of the first an ARM processor or a GPU . The PCIe links on which the
memory module , storing, by the controller of the first 40 CXL links are instantiated may be PCIe 5.0 or another
memory module , the data in the first memory module, and version (e.g. , an earlier version or a later (e.g. , future )
sending, by the controller of the first memory module, to the version ( e.g. , PCIe 6.0 ) . In some embodiments, a different
stored -program processing circuit , a command for invali- cache - coherent protocol is used in the system instead of, or
dating a cache line . According to an embodiment of the in addition to , CXL , and aa different cache coherent switch
present invention, there is provided a system , including: a 45 may be used instead of, or in addition to , the enhanced
first server, including: a stored -program processing circuit , a capability CXL switch 130. Such a cache coherent protocol
first network interface circuit, and a first memory module , may be another standard protocol or a cache coherent variant
wherein : the first memory module includes: a first memory of the standard protocol ( in a manner analogous to the
die , and controller means, the controller means being con- manner in which CXL is a variant of PCIe 5.0) . Examples
nected : to the first memory die through a memory interface, 50 of standard protocols include , but are not limited to , non
to the stored -program processing circuit through a cache- volatile dual in - line memory module ( version P )(NVDIMM
coherent interface, and to the first network interface circuit . P ) , Cache Coherent Interconnect for Accelerators (CC IX ) ,
Referring to FIG . 1C , in some embodiments , a server and Open Coherent Accelerator Processor Interface (Open
system includes a plurality of servers 105 , connected CAPI ) .
together by a top of rack ( TOR ) Ethernet switch 110. Each 55 The system memory 120 may include , e.g. , DDR4
server includes one or more processing circuits 115 , each memory , DRAM , HBM , or LDPPR memory. The memory
connected to ( i ) system memory 120 (e.g. , DDR4 memory ), modules 135 may be partitioned or contain cache controllers
( ii ) one or more network interface circuits 125 , and ( iii ) an to handle multiple memory types . The memory modules 135
enhanced capability CXL switch 130. The enhanced capa- may be in different form factors, examples of which include
bility CXL switch 130 may be connected to a plurality of 60 but are not limited to HHHL , FHHL , M.2 , U.2 , mezzanine
memory modules 135. That is , the system of FIG . 1C card , daughter card , E1.S , E1.L , E3.L , and E3.S.
includes a first server 105 , including a stored - program In some embodiments , the system implements a aggre
processing circuit 115 , a network interface circuit 125 , a gated architecture, including multiple servers , with each
cache - coherent switch 130 , and aa first memory module 135 . server aggregated with multiple CXL - attached memory
In the system of FIG . 1C , the first memory module 135 is 65 modules 135. Each of the memory modules 135 may contain
connected to the cache - coherent switch 130 , the cache- multiple partitions that can separately be exposed as
coherent switch 130 is connected to the network interface memory devices to multiple processing circuits 115. Each
US 11,461,263 B2
11 12
input port of the enhanced capability CXL switch 130 may As mentioned above , in some embodiments, the memory
independently access multiple output ports of the enhanced modules 135 are organized into groups, e.g. , into one group
capability CXL switch 130 and the memory modules 135 which is memory intensive , another group which is HBM
connected thereto . As used herein , an “ input port” or heavy, another group which has limited density and perfor
" upstream port ” of the enhanced capability CXL switch 130 5 mance , and another group that has a dense capacity. Such
is a port connected to (or suitable for connecting to ) a PCIe groups may have different form factors or be based on
root port, and an “ output port ” or downstream port ” of the different technologies. The controller 137 of the enhanced
enhanced capability CXL switch 130 is a port connected to capability CXL switch 130 may route data and commands
( or suitable for connecting to ) a PCIe endpoint. As in the intelligently based on , for example , a workload , a tagging, or
case of the embodiment of FIG . 1A , each memory module 10 a quality of service ( QoS ) . For read requests, there may be
135 may expose a set of base address registers ( BARs ) to no routing based on such factors.
host BIOS as a memory range . One or more of the memory The controller 137 of the enhanced capability CXL switch
modules 135 may include firmware to transparently manage 130 may also ( as mentioned above ) virtualize the process
its memory space behind the host OS map . ing - circuit - side addresses and memory - side addresses, mak
In some embodiments, as mentioned above , the enhanced 15 ing it possible for the controller 137 of the enhanced
capability CXL switch 130 includes an FPGA (or ASIC ) capability CXL switch 130 to determine where data is to be
controller 137 and provides additional features beyond stored. The controller 137 of the enhanced capability CXL
switching of CXL packets . For example, it may ( as men- switch 130 may make such a determination based on infor
tioned above ) virtualize the memory modules 135 , i.e. , mation or instructions it may receive from a processing
operate as a translation layer, translating between processing 20 circuit 115. For example, the operating system may provide
circuit - side addresses ( or “ processor - side ” addresses, i.e. , a memory allocation feature making it possible for an
addresses that are included in memory read and write application to specify that low - latency storage, or high
commands issued by the processing circuits 115 ) and bandwidth stora or persistent storage is to be allocated ,
memory - side addresses (i.e. , addresses employed by the and such a request, initiated by the application, may then be
enhanced capability CXL switch 130 to address storage 25 taken into account by the controller 137 of the enhanced
locations in the memory modules 135 ) , thereby masking the capability CXL switch 130 in determining where (e.g. in
physical addresses of the memory modules 135 and present- which of the memory modules 135 ) to allocate the memory .
ing a virtual aggregation of memory. The controller 137 of For example , storage for which high bandwidth is requested
the enhanced capability CXL switch 130 may also act as a by the application may be allocated in memory modules 135
management device for the memory modules 135 and facili- 30 containing HBM , storage for which data persistence is
tate with host control plane processing . The controller 137 requested by the application may be allocated in memory
may transparently move data without the participation of the modules 135 containing NAND flash , and other storage ( for
processing circuits 115 and accordingly update the memory which the application has made no requests ) may be stored
map ( or “ address translation table ” ) so that subsequent on memory modules 135 containing relatively inexpensive
accesses function as expected . The controller 137 may 35 DRAM . In some embodiments, the controller 137 of the
contain a switch management device that ( i ) can bind and enhanced capability CXL switch 130 may make determina
unbind the upstream and downstream connections during tions about where to store certain data based on network
runtime as appropriate , and (iii ) can enable rich control usage patterns. For example, the controller 137 of the
semantics and statistics associated with data transfers into enhanced capability CXL switch 130 may determine, by
and out of the memory modules 135. The controller 137 may 40 monitoring usage patterns, that data in a certain range of
include an additional " backdoor” 100 GbE or other network physical addresses are being accessed more frequently than
interface circuit 125 (in addition to the network interface other data, and the controller 137 of the enhanced capability
used to connect to the host ) for connecting to other servers CXL switch 130 may then copy these data into a memory
105 or to other networked equipment. In some embodi- module 135 containing HBM , and modify its address trans
ments, the controller 137 presents as a Type 2 device to the 45 lation table so that the data, in the new location, are stored
processing circuits 115 , which enables the issuing of cache in the same range of virtual addresses. In some embodiments
invalidate instructions to the processing circuits 115 upon one or more of the memory modules 135 includes flash
receiving remote write requests. In some embodiments , memory (e.g. , NAND flash ), and the controller 137 of the
DDIO technology is enabled, and remote data is first pulled enhanced capability CXL switch 130 implements a flash
to last level cache ( LLC ) of the processing circuit 115 and 50 translation layer for this flash memory. The flash translation
later written to the memory modules 135 ( from cache ). layer may support overwriting of processor- side memory
As mentioned above , one or more of the memory modules locations (by moving the data to a different location and
135 may include persistent storage. If a memory module 135 marking the previous location of the data as invalid ) and it
is presented as a persistent device , the controller 137 of the may perform garbage collection (e.g. , erasing a block , after
enhanced capability CXL switch 130 may manage the 55 moving , to another block, any valid data in the block , when
persistent domain (e.g. , it may store , in the persistent stor- the fraction of data in the block marked invalid exceeds a
age , data identified ( e.g. , by the use of a corresponding threshold ).
operating system function) by a processing circuit 115 as In some embodiments, the controller 137 of the enhanced
requiring persistent storage. In such an embodiment, a capability CXL switch 130 may facilitate a physical function
software API may flush caches and data to the persistent 60 (PF ) to PF transfer. For example, if one of the processing
storage. circuits 115 needs to move data from one physical address
In some embodiments, direct memory transfer to the to another (which may have the same virtual addresses; this
memory modules 135 may be performed in a manner fact need not affect the operation of the processing circuit
analogous to that described above for the embodiment of 115 ) , or if the processing circuit 115 needs to move data
FIGS . 1A and 1B , with operations performed by the con- 65 between two virtual addresses (which the processing circuit
trollers of the memory modules 135 being , performed by the 115 would need to have ) the controller 137 of the enhanced
controller 137 of the enhanced capability CXL switch 130 . capability CXL switch 130 may supervise the transfer,
US 11,461,263 B2
13 14
without the involvement of the processing circuit 115. For circuit, a cache - coherent switch , and aa first memory module ,
example, the processing circuit 115 may send a CXL wherein : the first memory module is connected to the
request, and data may be transmitted from one memory cache - coherent switch, the cache - coherent switch is con
module 135 to another memory module 135 (e.g. , the data nected to the network interface circuit , and the stored
may be copied from one memory module 135 to another 5 program processing circuit is connected to the cache-coher
memory module 135 ) behind the enhanced capability CXL ent switch . In some embodiments, the system further
switch 130 without going to the processing circuit 115. In includes a second memory module connected to the cache
this situation, because the processing circuit 115 initiated the coherent switch, wherein the first memory module includes
CXL request, the processing circuit 115 may need to flush its volatile memory and the second memory module includes
cache to ensure consistency. If instead a Type 2 memory 10 persistent memory . In some embodiments, the cache -coher
device (e.g. , one of the memory modules 135 , or an accel- ent switch is configured to virtualize the first memory
erator that may also be connected to the CXL switch) module and the second memory module. In some embodi
initiates the CXL request and the switch is not virtualized , ments, the first memory module includes flash memory, and
then the Type 2 memory device may send a message to the the cache - coherent switch is configured to provide a flash
processing circuit 115 to invalidate the cache . 15 translation layer for the flash memory. In some embodi
In some embodiments, the controller 137 of the enhanced ments, the cache - coherent switch is configured to : monitor
capability CXL switch 130 may facilitate RDMA requests an access frequency of a first memory location in the first
between servers. A remote server 105 may initiate such an memory module ; determine that the access frequency
RDMA request, and the request may be sent through the TOR exceeds a first threshold ; and copy the contents of the first
Ethernet switch 110 , and arrive at the enhanced capability 20 memory location into a second memory location , the second
CXL switch 130 in the server 105 responding to the RDMA memory location being in the second memory module . In
request (the “ local server” ). The enhanced capability CXL some embodiments, the second memory module includes
switch 130 may be configured to receive such an RDMA high bandwidth memory (HBM ) . In some embodiments, the
request and it may treat a group of memory modules 135 in cache - coherent switch is configured to maintain a table for
the receiving server 105 (i.e. , the server receiving the 25 mapping processor- side addresses to memory - side
RDMA request) as its own memory space . In the local addresses. In some embodiments , the system further
server, the enhanced capability CXL switch 130 may receive includes: a second server, and aa network switch connected to
the RDMA request as a direct RDMA request ( i.e. , an first server and the the second server. In some embodiments ,
RDMA request that is not routed through a processing the network switch includes top of rack ( TOR ) Ethernet
circuit 115 in the local server ) and it may send a direct 30 switch . In some embodiments, the cache - coherent switch is
response to the RDMA request ( i.e. , it may send the response configured to receive straight remote direct memory access
without it being routed through a processing circuit 115 in (RDMA) requests, and to send straight RDMA responses. In
the local server ). In the remote server, the response ( e.g. , some embodiments, the cache - coherent switch is configured
data sent by the local server) may be received by the to receive the remote direct memory access (RDMA)
enhanced capability CXL switch 130 of the remote server, 35 requests through the TOR Ethernet switch and through the
and stored in the memory modules 135 of the remote server, network interface circuit , and to send straight RDMA
without being routed through a processing circuit 115 in the responses through the TOR Ethernet switch and through the
remote server. network interface circuit. In some embodiments, the cache
FIG . 1D shows a system similar to that of FIG . 1C , in coherent switch is configured to support a Compute Express
which the processing circuits 115 are connected to the 40 Link ( CXL ) protocol . In some embodiments, the first server
network interface circuits 125 through the enhanced capa- includes an expansion socket adapter, connected to an
bility CXL switch 130. The enhanced capability CXL switch expansion socket of the first server, the expansion socket
130 , the memory modules 135 , and the network interface adapter including: the cache - coherent switch ; and a memory
circuits 125 are on an expansion socket adapter 140. The module socket, the first memory module being connected to
expansion socket adapter 140 may be a circuit board or 45 the cache - coherent switch through the memory module
module that plugs into an expansion socket , e.g. , a PCIe socket. In some embodiments , the memory module socket
connector 145 , on the motherboard of the server 105. As includes an M.2 socket . In some embodiments, the network
such , the server may be any suitable server, modified only by interface circuit is on the expansion socket adapter. Accord
the installation of the expansion socket adapter 140 in the ing to an embodiment of the present invention, there is
PCIe connector 145. The memory modules 135 may be 50 provided a method for performing remote direct memory
installed in connectors ( e.g. , M.2 connectors ) on the expan- access in a computing system , the computing system includ
sion socket adapter 140. In such an embodiment, ( i ) the ing : a first server and a second server, the first server
network interface circuits 125 may be integrated into the including: a stored -program processing circuit , a network
enhanced capability CXL switch 130 , or ( ii ) each network interface circuit , a cache -coherent switch, and a first
interface circuit 125 may have a PCIe interface ( the network 55 memory module , the method including: receiving, by the
interface circuit 125 may be a PCIe endpoint ), so that the cache -coherent switch, a straight remote direct memory
processing circuit 115 to which it is connected may com- access (RDMA) request, and sending, by the cache - coherent
municate with the network interface circuit 125 through a switch , a straight RDMA response . In some embodiments :
root port to endpoint PCIe connection . The controller 137 of the computing system further includes an Ethernet switch ,
the enhanced capability CXL switch 130 (which may have 60 and the receiving of the straight RDMA request includes
a PCIe input port connected to the processing circuit 115 and receiving the straight RDMA request through the Ethernet
to the network interface circuits 125 ) may communicate switch . In some embodiments , the method further includes :
with the network interface circuit 125 through a peer - to - peer receiving, by the cache -coherent switch , a read command,
PCIe connection . from the stored -program processing circuit, for a first
According to an embodiment of the present invention , 65 memory address, translating, by the cache - coherent switch ,
there is provided a system , including: a first server , includ- the first memory address to a second memory address , and
ing : a stored -program processing circuit, a network interface retrieving, by the cache - coherent switch , data from the first
US 11,461,263 B2
15 16
memory module at the second memory address . In some memory devices, each device being partitioned into multiple
embodiments , the method further includes: receiving data , logical devices each with a respective LD - ID . A TOR switch
by the cache - coherent switch , storing, by the cache - coherent 112 (which may be referred to as a “ server -linking switch ”
switch , the data in the first memory module , and sending, by enables the one -to -many functionality, and the enhanced
the cache - coherent switch , to the stored -program processing 5 capability CXL switch 130 in the server 105 enables the
circuit , a command for invalidating a cache line . According many - to -one functionality . The server - linking switch 112
to an embodiment of the present invention , there is provided may be a PCIe switch, or a CXL switch, or both . In such a
a system , including : a first server , including : a stored- system , the requesters may be the processing circuits 115 of
program processing circuit, a network interface circuit , the multiple servers 105 , the responders may be the many
cache - coherent switching means , and a first memory mod- 10 aggregated memory modules 135. The hierarchy of two
ule , wherein : the first memory module is connected to the switches ( with the master switch being , as mentioned above ,
cache - coherent switching means , the cache - coherent switch- the server - linking switch 112 , and the slave switch being the
ing means is connected to the network interface circuit , and enhanced capability CXL switch 130 ) enables any - any com
the stored -program processing circuit is connected to the munication. Each of the memory modules 135 may have one
cache -coherent switching means. 15 physical function (PF ) and as many as 16 isolated logical
FIG . 1E shows an embodiment in which each of a devices. In some embodiments the number of logical
plurality of servers 105 is connected to a ToR server -linking devices ( e.g. , the number of partitions) may be limited (e.g.
switch 112 , which may be a PCIe 5.0 CXL switch, having to 16 ) , and one control partition (which may be a physical
PCIe capabilities, as illustrated . The server - linking switch function used for controlling the device) may also be pres
112 may include an FPGA or ASIC , and may provide 20 ent. Each of the memory modules 135 may be a Type 2
performance in terms of throughput and latency ) superior to device with cxl.cache , cxl.mem and cxl.io and address
that of an Ethernet switch . Each of the servers 105 may translation service ( ATS ) implementation to deal with cache
include a plurality of memory modules 135 connected to the line copies that the processing circuits 115 may hold . The
server - linking switch 112 through the enhanced capability enhanced capability CXL switch 130 and aa fabric manager
CXL switch 130 and through aa plurality of PCIe connectors . 25 may control discovery of the memory modules 135 and (i )
Each of the servers 105 may also include one or more perform device discovery, and virtual CXL software cre
processing circuits 115 , and system memory 120 , as shown. ation , and (ii ) bind virtual to physical ports. As in the
The server - linking switch 112 may operate as a master, and embodiments of FIGS . 1A - 1D , the fabric manager may
each of the enhanced capability CXL switches 130 may operate through connections over an SMBus sideband. An
operate as a slave , as discussed in further detail below . 30 interface to the memory modules 135 , which may be Intel
In the embodiment of FIG . 1E , the server - linking switch ligent Platform Management Interface ( IPMI ) or an interface
112 may group or batch multiple cache requests received that complies with the Redfish standard ( and that may also
from different servers 5, and it may group packets, reduc- provide additional features tot required by the standard ),
ing control overhead . The enhanced capability CXL switch may enable configurability.
130 may include a slave controller (e.g. , a slave FPGA or a 35 As mentioned above , some embodiments implement a
slave ASIC ) to (i ) route data to different memory types based hierarchical structure with a master controller (which may
on workload , ( ii ) virtualize processor- side addresses to be implemented in an FPGA or in an ASIC ) being part of the
memory -side addresses, and ( iii ) facilitate coherent requests server - linking switch 112 , and aa slave controller being part
between different servers 105 , bypassing the processing of the enhanced capability CXL switch 130 , to provide a
circuits 115. The system illustrated in FIG . 1E may be CXL 40 load - store interface ( i.e. , an interface having cache - line ( e.g. ,
2.0 based, it may include distributed shared memory within 64 byte ) granularity and that operates within the coherence
a rack , and it may use the ToR server - linking switch 112 to domain without software driver involvement). Such a load
natively connect with remote nodes . store interface may extend the coherence domain beyond an
The ToR server - linking switch 112 may have an addi- individual server, or CPU or host , and may involve a
tional network connection (e.g. , an Ethernet connection , as 45 physical medium that is either electrical or optical (e.g. , an
illustrated , or another kind of connection , e.g. , a wireless optical connection with electrical - to -optical transceivers at
connection such as a WiFi connection or a 5G connection ) both ends ). In operation, the master controller ( in the server
for making connections to other servers or to clients. The linking switch 112 ) boots ( or “ reboots ” ) and configures all
server - linking switch 112 and the enhanced capability CXL the servers 105 on the rack . The master controller may have
switch 130 may each include a controller, which may be or 50 visibility on all the hosts , and it may (i ) discover each server
include a processing circuit such as an ARM processor. The and discover how many servers 105 and memory modules
PCIe interfaces may comply with the PCIe 5.0 standard or 135 exist in the server cluster, (ii ) configure each of the
with an earlier version , or with a future version of the PCIe servers 105 independently , ( iii ) enable or disable some
standard, or interfaces complying with a different standard blocks of memory (e.g. , enable or disable any of the memory
( e.g. , NVDIMM - P , CCIX , or OpenCAPI) may be employed 55 modules 135 ) on different servers , based on, e.g. , the con
instead of PCIe interfaces . The memory modules 135 may figuration of the racks, ( iv ) control access ( e.g. , which server
include various memory types including DDR4 DRAM , can control which other server ), ( v ) implement flow control
HBM , LDPPR , NAND flash , or solid state drives ( SSDs ) . (e.g. it may, since all host and device requests go through the
The memory modules 135 may be partitioned or contain master, transmit data from the one server to another server ,
cache controllers to handle multiple memory types, and they 60 and perform flow control on the data ) , ( vi) group or batch
may be in different form factors, such as HHHL , FHHL , requests or packets ( e.g. , multiple cache requests being
M.2 , U.2 , mezzanine card, daughter card , E1.S , E1.L , E3.L , received by the master from different servers 105 ) , and ( vii)
or E3.S. receive remote software updates, broadcast communica
In the embodiment of FIG . 1E , the enhanced capability tions , and the like . In batch mode, the server - linking switch
CXL switch 130 may enable one - to -many and many - to -one 65 112 may receive a plurality of packets destined for the same
switching , and it may enable aa fine grain load - store interface server ( e.g. , destined for a first server) and send them
at the flit ( 64 -byte ) level . Each server may have aggregated together (i.e. , without a pause between them ) to the first
US 11,461,263 B2
17 18
server . For example, server -linking switch 112 may receive In some embodiments, the server- linking switch 112 has
a first packet , from a second server , and a second packet, enhanced capabilities and also includes an integrated CXL
from a third server, and transmit the first packet and the controller. In other embodiments, the server - linking switch
second packet , together, to the first server. Each of the 112 is only a physical routing device , and each server 105
servers 105 may expose , to the master controller, (i ) an IPMI 5 includes a master CXL controller. In such an embodiment,
network interface, ( ii ) a system event log ( SEL ) , and ( iii ) a masters across different servers may negotiate a master
board management controller ( BMC ) , enabling the master slave architecture . The intelligence functions of ( i ) the
controller to measure performance , to measure reliability on enhanced capability CXL switch 130 and of (ii ) the server
the fly , and to reconfigure the servers 105 . linking switch 112 may be implemented in one or more
In some embodiments, a software architecture that facili- 10 FPGAs, one or more ASICs , one or more ARM processors ,
tates a high availability load - store interface is used . Such a or in one or more SSD devices with compute capabilities .
software architecture may provide reliability, replication, The server -linking switch 112 may perform flow control,
consistency, system coherence, hashing , caching , and per- e.g. , by reordering independent requests. In some embodi
sistence. The software architecture may provide reliability ments, because the interface is load- store , RDMA is optional
( in a system with a large number of servers ), by performing 15 but there may be intervening RDMA requests that use the
periodic hardware checks of the CXL device components via PCIe physical medium ( instead of 100 GbE ). In such an
IPMI . For example, the server - linking switch 112 may query embodiment, a remote host may initiate an RDMA request,
a status of a memory server 150 , through an IPMI interface, which may be transmitted to the enhanced capability CXL
of the memory server 150 , querying, for example, the power switch 130 through the server - linking switch 112. The
status (whether the power supplies of the memory server 150 20 server -linking switch 112 and the enhanced capability CXL
are operating properly ), the network status (whether the switch 130 may prioritize RDMA 4 KB requests, or CXL's
interface to the server - linking switch 112 is operating prop- flit (64 -byte) requests.
erly ) and an error check status ( whether an error condition As in the embodiment of FIGS . 1C and 1D , the enhanced
is present in any of the subsystems of the memory server capability CXL switch 130 may be configured to receive
150 ) . The software architecture may provide replication, in 25 such an RDMA request and it may treat a group of memory
that the master controller may replicate data stored in the modules 135 in the receiving server 105 ( i.e. , the server
memory modules 135 and maintain data consistency across receiving the RDMA request) as its own memory space .
replicas. Further, the enhanced capability CXL switch 130 may
The software architecture may provide consistency in that virtualize across the processing circuits 115 and initiate
the master controller may be configured with different 30 RDMA request on remote enhanced capability CXL
consistency levels, and the server - linking switch 112 may switches 130 to move data back and forth between servers
adjust the packet format according to the consistency level 105 , without the processing circuits 115 being involved .
to be maintained. For example, if eventual consistency is FIG . 1F shows a system similar to that of FIG . 1E , in
being maintained , the server - linking switch 112 may reorder which the processing circuits 115 are connected to the
the requests, while to maintain strict consistency, the server- 35 network interface circuits 125 through the enhanced capa
linking switch 112 may maintain a scoreboard of all requests bility CXL switch 130. As in the embodiment of FIG . 1D , in
with precise timestamps at the switches . The software archi- FIG . 1F the enhanced capability CXL switch 130 , the
tecture may provide system coherence in that multiple memory modules 135 , and the network interface circuits 125
processing circuits 115 may be reading from or writing to the are on an expansion socket adapter 140. The expansion
same memory address, and the master controller may, to 40 socket adapter 140 may be a circuit board or module that
maintain coherence , be responsible for reaching the home plugs into an expansion socket, e.g. , a PCIe connector 145 ,
node of the address ( using a directory lookup ) or broadcast- on the motherboard of the server 105. As such , the server
ing the request on a common bus . may be any suitable server, modified only by the installation
The software architecture may provide hashing in that the of the expansion socket adapter 140 in the PCIe connector
server - linking switch 112 and the enhanced capability CXL 45 145. The memory modules 135 may be installed in connec
switch may maintain aa virtual mapping of addresses which tors ( e.g. , M.2 connectors ) on the expansion socket adapter
may use consistent hashing with multiple hash functions to 140. In such an embodiment, ( i ) the network interface
evenly map data to all CXL devices across all nodes at circuits 125 may be integrated into the enhanced capability
boot -up (or to adjust when one server goes down or comes CXL switch 130 , or ( ii ) each network interface circuit 125
up ) . The software architecture may provide caching in that 50 may have a PCIe interface ( the network interface circuit 125
the master controller may designate certain memory parti- may be aa PCIe endpoint ), so that the processing circuit 115
tions ( e.g. , in a memory module 135 that includes HBM or to which it is connected may communicate with the network
a technology with similar capabilities) to act as cache interface circuit 125 through a root port to endpoint PCIe
( employing write -through caching or write - back caching connection , and the controller 137 of the enhanced capabil
for example ). The software architecture may provide per- 55 ity CXL switch 130 (which may have a PCIe input port
sistence in that the master controller and the slave controller connected to the processing circuit 115 and to the network
may manage persistent domains and flushes. interface circuits 125 ) may communicate with the network
In some embodiments , the capabilities of the CXL switch interface circuit 125 through a peer- to -peer PCIe connec
are integrated into the controller of a memory module 135 . tion .
In such an embodiment, the server -linking switch 112 may 60 According to an embodiment of the present invention,
nonetheless act as a master and have enhanced features as there is provided a system , including: a first server, includ
discussed elsewhere herein . The server - linking switch 112 ing : a stored -program processing circuit, a cache - coherent
may also manage other storage devices in the system , and it switch , and a first memory module ; and a second server, and
may have an Ethernet connection (e.g. , a 100 GbE connec- a server - linking switch connected to the first server and to
tion ), for connecting, e.g. , to client machines that are not part 65 the second server, wherein : the first memory module is
of the PCIe network formed by the server - linking switch connected to the cache -coherent switch , the cache - coherent
112 . switch is connected to the server - linking switch , and the
US 11,461,263 B2
19 20
stored -program processing circuit is connected to the cache- program processing circuit, cache - coherent switching
coherent switch . In some embodiments, the server - linking means, a first memory module ; and a second server ; and a
switch includes a Peripheral Component Interconnect server -linking switch connected to the first server and to the
Express ( PCIe ) switch . In some embodiments, the server- second server, wherein : the first memory module is con
linking switch includes a Compute Express Link (CXL ) 5 nected to the cache - coherent switching means , the cache
switch . In some embodiments , the server - linking switch coherent switching means is connected to the server -linking
includes a top of rack ( TOR ) CXL switch . In some embodi- switch , and the stored -program processing circuit is con
ments, the server - linking switch is configured to discover the nected to the cache -coherent switching means .
first server. In some embodiments, the server - linking switch FIG . 1G shows an embodiment in which each of a
is configured to cause the first server to reboot. In some 10 plurality of memory servers 150 is connected to a TOR
embodiments, the server - linking switch is configured to server -linking switch 112 , which may be a PCIe 5.0 CXL
cause the cache -coherent switch to disable the first memory switch , as illustrated . As in the embodiment of FIGS . 1E and
module. In some embodiments, the server - linking switch is 1F , the server -linking switch 112 may include an FPGA or
configured to transmit data from the second server to the first ASIC , and may provide performance (in terms of throughput
server, and to perform flow control on the data. In some 15 and latency ) superior to that of an Ethernet switch . As in the
embodiments, the system further includes a third server embodiment of FIGS . 1E and 1F, the memory server 150
connected to the server - linking switch , wherein : the server- may include a plurality of memory modules 135 connected
linking switch is configured to : receive a first packet, from to the server - linking switch 112 through a plurality of PCIe
the second server, receive a second packet, from the third connectors . In the embodiment of FIG . 16 , the processing
server, and transmit the first packet and the second packet to 20 circuits 115 and system memory 120 may be absent, and the
the first server . In some embodiments, the system further primary purpose of the memory server 150 may be to
includes a second memory module connected to the cache- provide memory, for use by other servers 105 having com
coherent switch , wherein the first memory module includes puting resources .
volatile memory and the second memory module includes In the embodiment of FIG . 16 , the server - linking switch
persistent memory. In some embodiments , the cache - coher- 25 112 may group or batch multiple cache requests received
2

ent switch is configured to virtualize the first memory from different memory servers 150 , and it may group
module and the second memory module. In some embodi- packets, reducing control overhead . The enhanced capability
ments, the first memory module includes flash memory, and CXL switch 130 may include composable hardware building
the cache -coherent switch is configured to provide a flash blocks to (i ) route data to different memory types based on
translation layer for the flash memory . In some embodi- 30 workload , and ( ii ) virtualize processor - side addresses ( trans
ments, the first server includes an expansion socket adapter, lating such addresses to memory - side addresses ). The sys
connected to an expansion socket of the first server, the tem illustrated in FIG . 16 may be CXL 2.0 based , it may
expansion socket adapter including: the cache - coherent include composable and disaggregated shared memory
switch ; and a memory module socket, the first memory within a rack , and it may use the TOR server - linking switch
module being connected to the cache - coherent switch 35 112 to provide pooled (i.e. , aggregated ) memory to remote
through the memory module socket. In some embodiments , devices.
the memory module socket includes an M.2 socket. In some The ToR server - linking switch 112 may have an addi
embodiments: the cache - coherent switch is connected to the tional network connection (e.g. , an Ethernet connection , as
server - linking switch through a connector, and the connector illustrated, or another kind of connection , e.g. , a wireless
is on the expansion socket adapter. According to an embodi- 40 connection such as a WiFi connection or a 5G connection )
ment of the present invention, there is provided a method for for making connections to other servers or to clients. The
performing remote direct memory access in a computing server - linking switch 112 and the enhanced capability CXL
system , the computing system including : a first server, a switch 130 may each include a controller, which may be or
second server, a third server, and a server - linking switch include a processing circuit such as an ARM processor. The
connected to the first server, to the second server, and to the 45 PCIe interfaces may comply with the PCIe 5.0 standard or
third server , the first server including : a stored - program with an earlier version, or with aa future version of the PCIe
processing circuit , a cache - coherent switch , and a first standard , or a different standard (e.g. , NVDIMM - P , CCIX ,
memory module, the method including : receiving, by the or OpenCAPI) may be employed instead of PCIe . The
server - linking switch , a first packet , from the second server, memory modules 135 may include various memory types
receiving , by the server - linking switch , a second packet , 50 including DDR4 DRAM , HBM , LDPPR , NAND flash , and
from the third server, and transmitting the first packet and solid state drives ( SSDs ) . The memory modules 135 may be
the second packet to the first server. In some embodiments , partitioned or contain cache controllers to handle multiple
the method further includes : receiving , by the cache -coher- memory types, and they may be in different form factors,
ent switch, a straight remote direct memory access (RDMA ) such as HHHL , FHHL , M.2 , U.2 , mezzanine card, daughter
2

request, and sending, by the cache - coherent switch, a 55 card, E1.5 , E1.L , E3.L , or E3.S.
straight RDMA response . In some embodiments, the receiv- In the embodiment of FIG . 16 , the enhanced capability
ing of the straight RDMA request includes receiving the CXL switch 130 may enable one - to -many and many - to -one
straight RDMA request through the server - linking switch . In switching , and it may enable a fine grain load -store interface
some embodiments, the method further includes: receiving, at the flit ( 64 - byte) level . Each memory server 150 may have
by the cache - coherent switch, a read command, from the 60 aggregated memory devices , each device being partitioned
stored - program processing circuit, for a first memory into multiple logical devices each with a respective LD - ID .
address, translating , by the cache - coherent switch , the first The enhanced capability CXL switch 130 may include a
memory address to a second memory address, and retriev- controller 137 (e.g. , an ASIC or an FPGA ), and a circuit
ing , by the cache - coherent switch, data from the first (which may be separate from , or part of, such an ASIC or
memory module at the second memory address. According 65 FPGA ) for device discovery, enumeration , partitioning, and
to an embodiment of the present invention , there is provided presenting physical address ranges . Each of the memory
a system , including : a first server, including : a stored- modules 135 may have one physical function (PF ) and as
US 11,461,263 B2
21 22
many as 16 isolated logical devices . In some embodiments provide reliability , replication, consistency, system coher
the number of logical devices ( e.g. , the number of partitions) ence, hashing, caching, and persistence, in a manner analo
may be limited (e.g. to 16 ) , and one control partition (which gous to that described herein for the embodiment of FIG . 1E ,
may be a physical function used for controlling the device ) with, e.g. , coherence being provided with multiple remote
may also be present. Each of the memory modules 135 may 5 servers reading from or writing to the same memory address,
be a Type 2 device with cxl.cache , cxl.mem and cxl.io and and with each remote server being configured with different
address translation service (ATS ) implementation to deal consistency levels . In some embodiments, the server -linking
with cache line copies that the processing circuits 115 may switch maintains eventual consistency between data stored
hold . on a first memory server, and data stored on a second
The enhanced capability CXL switch 130 and a fabric 10 memory server . The server - linking switch 112 may maintain
manager may control discovery of the memory modules 135 different consistency levels for different pairs of servers ; for
and (i ) perform device discovery, and virtual CXL software example, the server - linking switch may also maintain ,
creation , and ( ii ) bind virtual to physical ports . As in the between data stored on the first memory server , and data
embodiments of FIGS . 1A - 1D , the fabric manager may stored on a third memory server , a consistency level that is
operate through connections over an SMBus sideband. An 15 strict consistency, sequential consistency, causal consis
interface to the memory modules 135 , which may be Intel- tency, or processor consistency. The system may employ
ligent Platform Management Interface (IPMI ) or an interface communications in “ local-band ” ( the server- linking switch
that complies with the Redfish standard ( and that may also 112 ) and “ global-band ” (disaggregated server ) domains .
provide additional features not required by the standard ), Writes may be flushed to the " global band ” to be visible to
may enable configurability. 20 new reads from other servers . The controller 137 of the
Building blocks , for the embodiment of FIG . 16 , may enhanced capability CXL switch 130 may manage persistent
include ( as mentioned above ) a CXL controller 137 imple- domains and flushes separately for each remote server. For
mented on an FPGA or on an ASIC , switching to enable example , the cache - coherent switch may monitor aa fullness
aggregating of memory devices ( e.g. , of the memory mod- of a first region of memory (volatile memory, operating as
ules 135 ) , SSDs , accelerators (GPUs , NICs ) , CXL and 25 a cache) , and, when the fullness level exceeds a threshold ,
PCIe5 connectors, and firmware to expose device details to the cache - coherent switch may move data from the first
the advanced configuration and power interface (ACPI ) region of memory to a second region of memory, the second
tables of the operating system , such as the heterogeneous region of memory being in persistent memory . Flow control
memory attribute table (HMAT) or the static resource affin- may be handled in that priorities may be established, by the
ity table SRAT. 30 controller 137 of the enhanced capability CXL switch 130 ,
In some embodiments, the system provides composabil- among remote servers , to present different perceived laten
ity. The system may provide an ability to online and offline cies and bandwidths.
CXL devices and other accelerators based on the software According to an embodiment of the present invention ,
configuration , and it may be capable of grouping accelerator, there is provided a system , including: a first memory server,
memory , storage device resources and rationing them to 35 including: a cache - coherent switch , and a first memory
each memory server 150 in the rack . The system may hide module ; and a second memory server ; and a server - linking
the physical address space and provide transparent cache switch connected to the first memory server and to the
using faster devices like HBM and SRAM . second memory server, wherein : the first memory module is
In the embodiment of FIG . 16 , the controller 137 of the connected to the cache - coherent switch , and the cache
enhanced capability CXL switch 130 may (i ) manage the 40 coherent switch is connected to the server -linking switch . In
memory modules 135 , ( ii ) integrate and control heteroge- some embodiments, the server -linking switch is configured
neous devices such as NICS , SSDs , GPUs, DRAM , and ( iii ) to disable power to the first memory module . In some
effect dynamic reconfiguration of storage to memory devices embodiments: the server - linking switch is configured to
by power -gating. For example, the TOR server -linking disable power to the first memory module by instructing the
switch 112 may disable power ( i.e. , shut off power , or reduce 45 cache - coherent switch to disable power to the first memory
power ) to one of the memory modules 135 (by instructing module, and the cache - coherent switch is configured to
the enhanced capability CXL switch 130 to disable power to disable power to the first memory module, upon being
the memory module 135 ) . The enhanced capability CXL instructed , by the server - linking switch , to disable power to
switch 130 may then disable power to the memory module the first memory module . In some embodiments , the cache
135 , upon being instructed , by the server -linking switch 112 , 50 coherent switch is configured to perform deduplication
to disable power to the memory module . Such disabling may within the first memory module. In some embodiments, the
conserve power, and it may improve the performance ( e.g., cache -coherent switch is configured to compress data and to
the throughput and latency ) of other memory modules 135 store compressed data in the first memory module . In some
in the memory server 150. Each remote server 105 may see embodiments, the server -linking switch is configured to
a different logical view of memory modules 135 and their 55 query a status of the first memory server. In some embodi
connections based on negotiation . The controller 137 of the ments, the server- linking switch is configured to query a
enhanced capability CXL switch 130 may maintain state so status of the first memory server through an Intelligent
that each remote server maintains allotted resources and Platform Management Interface (IPMI ) . In some embodi
connections , and it may perform compression or deduplica- ments, the querying of a status includes querying a status
tion of memory to save memory capacity (using a config- 60 selected from the group consisting of a power status, a
urable chunk size ) . The disaggregated rack of FIG . 16 may network status, and an error check status. In some embodi
have its own BMC . It also may expose an IPMI network ments, the server -linking switch is configured to batch cache
interface and a system event log ( SEL ) to remote devices, requests directed to the first memory server. In some
enabling the master (e.g. , a remote server using storage embodiments, the system further includes a third memory
provided by the memory servers 150 ) to measure perfor- 65 server connected to the server -linking switch, wherein the
mance and reliability on the fly , and to reconfigure the server - linking switch is configured to maintain , between
disaggregated rack . The disaggregated rack of FIG . 16 may data stored on the first memory server and data stored on the
US 11,461,263 B2
23 24
third memory server, a consistency level selected from the network interface circuits 125 transmit the request to the
group consisting of strict consistency, sequential consis- TOR Ethernet switch 110 (which may have an RDMA
tency, causal consistency, and processor consistency. In interface ), bypassing processing circuits ; at 215 , the TOR
some embodiments , the cache - coherent switch is configured Ethernet switch 110 routes the RDMA request to the remote
to : monitor a fullness of aa first region of memory, and move 5 the server 105 for processing by the controller 137 of a
data from the first region of memory to a second region of memory module 135 , or by a remote enhanced capability
memory, wherein : the first region of memory is in volatile CXL switch 130 , via RDMA access to remote aggregated
memory, and the second region of memory is in persistent memory, bypassing the remote processing circuit 115 ; at
memory. In some embodiments, the server -linking switch 220 , the TOR Ethernet switch 110 receives the processed
includes a Peripheral Component Interconnect Express 10 data and routes the data to the local memory module 135 , or
( PCIe) switch . In some embodiments , the server -linking to the local enhanced capability CXL switch 130 , bypassing
switch includes a Compute Express Link (CXL ) switch . In the local processing circuits 115 via RDMA ; and, at 222 , the
some embodiments, the server - linking switch includes a top controller 137 of a memory module 135 of the embodiment
of rack ( ToR ) CXL switch . In some embodiments, the of FIGS . 1A and 1B , or the enhanced capability CXL switch
server - linking switch is configured to transmit data from the 15 130 receives the RDMA response straightly ( e.g. , without it
second memory server to the first memory server, and to being forwarded by the processing circuits 115 ) .
perform flow control on the data. In some embodiments, the In such an embodiment, the controller 137 of the remote
system further includes aa third memory server connected to memory module 135 , or the enhanced capability CXL
the server - linking switch, wherein : the server - linking switch switch 130 of the remote the server 105 , is configured to
is configured to : receive a first packet, from the second 20 receive straight remote direct memory access ( RDMA)
memory server, receive a second packet, from the third requests and to send straight RDMA responses . As used
memory server, and transmit the first packet and the second herein , the controller 137 of the remote memory module 135
packet to the first memory server. According to an embodi- receiving, or the enhanced capability CXL switch 130
ment of the present invention, there is provided a method for receiving, “ straight RDMA requests ” (or receiving such
performing remote direct memory access in a computing 25 requests “ straightly ” ) means receiving, by the controller 137
system , the computing system including: a first memory of the remote memory module 135 , or by the enhanced
server; a first server; a second server; and a server - linking capability CXL switch 130 , such requests without their
switch connected to the first memory server, to the first being forwarded or otherwise processed by a processing
server, and to the second server , the first memory server circuit 115 of the remote server, and sending, by the con
including : a cache - coherent switch , and a first memory 30 troller 137 of the remote memory module 135 , or by the
module ; the first server including : a stored -program pro- enhanced capability CXL switch 130 , “ straight RDMA
cessing circuit ; the second server including: a stored -pro- responses ” ( or sending such requests “ straightly " ) means
gram processing circuit; the method including: receiving, by sending such responses without their being forwarded or
the server -linking switch , a first packet, from the first server; otherwise processed by a processing circuit 115 of the
receiving , by the server - linking switch , a second packet , 35 remote server .
from the second server; and transmitting the first packet and Referring to FIG . 2B , in another embodiment, RDMA
the second packet to the first memory server . In some may be performed with the processing circuit of the remote
embodiments, the method further includes : compressing server being involved in the handling of the data . For
data , by the cache - coherent switch , and storing the data in example, at 225 , a processing circuit 115 may transmit data
the first memory module . In some embodiments, the method 40 or a workload request over Ethernet; at 230 , the ToR
further includes : querying, by the server - linking switch , a Ethernet switch 110 may receive the request and route it to
status of the first memory server. According to an embodi- the corresponding server 105 of the plurality of servers 105 ;
ment of the present invention , there is provided a system , at 235 , the request may be received , within the server, over
including : a first memory server, including : a cache -coherent port ( s) of the network interface circuits 125 (e.g. , 100
switch , and a first memory module; and a second memory 45 GbE - enabled NIC ) ; at 240 , the processing circuits 115 ( e.g. ,
server; and server- linking switching means connected to the x86 processing circuits) may receive the request from the
first memory server and to the second memory server, network interface circuits 125 ; and, at 245 , the processing
wherein : the first memory module is connected to the circuits 115 may process the request ( e.g. , together ), using
cache - coherent switch , and the cache - coherent switch is DDR and additional memory resources via the CXL 2.0
connected to the server - linking switching means . 50 protocol to share the memory (which, in the embodiment of
FIGS . 2A - 2D are flow charts for various embodiments . In FIGS . 1A and 1B , may be aggregated memory ).
the embodiments of these flow charts, the processing circuits Referring to FIG . 2C , in the embodiment of FIG . 1E ,
115 are CPUs; in other embodiments they may be other RDMA may be performed with the processing circuit of the
processing circuits (e.g. , GPUs ) . Referring to FIG . 2A , the remote server being involved in the handling of the data . For
controller 137 of a memory module 135 of the embodiment 55 example , at 225 , a processing circuit 115 may transmit data
of FIGS . 1A and 1B , or the enhanced capability CXL switch or a workload request over Ethernet or PCie at 230 , the TOR
130 of any of the embodiments of FIGS . 1C - 1G may Ethernet switch 110 may receive the request and route it to
virtualize across the processing circuit 115 and initiate an the corresponding server 105 of the plurality of servers 105 ;
RDMA request on an enhanced capability CXL switch 130 at 235 , the request may be received , within the server, over
in another server 105 , to move data back and forth between 60 port ( s) of the PCIe connector; at 240 , the processing circuits
servers 105 , without involving a processing circuit 115 in 115 (e.g. , x86 processing circuits) may receive the request
either server ( with the virtualization being handled by the from the network interface circuits 125 ; and, at 245 , the
controller 137 of the enhanced capability CXL switches processing circuits 115 may process the request ( e.g. ,
130 ) . For example, at 205 , the controller 137 of the memory together ), using DDR and additional memory resources via
module 135 , or the enhanced capability CXL switch 130 , 65 the CXL 2.0 protocol to share the memory ( which , in the
generates an RDMA request for additional remote memory embodiment of FIGS . 1A and 1B , may be aggregated
( e.g. , CXL memory or aggregated memory ); at 210 , the memory ). At 250 , the processing circuit 115 may identify a
US 11,461,263 B2
25 26
requirement to access memory contents (e.g. , DDR or aggre- Processing circuit hardware may include , for example,
gated memory contents ) from a different server ; at 252 the application specific integrated circuits ( ASICs ) , general pur
processing circuit 115 may send the request for said memory pose or special purpose central processing units (CPUs ) ,
contents (e.g. , DDR or aggregated memory contents) from a digital signal processors (DSPs ) , graphics processing units
different server , via a CXL protocol ( e.g. , CXL 1.1 or CXL 5 (GPUs ) , and programmable logic devices such as field
2.0 ) ; at 254 , the request propagates through the local PCIe programmable gate arrays ( FPGAs). In a processing circuit ,
connector to the server-linking switch 112 , which then as used herein , each function is performed either by hard
transmits the request to a second PCIe connector of a second ware configured, i.e. , hard -wired , to perform that function ,
server on the rack ; at 256 , the second processing circuits 115 or by more general purpose hardware , such as a CPU ,
( e.g. , x86 processing circuits) receive the request from the 10 configured to execute instructions stored in a non -transitory
second PCIe connector; at 258 , the second processing cir- storage medium . A processing circuit may be fabricated on
cuits 115 may process the request (e.g. , retrieval of memory a single printed circuit board (PCB ) or distributed over
contents ) together, using second DDR and second additional several interconnected PCBs . A processing circuit may
memory resources via the CXL 2.0 protocol to share the contain other processing circuits; for example a processing
aggregated memory; and, at 260 , the second processing 15 circuit may include two processing circuits, an FPGA and a
circuits (e.g. , x86 processing circuits ) transmit the result of CPU , interconnected on a PCB .
the request back to the original processing circuits via As used herein , a “ controller " includes a circuit , and a
respective PCIe connectors and through the server - linking controller may also be referred to as a “ control circuit ” or a
switch 112 . “ controller circuit ” . Similarly, a " memory module ” may also
Referring to FIG . 2D , in the embodiment of FIG . 16 , 20 be referred to as a “ memory module circuit ” or as a
RDMA may be performed with the processing circuit of the “ memory circuit ” . As used herein , the term “ array ” refers to
remote server being involved in the handling of the data . For an ordered set of numbers regardless of how stored ( e.g. ,
example ; at 225 , a processing circuit 115 may transmit data whether stored in consecutive memory locations , or in a
or a workload request over Ethernet; at 230 , the ToR linked list ) . As used herein , when a second number is
Ethernet switch 110 may receive the request and route it to 25 " within Y % ” of aa first number, it means that the second
the corresponding server 105 of the plurality of servers 105 ; number is at least ( 1 - Y / 100 ) times the first number and the
at 235 , the request may be received , within the server , over second number is at most ( 1 + Y / 100 ) times the first number.
port( s ) of the network interface circuits 125 ( e.g. , 100 As used herein , the term “ or” should be interpreted as
GbE - enabled NICs ) . At 262 , a memory module 135 receives " and /or ” , such that, for example, " A or B ” means any one of
the request from the PCIe connector; at 264 , the controller 30 “ A ” or “ B ” or “ A and B ” .
of the memory module 135 processes the request, using local As used herein , when a method (e.g. , an adjustment) or a
memory ; at 250 , the controller of the memory module 135 first quantity ( e.g. , a first variable ) is referred to as being
identifies a requirement to access memory contents ( e.g. , “ based on ” a second quantity (e.g. , a second variable ) it
aggregated memory contents) from aa different server ; at 252 , means that the second quantity is an input to the method or
the controller of the memory module 135 sends request for 35 influences the first quantity, e.g. , the second quantity may be
said memory contents ( e.g. , aggregated memory contents) an input ( e.g. , the only input, or one of several inputs) to a
from a different server via the CXL protocol ; at 254 the function that calculates the first quantity, or the first quantity
request propagates through the local PCIe connector to the may be equal to the second quantity, or the first quantity may
server - linking switch 112 , which then transmits the request be the same as ( e.g. , stored at the same location or locations
to a second PCIe connector of aa second server on the rack ; 40 in memory ) as the second quantity .
and at 266 , the second PCIe connector provides access via It will be understood that, although the terms “ first” ,
the CXL protocol to share the aggregated memory to allow “ second ” , “ third ” , etc., may be used herein to describe
the controller of the memory module 135 to retrieve memory various elements, components, regions , layers and / or sec
contents . tions , these elements , components, regions, layers and /or
As used herein , a “ server ” is a computing system includ- 45 sections should not be limited by these terms. These terms
ing at least one stored -program processing circuit (e.g. , a are only used to distinguish one element, component, region ,
processing circuit 115 ) , at least one memory resource ( e.g. , layer or section from another element, component, region ,
a system memory 120 ) , and at least one circuit for providing layer or section . Thus, a first element, component, region ,
network connectivity (e.g. , a network interface circuit 125 ) . layer or section discussed herein could be termed a second
As used herein , “ a portion of something means “ at least 50 element, component, region, layer or section , without
some of ” the thing, and as such may mean less than all of, departing from the spirit and scope of the inventive concept.
or all of, the thing. As such, “ a portion of a thing includes Spatially relative terms, such as “ beneath ” , “ below ” ,
the entire thing as a special case , i.e. , the entire thing is an " lower ” , “ under ” , “ above” , “ upper ” and the like , may be
example of a portion of the thing . used herein for ease of description to describe one element
The background provided in the Background section of 55 or feature's relationship to another element (s ) or feature ( s)
the present disclosure section is included only to set context, as illustrated in the figures. It will be understood that such
and the content of this section is not admitted to be prior art . spatially relative terms are intended to encompass different
Any of the components or any combination of the compo- orientations of the device in use or in operation, in addition
nents described ( e.g. , in any system diagrams included to the orientation depicted in the figures. For example, if the
herein ) may be used to perform one or more of the opera- 60 device in the figures is turned over, elements described as
tions of any flow chart included herein . Further, (i ) the “ below ” or “ beneath ” or “ under ” other elements or features
operations are example operations, and may involve various would then be oriented " above ” the other elements or
additional steps not explicitly covered , and ( ii ) the temporal features . Thus, the example terms “ below ” and “ under ” can
order of the operations may be varied . encompass both an orientation of above and below . The
The term “ processing circuit ” or “ controller means ” is 65 device may be otherwise oriented ( e.g. , rotated 90 degrees or
used herein to mean any combination of hardware , firmware , at other orientations) and the spatially relative descriptors
and software , employed to process data or digital signals. used herein should be interpreted accordingly. In addition, it
US 11,461,263 B2
27 28
will also be understood that when a layer is referred to as a first memory module ; and
being “ between ” two layers, it can be the only layer between a second memory server ; and
the two layers, or one or more intervening layers may also a server - linking switch connected to the first memory
be present. server and to the second memory server,
The terminology used herein is for the purpose of describ- 5 wherein :
ing particular embodiments only and is not intended to be the first memory module is connected to the cache
limiting of the inventive concept. As used herein , the terms coherent switch via a first interface, and
“ substantially ,” “ about, ” and similar terms are used as terms the cache - coherent switch is connected to the server
of approximation and not as terms of degree, and are linking switch via a second interface different from
intended to account for the inherent deviations in measured 10 the first interface .
or calculated values that would be recognized by those of 2. The system of claim 1 , wherein the server - linking
ordinary skill in the art. As used herein , the singular forms switch
“ a” and “ an ” are intended to include the plural forms as well , moduleis. configured to disable power to the first memory
unless the context clearly indicates otherwise. It will be
further understood that the terms " comprises ” and / or " com- 15 3.theThe system of claim 2 , wherein :
server -linking switch is configured to disable power to
prising ", when used in this specification, specify the pres
ence of stated features, integers , steps , operations, elements, the first memory module by instructing the cache
and / or components, but do not preclude the presence or coherent switch to disable power to the first memory
addition of one or more other features, integers, steps , module, and
operations, elements, components, and / or groups thereof. As 20 the cache -coherent switch is configured to disable power
used herein , the term “ and /or” includes any and all combi to the first memory module, upon being instructed, by
nations of one or more of the associated listed items. the server -linking switch , to disable power to the first
Expressions such as “ at least one of,” when preceding a list memory module.
of elements, modify the entire list of elements and do not 4. The system of claim 1 , wherein the cache - coherent
modify the individual elements of the list . Further, the use of 25 switch is configured to perform deduplication within the first
“ may ” when describing embodiments of the inventive con memory module.
cept refers to “ one or more embodiments of the present 5. The system of claim 1 , wherein the cache - coherent
disclosure ” . Also , the term " exemplary ” is intended to refer switch is configured to compress data and to store com
to an example or illustration . As used herein , the terms pressed data in the first memory module.
“ use," “ using , ” and “ used ” may be considered synonymous 30 6. The system of claim 1 , wherein the server - linking
with the terms " utilize , " " utilizing, ” and “ utilized , ” respec switch is configured to query a status of the first memory
tively. server .
It will be understood that when an element or layer is 7. The system of claim 6 , wherein the server - linking
“referred
adjacenttotoas” another
being “ on ” , “ connected to ” , “ coupled to " , or
element or layer, it may be directly on, 35 switch is configured to query a status of the first memory
connected to , coupled to , or adjacent to the other element or server through an Intelligent Platform Management Inter
layer, or one or more intervening elements or layers may be face ( IPMI).
present. In contrast, when an element or layer is referred to 8. The system of claim 7 , wherein the querying of a status
as being " directly on ” , “ directly connected to ” , “ directly comprises querying a status selected from the group con
coupled to " , or " immediately adjacent to ” another element 40 sisting of a power status, a network status, and an error
or layer, there are no intervening elements or layers present. check status.
Any numerical range recited herein is intended to include 9. The system of claim 1 , wherein the server - linking
all sub - ranges of the same numerical precision subsumed switch is configured to batch cache requests directed to the
within the recited range. For example, a range of “ 1.0 to first memory server.
10.0 " or " between 1.0 and 10.0 " is intended to include all 45 10. The system of claim 1 , further comprising a third
subranges between ( and including) the recited minimum memory server connected to the server - linking switch ,
value of 1.0 and the recited maximum value of 10.0 , that is , wherein the server -linking switch is configured to maintain ,
having a minimum value equal to or greater than 1.0 and a between data stored on the first memory server and data
maximum value equal to or less than 10.0 , such as , for stored on the third memory server , a consistency level
example, 2.4 to 7.6 . Any maximum numerical limitation 50 selected from the group consisting of strict consistency,
recited herein is intended to include all lower numerical sequential consistency, causal consistency, and processor
limitations subsumed therein and any minimum numerical consistency.
limitation recited in this specification is intended to include
11. The system of claim 1 , wherein the cache - coherent
all higher numerical limitations subsumed therein . switch is configured to :
Although exemplary embodiments of system and method 55 monitor aa fullness of a first region of memory, and
for managing memory resources have been specifically move data from the first region of memory to a second
described and illustrated herein , many modifications and region of memory,
variations will be apparent to those skilled in the art. wherein :
Accordingly, it is to be understood that system and method the first region of memory is in volatile memory , and
for managing memory resources constructed according to 60 the second region of memory is in persistent memory .
principles of this disclosure may be embodied other than as 12. The system of claim 1 , wherein the server - linking
9

specifically described herein . The invention is also defined switch comprises a Peripheral Component Interconnect
in the following claims , and equivalents thereof. Express ( PCIe ) switch.
What is claimed is : 13. The system of claim 1 , wherein the server - linking
1. A system , comprising : 65 switch comprises a Compute Express Link (CXL ) switch .
a first memory server, comprising : 14. The system of claim 13 , wherein the server - linking
a cache -coherent switch , and switch comprises a top of rack ( TOR ) CXL switch .
US 11,461,263 B2
29 30
15. The system of claim 1 , wherein the server - linking the second server comprising:
switch is configured to transmit data from the second a stored -program processing circuit ;
memory server to the first memory server, and to perform the method comprising :
flow control on the data . receiving, by the server - linking switch , a first packet,
16. The system of claim 1 , further comprising a third 5 from the first server ;
memory server connected to the server -linking switch , receiving, by the server - linking switch , a second
wherein : packet, from the second server; and
the server - linking switch is configured to : transmitting the first packet and the second packet to
receive a first packet , from the second memory server, the first memory server.
receive a second packet , from the third memory server, 10 18. The method of claim 17 , further comprising:
and
transmit the first packet and the second packet to the compressing data, by the cache - coherent switch , and
first memory server . storing the data in the first memory module .
17. A method for performing remote direct memory 19. The method of claim 17 , further comprising :
access in a computing system , the computing system com- 15 querying, by the server - linking switch, a status of the first
prising : memory server .
a first memory server ; 20. A system , comprising:
a first server ; a first memory server, comprising:
a second server; and a cache -coherent switch, and
a server -linking switch connected to the first memory 20 a first memory module ; and
server, to the first server , and to the second server, a second memory server; and
the first memory server comprising: server -linking switching means connected to the first
a cache -coherent switch , and memory server and to the second memory server,
a first memory module , wherein the first memory wherein :
module is connected to the cache - coherent switch via 25 the first memory module is connected to the cache
a first interface, and the cache - coherent switch is coherent switch via a first interface, and
connected to the server - linking switch via a second the cache -coherent switch is connected to the server
interface different from the first interface; linking switching means via a second interface dif
the first server comprising: ferent from the first interface .
a stored - program processing circuit;

You might also like