MDS Hardware Architecture
MDS Hardware Architecture
MDS Hardware/Architecture
A 20-year journey of continuous innovation
Harsha Bharadwaj
Principal Engineer
San Jose, California May 8-11, 2023
© 2023 Cisco and/or its affiliates. All rights reserved. MDS, NDFC, SAN Analytics
Agenda
• MDS HW Introduction
• MDS ASIC Architecture
• MDS Product Evolution
• Day in life of a packet inside MDS
• MDS ASIC 15 unique features
• Wrap up
……….
Cisco
MDS 9700 Directors
Cisco
MDS 9500 Directors
Line Card
(w/ Front Panel multiprotocol
ports)
Supervisor
(w/ Mgmt, Console Port
and USB Ports + NX-OS
+ Arbitration ASIC)
Power Supply
(w/ grid redundancy)
MD S
© 2023 Cisco and/or its affiliates. All rights reserved.
MDS Modular HW Specification
M9706 M9710 M9718
Line Cards 4 8 16
Power Supplies 4 6 12
Airflow Front-to-Back
© 2023 Cisco and/or its affiliates. All rights reserved.
*replacement mode only
Fixed Switch (aka Fabric Switch)
Chassis (1 or 2 RU)
Front
Mgmt Port
Console Port, Switch Ports (built-in)
USB Port
Fans (modular)
Rear
Queuing Reorder/Scheduler
FC Stack
© 2023 Cisco and/or its affiliates. All rights reserved. Ingress Port Egress Port
XBAR ARB
QUE
Egress MAC
T11 defined Standard Features MDS additional Features
• 4 FC Speeds • CRC check/drop at line rate
• Link Speed Auto Neg • Internal header for Stats Stats
• B2B crediting (Tx/Rx) metadata
• Frame Encode/Decode • Timestamping and
• FEC Timeout check Crediting Crediting
• VSAN Classification Ingress Egress
• BB_SC credit/frame loss Loopback
recovery • Slow N_Port (NPIV)/VM Buffer Parser, TS, Buffer
Framer,
• FC Encryption/ Decryption detection Frame Check TS Check
• FC traffic generator
• Loopback Decode Encode
• Credit pacing (IRL)
• Frame Stats
Ingress MAC
FC-MAC: Buffering
QUE
FWD
MAC
…..
24K Buffers 16G* 500 32
VL7 VL7 VL7 VL7 32G 500 32
64G ASIC Ingress buffer partitioning 64G 1000 100
© 2023 Cisco and/or its affiliates. All rights reserved. *except 9148S
XBAR ARB
• FSPF: Lowest cost (highest speed) port chosen among possible options
• Host->Target forwarding at MDS1 has 4 cases: 1x64G
MDS2
Host MDS1
Target
Host MDS1 MDS2 Target
1x64G MDS3
1x64G
2x64G MDS3
• Ingress port buffers organized on a Virtual output Queue basis for every other Egress port.
• Prevents HoL Blocking
Queuing: QoS/Scheduling
QUE
FWD
MAC
• Ingress:
• Frames classified using incoming FCHeader.Priority or using ZoneQoS/VL policy of the switch
Ccos, QCos
• VoQ (and Arbitration) is per-port, per-QoS level
EISL_UP,
VH_UP,
CS_CTL
• Egress:
• Every QoS Level can be scheduled with either Strict Priority(SP) or DWRR (3 levels - 50%, 30%, 20%)
EISL Header
Ingress Egress
+ FC frame
ACL/FIB Queue
Arbitration
Isola/Thunderbird ingress
Classifier
Vegas Header
+ FC frame
(Optional)
mapping result for Vegas 1 packets)
P0 P1 P2 P3
Ingress Port Egress Port
CCos[1:0] = VH_UP mapping result
QCos demux
DWRR
(RX) (TX)
QCos[1:0] = CCos (Qcos = CCos
CS_CTL rewrite
M
Port Logic
ARB
FC frame
L
(Port,QoS)
H
SP
VLs Evolution
Optional rewrite
(VLs)
mapping table
EISL port
16G, 32G 4
Return GID with
CCos to Carb
64G 8
© 2023 Cisco and/or its affiliates. All rights reserved.
der
e
P
t
XBAR ARB
• Ensures frame sent from ingress to egress port only when egress port has a buffer to accept it
• One Arbiter per Switch (Centralized ARB) - Dedicated ASIC (in SUP) in Modular, Integrated into the FC ASIC in
Fixed Switches
Arbiter checks buffer available 9 Arbiter accounts that Port-C now has (N) egress buffers
4 at port-C and grants request
0 Port-C informs arbiter it has has (N) egress buffers; Arbiter accounts it
Arbiter 5 Grant given to Port-A. Port-C accounting updates as (N-1) egress buffers
3 Request to Arbiter for Port-C from Port-A
2 Frame is enqueued into VoQ of Port-C 8 Port-C informs Arbiter that it has now a free buffer
FC Frame
FC Frame Input Output
1 Frame arrives on Port-A destined to a DID behind Port-C Frame Port Frame Port Frame
© 2023 Cisco and/or its affiliates. All rights reserved.
7 Frame transmitted out of Port-C
XBAR ARB
QUE
FWD
Crossbar Switch
• Performs the port-to-port switching function
Control and
• Establishes a temporary connection between input and output port for Scheduling
duration of the frame transmit Ingress
• Frames are transmitted once connection made and path available Crossbar
256G
1.5T
Fabric ASIC Fabric ASIC Fabric ASIC Fabric ASIC
. NX-OS NX-OS
Kickstart (Control Plane
. (Kernel) +Drivers)
Switch
XBAR Switching ASIC
ASIC (data plane)
Switching ASIC
CPU/NPU
(data plane) Linecards
EOBC NX-OS
Kernel NX-OS (Drivers)
XBAR
XBAR ARB ARB ARB
XBAR XBAR
Linecard
…… Linecard
*BW Per Slot = Add operating speed of all ports in the Linecard
No of Fabric FC Front Panel BW per Slot* FC Front Panel BW per Slot*
Modules FAB-1 FAB-3
1 256 Gbps 512 Gbps
2 512 Gbps 1024 Gbps
XBAR-16 or
XBAR-16 FAB1 XBAR 32/64 FAB1/3 XBAR-32/64 FAB 3
x6 x6 x6
XBAR XBAR
(internal) (internal) XBAR
(internal)
ARB ARB
(internal) (internal) ARB
(internal)
ARB ARB
(internal) (internal)
32G ASIC 32G ASIC 32G ASIC
9148S (16x32G) (16x32G) (16x32G) 64G 1Die 64G 1Die
(24x64G) (24x64G)
9148T 9148V
© 2023 Cisco and/or its affiliates. All rights reserved.
Packet
ARB 16 XBAR 16
Memory x 12 XBAR 32/64 XBAR 32/64 x 2
ARB 32
16G ASIC
(8x16G) (8x16G) (8x16G)
(16x32G) vv
(16x32G) (16x32G) 64G 1Die 64G 1Die
16G ASIC 16G ASIC 16G ASIC 32G ASIC 32G ASIC 32G ASIC (24x64G) (24x64G)
(8x16G) (8x16G) (8x16G) (8x16G)
(16x32G) (16x32G) (16x32G)
16G ASIC 16G ASIC LEM1 LEM2 LEM3
16G ASIC 16G ASIC ARB ARB
(8x16G) (8x16G) (8x16G) (8x16G) 64G 1Die 64G 1Die
(24x64G) (24x64G)
9396S 9396T
9396V
SW Analytics Encoder
Ingress ACL Mgmt Port
FC & SCSI/NVMe Engine DB Pull/Push to SUP
Tap + Zone
32G ASIC headers (Persist) Agg DB
Streamer
32G ASIC (Temp)
Streaming
Egress ACL NPU 1.0 Telemetry
Tap SUP
32G Linecard
12. Transmit to
Fabric Module 1 Fabric Module 2 Fabric Module 3 Fabric Module 4 Fabric Module 5 Fabric Module 6
fabric
10. Request Fabric ASIC Fabric ASIC Fabric ASIC Fabric ASIC Fabric ASIC Fabric ASIC
(Super Framing)
buffer credit for
destination port
+ priority
Analytics Analytics
Fabric ASIC Fabric ASIC Engine Fabric ASIC Fabric ASIC Engine
7. Final lookup result:
destination port + priority
Req Credit
14. Receive
Forwarding from fabric
… ACL
q1
q2 TCAM 15. Copy Hdr
Dst+Pri 6. Zoning lookups, to Analytics
e
q3 HDR
PKT
ot q4 Zoning Lookups FIB Load Balancing Decision,
em s l
FIB Lookups
fc2/25… R VoQ ca s TCAM Stats
fc1/20 Lo ort
9. Queue packet P SP DWRR
descriptor in VOQ
(destination port + Virtual 16. Buffer on egress
l
ca s
5.Copy Hdr to Lo ort
priority) Queuing Analytics based on destination port fc2/25 P
Ingress + priority, QoS Scheduler
Buffer Ingress Parser
PKT HDR Egress Buffer
8. Packet Stored in 4. Packet
32G ASIC 1
32G ASIC 2
32GASIC 3
headers sent to
32G SOC 2
32G SOC1
32G SOC3
ingress buffer based
on incoming Port, 2. CRC check, Add PHY/MAC FWD 17. Schedule PHY/MAC
Priority/VL Internal Header + for
timestamp, VSAN transmission
© 2023 Cisco and/or its affiliates. All rights reserved. 3. Ingress
18. Remove internal
LC 1 1. Rx packet
R_RDY packet parsing
Header, Timestamp check, LC 2
from wire PKT HDR fc1/20 fc2/25
Agenda
• MDS HW Introduction
• MDS ASIC Architecture
• MDS Product Evolution
• Day in life of a packet inside MDS
• MDS ASIC 15 unique features
• Wrap up
• Slightly higher per frame switching latency hardly matters compared to overall I/O completion times
• Storage Access times = 100+us (even on NVMe flash storage); Per frame switching latency ~ 2us
• Ensures data integrity
PHY
© 2023 Cisco and/or its affiliates. All rights reserved. Standard FC Frame (Upto 2148 bytes)
#2: Consistent switching Latency
• In all architectures latency between any two ports of the switch is the same (same/different LC/ASIC)
XBAR ARB
LC1 LC2
ARB XBAR ARB XBAR
ASIC1 ASIC2 ASIC1 ASIC2
ASIC1 ASIC2
I/O Req
Host 1
I/O Resp
(FCID= 10.1.1) I/O Resp PortA (#1) PortC (#1)
I/O Req
OXID=100 I/O Req
hash(10.1.2/ 20.1.1/ 200) à 1 (PortA) hash (20.1.1/ 10.1.2/ 200) à 1 (PortC)
CMD ELS
DATA
(Read) (RSCN)
F1 F2 F3 F4 F5 F6
Stage2
Superframe Superframe
Multiple Paths Multiple Paths
F1 F2 F1 F2
Stage1 Stage3
© 2023 Cisco and/or its affiliates. All rights reserved.
#7: Ensuring Data Integrity
Data Integrity is paramount in Storage
• Multistage CRC error checking
• CRC checks on incoming frame(Ingress MAC)
• ASIC to Crossbar Superframe (Inside xbar module)
• Crossbar to ASIC Superframe (Egress Path)
• Error correction
• FEC on every incoming frame
• FEC inside Crossbar
• ECC protected packet memories
• Collectively ensure integrity of data to/from storage media as it transits through the MDS
© 2023 Cisco and/or its affiliates. All rights reserved.
#8: VSAN
• Logically partition physical SAN to multiple virtual SANs
• Classified at Ingress MAC and carried via internal header from Ingress to Egress Port.
• Carried across ISL if the other end is a Cisco MDS (Trunking E_Port)
• Traffic Isolation, Scalability, Redundancy
Queuing Reorder/Scheduler
PHY PHY
VL Credits
(ER_RDY)
Frames
Frames
Credits Frames
(R_RDY)
Port Tx Port Rx Port Tx VL Credits
Port Rx
Buffers Buffers Buffers (ER_RDY)
Buffers
Table 30 - Frame_Header
Bits 31 .. 24 23 .. 16 15 .. 08 07 .. 00
Word
0 R_CTL D_ID
1 CS_CTL/Priority S_ID
2 TYPE F_CTL
5 Parameter
The Frame_Header shall immediately follow the SOF delimiter, if no Extended_Headers are present, or
shall follow the last Extended_Header present, and shall be transmitted on a word boundary. The
© 2023 Cisco and/or its affiliates. All rights reserved.
Frame_Header is used to control link operations and device protocol transfers as well as detect missing or
out of order frames.
#11: Credit Pacing (aka Ingress Rate Limiting)
• Paces the rate of B2B credit return to link partner to dictate incoming frame rate (other direction)
• DIRL builds dynamism on top of credit pacing to throttle up/down the incoming rate on the port
• Pushes back oversubscription/credit stall type congestions to misbehaving devices
Congestion
Congestion Reduce credit subsides Resume credit
towards Host return towards Host return gradually
Frames
Credits
Tx (R_RDY) Tx Credits Tx Credits Tx Credits
Tx Credits Tx Credits
Buff Buff Buff (R_RDY) Buff (R_RDY)
Buff Buff
Normal case (No IRL): Switch port Switch Tx Congestion: Switch port credit Switch Tx Congestion subsides: Tx rate from
returns credit as soon as Rx buffer is free return programmed to a trickle using IRL host slowly picks up
(independent of Switch Rx buffer
© 2023 Cisco and/or its affiliates. All rights reserved.
availability) - Tx rate from host slows down
#12: Onchip, Realtime, Programmable Analytics
• 64G ASIC can compute 70+ I/O metrics on chip • NPU also connected to the ASIC Data Path
• ACLs can be programmed (Ingress/Egress) to copy out
• Full visibility into I/O metrics on all ports at line rate
headers of frames of interest to NPU
• DATA IU not inspected (DATA could be encrypted also) • Software Analytics Engine in NPU can compute custom
• No impact to switching latency metrics not computed on chip
SCSI or NVMe:
CMD/XRDY/RSP
HW Analytics
frames
FWD Engine FWD
(Onchip)
SW Analytics
SCSI or NVMe:
MAC Any frame
Engine (NPU) MAC
matching ACL
© 2023 Cisco and/or its affiliates. All rights reserved. Ingress Egress
#13: Congestion Management
Detection Notification Avoidance/Mitigation
• TxWait, RxWait,
• Non-oversubscribed
TBBZ, RBBZ per-port
switching fabric
(Credit Stall)
• No Credit drop
• Egress Buffer • Congestion Signals
• Timeout drop
occupancy/outstandi (FPIN is software)
• Congestion Isolation
ng credits per-port
/w VLs
(Oversubscription)
• VLs with HBAs
• HW Creditmon
• Credit Pacing
Combination of T11 standard and MDS unique solutions for congestion mgmt
© 2023 Cisco and/or its affiliates. All rights reserved.
#14: Non-disruptive HW upgrade
• SUP1, SUP4, FAB1, FAB3 are compatible with all the Linecards of the MDS system
• SUP4 ARB ASIC is backwards compatible to SUP1 ARB ASIC
• FAB3 XBAR ASIC is backwards compatible with FAB1 XBAR ASIC
• Mix-n-match allowed only during migration procedure
ACTIVE SUP
STANDBY x.x.x
SUP y.y.y
x.x.x Modular ~Zero Zero
1. Pre-checks for
ACTIVE SUP
STANDBY y.y.y
SUP y.y.y
x.x.x
ISSU Fixed ~2 mins Zero
5. Upgrade
Line-cards in
© 2023 Cisco and/or its affiliates. All rights reserved.
batch
Agenda
• MDS HW Introduction
• MDS ASIC Architecture
• MDS Product Evolution
• Day in life of a packet inside MDS
• MDS ASIC 15 unique features
• Wrap up
• Non blocking architecture inside the switch for efficient use of switch resources
• Central Arbitration to ensure frames are never dropped inside of switches
• Source Fairness to ensure all ports are given a proportionate share of switch
resources
• Consistent Latency to ensure predictable application I/O performance
• In-Order frame delivery even during fabric changes to ensure end-devices don’t
unnecessarily spend cycles reassembling frames of I/O
• Onchip, Real time Analytics for unprecedented visibility to all I/O transiting the
fabric