P51a 05and06
P51a 05and06
Lecture 5/6
A huge thank you to Eben Upton, Raspberry Pi Foundation, and PiHut people for enabling this incarnation
of the module at incredibly short notice.
With great appreciation to Dick Sites for sharing wisdom, patience, and teaching materials.
• Cost
• Manufacturing limitations (e.g. maximum silicon size)
• Power consumption
• General purpose or user specific?
• I/O on the package
• Number of ports:
• Front panel size (24,32,48 ports in 19inch rack)
• MAC area
Packet Rate as a Performance Metric
• Bandwidth is misleading
• For example: full line rate for 1024B packets
but not for 64B packets…
• Packet Rate: how many packets can be processed every second?
• Unit: packets per second (PPS)
HP
HP
HP
HP
Network Interfaces
NIF HP NIF
NIF HP NIF
NIF HP NIF
NIF HP NIF
Switching
NIF HP NIF
NIF HP NIF
NIF HP NIF
NIF HP NIF
Output Queues
NIF HP OQ NIF
NIF HP OQ NIF
NIF HP OQ NIF
NIF HP OQ NIF
Scheduling
NIF HP OQ NIF
NIF HP OQ NIF
SCH
NIF HP OQ NIF
NIF HP OQ NIF
Is This A Real Switch?
NIF HP OQ NIF
NIF HP OQ NIF
SCH
NIF HP OQ NIF
NIF HP OQ NIF
Recall What Drives Real World Switches
• Cost
• Power
• Area
Sharing Resources Is Good!
NIF OQ NIF
NIF OQ NIF
NIF OQ NIF
NIF OQ NIF
Rethinking The Switch Architecture
NIF OQ NIF
NIF OQ NIF
SCH
HP
NIF OQ NIF
NIF OQ NIF
Where Is The Switching?
NIF OQ NIF
NIF OQ NIF
SCH
HP
NIF OQ NIF
NIF OQ NIF
Output Queueing
OQ NIF
OQ
OQ Schedule
& NIF
OQ Rate limit
OQ
Input Queueing
NIF IQ OQ NIF
NIF IQ OQ NIF
SCH
HP
NIF IQ OQ NIF
NIF IQ OQ NIF
Virtual Output Queueing
IQ OQ NIF
IQ OQ NIF
SCH
HP
IQ OQ NIF
IQ OQ NIF
Virtual Output Queueing
VOQ OQ NIF
VOQ OQ NIF
SCH
HP
VOQ OQ NIF
VOQ OQ NIF
Virtual Output Queueing
VOQ OQ NIF
VOQ OQ NIF
SCH
HP
VOQ OQ NIF
VOQ OQ NIF
Deep Buffers
Q
Queues
Manager
External
Memory
Q External External
Memory Memory
Controller PHY
Q
Scheduling
SC SC SC SC SC SC SC SC SC
H H H H H H H H H
SP Pn+RL BE
SCH (Priority)
NIF OQ NIF
NIF OQ NIF
SCH
HP
NIF OQ NIF
NIF OQ NIF
High Throughput Switching
• High throughput metrics
• Types of high performance switches
• cut through vs. store and forward
• ToR vs. Core switch
• General purpose vs. proprietary ASIC
• High throughput switch architectures (including silicon vs system vs
network)
Bandwidth, Throughput and Goodput
But bandwidth can be limited below link’s capacity and vary over
time, throughput can be measured differently from bandwidth
etc…..
Speed and Bandwidth
PACKET
PACKET
Circuit Switches
• In a circuit switch:
Scheduler
PKT NIF IQ OQ NIF
PP
PKT NIF IQ OQ NIF
NIF IQ OQ NIF
Store and Forward
Scheduler
NIF IQ OQ NIF
PP
NIF IQ OQ NIF
NIF IQ OQ NIF
Cut Through
PKT2
3
1 NIF IQ OQ NIF
Scheduler
NIF IQ OQ NIF
PP
NIF IQ OQ NIF
NIF IQ OQ NIF
Measuring Performance
• Bandwidth: number of bits (or bytes) through the channel every unit of
time
• One way to calculate: bus width clock frequency
Measuring Performance
256B
256B
512B Width
e.g. 256B
CLOCK CLOCK
CYCLE2 CYCLE1
The Truth About Switch Silicon Design
B
1
PACKET Data path
256B
257B Width
e.g. 256B
CLOCK CLOCK
CYCLE2 CYCLE1
Low Latency Switches
How to lower the latency of a switch?
NIF IQ OQ NIF
PP
NIF IQ OQ NIF
NIF IQ OQ NIF
How to lower the latency of a switch?
NIF IQ OQ NIF
PP
NIF IQ OQ NIF
NIF IQ OQ NIF
How to lower the latency of a switch?
NIF IQ OQ NIF
PP
NIF IQ OQ NIF
NIF IQ OQ NIF
How to lower the latency of a switch?
NIF IQ OQ NIF
PP
NIF IQ OQ NIF
NIF IQ OQ NIF
Cut Through Switching
Cut Through Switch
Node 1
Node 2
What is a cut-through switch?
• Observation: data bus width will be no less than incoming data rate and
feasible clock rate
Network Interfaces
• A likely flow:
• Possible implementations:
• The entire packet goes through the header processing unit
• Just the header goes through the header processing unit
• “Better” depends on your performance profile (what are the
bottlenecks? Resource limitations?)
Packet Processing
• A likely flow:
• Challenges:
• A field may arrive over multiple clock cycles (e.g. 32b field, 16b on
clock 2 and 16b on clock 3)
• Memory access taking more than 1 clock cycle
• E.g. request on clock 1, reply on clock 3
• Some memories allow multiple concurrent accesses, some don’t
• The bigger the memory, the more time it takes
Packet Processing
• A likely flow:
• Solutions:
• Pipelining!
Don’t stall, add NOP stages in your pipe.
• Reorder operations (where possible)
• E.g. Lookup 1 Action 1 Lookup 2 Action 2 turns:
Lookup 1 Lookup 2 Action 1 Action 2
• Don’t create hazards!
Arbitration
Arbiter
• (approximately) same arrival time PP
• Arbiter uses Round Robin NIF IQ
Arbiter
• (approximately) same arrival time PP
• Arbiter uses Round Robin NIF IQ
• Yes: packets need to wait for previous packets to be handled before being
admitted.
Worst case waiting with <N> inputs is <N-1>Packet time
Arbitration
Write Clk
Write Ptr
Read Clk
Clk Out
Read Ptr
• The flow of the data through the device (the network) needs to be
regulated
• Different events may lead to stopping the data:
• An indication from the destination to stop
• Congestion (e.g. 2 ports sending to 1 port)
• Crossing clock domains
Data
• Rate control
• …
Back pressure
Flow Control
NIF IQ OQ NIF
Scheduler
NIF IQ OQ NIF
PP
NIF IQ OQ NIF
NIF IQ OQ NIF
Flow Control
NIF IQ OQ NIF
Scheduler
NIF IQ OQ NIF
PP
NIF IQ OQ NIF
NIF IQ OQ NIF
Flow Control and Buffering
Stop Data
triggered stops
time
• Need to either:
• Assert back pressure sufficient time before traffic needs to stop
OR
• Provide sufficient buffering
Flow Control and Buffering
• Performance measures
– Latency (response time)
– Throughput (bandwidth)
– Desktops & embedded systems
• Mainly interested in response time & diversity of devices
– Servers
• Mainly interested in throughput & expandability of devices
• Reliability
– Particularly for storage devices (fault avoidance, fault tolerance, fault
forecasting)
I/O Management and strategies
• Polling
• Interrupt Driven
• Direct Memory Access
The I/O Access Problem
• Data is written and read directly to/from the last level cache
(LLC)
PCIe introduction
Provides:
• Processor independence &
buffered isolation
• Bus mastering
• Messages. Handled like posted writes. Used for event signaling and
general purpose messaging.
PCIe architecture
Interrupt Model
3. INTx Emulation
four physical interrupt signals INTA-INTD are messages upstream
- ultimately be routed to the system interrupt controller
Host system
PCI endpoint
Direct
AXI Memory
Interconnect Access
Input
Arbiter
Output
NetFPGA Reference Projects
Port Lookup
Output
Queues
10GE
10GE
10GE
10GE
Processing Overheads
TCP/IP/ETH TCP/
Buffers
Kernel Kernel IP/ETH
OS packet I/O Library
Device driver Device driver Device driver
• Usage examples:
• Send and receive packets within minimum number of CPU cycles
• E.g. less than 80 cycles
• Fast packet capture algorithms
• Running third-party stacks
• Some projects demonstrated 100’s of millions packets per seconds
• But with limited functionality
• E.g. as a software switch / router
High Throughput Switches
The Truth About Switch Silicon Design
12.8Tbps Switches!
20
5.8 Gpps @ 256B
18
16
19.2 Gpps @ 64B 14
Required Parallelism
12
10
8
6
But clock rate is only ~1GHz…. 4
2
0
50 250 450 650 850 1050 1250 1450
Packet Size [B]
Multi-Core Switch Design
Barefoot Tofino
Broadcom Tomahawk 3
– Shared tables:
– need to allow access from multiple pipelines
– need to support query rate at packet rate
– Separate tables:
– wastes resources
– need to maintain consistency
– Not everyone agree with this assumption
Multi Core Switch Design
NIF PP OQ NIF
NIF PP OQ NIF
Inferring Switch Architecture
refer to:
http://www.mellanox.com/tolly/
All interpretations in the following slides are a guess, and
not based on internal information
What is wrong with Broadcom Tomahawk?
Broadcom Tomahawk
• 32 x 100GE
• In packet rate: 32 x 150Mpps = 4800 Mpps
• Manufacturing process: 28nm
• Therefore clock frequency likely <1GHz
• More than 7 billion transistor
• Reference: Intel debut around the same time 18-core Xeon
E5-2600 v3 with 5.57 billion transistors
• On chip memories
• Advantage: fast access time
• Disadvantage: limited size (10’s of MB)
• Off chip memory:
• Advantage: large size (up to many GB)
• Disadvantage: access time, cost, area, power
• New technologies
• Offer mid-way solutions
Example: QDR-IV SRAM
Speed up
1.4
(e.g., 128B, 256B, …) 1.3
1.2
1.1
1
60 260 460 660 860 1060 1260 1460
Packet Size [B]
Example: PCI Express Gen 3, x8