0% found this document useful (0 votes)
125 views39 pages

NVMe-oF An Advanced Introduction

Uploaded by

Ahmed
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
125 views39 pages

NVMe-oF An Advanced Introduction

Uploaded by

Ahmed
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 39

NVMe-oF: An Advanced Introduction

2018 IBM Systems


Technical
University
Please
complete
the
session
survey!
© Copyright IBM Corporation 2018. Technical University/Symposia materials may
IBM Systems Technical Events | not be reproduced in whole or in part without the prior written permission of IBM.
2
ibm.com/training/events
What is
NVMe
• NVMe (Non-Volatile Memory Express) is an open logical
device interface specification
• Created for access non-volatile storage media via a PCI
Express (PCIe) bus inside a server chassis
• Common association is working with flash memory cards
and systems
• Reduces stack overhead and can bring about improved
performance
• A precursor of NVMe was made public at the Intel Developer
Forum in 2007 as NVMHCI (non-volatile memory host controller
interface)
• Technical work with NVMe started in 2H’2009 with contributions
by more than 90 companies

© Copyright IBM Corporation 2018. Technical University/Symposia materials may


IBM Systems Technical Events | not be reproduced in whole or in part without the prior written permission of IBM.
ibm.com/training/events
What is
NVMe
• Standard specification milestones for NVMe are:
 v1.0 - March 2011
 v1.1 - October 2012 (multi-path I/O, namespace sharing
and scatter-gather I/O)
 v1.2.1 – June 2016
 v1.3 – May 2017
v1.3c – May, 2018
• First NVMe chipsets were introduced in August 2014
• First NVMe products announced in July 2013
• NVMe-oF v1.0 – May 2016
NVMe-oF v1.0a – July 2018
• FC-NVMe v1.0 – August 2017

NOTE: NVMe-oF is about how the protocol should work over


various networking protocols. FC-NVMe are the specifications of
how NVMe-oF is implemented in Fibre Channel fabrics.

© Copyright IBM Corporation 2018. Technical University/Symposia materials may


IBM Systems Technical Events | not be reproduced in whole or in part without the prior written permission of IBM.
ibm.com/training/events
IBM Delivers

The Evolution of IBM Storage &


NVMe & NVMe-oF
Solutions
SDI 201

NVMe?
NVM Express Management Interface (NVMe-MI™)
• NVM Express over Fabrics (NVMe-oF™)
Specification published; extending
8
Specification officially released. Provides out-of-band NVMe™ onto fabrics such as Ethernet,
management for NVMe™ components and systems Fibre Channel and InfiniBand®,
and a common baseline mgmt. providing access to individual
feature set across NVMe™ devices
all NVMe™ devices & storage
and
systems. systems.
2015 2017
2 1st Partial NVMe
solutions from
0 Competitors
1
6 • NVMe Spec
2014 1.3 published.
• NVM Express Spec Addresses the
1.0 published by needs of mobile
industry leaders on devices, with
March 1 • Nothing their need for
Happene low power and
d other technical

2011 2013 features,


making
• NVMe Specification 1.1 on Oct it the only
11 storage interface
• multi-path I/O (basic), • NVM Express Spec 1.2
available
2012 released on November
namespace sharing
scatter-gather for all platforms
3
and
I/O from mobile
• NVM Express Work Group
devices through
was incorporated at NVMe
data center
Inc., the consortium
storage
respon-
the NVM sible for specification
Express the
systems.
• development
Work of Express over
on the NVM
Fabrics (NVMe-oF™) Specification
kicked-off

http://nvmexpress.org/resources/specifications/
Why Use
NVMe
• NVMe is an alternative to SCSI (small computer
system interface)
• SCSI became a standard in 1986 for connecting and
transferring data between hosts and storage devices (HDD
and tape)
• SCSI-based file commands work well with disk storage,
but performance does not improve as much when
working with flash systems.
• How?
 Simplified I/O stack (particularly on the host side)
 Parallel requests easy with enhanced queuing capabilities
 NVMe provides for large numbers of queues (up to 64,000)
and supports massive queue depth (up to 64,000
commands)
 I/O locking is not required

© Copyright IBM Corporation 2018. Technical University/Symposia materials may


IBM Systems Technical Events | not be reproduced in whole or in part without the prior written permission of IBM.
ibm.com/training/events
Why Use
NVMe
Application or File system layer

Block
Layer

SCSI Mid
Layer

SCSI Driver
Layer

HBA Driver NVMe HBA


Layer Driver

HBA

© Copyright IBM Corporation 2018. Technical University/Symposia materials may


IBM Systems Technical Events | not be reproduced in whole or in part without the prior written permission of IBM.
ibm.com/training/events
Why Use
NVMe
• NVMe has a greatly reduced and simpler instruction set
34 basic SCSI commands map to just 15 NVMe commands
• Two general types of instructions
 Management
Health Status polling, Configuration Set/Get, VPD
Read/Write, Reset, Connect
 NVMe Commands
Firmware actions, Format media, Get Features/Log Page,
Namespace Management/Attachment, Security Send/Receive,
Set Features, Read (4), Write(4), Identify(7)

© Copyright IBM Corporation 2018. Technical University/Symposia materials may


IBM Systems Technical Events | not be reproduced in whole or in part without the prior written permission of IBM.
ibm.com/training/events
Why Use
NVMe
SCSI Vs NVMeFC: On typical SCSI 70% READ workload

Workload
IO size 4KB @70% Read-30% Write cache hit
Identical workload of 200K IOPs on SCSI and
NVMeFC 32 SCSI devices with total QD=32
32 NVMeFC devices with 4 associations and total
64 queues

Analysis
• Identical response time for typical workload
• Identical IOPS for typical workload
• SCSI max out at QD32
• NVMeFC - same IOPs at half CPU consumption
• Code efficiency in NVMeFC host stack
• Potential queuing in NVMeFC gives same response time
with lesser CPU cost than SCSI

9 © Copyright IBM Corporation 2018


Why Use
NVMe
SCSI Vs NVMeFC: On IO intensive 70% READ workload

Workload
IO size 4KB @70% Read/ 30% Write cache
hit Maximum workload with QD 512
32 SCSI devices
32 NVMeFC devices with 4 associations and
total 64 queues

Analysis
• NVMeFC IOPs scale to 400K – 500K IOPs
• NVMeFC IOPS limited by Storage target port capability
• NVMeFC show 50% latency drop over SCSI
• SCSI IOPs limited to 220K IOPs
• SCSI performance limited by host stack bottleneck
• SCSI drives CPU usage almost to 70%

10 © Copyright IBM Corporation 2018


NVMe Basic
Terminology
• Subsystem – Non-volatile memory storage device
• Capsule – Unit of information exchange used in NVMe-oF
which contains NVMe command, data and/or responses
• Discovery Controller – A type of controller which supports
minimal functionality for discovery of NVMe media controllers
• Namespace ID (NSID) – Similar to SCSI’s LUN (Logical Unit
Number) identifier. The NSID is a set of logical block addresses
(LBA) on the NVM media. I.e., a volume.
• SQ (Submission Queue) – A queue used to submit I/O commands
to a controller
• CQ (Completion Queue) – A queue used to indicate command
completions for any return data and completion status by a
controller.

© Copyright IBM Corporation 2018. Technical University/Symposia materials may


IBM Systems Technical Events | not be reproduced in whole or in part without the prior written permission of IBM.
ibm.com/training/events
NVMe Basic
Terminology
• Admin Queue – A queue used to submit administrative
commands to a controller
• I/O Queue – A queue used to submit I/O commands to a
controller for data movement.
• Association – An exclusive relationship between a specific
controller and a specific host that includes the Admin Queue and
all I/O queues on that controller accessible by the specific host.
• Scatter-Gather Lists (SGL) – One or more pointers to memory
containing data to be moved, or stored, where each pointer
consists of a memory address and length value.

© Copyright IBM Corporation 2018. Technical University/Symposia materials may


IBM Systems Technical Events | not be reproduced in whole or in part without the prior written permission of IBM.
ibm.com/training/events
Taxonomy of
Transports
NVM NVMe-
e oF

*
*

Inside
Network-attached
host
chassis ** RDMA means
remote direct
memory access

© Copyright IBM Corporation 2018. Technical University/Symposia materials may


IBM Systems Technical Events | not be reproduced in whole or in part without the prior written permission of IBM.
ibm.com/training/events
Taxonomy of Transports -
RoCE
• RoCE (pronounced rocky) is RDMA over Converged Ethernet
• Two versions of RoCE
 Version 1 is Internet protocol based and limited to a broadcast domain
similar limitation of FCoE
 Version 2 is built on top of UDP (but in-order delivery is not
guaranteed) and packets between the same source and destination pair
must not be re- ordered by the network
 V2 has a simple congestion control mechanism
 Converged Ethernet with DCB networks are needed to get the
performance characteristics similar to InfiniBand

RoCE RoCE
v1 v2
RDMA Software Stack RDMA Software Stack
IB Transport Protocol IB Transport Protocol
IB Network Layer UDP/IP
Ethernet Link Layer Ethernet Link Layer
Ethernet Management Ethernet / IP Management

© Copyright IBM Corporation 2018. Technical University/Symposia materials may


IBM Systems Technical Events | not be reproduced in whole or in part without the prior written permission of IBM.
ibm.com/training/events
Taxonomy of Transports -
iWARP
• iWARP stands for Internet Wide-Area RDMA Protocol
• Created by IETF and initially defined in 5 RFCs in
2007 (RFC 5040, RFC 5041, RFC 5042, RFC 5043 and RFC
5044)
• Since 2011, IETF has made some updates with 3 additional
RFCs (RFC 6580 - 2012, RFC 6581 - 2011 and RFC 7306 - 2014)
• Uses TCP (on top of UDP) or SCTP (stream control
transmission protocol) to control RDMA flows
• iWARP is a protocol, not a full implementation mechanism
(i.e. missing some implementation details)
• Implemented as a part of TCP I/O stack (kernel/software
bottleneck) or via by using TOE NICs (TCP/IP Off-load Engine)
iWARP Stack
RDMA Software
iWARP Protocol
TCP/IP Network
Ethernet Link Layer
Ethernet / IP Management

© Copyright IBM Corporation 2018. Technical University/Symposia materials may


IBM Systems Technical Events | not be reproduced in whole or in part without the prior written permission of IBM.
ibm.com/training/events
Taxonomy of Transports -
InfiniBand
• InfiniBand (IB) architecture specifications define how RDMA over
an IB network is performed
• IB has a link level flow control mechanism via a credit-
based algorithm that guarantees lossless communication
• IB has congestion control method based on FECN/BECN
(forward explicit congestion notification/backward explicit
congestion notification) marking
• IB switches have lower latency than Ethernet switches
(approximately 100 ns versus 230 ns), but cost per IB port is a
factor

© Copyright IBM Corporation 2018. Technical University/Symposia materials may


IBM Systems Technical Events | not be reproduced in whole or in part without the prior written permission of IBM.
ibm.com/training/events
Taxonomy of Transports – FC-
NVMe
• NVMe over Fibre Channel standards were developed by the
T11 organization
• T11 standards documentation for NVMe is FC-NVMe
• Very similar to FCP (fibre channel protocol) where
NVMe is “mapped” into the payload area of FC
frame(s)
• FC-NVMe uses slightly more than 90% of in-host
NVMe implementation over a PCIe bus
• Main difference is NVMe-oF uses a message or packet model
for communication between a host and storage target, while
in-host NVMe implementations just use shared memory and
memory pointers
• NVMe-oF is meant to extend the
distance between host and storage
device
• Original design goal of NVMe-oF was to
add no more than 10 microseconds of
latency
© Copyright IBM Corporation 2018. Technical University/Symposia materials may
IBM Systems Technical Events | not be reproduced in whole or in part without the prior written permission of IBM.
ibm.com/training/events
Summary of NVMe Connectivity Options

NVMe Host Subsystem

NVMe Host-Side Transport Abstraction

Fibre Channel Transport NVMe RDMA RDMA NVMe PCIe


RDMA Verbs Transport Host Software

Fibre Channel
iWARP RoCE Infiniband

FC Fabric RDMA Fabric PCIe Fabric

Fibre Channel
iWARP RoCE Infiniband PCIe Function

NVMe RDMA RDMA


Fibre Channel Transport NVME PCIe I/F Function
RDMA Verbs Transport

NVMe Controller-SideTransport Abstraction

NVMe Controller Subsystem

© Copyright IBM Corporation 2018. Technical University/Symposia materials may


IBM Systems Technical Events | not be reproduced in whole or in part without the prior written permission of IBM.
ibm.com/training/events
Taxonomy of Transports – Stack
Comparison

© Copyright IBM Corporation 2018. Technical University/Symposia materials may


IBM Systems Technical Events | not be reproduced in whole or in part without the prior written permission of IBM.
ibm.com/training/events
FC-NVMe
Introduction
• NVMe is mapped using the similar FCP mapping as SCSI
• Built from the ground up with non-volatile memory as the
storage to leverage the speed and robustness of Fibre Channel
• Does not require building a parallel infrastructure
• Easily provides for scalability and access controls
• Aimed at 16Gbps, 32Gbps and higher speed switches and
fabrics
• The fabric switches have to be able to recognize FC-NVMe
services and devices plus handle the registration and queries
of FC features
 Brocade FOS 8.2 or higher
 Cisco NX_OS 8.x or higher

© Copyright IBM Corporation 2018. Technical University/Symposia materials may


IBM Systems Technical Events | not be reproduced in whole or in part without the prior written permission of IBM.
ibm.com/training/events
FC-NVMe
Components
• Same basic components as SCSI and FCP
Hosts (initiators), SAN (network) and storage devices (targets)

• Storage device can also be known as a NVMe Storage


Subsystem
• NVMe Storage Subsystem has:
 NVMe controllers (which contains the SQ and CQ queues)
 NVMe Namespace
 NVMe Storage Media

• NVMe controllers are logical devices which


can be created dynamically when a host
connects to a namespace or the NVMe
Storage Subsystem may have a static
number of controllers already set-up.
• NVMe controllers handle tags to the namespace(s) which are
known as Namespace IDs
© Copyright IBM Corporation 2018. Technical University/Symposia materials may
IBM Systems Technical Events | not be reproduced in whole or in part without the prior written permission of IBM.
ibm.com/training/events
FC-NVMe
Components
• NVMe Namespace is another
NVM
logical construct Subsystem

• NVMe Namespace provides an


interface between the controller
and the storage media
• The namespace is group of LBAs
(logical block addresses) within
the storage media.
• The namespace defines the format
used for the LBAs and the size of
the LBAs

© Copyright IBM Corporation 2018. Technical University/Symposia materials may


IBM Systems Technical Events | not be reproduced in whole or in part without the prior written permission of IBM.
ibm.com/training/events
FC-NVMe
Components
NVMe Storage Controllers and Queues

NVMe
Subsystem

© Copyright IBM Corporation 2018. Technical University/Symposia materials may


IBM Systems Technical Events | not be reproduced in whole or in part without the prior written permission of IBM.
ibm.com/training/events
FC-NVMe
Components

© Copyright IBM Corporation 2018. Technical University/Symposia materials may


IBM Systems Technical Events | not be reproduced in whole or in part without the prior written permission of IBM.
ibm.com/training/events
FC-NVMe
Components
NVMe Queue Handling
Applies to Admin and I/O, Submission and Completion
Queues

Tail points the next


free entry in the
Tail queue
Head points to next entry
to be pulled out of the
queue

Head and Tail point to the


same entry? Empty
queue!

Head equal Tail plus


one? Full Queue!

Hea
d

© Copyright IBM Corporation 2018. Technical University/Symposia materials may


IBM Systems Technical Events | not be reproduced in whole or in part without the prior written permission of IBM.
ibm.com/training/events
FC-NVMe
Components
Tail
NVMe Queue
Sizing
• Queue size is number of entries
minus 1
• I/O Queue minimum size is 2
and maximum size is 64K
• Admin Queue minimum size is 2 Hea
and maximum size is 4K d

•• One Admin Queue per NVMe


Admin Queue is used to configure I/O Queue and
subsystem
controller management
• I/O Queue controls data movement (read and
write operations)
• Queues can be contiguous block of physical memory or a
set of memory pages

© Copyright IBM Corporation 2018. Technical University/Symposia materials may


IBM Systems Technical Events | not be reproduced in whole or in part without the prior written permission of IBM.
ibm.com/training/events
FC-NVMe
Components

NVMe Queue Formats


• Submission Queue Entries (SQE)
are 64 Bytes each

• Completion Queue Entries have


a size of 16 Bytes

© Copyright IBM Corporation 2018. Technical University/Symposia materials may


IBM Systems Technical Events | not be reproduced in whole or in part without the prior written permission of IBM.
ibm.com/training/events
Discovery or Getting
Started
1) Devices perform
normal FC
logging in
Name operations
Server
2) Devices register
their FC features
with Name
Server

FC-NVMe
storage
Hos Discovery
t Controller

Storage
Controller

IBM Systems Technical Events |


ibm.com/training/events
Discovery or Getting
Started
3) Host queries the
Name Server for
list of fabric
Name members it is
Server allowed to access
via zoning.
4) Name Server
responds with
list of FCIDs

FC-NVMe storage

Hos Discovery
t Controller

Storage
Controller

© Copyright IBM Corporation 2018. Technical University/Symposia materials may


IBM Systems Technical Events | not be reproduced in whole or in part without the prior written permission of IBM.
ibm.com/training/events
Discovery or Getting
Started
5) Host queries NS
about each device
asking for FC
Name Features supported
Server
6) NS responds to each
query with FC
Feature information

FC-NVMe
storage
Hos Discovery
t Controller

Storage
Controller

© Copyright IBM Corporation 2018. Technical University/Symposia materials may


IBM Systems Technical Events | not be reproduced in whole or in part without the prior written permission of IBM.
ibm.com/training/events
Discovery or Getting
Started 7) Host logs into the storage
subsystem and queries the
NVMe Subsystem Discovery
Controller for specific
Name Server feature(s)

8) NVM Subsystem Discovery


Controller responds with
information about
available resources

FC-NVMe
storage
Hos Discovery
t Controller

Storage
Controller

© Copyright IBM Corporation 2018. Technical University/Symposia materials may


IBM Systems Technical Events | not be reproduced in whole or in part without the prior written permission of IBM.
ibm.com/training/events
Discovery or Getting
Started 9) Host logs into the
storage controller and
queries the NVMe
Storage Controller
Name Server
10) NVMe Storage Controller
responds with
information about
available resources

FC-NVMe
storage
Hos Discovery
t Controller

Storage
Controller
NOTE: At this point, the host can
disconnect from the NVMe
Storage Subsystem’s Discovery
Controller.

© Copyright IBM Corporation 2018. Technical University/Symposia materials may


IBM Systems Technical Events | not be reproduced in whole or in part without the prior written permission of IBM.
ibm.com/training/events
Discovery or Getting
Started 11) Host issues a
Connect request to
the NVMe Storage
Controller
Name Server
12) NVMe Storage
Controller responds by
creating Admin and I/O
13) I/O operations
queues
starts

FC-NVMe
storage
Hos Discovery
t Controller

Storage
Controller

© Copyright IBM Corporation 2018. Technical University/Symposia materials may


IBM Systems Technical Events | not be reproduced in whole or in part without the prior written permission of IBM.
ibm.com/training/events
I/O
Operations
I/O operations flow of FCP – SCSI and FC-NVMe

Read Operation Flow

Write Operation
Flow

IBM Systems Technical Events | © Copyright IBM Corporation 2018. Technical University/Symposia materials may
ibm.com/training/events not be reproduced in whole or in part without the prior written permission of IBM.
Exploiting Dual Protocol FCP and FC-
NVMe
Dual-protocol infrastructure for easiest migration
Servers
• Deploy NVMe based arrays SCSI-on-
• Leverage existing infrastructure FC
• Easily supported dual infrastructure FC-NVMe GEN6
• How long will the transition take? HBAs
• Avoid risks
• Incremental Migration
• Applications dictate how
individual volumes can be
migrated
• Changes can be rolled back
easily without DUAL Protocol Existing 32Gb
disruption to hardware or SAN arrays FC fabric
cabling.
Existing enterprise SCSI-on-
• FCP and NVMe over FC can both Storage FC FC-
leverage FC zoning
• Only SAN Zoning improves security infrastructure NVME
• Zoning restricts devices from accessing
network areas that should not be
visiting!
• Discovery SCSI on FC FC NVMe
• Emulex create drivers that leverage Arrays Arrays
FCPTechnical
IBM Systems for Events
ibm.com/training/events
device
| discovery, then check © Copyright IBM Corporation 2018. Technical University/Symposia materials may
not be reproduced in whole or in part without the prior written permission of IBM.
38

those devices for FC-NVMe traffic


Final
Thoughts

© Copyright IBM Corporation 2018. Technical University/Symposia materials may


IBM Systems Technical Events | not be reproduced in whole or in part without the prior written permission of IBM.
ibm.com/training/events
Session
Summary
• Assumptions for this
session
• What is NVMe
• Why use NVMe
• Basic Terminology
• Taxonomy of Transports
• FC-NVMe Introduction
• FC-NVMe Components
• Discovery
• I/O Operations

© Copyright IBM Corporation 2018. Technical University/Symposia materials may


IBM Systems Technical Events | not be reproduced in whole or in part without the prior written permission of IBM.
ibm.com/training/events
Questions ?
?

© Copyright IBM Corporation 2018. Technical University/Symposia materials may


IBM Systems Technical Events not be reproduced in whole or in part without the prior written permission of IBM.

|
© Copyright IBM Corporation 2018. Technical University/Symposia materials may
IBM Systems Technical Events | 39
ibm.com/training/events not be reproduced in whole or in part without the prior written permission of IBM.

You might also like