0% found this document useful (0 votes)

45 views29 pages

High Performance Storage With BLK-MQ and Scsi-Mq: Christoph Hellwig

Uploaded by

Dot-Insight

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

45 views29 pages

High Performance Storage With BLK-MQ and Scsi-Mq: Christoph Hellwig

Uploaded by

Dot-Insight

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 29

High Performance Storage with

blk-mq and scsi-mq

Christoph Hellwig

2014 Storage Developer Conference. © Insert Your Company Name. All Rights Reserved.
Problem Statement

 The Linux storage stack doesn't scale:

– ~ 250,000 to 500.000 IOPS per LUN
– ~ 1,000,000 IOPS per HBA
– High completion latency
– High lock contention and cache line bouncing
– Bad NUMA scaling
Linux SCSI Performance
fio 4k random read performance - RAID HBA with 16 SAS SSDs
900,000

800,000

700,000

600,000
Aggregate IOPS

500,000

400,000 Linux 2.6.32

300,000

200,000

100,000

0
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16

LUNs
Linux Storage Stack - Issues

 The Linux block layer can't handle high IOP or low

latency devices
– All the block layer?
Linux Storage Stack

BIO submission

Device mapper,
Software RAID

Request layer

SCSI layer

HW driver HW driver HW driver HW driver

Linux Storage Stack – Issues (2)

 The request layer can't handle high IOPS or low

latency devices
 Vendors work around by implementing
make_request based drivers
– Lots of code duplication
– Missing features
 SCSI drivers are tied into the request framework
Linux Storage Stack – blk-mq

 A replacement for the request layer

– First prototyped in 2011
– Merged in Linux 3.13 (2014)
 Not a drop-in replacement

– Different driver API

– Different queuing model (push vs pull)
Blk-mq – architecture

 Processes dispatch into per-cpu software queues

 Software queues map to hardware issue queues

– In the optimal case:

• N(hardware queues) = N(CPU cores)
– For now the most common case is::
• N(hardware queues) = 1
Blk-mq I/O submission path

Processes

Software contexts
(per-CPU)

Hardware contexts
(based on HW capabilities)

HBA
Blk-mq – request allocation and tagging

 Provides combined request allocation and tagging

– Requests are allocated at initialization
– Requests are indexed by the tag
– Tag and request allocation are combined
 Avoids per-request allocations in the driver

– Driver data in “slack” space behind request

– S/G list is part of driver data
Blk-mq – I/O completions

 Uses IPIs to complete on the submitting node and

avoid false cache line sharing
– Can be disabled, or forced to the submitting
core
 Old request code provided similar functionality

– Non-integrated additional functionality

– Uses software interrupts instead of IPIs
Prototype for blk-mq usage in SCSI

 First “scsi-mq” prototype from Nic Bellinger

– Published in late 2012
– Used early blk-mq to drive SCSI
– Demonstrated millions of IOPS
– Required (small) changes to drivers
– Only using a single hardware queue
– Did not support various existing SCSI stack
features
Production design for blk-mq in SCSI

 Should be a drop in replacement

– Must support full SCSI stack functionality
– Must not require driver API changes
– Driver should not be tied to blk-mq
 Should avoid code duplication

– Push as much as possible work to blk-mq

– Refactor SCSI code to avoid separate code paths
as much as possible
Production design for blk-mq in SCSI -
Request allocation and tagging

 Considerations for request and tag allocation:

– Allocating a request for each per-LUN tag would
inflate memory usage
– Various hardware requires per-host tags anyway
 Thus went with blk-mq changes to allow per-host
tag sets
Production design for blk-mq in SCSI -
S/G lists

 Modern SCSI HBAs allow for huge S/G lists

– Linux supports up to 2048 S/G list entries,
which require 56 KiB of S/G list structures
– We don't want to preallocate that much
 Preallocate a single 128 entry chunk

– Enough for most latency sensitive small I/O

– The rest is dynamically allocated as needed
Blk-mq work driven by SCSI

 Transparent pre/post-flush request handling

 Head of queue request insertion
 Partial completion support
 BIDI request support
 Shared tag space between multiple request_queues
 Better support for requeuing from IRQ context
 Lots of bugfixes and small features / cleanups

2014 Storage Developer Conference. © Christoph Hellwig. All Rights Reserved.

SCSI preparation for blk-mq

 New cmd_size field in host template

– Allows to allocate per-driver command data
 Host-lock reductions

– Elimination of host-wide spinlocks in I/O

submission and completion
 Upper level driver refactoring

– Avoids legacy request layer interaction

– Provides a cleaner drivers abstraction
SCSI blk-mq status

 Required blk-mq features included in Linux 3.16

 Preparatory SCSI work merged in Linux 3.16
 Blk-mq support for SCSI merged in Linux 3.17

– Must be enabled by scsi_mod.use_blk_mq=Y

boot option
– Does not work with dm-multipath
 Big distributions include preparatory patches
Linux SCSI Performance
fio 512 byte random read performance - RAID HBA with 16 SAS SSDs
1,200,000

1,000,000

800,000
Aggregate IOPS

600,000
Linux 2.6.32
3.17-rc3 (with blk-mq)

400,000

200,000

Note: HBA maxes out at about 1 million IOPS

0
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16

LUNs
SCSI profiling data
46.13% [kernel] [k]
[k] _spin_lock_irq
Linux 2.6.32
46.13% [kernel] _spin_lock_irq
26.92%
26.92% [kernel]
[kernel] [k]
[k] _spin_lock_irqsave
_spin_lock_irqsave
9.32%
9.32% [kernel]
[kernel] [k] _spin_lock
[k] _spin_lock
0.47%
0.47% [kernel]
[kernel] [k]
[k] kmem_cache_alloc
kmem_cache_alloc
0.45%
0.45% [kernel]
[kernel] [k]
[k] scsi_request_fn
scsi_request_fn
0.39%
0.39% [kernel]
[kernel] [k] _spin_unlock_irqrestore
[k] _spin_unlock_irqrestore
0.33%
0.33% [kernel]
[kernel] [k]
[k] kref_get
kref_get
0.32%
0.32% [kernel]
[kernel] [k] __blockdev_direct_IO_newtrunc
[k] __blockdev_direct_IO_newtrunc
0.32%
0.32% [kernel]
[kernel] [k]
[k] kmem_cache_free
kmem_cache_free
0.30%
0.30% [kernel]
[kernel] [k]
[k] native_write_msr_safe
native_write_msr_safe

2.67%
2.67% [kernel]
[kernel] [k]
[k] do_blockdev_direct_IO
do_blockdev_direct_IO
2.60% [kernel] [k]
[k] __bt_get
Linux 3.17-rc3 2.60%
2.43%
2.43%
2.07%
[kernel]
[kernel]
[kernel]
[kernel]
[k]
__bt_get
__blk_mq_run_hw_queue
[k] __blk_mq_run_hw_queue
[k]
2.07% [kernel] [k] put_compound_page
put_compound_page
(with blk-mq) 1.87%
1.87%
1.60%
1.60%
[kernel]
[kernel]
[kernel]
[kernel]
[k]
[k] __blk_mq_alloc_request
[k]
__blk_mq_alloc_request
_raw_spin_lock
[k] _raw_spin_lock
1.59%
1.59% [kernel]
[kernel] [k]
[k] kmem_cache_alloc
kmem_cache_alloc
1.58%
1.58% [kernel]
[kernel] [k] scsi_queue_rq
[k] scsi_queue_rq
1.44%
1.44% [kernel]
[kernel] [k]
[k] _raw_spin_lock_irqsave
_raw_spin_lock_irqsave
Linux SCSI Performance
Multiple LUN performance, single threaded - SRP attached null_io target
1,400,000 140%
130%
1,200,000 120%
110%
1,000,000 100%
90%
800,000 80%

CPU usage
70%
IOPS

600,000 60%
50%
400,000 40%
30%
200,000 20%

Note: Target overload in 8 LUN case prevents linear scaling 10%

0 0%
1 2 4 6 8

LUNs

3.14.3 3.16+ 3.16+ (with blk-mq)

Linux SCSI Performance
Single LUN performance - SRP attached null_io target
1,400,000

1,200,000

1,000,000

800,000 3.14.3
3.16+
IOPS

3.16+ (with blk mq)

600,000

400,000

200,000

0
random read, 12 threads random write, 12 threads random read, 1 thread random write, 1 thread
SCSI blk-mq status - near term work

 Better way to select blk-mq vs legacy code path

– Compile time option added for 3.18-rc
 We would like to fully replace the old SCSI I/O path
with the blk-mq one.
 Missing features:

– I/O scheduler support in blk-mq

– multipath support (prototype exists now)
Exposing multiple HW queues to SCSI
drivers

 SCSI core so far only exposes a single queue

– Some drivers are ready for multiple queues
– So far do internal queue mapping
 Design for tag allocation:

– We want per-queue tag allocations for scalability

reasons
– Add a queue prefix to the Tag
– Work done by Bart van Assche, likely to be
merged for Linux 3.19
Future work – better integration

 Expose more blk-mq flags to SCSI

– Request merge control
– better command allocation/freeing hooks
– Reserved tags for HBA use
Future work - longer term research

 Further reduction of shared cache lines:

– let blk-mq handle per-host queuing limits
– let hardware handle per-LUN or per-target
queuing limits
 Map multiple LUNs (request_queues) to the same
blk-mq contexts
References

 Benchmarks:
– Bart van Assche (Fusion-io / Sandisk):
• https://docs.google.com/file/d/0B1YQOreL3_FxWmZfbl8xSzRfdGM/edit?pli=1

– Robert Elliott (HP):

• http://marc.info/?l=linux-kernel&m=140313968523237&w=2
Thanks

 Fusion-io (now a Sandisk company)

– For sponsoring the blk-mq in SCSI work
 Jens Axboe

– For code and slide review, and blk-mq itself

 Bart van Assche, Robert Elliott

– For code and slide review as well as benchmark

data
Questions?

Sk015 - Jawapan Tutor Edit
No ratings yet
Sk015 - Jawapan Tutor Edit
22 pages
Paging Mechanism of 80386
No ratings yet
Paging Mechanism of 80386
15 pages
Memory Segmentation of 8086
89% (9)
Memory Segmentation of 8086
17 pages
Mces MCQ
No ratings yet
Mces MCQ
50 pages
Making A Prallel Jaw Bar Clamp
No ratings yet
Making A Prallel Jaw Bar Clamp
34 pages
BRKDCN 1247
No ratings yet
BRKDCN 1247
52 pages
1.initial Boot Sequence
No ratings yet
1.initial Boot Sequence
94 pages
Placement Bluebook
No ratings yet
Placement Bluebook
1 page
IOT UNIT 2 Part 2
No ratings yet
IOT UNIT 2 Part 2
34 pages
Precision Machine Components NSK
No ratings yet
Precision Machine Components NSK
536 pages
How To - Oscam Compile Tutorial
No ratings yet
How To - Oscam Compile Tutorial
5 pages
CS7 Installation - Service Manual (VER 1.30) A47FJA01EN12 - 161220 - Fix
No ratings yet
CS7 Installation - Service Manual (VER 1.30) A47FJA01EN12 - 161220 - Fix
1,048 pages
DU-H4 Manual
No ratings yet
DU-H4 Manual
8 pages
JVC MX J30 Service ID46
No ratings yet
JVC MX J30 Service ID46
29 pages
CSS List of Tools and Equipment 2023
No ratings yet
CSS List of Tools and Equipment 2023
2 pages
OS Module 2
No ratings yet
OS Module 2
40 pages
Juki 750e Menual
No ratings yet
Juki 750e Menual
32 pages
Ceph Optimizations For Nvme: Chunmei Liu, Intel Corporation
No ratings yet
Ceph Optimizations For Nvme: Chunmei Liu, Intel Corporation
32 pages
LC LTC Series Installation and Operating Instruction English V11 2109n
No ratings yet
LC LTC Series Installation and Operating Instruction English V11 2109n
74 pages
Usermanual GC Installation Startup 8890 g3540 90013 en Agilent
No ratings yet
Usermanual GC Installation Startup 8890 g3540 90013 en Agilent
86 pages
Unit-3 & 4: Concurrency & Interprocess Communication
No ratings yet
Unit-3 & 4: Concurrency & Interprocess Communication
81 pages
WTQ Protocols en
No ratings yet
WTQ Protocols en
86 pages
Lesson 4: Software Embedded: Into A System-Part 1
No ratings yet
Lesson 4: Software Embedded: Into A System-Part 1
18 pages
Trill: A High-Performance Incremental Query Processor For Diverse Analytics
No ratings yet
Trill: A High-Performance Incremental Query Processor For Diverse Analytics
15 pages
ER TFTM070 6 - Datasheet
No ratings yet
ER TFTM070 6 - Datasheet
28 pages
WRD-130-U1 Config Tool Manual V0172 en
No ratings yet
WRD-130-U1 Config Tool Manual V0172 en
12 pages
Licensing and Pricing Spla: Windows Server 2012
No ratings yet
Licensing and Pricing Spla: Windows Server 2012
7 pages
New Holland CR Mav - 100 Complete Chopper Installation Guide
No ratings yet
New Holland CR Mav - 100 Complete Chopper Installation Guide
25 pages
Manual GIGABYTE Gaep43ud3l
No ratings yet
Manual GIGABYTE Gaep43ud3l
96 pages
Infinity USB Phoenix Card Reader, How To Setup With Clarkconnect & CCcam
No ratings yet
Infinity USB Phoenix Card Reader, How To Setup With Clarkconnect & CCcam
2 pages
MLH Wiring 20230927
No ratings yet
MLH Wiring 20230927
6 pages
13-Virtual Memory 1
No ratings yet
13-Virtual Memory 1
26 pages
Exemplo - Sign in To Your Account
No ratings yet
Exemplo - Sign in To Your Account
1 page
Field Arrival Trials
No ratings yet
Field Arrival Trials
1 page
Qebw Rifgdgb3
No ratings yet
Qebw Rifgdgb3
5 pages
COC1 Institutional Assessment
No ratings yet
COC1 Institutional Assessment
2 pages
The Unwinding: An Inner History of the New America
From Everand
The Unwinding: An Inner History of the New America
George Packer
4/5 (45)
Principles: Life and Work
From Everand
Principles: Life and Work
Ray Dalio
4/5 (643)
The Yellow House: A Memoir (2019 National Book Award Winner)
From Everand
The Yellow House: A Memoir (2019 National Book Award Winner)
Sarah M. Broom
4/5 (100)
The World Is Flat 3.0: A Brief History of the Twenty-first Century
From Everand
The World Is Flat 3.0: A Brief History of the Twenty-first Century
Thomas L. Friedman
3.5/5 (2289)
Fear: Trump in the White House
From Everand
Fear: Trump in the White House
Bob Woodward
3.5/5 (836)
A Heartbreaking Work Of Staggering Genius: A Memoir Based on a True Story
From Everand
A Heartbreaking Work Of Staggering Genius: A Memoir Based on a True Story
Dave Eggers
3.5/5 (233)
Rise of ISIS: A Threat We Can't Ignore
From Everand
Rise of ISIS: A Threat We Can't Ignore
Jay Sekulow
3.5/5 (144)
The Glass Castle: A Memoir
From Everand
The Glass Castle: A Memoir
Jeannette Walls
4.5/5 (1856)
The Emperor of All Maladies: A Biography of Cancer
From Everand
The Emperor of All Maladies: A Biography of Cancer
Siddhartha Mukherjee
4.5/5 (298)
Manhattan Beach: A Novel
From Everand
Manhattan Beach: A Novel
Jennifer Egan
3.5/5 (919)
The Gifts of Imperfection: Let Go of Who You Think You're Supposed to Be and Embrace Who You Are
From Everand
The Gifts of Imperfection: Let Go of Who You Think You're Supposed to Be and Embrace Who You Are
Brene Brown
4/5 (1175)
Shoe Dog: A Memoir by the Creator of Nike
From Everand
Shoe Dog: A Memoir by the Creator of Nike
Phil Knight
4.5/5 (629)
Her Body and Other Parties: Stories
From Everand
Her Body and Other Parties: Stories
Carmen Maria Machado
4/5 (903)
Steve Jobs
From Everand
Steve Jobs
Walter Isaacson
4.5/5 (1139)
The Light Between Oceans: A Novel
From Everand
The Light Between Oceans: A Novel
M.L. Stedman
4.5/5 (815)
The Perks of Being a Wallflower
From Everand
The Perks of Being a Wallflower
Stephen Chbosky
4.5/5 (4103)
Angela's Ashes: A Memoir
From Everand
Angela's Ashes: A Memoir
Frank McCourt
4.5/5 (943)
Team of Rivals: The Political Genius of Abraham Lincoln
From Everand
Team of Rivals: The Political Genius of Abraham Lincoln
Doris Kearns Goodwin
4.5/5 (244)
Sing, Unburied, Sing: A Novel
From Everand
Sing, Unburied, Sing: A Novel
Jesmyn Ward
4/5 (1267)
John Adams
From Everand
John Adams
David McCullough
4.5/5 (2546)
The Outsider: A Novel
From Everand
The Outsider: A Novel
Stephen King
4/5 (2885)
Little Women
From Everand
Little Women
Louisa May Alcott
4.5/5 (2369)

High Performance Storage With BLK-MQ and Scsi-Mq: Christoph Hellwig

Uploaded by

High Performance Storage With BLK-MQ and Scsi-Mq: Christoph Hellwig

Uploaded by

High Performance Storage with

blk-mq and scsi-mq

 The Linux storage stack doesn't scale:

400,000 Linux 2.6.32

 The Linux block layer can't handle high IOP or low

HW driver HW driver HW driver HW driver

 The request layer can't handle high IOPS or low

 A replacement for the request layer

– Different driver API

 Processes dispatch into per-cpu software queues

– In the optimal case:

 Provides combined request allocation and tagging

– Driver data in “slack” space behind request

 Uses IPIs to complete on the submitting node and

– Non-integrated additional functionality

 First “scsi-mq” prototype from Nic Bellinger

 Should be a drop in replacement

– Push as much as possible work to blk-mq

 Considerations for request and tag allocation:

 Modern SCSI HBAs allow for huge S/G lists

– Enough for most latency sensitive small I/O

 Transparent pre/post-flush request handling

2014 Storage Developer Conference. © Christoph Hellwig. All Rights Reserved.

 New cmd_size field in host template

– Elimination of host-wide spinlocks in I/O

– Avoids legacy request layer interaction

 Required blk-mq features included in Linux 3.16

– Must be enabled by scsi_mod.use_blk_mq=Y

Note: HBA maxes out at about 1 million IOPS

Note: Target overload in 8 LUN case prevents linear scaling 10%

3.14.3 3.16+ 3.16+ (with blk-mq)

3.16+ (with blk mq)

 Better way to select blk-mq vs legacy code path

– I/O scheduler support in blk-mq

 SCSI core so far only exposes a single queue

– We want per-queue tag allocations for scalability

 Expose more blk-mq flags to SCSI

 Further reduction of shared cache lines:

– Robert Elliott (HP):

 Fusion-io (now a Sandisk company)

– For code and slide review, and blk-mq itself

– For code and slide review as well as benchmark

You might also like