Wagner W 420 KVM Performance Improvements and Optimizations
Wagner W 420 KVM Performance Improvements and Optimizations
Wagner W 420 KVM Performance Improvements and Optimizations
Overview
Will show how some features impact SPECvirt results Also show against real world applications
Note that not all features in all releases Some stuff will apply but your mileage will vary...
Agenda
What is SPECvirt A quick, high level overview of KVM Low Hanging Fruit Memory Networking Block I/O basics NUMA and affinity settings CPU Settings Wrap up
SPECvirt - Basics
Called a Tile Web (http) Infrastructure (NFS for Web) Application (Java Enterprise) DB (for App) Mail (imap) Idle
SPECvirt - Basics
Run as many Tiles as possible until any of the workloads fail any of the Quality of Service requirements
SPECvirt - Basics
SPECjApp workload has peaks/valleys to greatly vary resource usage in App & DB guests SPECvirt Home Page
http://www.spec.org/virt_sc2010/
SPECvirt Diagram
Each client runs a modified version of SPECweb, SPECjApp, and SPECmail
Idle
Controller
1.4 1.2 1
RHEL6
1367
1763
SPECvirt_sc2010 Score
vmware
1169 1221
RHEL5
1369
System
Tiles / Core
0.8
1 0.9 0.8
SPECvirt_sc2010 Score
5000
4000
0.7
3723
3000
2721
2742
0.5 0.4
2000
1000
0
Vmware ESX 4.1 / Bull SAS /32core Vmware ESX 4.1 / IBMx3850X5 / 32 core Vmware ESX 4.1 / HP DL580 G7 / 40 core RHEL 6 (KVM) / IBMx3850 X5 / 64 core
System
Tiles / Core
0.6
RHEL6
1820
1.4 1.2 1
SPECvirt_sc2010 Score
1169
vmware 1221
RHEL5 1367
1369
System
Tiles / Core
0.8
SPECvirt_sc2010 Score
6000 5000 4000 3000 2000 1000 0 Vmware ESX 4.1 / Vmware ESX 4.1 / Vmware ESX 4.1 / Bull SAS /32 cores IBMx3850X5 / 32 HP DL580 G7 / 40 cores cores 2721 2742
System
Tiles / core
Number of Tiles
10 8 6 4 2 0 Baseline SPECvirt Tiles Based on presentation by Andrew Theurer at KVM Forum August 2010 8
Agenda
What is SPECvirt A quick, high level overview of KVM Low Hanging Fruit Memory Networking Some block I/O basics NUMA and affinity settings CPU Settings Wrap up
Guests run as a process in userspace on the host A virtual CPU is implemented using a Linux thread
The Linux scheduler is responsible for scheduling a virtual CPU, as it is a normal thread NUMA Huge Pages Support for new hardware
I/O settings in host can make a big difference in guest I/O performance Proper settings to achieve true direct I/O from the guest Deadline scheduler (on host) typically gives best performance
User VM
User VM
User VM
Component CPU/Kernel
Feature NUMA Ticketed spinlocks; Completely fair scheduler; Extensive use of Read Copy Update (RCU) Scales up to 64 vcpus per guest Large memory optimizations: Transparent Huge Pages is ideal for hardware based virtualization Vhost-net a kernel based virtio w/ better throughput and latency. SRIOV for ~native performance AIO, MSI, scatter gather.
Performance Enhancements
Vhost-net
new host kernel networking backend providing superior throughput and latency over the prior userspace implementation
Avoid the need for host to trap guest FPU cr0 access
Performance Enhancements
batches writes upon I/O barrier, rather than fsync every time additional block storage needed (thin provisioning growth)
Agenda
What is SPECvirt A quick, high level overview of KVM Low Hanging Fruit Memory Networking Some block I/O basics NUMA and affinity settings CPU Settings Wrap up
Remember this ?
Guest NFS Write Performance
Be Specific !
virt-manager will:
The more info you provide the more tailoring will happen
Specify OS + flavor
vhost_net drivers
12.5 x
RHEL6-default
RHEL6-vhost
RHEL5-virtio
Agenda
What is SPECvirt A quick, high level overview of KVM Low Hanging Fruit Memory Networking Some block I/O basics NUMA and affinity settings CPU Settings Wrap up
Extended Page Table (EPT) age bits Kernel Same-page Merging (KSM) Transparent Huge Pages
Allow host to make smarter swap choice when under pressure. Consolidate duplicate pages. Particularly efficient for Windows guests. Efficiently manage large memory allocations as one unit
Memory sharing
KSM
Help delay/avoid swapping At cost of slower memory access That's the major issue to make sure THP won't reduce the memory over commit capacity of the host Throughput vs. density
Virtual to physical page map is 512 times smaller TLB can map more physical page resulting fewer misses
Traditional Huge Pages always pinned Transparent Huge Pages in RHEL6 Most databases support Huge Pages How to configure Huge Pages (16G)
24%
46%
SPECjbb2005 bops
250K 200K 150K 100K 50K K No huge pages Host using huge pages Guest & host using huge pages
10U
20U
40U
400K 350K 300K 250K 200K 150K 100K 50K K r6-guest r6-metal
No-THP
THP
10%
Number of Tiles
10 8 6 4 2 0 Baseline Huge Pages SPECvirt Tiles Based on presentation by Andrew Theurer at KVM Forum August 2010 8 8.8
Agenda
What is SPECvirt A quick, high level overview of KVM Low Hanging Fruit Memory Networking Some block I/O basics NUMA and affinity settings CPU Settings Wrap up
General Tips Virtio vhost_net PCI passthrough SR-IOV (Single root I/O Virtualization)
VM traffic never hits the HW on same box Can really kick up MTU as needed Need to make sure it is set across all components
Packet size
Virtio
VirtIO drivers for network Bypass the qemu layer Bypass the host and pass the PCI device to the guest Can be passed only to one guest
PCI passthrough
Can be shared among multiple guests Limited hardware support Pass through to the guest
VirtIO drivers included in Linux Kernel VirtIO drivers available for Windows
Latency (usecs)
4X gap in latency
Virtio Host
New in RHEL6.1 Moves QEMU network stack from userspace to kernel Improved performance Lower Latency Reduced context switching One less copy
Latency (usecs)
45 40 35 30 25 20 15 10 5 0
-v h 32
32
os
vhost_net Efficiency
8 Guest Scale Out RX Vhost vs Virtio - % Host CPU
Mbit per % CPU netperf TCP_STREAM 400
Vhost Virtio
Physical NIC is passed directly to guest Guest sees real physical device
Requires hardware support Lose hardware independence 1:1 mapping of NIC to Guest BTW - This also works on some I/O controllers
Single Root I/O Virtualization New class of PCI devices that present multiple virtual devices that appear as regular PCI devices
Requires hardware support Low overhead, high throughput No live migration Lose hardware independence
Latency (usecs)
100,000 90,000 80,000 70,000 60,000 50,000 40,000 30,000 20,000 10,000 0
86,469 69,984
92,680
M P O l a t o T
1R ed H a tK VM b rid g ed g ues t
1R ed H a tK VM S R -IOV g ues t
10%
10.6 8 8.8
32%
Number of Tiles
SR-IOV
Agenda
What is SPECvirt A quick, high level overview of KVM Low Hanging Fruit Memory Networking Block I/O basics NUMA and affinity settings CPU Settings Wrap up
Block
Recent improvements
Tuning hardware Choosing an elevator Choosing the caching model Tuned / ktune Device assignment
Block Improvements
MSI support
AIO implementation
SAS or SATA? Fibre Channel, Ethernet or SSD? Bandwidth limits Device-mapper-multipath Provides multipathing capabilities and LUN persistence Low level I/O tools dd, iozone, dt, etc
Multiple HBAs
How to test
Deadline
Two queues per device, one for read and one for writes IOs dispatched based on time spent in queue
CFQ
Per process queue Each process queue gets fixed time slice (based on process priority)
Noop
Boot-time
1Guest
2 Guests
4 Guests
Three Types
Cache=none
I/O from the guest in not cached I/O from the guest is cached and written through on the host Potential scaling problems with this option with multiple guests (host cpu used to maintain cache)
Cache=writethrough
Virt-manager dropdown option under Advanced Options Libvirt XML file - driver name='qemu' type='raw' cache='none' io='native'
Cache=WT Cache=none
Needed for data integrity On by default Can disable on Enterprise class storage Database ( parameters to configure read ahead) Block devices ( getra , setra ) Eliminate Synchronous I/O stall Critical for I/O intensive applications
Asynchronous I/O
10U
20U
Configurable per device (only by xml configuration file) Libvirt xml file - driver name='qemu' type='raw' cache='none' io='native'
default
CFQ elevator (cgroup) I/O barriers on ondemand power savings upstream VM 4 msec quantum
latency-performance
elevator=deadline power=performance
throughput-performance
Device Assignment
It works for Block too ! Device Specific Similar Benefits And drawbacks...
79 %
94 %
SAS system SAS Total
KVM VirtIO
KVM/PCI-PassThrough
Bare-Metal
78
Agenda
What is SPECvirt A quick, high level overview of KVM Low Hanging Fruit Memory Networking Block I/O basics NUMA and affinity settings CPU Settings Wrap up
NUMA is needed for scaling RHEL 5 / 6 completely NUMA aware Additional performance gains by enforcing NUMA placement
Static Huge Page allocation takes place uniformly across NUMA nodes
Workaround 1 Use Transparent Huge Pages Workaround 2 Allocate Huge pages / Start Guest / De-allocate Huge pages
Physical Memory 128G 4 NUMA nodes 20GB Guest using Huge Pages Huge Pages 80G 20G in each NUMA node 20GB Guest using NUMA and Huge Pages
1200K
1000K
800K
600K
400K
200K
Non NUMA
NUMA
8%
12 %
18 %
1200K
1000K
800K
600K
400K
200K
Non NUMA
NUMA
10U
20U
40U
Agenda
What is SPECvirt A quick, high level overview of KVM Low Hanging Fruit Memory Networking Block I\O basics NUMA and affinity settings CPU Settings Wrap up
CPU Performance
CPU Improvements
we scale to 64 vcpus!
Dynamic cancel ticket spinlocks Add user return notifiers in the kernel X2apic
Mixed results with CPU type and topology Experiment and see what works best in your case
1:1 pinning
50% 32%
10%
10.6 8 8.8
12
Number of Tiles
SR-IOV
Node Binding
Monitoring tools
top, vmstat, ps, iostat, netstat, sar, perf /proc, sysctl, AltSysrq ethtool, ifconfig oprofile, strace, ltrace, systemtap, perf
Kernel tools
Networking
Profiling
Agenda
What is SPECvirt A quick, high level overview of KVM Low Hanging Fruit Memory Networking Block I/O basics NUMA and affinity settings CPU Settings Wrap up
Wrap up
KVM can be tuned effectively
Understand what is going on under the covers Turn off stuff you don't need Be specific when you create your guest Look at using NUMA or affinity Choose appropriate elevators (Deadline vs CFQ) Choose your cache wisely
Campground 2 Thurs 10:20 Performance Analysis & Tuning of Red Hat Enterprise Linux Shak and Larry
Part 1 - Thurs 10:20 Part 2 - Thurs 11:30 Sanjay Rao - Fri 9:45
KVM Wiki
http://www.linux-kvm.org/page/Main_Page
http://www.linux-kvm.org/page/Lists%2C_IRC
http://libvirt.org/
http://docs.redhat.com/docs/enUS/Red_Hat_Enterprise_Linux/index.html Should be available soon !
libvirt Wiki
https://access.redhat.com/knowledge/refarch/TBD
http://www.linux-kvm.org/wiki/images/7/7f/2010-forum-perf-andscalability-server-consolidation.pdf
Principled Technologies
RHEL6
1820
1.4 1.2 1
SPECvirt_sc2010 Score
1169
vmware 1221
RHEL5 1367
1369
System
Tiles / Core
0.8
1 0.9 0.8
SPECvirt_sc2010 Score
5466
0.7
Vmware ESX 4.1 / Vmware ESX 4.1 / Vmware ESX 4.1 / Bull SAS /32 cores IBMx3850X5 / 32 HP DL580 G7 / 40 cores cores
System
Tiles / core
0.6