Building High Performance Storage Hyper-V Cluster Scale-Out File Servers InfiniBand
Building High Performance Storage Hyper-V Cluster Scale-Out File Servers InfiniBand
Danyu Zhu
Liang Yang
Dan Lovinger
This document does not provide you with any legal rights to any intellectual property in any Microsoft product. You
may copy and use this document for your internal, reference purposes.
Microsoft, Windows, Windows Server, Hyper-V are either registered trademarks or trademarks of Microsoft
Corporation in the United States and/or other countries.
Violin Memory is a registered trademark of Violin Memory, Inc in the United States.
The names of other actual companies and products mentioned herein may be the trademarks of their respective
owners.
The following results highlight the scalability, throughput, bandwidth, and latency that can be achieved
from the platform presented in this report using two Violin WFA-64 arrays in a Scale-Out File Server
Cluster in a virtualized environment:
Throughput: linear scale to over 2 million random read IOPS or 1.6 million random write IOPS.
Bandwidth: linear scale to over 8.6 GB/s sequential read bandwidth or 6.2 GB/s sequential write
bandwidth.
Latency: 99th percentile latencies of 4.5ms at a load of 2 million random read IOPS or 99th percentile
latencies of 3.7-4ms for simulated OLTP traffic at a load of 1.15 million IOPS.
Microsoft Windows Server 2012 R2 provides a continuum of availability options that protects from a
wide range of failure modes. It starts from availability in a single-node across the storage stack, to multi-
nodes availability by clustering and the Scale-Out File Server role. To provide Continuous Availability
storage solutions to the volume server market, Microsoft has partnered with many industry leading
vendors to develop a set of Cluster-in-a-Box (CiB) storage platforms providing a clustered system for
simple deployment. These systems combine server blades, shared storage, cabling, and redundant
power supplies into a single pre-configured and pre-cabled chassis. They enable higher levels of
availability, cost-effectiveness, and easier deployment across all market segments to meet customer’s
different Service Level Agreements (SLA).
Violin Windows Flash Array (WFA) is a next generation All-Flash Array storage platform delivered by the
joint efforts of Microsoft and Violin Memory, providing built-in high performance, availability and
scalability. With the integration of Violin’s All Flash Array and Microsoft Windows Server 2012 R2 Scale-
Out File Server cluster, Violin WFA provides a tier-zero and tier-one storage solution for customer’s
mission critical applications in datacenters, , and the public and private cloud computing environments.
Figure 1 presents the overview of the Scale-Out File Server solution built using Violin WFA-64.
In this white paper, we discuss some of the scenarios and workloads that benefit from the capabilities
and the performance of the storage platform provided by Violin WFA. A good high value scenario is
Hyper-V using Scale-Out File Servers to store virtual disk files (VHD/VHDX) for VMs on remote storage
shares with inherent availability and scalability promises. With Violin’s enterprise-class all-flash storage,
Microsoft’s SMB Direct protocol and Microsoft Windows Server 2012 R2 storage features, the Violin
WFA-64 is well-suited as a file server solution when deploying Hyper-V over SMB.
This white paper demonstrates that synthetic virtualized IO workloads running in Hyper-V VMs can
linearly scale to over two million random read IOPS and over 8.6 GB/s sequential read bandwidth with
two Violin WFA-64 arrays in a Scale-Out File Server Cluster. In this platform, 99th percentile latencies of
4.5ms can be achieved at a load of 2 million random read IOPS. For simulated OLTP IO traffic, 99th
percentile latencies of 3.7-4ms can be achieved at a load of 1.15 million IOPS. The Violin WFA with its
high performance, availability and scalability can easily keep up with customer’s most demanding
application SLAs while providing increased density and efficiency in a virtualized environment.
Table 1 presents the hardware specification for the WFA-64 arrays used in this white paper. Each Violin
WFA-64 array has raw flash capacity of 70 TB with 44 TB usable capacities at a default 84% format level.
The Violin WFA-64 supports several different Remote Direct Memory Access (RDMA) I/O modules,
including InfiniBand (IB), Internet Wide Area RDMA Protocol (iWARP), and RDMA over Converged
Ethernet (RoCE). For the performance results presented in this white paper, we use Mellanox FDR
InfiniBand RDMA modules. The two memory gateways in the WFA-64 arrays are running with Windows
Server 2012 R2.
The WFA architecture offers sub-millisecond latency and wide stripe vRAID accelerated switched flash
for maximum performance. Figure 2 presents an overview of the Violin Windows Flash Array
architecture. The system can be divided into the following blocks:
IO Modules: The Violin WFA’s IO modules support all current RDMA protocols, including
InfiniBand, iWARP and RoCE.
Active/Active Memory Gateways (MG): The built in Windows Server 2012 R2 offers ways to
easily build and configure Windows Fail-Over clustering across multiple Memory Gateways,
manage Windows Scale-Out File Server Role, and setup Continuously Available File Shares with
Cluster Shared Volume (CSV) support. Violin also provides a user friendly control utility to
manage storage disk LUN configurations for Violin storage devices.
vRAID Control Modules (VCM): The Violin WFA provides 4 Active-Active vRAID Control Modules
for full redundancy. The VCMs implement Violin Memory’s patented vRAID algorithm to manage
the flash modules in RAID mode. vRAID is specifically engineered for flash and highly optimized
for Violin’s all flash memory arrays. It delivers fabric level flash optimization, dynamic wear
leveling, advanced ECC for fine grained flash endurance management, as well as fabric
orchestration of garbage collection and grooming to maximize system level performance. vRAID
also provides Violin Intelligent Memory Module (VIMM) redundancy support and protects the
system from VIMM failures.
Flash Fabric Architecture: The Flash Fabric Architecture (FFA) implements dynamic hardware
based flash optimization. Violin’s VIMMs form the core building block of the FFA. The WFA-64
model uses 64 VIMMs with 60 active VIMMs plus 4 global hot spares. A single VIMM can contain
up to 128 flash dies. The 64 VIMMs implementation in the WFA-64 thus contains more than
8000 flash dies, managed as a single system by vRAID in the VCMs. Optimizing flash endurance,
data placement, and performance across such a large number of dies is the key to deliver
sustainable performance, low latency, and high flash endurance rate. The Violin Memory Flash
Memory Fabric can leverage 1000’s of dies to make optimization decisions.
Beside performance and cost-efficiency, business critical tier-0 and tier-1 applications have high demand
on system reliability. The Violin WFA-64 provides multi-level redundancy with the capability to hot-swap
all active components. The system has redundancy at all layers for hot serviceability and fault tolerance.
Table 2 provides details for Violin WFA-64 redundancy each component layer.
Module Total
Fans 6
Power Supply 2
vRAID Controllers 4
Array Controllers 2
Memory Gateways 2
Table 2: Violin WFA-64 Multi-Level Redundancy
2.2 Next Generation All Flash Array with Full Integration of Windows Scale-Out File
Server
Violin’s WFA-64 model is a next generation All-Flash Array with full integration of a Windows Scale-Out
File Server solution. Customers can set up, configure and manage their file server storage in the familiar
Windows environment using Windows native tools.
SMB Direct is compatible with SMB Multichannel to achieve load balancing and automatic failover. SMB
multichannel automatically detects multiple networks for SMB connections. It provides a simple and
configuration free way of dynamic Multiple-Path IO (MPIO) for SMB traffic. SMB multichannel offers
resiliency against path failures and transparent failover with recovery without service disruption. By
aggregating network bandwidth from multiple network interfaces, SMB multichannel also provides
much improved throughput. Server applications can then take full advantage of all available network
bandwidth, as well as making them more resilient to network failure. In this white paper, the memory
gateways for Violin WFAs have been configured with multiple InfiniBand RDMA network adapters. With
SMB multichannel, the Violin WFA can fully utilize the redundancy and capacity provided by those
adapters.
A failover cluster is a group of independent computers that work together to increase the availability
and scalability of clustered roles. If one or more of the cluster nodes fail, the services will automatically
failover to other node without disruption of service. The Scale-Out File Server (SOFS) role in Windows
Server 2012 R2 not only provides a continuously available SMB service, but also provides a mechanism
for clustered file servers in an active-active configuration to aggregate bandwidth across the cluster
nodes. In continuously available file shares, persistent file handles are always opened with write through
to guarantee that data is on stable storage and durable against cluster node failure, which matches to
the capabilities of an All-Flash Array such as the Violin WFA-64.
Operating against a SOFS, SMB clients are transparently directed to do their IO against their owner node
to achieve balancing around the cluster. Cluster Shared Volumes in a Windows Server 2012 R2 failover
cluster allow multiple nodes in the cluster to simultaneously access shared storage with a consistent and
distributed file namespace. Therefore, CSVs greatly simplifies the management of large number of LUNs
in a failover cluster.
For the performance testing performed for this white paper, we create a failover cluster across two
Violin WFA-64 arrays with the Scale-Out File Server role. This mode of operation is an asymmetric SOFS,
since each individual CSV is served by one WFA and pair of Violin Memory Gateways. The Scale-Out File
Server Cluster supports scaling up to four Violin WFA arrays, at the supported limit of eight file server
nodes (Violin Memory Gateways) per SOFS of both Windows Server 2012 and Windows Server 2012 R2,
with up to 280 TB raw flash capacity.
Memory 4 TB
Windows Server 2012 R2 provides a Hyper-V storage NUMA I/O capability to create a number of
communication channels between the guest devices and host storage stack with a specified dedicated
set of VPs for the storage IO processing. Hyper-V storage NUMA I/O offers a more efficient I/O
completion mechanism involving interrupts distribution amongst the virtual processors to avoid
expensive inter-processor interruptions. With those improvements, the Hyper-V storage stack can
provide scalability improvements in terms of I/O throughput to support the needs of large virtual
machine configuration for data intensive workloads like SQL.
VHDX is a new virtual hard disk format introduced in Windows Server 2012, which allows the creation of
resilient high-performance virtual disks up to 64 terabytes in size with online resizing capability.
Microsoft recommends using VHDX as the default virtual hard disk format for VMs. VHDX provides
additional protection against data corruption during power failures by logging updates to the VHDX
metadata structures, as well as the ability to store custom metadata. The VHDX format also provides
support for the TRIM command which results in smaller file size and allows the underlying physical
storage device to reclaim unused space. The support for 4KB logical sector virtual disk as well as the
larger block sizes for dynamic and differential disks allows for increased performance.
In this white paper, we create a Hyper-V failover cluster that groups together multiple Hyper-V VMs
from each host to provide high availability. The data storage for those VMs is hosted in the remote SMB
shares on the Violin WFA arrays. The results presented in this white paper demonstrate extremely high
throughput and low latency for IO intensive workloads within Hyper-V VMs.
Dell R820 Sandy Bridge Server Dell R820 Sandy Bridge Server Dell R820 Sandy Bridge Server Dell R820 Sandy Bridge Server
The family of Intel Sandy Bridge processors has embedded PCIe lanes for improved I/O performance
with reduced latency and support up to 160 lanes of PCIe 3.0 (40 per socket).
4 Hardware Configurations
4.1 Server Configurations
Each Violin gateway (SMB server node) is a two socket server with dual Intel Sandy Bridge Xeon Hyper-
Threaded 8-core E5-2448L 1.80GHZ CPUs, 24G DDR3-1600 RAM, for 32 total Logical Processors. One
socket has the direct connection to the add-on PCI Express cards (RDMA NICs), and the other has the
direct connection to the NAND flash array.
The Violin WFA-64 gateway only supports PCIe 2.0(5.0GT/s) which is sufficient to drive the array itself.
The following screen copy shows the actual PCI Express link speed and width for each add-on Mellanox
RDMA NIC in the Violin gateway.
MLNX-OS: SX_3.3.5006
All the InfiniBand network adapters on both Violin gateway nodes and client nodes were using the latest
driver and firmware as of September 2014. Figure 5 shows that the link speed negotiated to the switch
is indeed the 54Gbps of FDR Infiniband.
Figure 7 and Figure 8 shows these two clusters and their configured roles.
Figure 9 shows a snapshot of Hyper-V Cluster with four Hyper-V server nodes.
Figure 10. File Server Cluster with four file server nodes
To avoid the performance hit of redirect I/Os, each node (memory gateway) in the file server cluster is
assigned the ownership of cluster storage volumes based on the affinity of physical storage. Figure 12 on
next page shows \\9-1109ARR01 owns Cluster Disk 2/12/13/15, \\9-1109ARR02 owns Cluster Disk
6/7/14/17, \\9-1109ARR03 owns Cluster Disk 1/4/9/11 and \\9-1109ARR04 owns Cluster Disk
3/8/10/16. Note: since sixteen LUNs come from two separate arrays (8 per array), you will see disk
numbers are duplicate from 1~8.
In this report, the storage cluster created by the two Violin WFA arrays is asymmetric. If a storage cluster
is created only using a single Violin WFA array, it is considered as symmetric as direct IO is possible from
both nodes (memory gateways) within the same Violin WFA array. For both symmetric and asymmetric
storage, metadata operations (e.g. creating a new file) continue to go through the CSV resource owner
node.
DeviceID and InstanceID can be obtained from device instance path of virtual SCSI controller from device
manager within VM.
1
More VM tunings can be found at Windows Server 2012 R2 Performance Tuning Guide for Hyper-V Servers:
http://msdn.microsoft.com/en-us/library/windows/hardware/dn567657.aspx#storageio
Figure 20 shows the VM settings for a VM running in the cluster HYPV9-1109 from Windows Failover
Cluster Manager UI. All the VMs use the same settings: each VM is configured with 16 virtual processors
(VP), one virtual NUMA, 16G RAM and one virtual SCSI controller with 16 VHDX data files attached. All
the VHDX files attached to the same VM are about 127GB fixed type and hosted on the same SMB file
share in a Scale-Out file server cluster.
1 IOMeter manager
16 IOMeter worker threads
One worker per target (VHDX) and 16 targets per VM
o Warm-up time prior to measurements: 5 minutes
o Measurement run time: 5 minutes
Queue depth per thread: 64 for Random and 1 for Sequential workloads
The DISKSPD2 2.0.12 performance benchmarking tool was used for capturing latency measurement due
to its support for capturing high fidelity latency histograms. Equivalent settings were used, discussed in
more detail in presentation of those results. Results presented in section 6.3.4 are collected using the
DISKSPD benchmarking tool.
2
DISKSPD is available as a binary and open source (MIT License) release.
Binary: http://aka.ms/diskspd
Source: http://github.io/microsoft/diskspd
I/O Connectivity 10GbE, 56Gb IB 10GbE, 56Gb IB 10GbE, 56Gb IB 10GbE, 56Gb IB 10GbE, 56Gb IB
Max. 4KB IOPS 1M IOPS 1M IOPS 750k IOPS 750k IOPS 750k IOPS
Nominal Latency <500 µsec <500 µsec <500 µsec <500 µsec <500 µsec
Our experimental results show we can fully achieve the published data from Hyper-V VMs running on
the SMB client side over the RDMA network. Table 7 summarizes the performance results achieved in
this platform for monolithic workloads and mixed type workloads. The Violin WFA array can linear scale
from one array to two arrays in terms of throughput and bandwidth.
OLTP: 8K, 90% Read, 10% Write, 100% 550K (IOPS) 1.1 Million (IOPS) 2.0
Random 4.4GB/s (Bandwidth) 8.8GB/s (Bandwidth) 2.0
Exchange Server: 32K 60% Read, 40% 130K (IOPS) 260K (IOPS) 2.0
Write, 80% Random, 20% Sequential 4.16GB/s (Bandwidth) 8.32GB/s (Bandwidth) 2.0
Table 7. Summary of Experimental Performance Results
GBps
5
1 4
3
0.5 2
1
0 0
Half Array (1 Client w/ 2VMs) One Array (2 Clients w/ 4 VMs) Two Arrays (4 Clients w/ 8 VMs)
Throughput Bandwidth
1.6
6
1.4
5
1.2
Million IOPS
1 4
GBps
0.8 3
0.6
2
0.4
1
0.2
0 0
Half Array (1 Client w/ 2VMs) One Array (2 Clients w/ 4 VMs) Two Arrays (4 Clients w/ 8 VMs)
Throughput Bandwidth
6
5
4
3
2
1
0
Half Array (1 Client w/ 2VMs) One Array (2 Clients w/ 4 VMs) Two Arrays (4 Clients w/ 8 VMs)
5
GBps
Half Array (1 Client w/ 2VMs) One Array (2 Clients w/ 4 VMs) Two Arrays (4 Clients w/ 8 VMs)
9
1
8
7
0.8
6
Million IOPS
GBps
0.6 5
4
0.4
3
2
0.2
1
0 0
Half Array (1 Client w/ 2VMs) One Array (2 Clients w/ 4 VMs) Two Arrays (4 Clients w/ 8 VMs)
Throughput Bandwidth
8
0.25
7
0.2 6
Million IOPS
5 Gbps
0.15
4
0.1 3
2
0.05
1
0 0
Half Array (1 Client w/ 2VMs) One Array (2 Clients w/ 4 VMs) Two Arrays (4 Clients w/ 8 VMs)
Throughput Bandwidth
4K random reads: extending the data in Figure 21 (page 26), a sweep of increasing IO load
leading up to the 2 million 4K read IOPS result
OLTP: extending the data in Figure 25 (page 28), a sweep of increasing IO load leading up to the
1.15 million IOPS result
Each measurement was taken over a two minute period, preceded by a two minute warmup. For
instance, the two minutes of 2 million IOPS would yield 240 million (2.4 x 108) total measured IOs.
Earlier in Figure 21 the 2 million 4K read result was presented, as measured with each of the 8 VM’s 16
workers queueing 64 IOs. The following figure provides context for that result:
2,500,000 10
1,500,000 6
1,000,000 4
500,000 2
0 0
1 2 4 8 16 32 64
Queue Depth / Worker
As opposed to looking only at average latency, percentiles3 slice the results by answering the question of
how many IOs took more or less time than the given value. In the case of the 99 th percentiles shown in
the figure, 99% of the IOs completed as fast or faster, and 1% took longer.
From the left, the figure shows that at the shallowest queue depth the WFAs were already driving
370,000 IOPS at a 99th percentile 540us latency to the VMs. Doubling the queue depth at each
3
For more discussion: http://en.wikipedia.org/wiki/Percentile
The full distribution for each of the 8 VMs in the 32 queue depth case is shown in the following figure:
3 25
20
2.5
15
2 10
1.5 5
1 0
99
99.9
99.99
99.999
99.9999
99.99999
99.999999
100
0.5
0
0 20 40 60 80 100
Percentile Percentile
These figures now put the 99th percentile in context, in a sweep from approximately 1.5-2.5ms between
the 20th and 80th percentile (over half of the total IO), and outlying latencies of between 15-25ms with a
few around 50ms affecting only two VMs. Although these outliers are significant, they stand up very well
to behavior for locally attached flash storage while at the same time providing the benefits of a scale out
shared storage architecture at over two million total IOPS.
Turning to OLTP, the following results repeat the OLTP data shown earlier in Figure 25 (page 28).
1,000,000 10
600,000 6
400,000 4
200,000 2
0 0
1 2 4 8 16 32 64
Queue Depth / Worker
Similar to the 4K Read scaling in Figure 27, OLTP also saturates earlier than the single result presented
for OLTP in Section 6.3.3 Figure 25 (page 28). A queue depth of 16 IOs per each of the 16 workers in
each of the 8 VMs drives 1.08 million OLTP IOPS with a 99th percentile latency of 3.8ms for both reads
and writes. This latency correlation indeed holds across the full scaling.
The full distribution for each of the 8 VMs in the 16 queue depth case is shown in the following figure:
3 25
20
2.5 15
2 10
1.5 5
0
1
99
99.9
99.999
99.9999
99.999999
99.99
99.99999
100
0.5
0
0 20 40 60 80 100
Percentile Percentile
The latency outliers seen for 4K Reads do not reappear, instead all VMs see a similar distribution of
maximum latencies in the range of 15-22ms. The median of the distribution centers on 1.8ms.
With a total queue depth of 8 VMs x 16 Workers x 16 IOs = 1024 IOs in flight from the VMs to the WFA,
there is ample capacity in the system. This demonstrates the capability of the Scale out Windows Server
Hyper-V and Violin WFA solution to handle high intensity application loads.
7 Conclusion
Violin Windows Flash Array (WFA) is a next generation All Flash Array. With the joint efforts of Microsoft
and Violin Memory, Windows Flash Array provides a tier-zero and tier-one storage solution for critical
applications which transforms the speed of business by providing high performance, availability and
scalability in a virtualized environment with low cost management. The results presented in this white
paper show the high throughput and low latency that can be achieved using Microsoft technologies
bundled with Violin hardware. With two Violin WFA-64 arrays, the workloads running in Hyper-V VMs
can linearly scale to over two million or 1.6 million IOPS for random reads or writes, 8.6 GB/s or 6.2GB/s
bandwidth for sequential reads or writes. Even at the maximum throughput of 2 million IOPS, the 99th
percentile latency can still be capped at 4.5ms and the latency of simulated OLTP IO traffic at a load of
1.15 million IOPS is capped at 3.7-4ms as well.
Reference
[1] Achieving over 1-Million IOPS from Hyper-V VMs in a Scale-Out File Server Cluster using Windows
Server 2012 R2: http://www.microsoft.com/en-us/download/details.aspx?id=42960
[2] Windows Storage Server Overview: http://technet.microsoft.com/en-us/library/jj643303.aspx
[3] Storage Quality of Service for Hyper-V: http://technet.microsoft.com/en-us/library/dn282281.aspx
[7] Windows Server 2012 R2 Performance Tuning Guide for Hyper-V Servers:
http://msdn.microsoft.com/en-us/library/windows/hardware/dn567657.aspx#storageio
Hyper-V: Rick Baxter, John Starks, Harini Parthasarathy, Senthil Rajaram, Jake Oshins
Windows Fundamentals: Brad Waters, Jeff Fuller, Ahmed Talat
File Server: Jim Pinkerton, Jose Barreto, Greg Kramer, David Kruse, Tom Talpey, Spencer Shepler
Windows Cluster: Elden Christensen, Claus Joergensen, Vladimir Petter
Windows Server and System Center: John Loveall, Jeff Woolsey
Windows Storage: Scott Lee, Michael Xing, Darren Moss
Networking: Sudheer Vaddi, Don Stanwyck, Rajeev Nagar
The authors would also like to thank our industry partners including Violin Memory and Mellanox for
providing their latest product samples to allow us to build the test infrastructure for the performance
experiments discussed in this paper. The product pictures used in this report are provided by courtesy of
them as well. Particularly we want to give our special thanks to the following people for their help: