Vmware Vsphere Apis Array Integration Vaai Noindex
Vmware Vsphere Apis Array Integration Vaai Noindex
Please visit
https://core.vmware.com/resource/vmware-vsphere-apis-array-integration-vaai for the latest version.
Table of contents
ATS ..................................................................................................................................................... 15
XCOPY ................................................................................................................................................. 15
WRITE_SAME ........................................................................................................................................ 15
UNMAP ................................................................................................................................................ 15
Acknowledgements ................................................................................................................................... 22
It is not essential to understand all of these operations in the context of this paper. It is sufficient to understand that various VMFS
metadata operations require a lock.
ATS is an enhanced locking mechanism designed to replace the use of SCSI reservations on VMFS volumes when doing metadata
updates. A SCSI reservation locks a whole LUN and prevents other hosts from doing metadata updates of a VMFS volume when one
host sharing the volume has a lock. This can lead to various contention issues when many virtual machines are using the same
datastore. It is a limiting factor for scaling to very large VMFS volumes.
ATS is a lock mechanism that modifies only a disk sector on the VMFS volume. When successful, it enables an ESXi host to perform
a metadata update on the volume. This includes allocating space to a VMDK during provisioning, because certain characteristics
must be updated in the metadata to reflect the new size of the file. The introduction of ATS addresses the contention issues with
SCSI reservations and enables VMFS volumes to scale to much larger sizes. ATS has the concept of a test-image and set-image. So
long as the image on-disk is as expected during a “compare”, the host knows that it can continue to update the lock.
In vSphere 4.0, VMFS3 used SCSI reservations for establishing the lock, because there was no VAAI support in that release. In
vSphere 4.1, on a VAAI-enabled array, VMFS3 used ATS for only operations 1 and 2 listed previously, and only when there was no
contention for disk lock acquisitions. VMFS3 reverted to using SCSI reservations if there was a multi-host collision when acquiring
an on-disk lock using ATS.
The use of ATS to maintain “liveness” of heartbeats (operation 5 above) was introduced in vSphere 5.5U2. Prior to this release,
SCSI reservations were used to maintain the “liveness” of the heartbeat.
One point of note with VMFS3 using ATS; if there was an mid-air collision from 2 hosts trying to lock the same sector via ATS, the
hosts would revert to SCSI reservations on retry. For VMFS5 and VMFS6 datastores formatted on a VAAI-enabled array, all the
critical section functionality from operations 1 to 9 listed above is done using ATS. There no longer should be any SCSI reservations
on VAAI-enabled VMFS5. ATS continues to be used even if there is contention.
In the initial VAAI release, the ATS primitives had to be implemented differently on each storage array, requiring a different ATS
opcode depending on the vendor. ATS is now a standard T10 SCSI command and uses opcode 0x89 (COMPARE AND WRITE).
On non-VAAI enabled storage, SCSI reservations continue to be used for establishing critical sections in VMFS5.
You cannot turn on the ATS only flag on VMFS5 volumes upgraded from VMFS3. The ATS only flag can be manually enabled or
disabled on a newly created VMFS5. To manually enable it, you can use the hidden option:
Since there is no upgrade path from VMFS3 or VMFS5 to VMFS6, this is not a concern for VMFS6 volumes.
ATS Miscompare
When ATS was introduced for maintaining heartbeat “liveness” in vSphere 5.5U2, some issues arose. Some storage arrays
returned ATS miscompares when the array was overloaded. In another scenario, reservations on the LUN on which the VMFS
resided also caused ATS miscompares. In another case, the ATS “set” was correctly written to disk, but the array still returned a
miscompare.
But it wasn’t all array related issues. Situations arose where VMFS also detected an ATS miscompare incorrectly. In this case, a
heartbeat I/O (1) got timed-out and VMFS canceled that I/O, but before canceling the I/O, the I/O (ATS “set”) actually made it to the
disk. VMFS next re-tried the ATS using the original “test-image” in step (1) since the previous one was canceled, and the
assumption was that the ATS didn’t make it to the disk. Since the ATS “set” made it to the disk before the cancel, the ATS “test”
meant that the in-memory and on-disk images no longer matched, so the array returned “ATS miscompare”.
When an ATS miscompare is received, all outstanding IO is canceled. This led to additional stress and load being placed on the
storage arrays and degraded performance.
In vSphere 6.5, there are new heuristics added so that when an ATS miscompare event is encountered, VMFS reads the on-disk
heartbeat data and verifies it against the ATS “test” and “set” images. This check to see if there is actually a real miscompare. If
the miscompare is real, then we do the same as before, which is to cancel all the outstanding I/O. If the on-disk heartbeat data
has not changed, then this is a false miscompare. In the event of a false miscompare, VMFS will not immediately cancled IOs.
Instead, VMFS will now re-attempt the ATS heartbeat operation after a short interval (usually less than 100ms).
NOTE: Some storage arrays, on receipt of the WRITE_SAME SCSI command, will write zeroes directly to disk. Other arrays do not
need to write zeroes to every location; they simply do a metadata update to write a page of all zeroes. Therefore, it is possible that
you will observe significant differences in the performance of this primitive, depending on the storage array
UNMAP
One VAAI primitive, UNMAP, enables an ESXi host to inform the storage array that space, previously occupied by a virtual machine
that has been migrated to another datastore or deleted, can now be reclaimed. This is commonly referred to as ‘dead space
reclamation’ and enables an array to accurately report space consumption of a thin-provisioned datastore as well as enabling
users to monitor and correctly forecast new storage requirements.
The mechanism for conducting a space reclamation operation has been regularly updated since the primitive was introduced in
vSphere 5.0. Initially, the operation was automatic. When a virtual machine was migrated from a datastore or deleted, the UNMAP
was called immediately and space was reclaimed on the array. There were some issues with this approach, primarily regarding
performance and an array’s ability to reclaim the space in an optimal time frame. For this reason, the UNMAP operation was
changed to a manual process. However, with the release of vSphere 6.5, the UNMAP primitive is once again automatic, as long as
the underlying datastore is formatted with VMFS6. It is enabled automatically on VMFS6 volumes. The automated UNMAP crawler
mechanism for reclaiming dead or stranded space on VMFS datastores now runs continuously in the background.
With esxcli, users can display device-specific details regarding Thin Provisioning and VAAI.
Users can also display the VAAI primitives supported by the array for that device, including whether the array supports the UNMAP
primitive for dead space reclamation (referred to as the Delete Status). Another esxcli command is used for this step, as is shown
in the following:
The device displays “Delete Status: supported” to indicate that it is capable of sending SCSI UNMAP commands to the array when
a space reclamation operation is requested . If a Storage vMotion operation is initiated and a virtual machine is moved from a
source datastore to a destination datastore, the array will reclaim that space.
The granularity of the reclaim is set to 1MB chunks. Automatic UNMAP is not supported on arrays with UNMAP granularity greater
than 1MB. Therefore, customers should check the VMware Hardware Compatibility Guide (HCL) footnotes of your storage array to
check if the Auto UNMAP feature is supported.
Disabling the UNMAP primitive does not affect any of the other Thin Provisioning primitives such as Thin Provisioning Stun and the
space threshold warning. All primitives are orthogonal.
Note that UNMAP is only automatically issued to VMFS datastores that are VMFS-6 and have powered-on VMs. Since UNMAP is
automated, it can take 12-24 hours to fully reclaim any dead space on the datastore. UNMAP can still be run manually against
older VMFS volumes.
TRIM Considerations
TRIM is the ATA equivalent of SCSI UNMAP. A TRIM operation gets converted to UNMAP in the I/O stack, which is SCSI. However,
there are some issues with TRIM getting converted into UNMAP. UNMAP works at certain block boundaries on VMFS, whereas TRIM
does not have such restrictions. While this should be fine on VMFS-6, which is now 4K aligned, certain TRIMs converted into
UNMAPs may fail due to block alignment issues on previous versions of VMFS.
Extended Statistics
This primitive enables visibility into space usage on NAS datastores. This is especially useful for thin-provisioned datastores
because it enables vSphere to display actual space usage statistics in situations where oversubscription was used. Previously,
vSphere administrators needed to use array-based tools to manage and monitor how much space a thinly provisioned VMDK was
consuming on a back-end datastore. With this new primitive, this information can be surfaced up in the VMware vSphere Client™.
This enables vSphere administrators to have a much better insight into storage resource consumption, without the need to rely on
third-party tools or engage with their storage array administrators.
Reserve Space
In previous versions of vSphere, only a thin VMDK could be created on NAS datastores. This new reserve space primitive enables
the creation of thick VMDK files on NAS datastores, so administrators can reserve all of the space required by a VMDK, even when
the datastore is NAS. As shown in Figure 1, users now have the ability to select lazy-zero or eager-zero disks on NAS datastore.
# esxcfg-advcfg -g /DataMover/MaxHWTransferSize
Value of MaxHWTransferSize is 4096
# esxcfg-advcfg -s 16384 /DataMover/MaxHWTransferSize
Value of MaxHWTransferSize is 16384
NOTE: Please use extreme caution when changing this advanced parameter. This parameter is a global parameter, so it will impact
ALL storage arrays attached to the host. While a storage array vendor might suggest making a change to this parameter for
improved performance on their particular array, it may lead to issues on other arrays which do not work well with the new setting,
including degraded performance.
The data-out buffer of the WRITE_SAME command contains all zeroes. A single zero operation has a default zeroing size of 1MB.
The maximum number of outstanding WRITE_SAME commands is 32. We currently do not support changing the WRITE_SAME size
of 1MB.
One can observe the clone operations (XCOPY), ATS operations, and zero operations (WRITE_SAME) in an esxtop output. The
following is an esxtop output showing the various block primitives:
In this—and only this—instance, the latencies observed in esxtop should not be interpreted as a performance issue, unless there
are other symptoms present.
Figure 5.
VAAI Status
Hardware Acceleration status is determined by the support state of various primitives. If ATS is supported, the UI displays
Hardware Acceleration as
Supported. If, however, ATS, XCOPY and Zero all are unsupported, Hardware
Acceleration displays as Unsupported. All other support states set the Hardware Acceleration status to Unknown, which typically is
a state that is displayed until the first clone and zero operation is initiated. Many users initiate an eagerzeroedthick clone of a file
to test the VAAI status.
In vSphere 6.5, VMFS6 datastore will also display a Space Reclamation field.
This is indicating whether or not Automatic UNMAP is available. In vSphere 6.5, the only Space Reclamation priority available is
Low . This cannot be changed.
ATS
To check the status of the ATS primitive and to turn it on and off at the command line, the following commands can be used:
XCOPY
To turn the Extended Copy primitive for cloning on and off, use the previous command with the following advanced setting:
WRITE_SAME
To turn the Write Same primitive for zeroing blocks on and off, use the following advanced setting:
UNMAP
Prior to vSphere 6.5 and earlier versions of VMFS, to turn the UNMAP primitive for space reclamation on and off, use the following
advanced setting:
In vSphere 4.1, it was possible to define how often an ESXi host verified whether hardware acceleration was supported on the
storage array. This means that if at initial deployment, an array does not support the offload primitives, but at a later date the
firmware on the arrays is upgraded and the offload primitives become supported, nothing must be done regarding ESXi—it
automatically will start to use the offload primitives.
# esxcfg-advcfg -g /DataMover/HardwareAcceleratedMoveFrequency
Value of HardwareAcceleratedMoveFrequency is 16384
The value relates to the number of I/O requests that occur before another offload is attempted. With an I/O size of 32MB, 512GB of
I/O must flow before a VAAI primitive is retried.
The same is true for an offload failure. If the array returns a failure status for an offload operation, the software Data Mover is used
to continue the operation. After another 16,384 I/O requests using the software Data Mover, the VAAI offload is once again
attempted.
HardwareAcceleratedMoveFrequency exists only in vSphere 4.1. In later versions of vSphere, the parameter was replaced with a
periodic VAAI state evaluation that runs every 5 minutes.
The following diagram shows the various paths taken by the various Data Movers:
looking at storage array hardware acceleration state. If we decide to transfer using VAAI and then FAIL—a scenario called
degraded mode—the VMkernel will try to accomplish the transfer using the software Data Mover. It should be noted that the
operation is not restarted; rather it picks up from where the previous transfer left off as we do not want to abandon what could
possibly be very many gigabytes of copied data because of a single transient transfer error. When reverting to the software Data
Mover, we cannot operate in terms of the XCOPY transfer size of 4MB, due to memory buffer requirement, so the transfer size is
reduced to 64KB.
VAAI Caveats
This section focuses on Data Mover and how it is used for different types of VAAI Caveats.
In cases involving the following known caveats related to VAAI, hardware offloads to the array will not be leveraged; the Data
Mover will be used instead:
Misalignment
The software Data Mover is used if the logical address and/or transfer length in the requested operation is not aligned to the
minimum alignment required by the storage device.
If the VMFS datastores are created using vSphere Client, it is likely that all requests will be aligned. If a LUN is incorrectly
partitioned using low-level tools such as fdisk or partedUtil , the operation might revert to using the software Data Mover.
Because customers typically use vSphere Client and vCenter exclusively for a created datastore, it is unlikely that this issue would
be encountered. If a customer’s environment was migrated from ESX 2.x/VMFS2 in the distant past, however, the volumes might
be misaligned. This can be an issue on some VAAI arrays that have a high alignment requirement.
Automatic UNMAP also relies on alignment, either on VMFS6 on in-guest. Customers are advised to check the footnote in the
VMware HCL to confirm whether or not their storage array supports automatic UNMAP.
On VAAI Upgrade
New VMFS5/6 VMFS3
Hardware VMFS5
Allows spanning only on ATS, except when locks on non- ATS, except when locks on non-
Multi-extent datastore
ATS hardware head head
Acknowledgements
This section acknowledges the contributors who have helped the author.
Thank you to Paudie O’Riordan, Duncan Epping for reviewing the contents of this paper.