Pure Storage and VMware VAAI

Today I posted a new document to our repository on purestorage.com: Pure Storage and VMware Storage APIs for Array Integration—VAAI. This is a new white paper that describes in detail the VAAI block primitives that VMware offers and that we support. Furthermore, performance expectations are described, comparing before/after and how the operations do at scale. There are some best practices listed as well, the why and how of those recommendations are also described within.

I have to say, especially when it comes to XCOPY, I have never seen a storage array do so well with it. It is really quite impressive how fast XCOPY sessions complete and how scaling it up (in terms of numbers of VMs or size of the VMDKs) doesn’t weaken the process at all. The main purpose of this post is to alert you to the new document but I will go over some high level performance pieces of information as well. Read the document for the details and more.


vaai_pdf_cover

The Pure Storage FlashArray supports the following VAAI primitives:

  1. Full Copy (XCOPY)
  2. Hardware Assisted-Locking (Atomic Test & Set)
  3. Block Zero (WRITE SAME)
  4. Dead Space Reclamation (UNMAP)

I will show a couple of examples of 1-3, but I have covered UNMAP thoroughly in posts here and here so I’ll skip over that.

VAAI and Pure Storage works out of the box without any special configuration. Besides standard multipathing best practices, there are only two non-default settings Pure recommends for the best VAAI performance. WRITE SAME and ATS require nothing. For XCOPY we advise setting the MaxHWTransferSize to the maximum of 16 MB–the will not make performance differences like night and day but XCOPY does move slightly quicker with the larger transfer size. See information on setting this in the white paper or here. For UNMAP (well the esxcli 5.5 version), we recommend using a larger block count so the UNMAP process finishes sooner, around 60,000 is fine.

Full Copy

XCOPY on the FlashArray really blew me away. First off, there are ZERO caveats on the array side that would prevent XCOPY from being engaged (there are of course some from the VMware side). So regardless of the storage configuration, if it is enabled on ESXi and ESXi wants to use XCOPY we will use XCOPY. Secondly, the performance is very good. By far the best that I have worked with. There are no XCOPY queue limits per device or per array, so what ever and how ever much you want to offload will be offloaded. So let’s look at some use cases:

All tests will use the same virtual machine:

  • Windows Server 2012 R2 64-bit
  • 4 vCPUs, 8 GB Memory
  • One zeroedthick 100 GB virtual disk containing 50 GB of data

xcopy_svmotion

Note this chart has three parts to it. The first test was with Storage vMotion while the VM was powered on, but basically idle. The second was with the VM powered off. The third had the VM powered on and running a solid workload.

Basically Storage vMotion/migration on Pure is extremely fast, regardless of the power state. Virtual disk type does make a difference though, see my post here about that. You will see similar performance with deploy from template operations or standard cloning. It also scales quite well. Using the same VM and turning it into a template, I deployed 32 virtual machines from it using a PowerCLI script. The entire 32-VM deployment process took about :56 seconds from the time the first machine was deployed to the last VM deployment finished. Each individual VM took 13-17 seconds a piece to copy. So it took a bit longer on a per-VM basis, but overall to deploy 32 VMs in less than a minute is not too shabby! Scaling it up even more I deployed 128 VMs in 3 minutes and 39 seconds, the average VM deployment time being similar to before, 13-17 seconds. Now it should be noted that these overall times may vary, because vCenter/ESXi have some concurrency limits, so there is an amount of queuing on the VMware side of things–only so many operations occur at once. So overall times may be higher or lower, but individual VM times should be about the same. The concurrency limit in my test was 16 VM copy operations at a time.

xcopyscale1

 

The above chart shows a trendline connecting the three aforementioned tests and a fourth (1 VM, 32 VMs, 64 VMs and 128 VMs) compared to time to complete and it is pretty much a perfectly linear progression.

Hardware-assisted Locking

This is a somewhat harder to demo feature, especially at small scale, but it can be done. In this scenario, a virtual machine ran a workload to five virtual disks that all resided on the same datastore as a 150 virtual machines that were all booted up simultaneously. By referring to the below perfmon charts, it can be easily noted that with hardware assisted locking disabled the workload is deeply disrupted resulting in inconsistent and inferior performance during the boot storm. Both the IOPS and throughput  vary wildly throughout the test. When hardware assisted locking is enabled the disruption is almost entirely gone and the workload proceeds unfettered.

Note that the scale for throughput is in MB/s but is reduced in scale by a factor of ten to allow it to fit in a readable fashion on the chart with the IOPS values. So a throughput number on the chart of 1,000 is actually a throughput of 100 MB/s.

ats_off ats_on

As you can see, the boot storm does a number on the workload for the duration when ATS is not enabled.

Block Zero

Block Zero (WRITE SAME) is one of the most used VAAI primitives save maybe ATS. Anytime new space is allocated WRITE SAME jumps into action–formatting a VMFS, creating a eagerzereodthick vmdk, writing to new blocks in a zeroedthick or thin vmdk etc. Deployment times for a new eagerzeroedthick vmdk is probably the most cited example of WRITE SAME benefit, mostly because the apparent benefit is the most obvious. There are a variety of other benefits too, but I will show the EZT use case here. For more information check the VAAI white paper or this post here.

Here are simple tests of deploying eagerzeroedthick virtual disks with WRITE SAME off and on.

writesame

Somewhat of a profound performance increase regardless to the virtual disk size. Replacing those contiguous zero writes with WRITE SAME SCSI commands makes quite the difference. Not much more to say here, I think the chart speaks for itself.

Okay that’s enough on this. No reason to re-write the white paper–check it out! Let me know if you have questions either in the comments or on Twitter.

4 Replies to “Pure Storage and VMware VAAI”

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.