I’ve have been working with VMware’s vCenter Site Recovery Manager since the tail end of the 1.x release and I have to say this is the most excited I have been about a Storage Replication Adapter release that I can remember. Since I started with Pure in late April 2014 I have been working with our development team and product management to design and shape this initial release of the Pure Storage SRA. I have to say it has been a blast–a really great team that does some really amazing work! It is now officially approved and posted on VMware’s compatibility guide and SRA download site:
One of the issues I would run into quite often working with previous SRAs was complexity. A lot of setup and configuration was required and this configuration was not particularly flexible. As the environment changed or grew it took a lot of remediation to make sure SRM test/recovery didn’t break. The primary goal for our initial release of our SRA was to avoid complexity for the user. Anything we could automate or mask from the end user would automate or mask. Seeing the end result I think we definitely achieved this. We worked simplicity from a few different fronts. Configuration, operation and troubleshooting.
Often times configuration was the most difficult of using a SRA which usually coincided with the complexity of the configuration of the replication itself. Test failover could at times be a nightmare. Our goal was the user should not have to pre-create anything besides the volume they want to replicate. So let’s walk through the configuration of the Pure SRA:
- Install the SRA on the protected and remote SRM servers. Pretty straight forward process for all SRAs–just a quick installation wizard.
- Enable replication between your source and target FlashArrays. Prior to replicating any volume remotely you need to allow replication–this prevents just anyone from replicating to any array they choose. The connected FlashArrays will show up as “enabled pairs” in SRM after the array manager is configured.
- Configure the array managers. In order to discover array pairs and replicated devices you need to create “instances” of a SRA, which in SRM is called an array manager. There is nothing special that needs to be configured on the array or in between the SRA and the array to allow this. If the array is running Purity 4.0 you are good to go. Put in the IP address (or FQDN) of the local and peer FlashArrays and valid credentials and configuration is done! You will be able to enable array pairs and discover devices immediately. No additional software to install or daemons to configure.
- Replicate devices. Figure out what devices you want to replicate and either create a new protection policy on the FlashArray or add them to an existing one. Once a device is in a protection policy (technically called protection group on the array but for this post I will use policy as to not confuse between that and SRM protection groups). Decide your local and remote replication policy (only remote is required for SRM though) and to what array and what volumes. About 8 or 10 clicks at most.
- Configure hosts or host group on the recovery FlashArray. Create entries for any hosts/host groups (host usually means one ESXi server and host group is a collection of ESXi servers, i.e. a cluster).
- At this point no more work needs to be done outside of SRM to run tests/cleanup/recovery/reprotect for existing replicated devices! Everything else is automated and ready to go. Create your SRM protection groups and recovery plans and go nuts.
All of that configuration can be done in about 8 minutes (in reality can be done even quicker than that). See the below video that I created that walks through all of it. I did include a voice over for the video.
Operationally the Pure Storage SRA is quite simple as well.
- Preconfiguration requirements. There really aren’t any besides what is listed above. While any replication has static source volumes, our replication does also not require static/pre-configured remote/target volumes. Users simply set up their protection policy and it replicates the snapshots over. When you want to use a particular snap (or restore from it etc.) you simply associate that snap with an actual volume and Purity will create a metadata copy for that selected volume which can then be presented to a host or hosts. The original snapshot is not changed though so it can be re-used over and over.
- In discovery we just tell SRM our target volume is “replica-of-<source-volume-name>” which has yet to be created. This volume will only be created (and deleted) as needed automatically by the SRA. During a test recovery we will create the volume(s) and it will be presented to the recovery cluster for the duration of the test and then deleted during cleanup. Same thing as an actual recovery–the volume is created on-demand and the original source volume is removed during reprotection. You don’t need to pre-create volumes ever on the recovery side.
- Test recovery PiT options. The first release (as it stands now) will support two options for test recovery PiT. We will either replicate over the latest changes at the time of the test recovery in order to run the test or we will just use the last copy that was automatically made by the protection policy of the device(s). This is decided upon by whether the user leaves selected or de-selects the “Replicate recent changes to the recovery site” option built in the SRM test recovery initiation wizard. In a future release we plan on offering the ability to choose any existing PiT to recover to.
- The next feature is something we are leveraging from SRM itself. SRM optionally provides a feature to the SRA called “dynamic access restrictions”. This causes SRM (when the SRA tells it that this is supported) to inform the SRA upon a test or recovery what WWNs or IQNs a given volume needs to be presented to for that operation to succeed. When we get this information, we analyze the configured hosts or host groups to see what matches them. When we find a match we will attach the volume automatically to the appropriate hosts/host groups. While this does require pre-configuration of the host/host groups on the FlashArray this is a one time operation.
- The reprotect operation is automated too. You do not need to pre-create a protection policy on the remote side–when we perform a reprotect which instantiates replication for a set of devices back to the original site, we will analyze the original protection policy (or policies) and create them on the recovery site. Providing the same SLA the devices were originally configured with but in the opposite direction.
- No granularity of failover restrictions. The Pure SRA will never require any two volumes to be failed over together. It will always allow the user to failover a single volume without affecting other volumes, even if it is in a protection policy with other volumes. The only restrictions will be enforced by SRM itself, like if a VM spans two volumes they will of course have to be failed over in unison.
Another item we put a lot of work into getting right is the logging. Instead of logging to one big log file for an entire day or until it hits a certain size, we will log each individual SRM operation into it’s own log file and name it so. This makes it much easier to see what happened in a given operation–no need to make your eyes bleed looking through a huge text file for timestamps and error messages. Furthermore, everything is logged. Decisions, inputs, outputs etc. The logs read like a narrative–making it very easy to find out what happened and when. Lastly, since the design is so lean–no additional software to install, or option files to configure there is a lot less to go wrong. Makes the troubleshooting chain much shorter.
See a video of the test recovery process and then the recovery/reprotect process below:
Essentially the SRA has been designed to work out of the box for 99% of the customers (I made that number up, but I bet it is close). In future versions we will allow more granular control of the SRA and what it does and how it does it, but I think defaults should work for most. Making the process of using our SRA very straight forward for most users. My hope is a couple page overview should be enough documentation for anyone to understand how the SRA works.