Pure Storage FlashArray SRA for Site Recovery Manager

I’ve have been working with VMware’s vCenter Site Recovery Manager since the tail end of the 1.x release and I have to say this is the most excited I have been about a Storage Replication Adapter release that I can remember. Since I started with Pure in late April 2014 I have been working with our development team and product management to design and shape this initial release of the Pure Storage SRA. I have to say it has been a blast–a really great team that does some really amazing work! It is now officially approved and posted on VMware’s  compatibility guide and SRA download site:

http://www.vmware.com/resources/compatibility/detail.php?productid=38264&deviceCategory=sra&details=1&partner=399

https://my.vmware.com/group/vmware/details?downloadGroup=SRM_SRA55&productId=451

srmpure

One of the issues I would run into quite often working with previous SRAs was complexity. A lot of setup and configuration was required and this configuration was not particularly flexible. As the environment changed or grew it took a lot of remediation to make sure SRM test/recovery didn’t break. The primary goal for our initial release of our SRA was to avoid complexity for the user. Anything we could automate or mask from the end user would automate or mask. Seeing the end result I think we definitely achieved this. We worked simplicity from a few different fronts. Configuration, operation and troubleshooting.

Often times configuration was the most difficult of using a SRA which usually coincided with the complexity of the configuration of the replication itself. Test failover could at times be a nightmare. Our goal was the user should not have to pre-create anything besides the volume they want to replicate. So let’s walk through the configuration of the Pure SRA:

  1. Install the SRA on the protected and remote SRM servers. Pretty straight forward process for all SRAs–just a quick installation wizard.
  2. Enable replication between your source and target FlashArrays. Prior to replicating any volume remotely you need to allow replication–this prevents just anyone from replicating to any array they choose. The connected FlashArrays will show up as “enabled pairs” in SRM after the array manager is configured.
  3. Configure the array managers. In order to discover array pairs and replicated devices you need to create “instances” of a SRA, which in SRM is called an array manager. There is nothing special that needs to be configured on the array or in between the SRA and the array to allow this. If the array is running Purity 4.0 you are good to go. Put in the IP address (or FQDN) of the local and peer FlashArrays and valid credentials and configuration is done! You will be able to enable array pairs and discover devices immediately. No additional software to install or daemons to configure.
  4. Replicate devices. Figure out what devices you want to replicate and either create a new protection policy on the FlashArray or add them to an existing one. Once a device is in a protection policy (technically called protection group on the array but for this post I will use policy as to not confuse between that and SRM protection groups). Decide your local and remote replication policy (only remote is required for SRM though) and to what array and what volumes. About 8 or 10 clicks at most.
  5. Configure hosts or host group on the recovery FlashArray. Create entries for any hosts/host groups (host usually means one ESXi server and host group is a collection of ESXi servers, i.e. a cluster).
  6. At this point no more work needs to be done outside of SRM to run tests/cleanup/recovery/reprotect for existing replicated devices! Everything else is automated and ready to go. Create your SRM protection groups and recovery plans and go nuts.

All of that configuration can be done in about 8 minutes (in reality can be done even quicker than that). See the below video that I created that walks through all of it. I did include a voice over for the video.

Operationally the Pure Storage SRA is quite simple as well.

  1. Preconfiguration requirements. There really aren’t any besides what is listed above. While any replication has static source volumes, our replication does also not require static/pre-configured remote/target volumes. Users simply set up their protection policy and it replicates the snapshots over. When you want to use a particular snap (or restore from it etc.) you simply associate that snap with an actual volume and Purity will create a metadata copy for that selected volume which can then be presented to a host or hosts. The original snapshot is not changed though so it can be re-used over and over.
  2. In discovery we just tell SRM our target volume is “replica-of-<source-volume-name>” which has yet to be created. This volume will only be created (and deleted) as needed automatically by the SRA. During a test recovery we will create the volume(s) and it will be presented to the recovery cluster for the duration of the test and then deleted during cleanup. Same thing as an actual recovery–the volume is created on-demand and the original source volume is removed during reprotection. You don’t need to pre-create volumes ever on the recovery side.devdiscovery
  3. Test recovery PiT options. The first release (as it stands now) will support two options for test recovery PiT. We will either replicate over the latest changes at the time of the test recovery in order to run the test or we will just use the last copy that was automatically made by the protection policy of the device(s). This is decided upon by whether the user leaves selected or de-selects the “Replicate recent changes to the recovery site” option built in the SRM test recovery initiation wizard. In a future release we plan on offering the ability to choose any existing PiT to recover to.testoptionsPiT
  4. The next feature is something we are leveraging from SRM itself. SRM optionally provides a feature to the SRA called “dynamic access restrictions”. This causes SRM (when the SRA tells it that this is supported) to inform the SRA upon a test or recovery what WWNs or IQNs a given volume needs to be presented to for that operation to succeed. When we get this information, we analyze the configured hosts or host groups to see what matches them. When we find a match we will attach the volume automatically to the appropriate hosts/host groups. While this does require pre-configuration of the host/host groups on the FlashArray this is a one time operation.
  5. The reprotect operation is automated too. You do not need to pre-create a protection policy on the remote side–when we perform a reprotect which instantiates replication for a set of devices back to the original site, we will analyze the original protection policy (or policies) and create them on the recovery site. Providing the same SLA the devices were originally configured with but in the opposite direction.
  6. No granularity of failover restrictions. The Pure SRA will never require any two volumes to be failed over together. It will always allow the user to failover a single volume without affecting other volumes, even if it is in a protection policy with other volumes. The only restrictions will be enforced by SRM itself, like if a VM spans two volumes they will of course have to be failed over in unison.

Another item we put a lot of work into getting right is the logging. Instead of logging to one big log file for an entire day or until it hits a certain size, we will log each individual SRM operation into it’s own log file and name it so. This makes it much easier to see what happened in a given operation–no need to make your eyes bleed looking through a huge text file for timestamps and error messages. Furthermore, everything is logged. Decisions, inputs, outputs etc. The logs read like a narrative–making it very easy to find out what happened and when. Lastly, since the design is so lean–no additional software to install, or option files to configure there is a lot less to go wrong. Makes the troubleshooting chain much shorter.

See a video of the test recovery process and then the recovery/reprotect process below:

Essentially the SRA has been designed to work out of the box for 99% of the customers (I made that number up, but I bet it is close). In future versions we will allow more granular control of the SRA and what it does and how it does it, but I think defaults should work for most. Making the process of using our SRA very straight forward for most users. My hope is a couple page overview should be enough documentation for anyone to understand how the SRA works.

 

 

5 thoughts on “Pure Storage FlashArray SRA for Site Recovery Manager”

  1. Hey Cody,

    Nice article again !

    Cant believe this is SRA setup is soo damn easy…… The things we had to do for the “other SRA” to get it working and now looking at PureStorage SRA configuration, this just seems like child’s play !

    The sync between the PureStorage plug-in and the array’s management UI is pretty seamless.

    Btw, how does the SRA communicate with the array ? Do you need to install any APIs before installing the SRA?

    And what are files that are created when the SRA is installed ? Like GlobalOptions.xml etc. ?

  2. Arjun, thanks! Yeah it really is very straight-forward. Eliminating complexity was our main goal with this SRA. The SRA communicates with the array over a built-in REST API. So not configuration/setup/deployment of a management server or the like is needed. The array has everything built in.

    The install just creates a few exe files and the log folders. There are no configuration files, except for one if you need to change a login timeout (which basically should never be required).

  3. Hi Cody,

    I have enabled the Pure array pair on the SRM after creating array managers at both Prod and DR site, however on the DR site, on the array pair, i am seeing below error,

    SRA command ‘discoverDevices’ failed. Cannot reach array. Error message: Unable to connect to the remote server
    Inner Exception=’A connection attempt failed because the connected party did not properly respond after a period of time, or established connection failed because connected host has failed to respond 10.16.152.20:443′.
    Please check network connectivity to array. Please see logs at C:\ProgramData\VMware\VMware vCenter Site Recovery Manager\Logs\SRAs\purestorage\discoverDevices_2017-09-15-19-04-04-826731-d0fccfee-59ef-4f5c-b1fd-bf80ebc14a10.log.

    On the SRA Log i see,

    [09/15/2017 19:04:04,Logging session for discoverDevices,V] *** Session started at log level Verbose ***
    [09/15/2017 19:04:04,Logging session for discoverDevices,V] SRA version: 1.5.7.2016-05-18_14-38-44
    [09/15/2017 19:04:04,Logging session for discoverDevices,V] Log path: C:\ProgramData\VMware\VMware vCenter Site Recovery Manager\Logs\SRAs\purestorage\discoverDevices_2017-09-15-19-04-04-826731-d0fccfee-59ef-4f5c-b1fd-bf80ebc14a10.log
    [09/15/2017 19:04:04,Logging session for discoverDevices,V] Machine name: SCIRVNCMPSRM01
    [09/15/2017 19:04:04,Logging session for discoverDevices,V] Process is 64-Bit.
    [09/15/2017 19:04:04,Logging session for discoverDevices,V] Running as the administrator.
    [09/15/2017 19:04:04,Logging session for discoverDevices,V] CLR version: 4.0.30319.34209
    [09/15/2017 19:04:04,Logging session for discoverDevices,V] Operating System: Microsoft Windows NT 6.1.7601 Service Pack 1
    [09/15/2017 19:04:04,PureSRA.cs:ParseCommand,V] Entering
    [09/15/2017 19:04:04,PureSRA.cs:ParseCommand,V] Received input:

    discoverDevices
    C:\Windows\TEMP\vmware-SYSTEM\sra-output-56-195
    C:\Windows\TEMP\vmware-SYSTEM\sra-status-57-99
    verbose
    C:\ProgramData\VMware\VMware vCenter Site Recovery Manager\Logs\SRAs\purestorage

    10.31.152.20

    pureuser
    ***

    10.16.152.20

    pureuser
    ***

    sgdrppfam01(3370b347-0d4f-4fe8-9bee-7fbbb6e02601)
    ircdppfam01(4c334d6c-b24f-4df6-8beb-a6d6a21198e0)

    [09/15/2017 19:04:04,PureSRA.cs:ParseCommand,V] Exiting
    [09/15/2017 19:04:04,PureSRA.cs:HandleCommand,V] Entering
    [09/15/2017 19:04:04,SRMCommandHandlerBase.cs:WriteResponse,V] Entering
    [09/15/2017 19:04:04,SRMCommandHandlerBase.cs:WriteResponse,V] Entering
    [09/15/2017 19:04:04,DiscoverDevices.cs:ProcessCommand,V] Entering
    [09/15/2017 19:04:04,SRMCommandHandlerBase.cs:PrepareArrayPair,V] Entering
    [09/15/2017 19:04:04,SRMCommandHandlerBase.cs:GetPairInfo,V] Entering
    [09/15/2017 19:04:04,SRMCommandHandlerBase.cs:ExtractArraySerialFromID,V] Entering
    [09/15/2017 19:04:04,SRMCommandHandlerBase.cs:ExtractArraySerialFromID,V] Array ID given by SRM is sgdrppfam01(3370b347-0d4f-4fe8-9bee-7fbbb6e02601)
    [09/15/2017 19:04:04,SRMCommandHandlerBase.cs:ExtractArraySerialFromID,V] Parsed out array serial 3370b347-0d4f-4fe8-9bee-7fbbb6e02601
    [09/15/2017 19:04:04,SRMCommandHandlerBase.cs:ExtractArraySerialFromID,V] Exiting
    [09/15/2017 19:04:04,SRMCommandHandlerBase.cs:ExtractArraySerialFromID,V] Entering
    [09/15/2017 19:04:04,SRMCommandHandlerBase.cs:ExtractArraySerialFromID,V] Array ID given by SRM is ircdppfam01(4c334d6c-b24f-4df6-8beb-a6d6a21198e0)
    [09/15/2017 19:04:04,SRMCommandHandlerBase.cs:ExtractArraySerialFromID,V] Parsed out array serial 4c334d6c-b24f-4df6-8beb-a6d6a21198e0
    [09/15/2017 19:04:04,SRMCommandHandlerBase.cs:ExtractArraySerialFromID,V] Exiting
    [09/15/2017 19:04:04,SRMCommandHandlerBase.cs:GetPairInfo,V] The given array pair is 3370b347-0d4f-4fe8-9bee-7fbbb6e02601 (local) => 4c334d6c-b24f-4df6-8beb-a6d6a21198e0 (peer)
    [09/15/2017 19:04:04,SRMCommandHandlerBase.cs:GetPairInfo,V] Exiting
    [09/15/2017 19:04:04,SRMCommandHandlerBase.cs:ConnectToInputArrays,V] Entering
    [09/15/2017 19:04:04,SRMCommandHandlerBase.cs:ConnectToInputArrays,V] Instantiating arrays from connection parameters
    [09/15/2017 19:04:04,SRMCommandHandlerBase.cs:ConnectToInputArrays,V] Connecting to FlashArray at 10.31.152.20 using connection localArray
    [09/15/2017 19:04:04,SRMCommandHandlerBase.cs:ConnectToInputArrays,V] Attempting to connect to end point https://10.31.152.20/api
    [09/15/2017 19:04:04,Utils.cs:GetSetting,V] Entering
    [09/15/2017 19:04:04,Utils.cs:GetSetting,V] Key doesn’t exist. Defaulting to 60.
    [09/15/2017 19:04:04,Utils.cs:GetSetting,V] Exiting
    [09/15/2017 19:04:04,SRMCommandHandlerBase.cs:ConnectToInputArrays,V] HTTP timeout is 60 seconds.
    [09/15/2017 19:04:23,SRMCommandHandlerBase.cs:ConnectToInputArrays,V] Successfully connected to FlashArray at https://10.31.152.20/api/1.3
    [09/15/2017 19:04:23,SRMCommandHandlerBase.cs:ConnectToInputArrays,V] Name: sgdrppfam01
    [09/15/2017 19:04:23,SRMCommandHandlerBase.cs:ConnectToInputArrays,V] Serial: 3370b347-0d4f-4fe8-9bee-7fbbb6e02601
    [09/15/2017 19:04:23,SRMCommandHandlerBase.cs:ConnectToInputArrays,V] Revision: 201705102013+977fb3c
    [09/15/2017 19:04:23,SRMCommandHandlerBase.cs:ConnectToInputArrays,V] Version: 4.8.10
    [09/15/2017 19:04:23,SRMCommandHandlerBase.cs:ConnectToInputArrays,V] Role: ArrayAdmin
    [09/15/2017 19:04:23,SRMCommandHandlerBase.cs:ConnectToInputArrays,V] Supported API: 1.0,1.1,1.2,1.3,1.4,1.5,1.6,1.7
    [09/15/2017 19:04:23,SRMCommandHandlerBase.cs:ConnectToInputArrays,V] Connecting to FlashArray at 10.16.152.20 using connection peerArray
    [09/15/2017 19:04:23,SRMCommandHandlerBase.cs:ConnectToInputArrays,V] Attempting to connect to end point https://10.16.152.20/api
    [09/15/2017 19:04:23,Utils.cs:GetSetting,V] Entering
    [09/15/2017 19:04:23,Utils.cs:GetSetting,V] Key doesn’t exist. Defaulting to 60.
    [09/15/2017 19:04:23,Utils.cs:GetSetting,V] Exiting
    [09/15/2017 19:04:23,SRMCommandHandlerBase.cs:ConnectToInputArrays,V] HTTP timeout is 60 seconds.
    [09/15/2017 19:05:05,SRMCommandHandlerBase.cs:ConnectToInputArrays,W] Connection failed to FlashArray at 10.16.152.20 using connection peerArray
    [09/15/2017 19:05:05,SRMCommandHandlerBase.cs:ConnectToInputArrays,E] “PureRestException: HttpStatusCode = ‘ServiceUnavailable’, RestErrorCode = ‘ServiceUnavailable’, Details = ”, InnerException = ‘System.Net.WebException: Unable to connect to the remote server —> System.Net.Sockets.SocketException: A connection attempt failed because the connected party did not properly respond after a period of time, or established connection failed because connected host has failed to respond 10.16.152.20:443
    at System.Net.Sockets.Socket.DoConnect(EndPoint endPointSnapshot, SocketAddress socketAddress)
    at System.Net.ServicePoint.ConnectSocketInternal(Boolean connectFailure, Socket s4, Socket s6, Socket& socket, IPAddress& address, ConnectSocketState state, IAsyncResult asyncResult, Exception& exception)
    — End of inner exception stack trace —
    at System.Net.HttpWebRequest.GetResponse()
    at PureStorage.Rest.PureRestClientBase.SendWebRequest(String verb, String requestUri, PureRequest request) in c:\hudson\workspace\BuildRESTLibrary_REL_1_5\PureStorage.Rest\PureStorage.Rest\PureRestClientBase.cs:line 205
    at PureStorage.Rest.PureRestClientBase.SendWithRetryAndErrorHandling(String verb, String uri, PureRequest request, Int32 maxAttampts) in c:\hudson\workspace\BuildRESTLibrary_REL_1_5\PureStorage.Rest\PureStorage.Rest\PureRestClientBase.cs:line 87′”
    [09/15/2017 19:05:05,SRMCommandHandlerBase.cs:ConnectToInputArrays,V] Exiting
    [09/15/2017 19:05:05,SRMCommandHandlerBase.cs:PrepareArrayPair,V] Exiting
    [09/15/2017 19:05:05,DiscoverDevices.cs:ProcessCommand,V] Exiting
    [09/15/2017 19:05:05,SRMCommandHandlerBase.cs:WriteResponse,V] Exiting
    Setting output:

    Unable to connect to the remote server
    Inner Exception=’A connection attempt failed because the connected party did not properly respond after a period of time, or established connection failed because connected host has failed to respond 10.16.152.20:443’
    C:\ProgramData\VMware\VMware vCenter Site Recovery Manager\Logs\SRAs\purestorage\discoverDevices_2017-09-15-19-04-04-826731-d0fccfee-59ef-4f5c-b1fd-bf80ebc14a10.log

    [09/15/2017 19:05:05,SRMCommandHandlerBase.cs:WriteResponse,V] Exiting
    [09/15/2017 19:05:05,PureSRA.cs:HandleCommand,V] Exiting
    [09/15/2017 19:05:05,Logging session for discoverDevices,V] Rest Library transcript:
    [09/15/2017 19:05:05,Logging session for discoverDevices,V] PureStorage.Rest Information: 12 : 2017-09-15T23:04:04.9047325Z Initializing PureRestClient
    PureStorage.Rest Information: 12 : 2017-09-15T23:04:04.9047325Z IgnoreCertificateError: True, HttpTimeoutInMilliseconds: 60000, ClientInfo: SRA/1.5.7, RoleRequired: StorageAdmin
    PureStorage.Rest Information: 10 : 2017-09-15T23:04:22.3458679Z GET https://10.31.152.20/api/api_version OK 17411ms
    PureStorage.Rest Information: 12 : 2017-09-15T23:04:22.3458679Z Response to API version query: {“version”: [“1.0”, “1.1”, “1.2”, “1.3”, “1.4”, “1.5”, “1.6”, “1.7”, “1.8”]}
    PureStorage.Rest Information: 12 : 2017-09-15T23:04:23.2194847Z Parameter version = 1.3
    PureStorage.Rest Information: 12 : 2017-09-15T23:04:23.2194847Z Supported version = 1.0
    PureStorage.Rest Information: 12 : 2017-09-15T23:04:23.2194847Z Supported version = 1.1
    PureStorage.Rest Information: 12 : 2017-09-15T23:04:23.2194847Z Supported version = 1.2
    PureStorage.Rest Information: 12 : 2017-09-15T23:04:23.2194847Z Supported version = 1.3
    PureStorage.Rest Information: 12 : 2017-09-15T23:04:23.2194847Z Supported version = 1.4
    PureStorage.Rest Information: 12 : 2017-09-15T23:04:23.2194847Z Supported version = 1.5
    PureStorage.Rest Information: 12 : 2017-09-15T23:04:23.2194847Z Supported version = 1.6
    PureStorage.Rest Information: 12 : 2017-09-15T23:04:23.2194847Z Supported version = 1.7
    PureStorage.Rest Information: 12 : 2017-09-15T23:04:23.2194847Z Validated version: 1.3
    PureStorage.Rest Information: 10 : 2017-09-15T23:04:23.5002901Z POST https://10.31.152.20/api/1.3/auth/apitoken OK 284ms
    PureStorage.Rest Information: 10 : 2017-09-15T23:04:23.5470910Z POST https://10.31.152.20/api/1.3/auth/session OK 23ms
    PureStorage.Rest Information: 10 : 2017-09-15T23:04:23.6094922Z GET https://10.31.152.20/api/1.3/array OK 42ms
    PureStorage.Rest Information: 12 : 2017-09-15T23:04:23.6250925Z Initializing PureRestClient
    PureStorage.Rest Information: 12 : 2017-09-15T23:04:23.6250925Z IgnoreCertificateError: True, HttpTimeoutInMilliseconds: 60000, ClientInfo: SRA/1.5.7, RoleRequired: StorageAdmin
    PureStorage.Rest Error: 11 : 2017-09-15T23:04:44.7010978Z ‘”PureRestException: HttpStatusCode = ‘ServiceUnavailable’, RestErrorCode = ‘ServiceUnavailable’, Details = ”, InnerException = ‘System.Net.WebException: Unable to connect to the remote server —> System.Net.Sockets.SocketException: A connection attempt failed because the connected party did not properly respond after a period of time, or established connection failed because connected host has failed to respond 10.16.152.20:443
    at System.Net.Sockets.Socket.DoConnect(EndPoint endPointSnapshot, SocketAddress socketAddress)
    at System.Net.ServicePoint.ConnectSocketInternal(Boolean connectFailure, Socket s4, Socket s6, Socket& socket, IPAddress& address, ConnectSocketState state, IAsyncResult asyncResult, Exception& exception)
    — End of inner exception stack trace —
    at System.Net.HttpWebRequest.GetResponse()
    at PureStorage.Rest.PureRestClientBase.SendWebRequest(String verb, String requestUri, PureRequest request) in c:\hudson\workspace\BuildRESTLibrary_REL_1_5\PureStorage.Rest\PureStorage.Rest\PureRestClientBase.cs:line 205
    at PureStorage.Rest.PureRestClientBase.SendWithRetryAndErrorHandling(String verb, String uri, PureRequest request, Int32 maxAttampts) in c:\hudson\workspace\BuildRESTLibrary_REL_1_5\PureStorage.Rest\PureStorage.Rest\PureRestClientBase.cs:line 87′”‘
    PureStorage.Rest Information: 10 : 2017-09-15T23:04:44.7010978Z GET https://10.16.152.20/api/api_version 21082ms
    PureStorage.Rest Error: 11 : 2017-09-15T23:05:05.7771031Z ‘”PureRestException: HttpStatusCode = ‘ServiceUnavailable’, RestErrorCode = ‘ServiceUnavailable’, Details = ”, InnerException = ‘System.Net.WebException: Unable to connect to the remote server —> System.Net.Sockets.SocketException: A connection attempt failed because the connected party did not properly respond after a period of time, or established connection failed because connected host has failed to respond 10.16.152.20:443
    at System.Net.Sockets.Socket.DoConnect(EndPoint endPointSnapshot, SocketAddress socketAddress)
    at System.Net.ServicePoint.ConnectSocketInternal(Boolean connectFailure, Socket s4, Socket s6, Socket& socket, IPAddress& address, ConnectSocketState state, IAsyncResult asyncResult, Exception& exception)
    — End of inner exception stack trace —
    at System.Net.HttpWebRequest.GetResponse()
    at PureStorage.Rest.PureRestClientBase.SendWebRequest(String verb, String requestUri, PureRequest request) in c:\hudson\workspace\BuildRESTLibrary_REL_1_5\PureStorage.Rest\PureStorage.Rest\PureRestClientBase.cs:line 205
    at PureStorage.Rest.PureRestClientBase.SendWithRetryAndErrorHandling(String verb, String uri, PureRequest request, Int32 maxAttampts) in c:\hudson\workspace\BuildRESTLibrary_REL_1_5\PureStorage.Rest\PureStorage.Rest\PureRestClientBase.cs:line 87′”‘
    PureStorage.Rest Information: 10 : 2017-09-15T23:05:05.7771031Z GET https://10.16.152.20/api/api_version 21062ms

    [09/15/2017 19:05:05,Logging session for discoverDevices,V] *** Session ended ***

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.