SRM Cannot Identify Replicated Datastores on iSCSI Devices

So we ran into a customer issue recently with VMware Site Recovery Manager that I have not seen before and have not found any on-point articles on, so I thought I’d share this one. Was an insidious one too, when troubleshooting this one I could not find the issue, eventually one of our rockstar escalation engineers at Pure (Jacob Hopkinson) figured it out after going through SRM debug logs line by line. Comes down to case sensitivity in iSCSI IQNs. I’ll explain…

The problem the customer ran into was that the SRA was returning the replicated devices to SRM properly but SRM could not make a match to a datastore. The VMFS was certainly on the device and it was definitely replicated. The SRA response was correct. The customer in question was using SRM 5.8 (but I am doing this example with SRM 6.1, so version doesn’t seem to matter) and was using iSCSI to present the FlashArray storage to the ESXi farm.

The issue

Since it couldn’t make a match the discovered device listing in SRM always looked like below, with the datastore name empty.

nodatastore

Some review of standards

The question is, why?

The bug ended up being because of ESXi IQNs being entered in as uppercase in ESXi. By default when using software iSCSI in ESXi, ESXi will create an IQN using, in part, the host name of the ESXi. Looks something like this:

upperesxisotware

As you can see, the “E” in Esx1 is uppercase. Furthermore, you can edit this and make it look different if you so choose, make it upper, lower, or something entirely different. It is important to note that the iSCSI standards set forth by Internet Engineering Task Force (IETF) allows for this (link and link):

RFC 1034 states that “By convention, domain names can be stored with arbitrary case…”

So this is fine. But comparisons must be done in a case-insensitive manner. So ESX1 is the same as esx1. That line above from RFC 1034 continues:

“…but domain name comparisons for all present domain functions are done in a case-insensitive manner…”

and:

“…This means that you are free to create a node with label “A” or a node with label “a”, but not both as brothers; you could refer to either using “a” or “A”. When you receive a domain name or label, you should preserve its case…”

Essentially, when comparing domain names, don’t pay attention to case, but also don’t change the case if given to you. VMware even has a KB that talks about this:

https://kb.vmware.com/kb/2017582

Okay, great so what’s the problem?

So in the above situation, the IQN had an upper case. Which was stored identically on the FlashArray

flasharryaiqn

The ESXi host could see the two volumes connected.

volumes

All to standards yay!

The bug(s)

Two problem occurred though. One is that our SRA would convert the IQN to lowercase when responding to SRM, you would see the log line excerpt:

Found iSCSI initiator: iqn.1998-01.com.vmware:csg-vw-Esx1-199b476e

Which is the SRA finding the host. Then the following later:

<Initiator type="iSCSI" id="iqn.1998-01.com.vmware:csg-vw-esx1-199b476e" />

It changed! Moved to lowercase. The reasoning for this was semi-well founded, it is being returned in order to compare. According to RFC 3722, and the allowed characters, upper case should be mapped to their lower case equivalents. But technically, the SRA isn’t doing the comparison, SRM is. So it should be kept as upper case.

This is bug #1. Which we have fixed in our SRA. We no longer convert IQNs.

But there is bug # 2, which is SRM that is still present at this point. Not sure if it will be fixed.

SRM was comparing the IQNs from the ESXi servers it sees, and failing the comparison. We sent back a lower case, but it had an upper case and it still failed. See the following log line (which can only be seen when SRM storage logging is set to trivia):

[01516 trivia 'Storage' opID=162ba00a] Skipped access path '[iSCSI] iqn.1998-01.com.vmware:csg-vw-Esx1-199b476e ;

It was expecting lowercase, which is not to iSCSI/domain name standards. So it doesn’t match. And volume relationships fail. The next question is well why does IQN matching even matter to SRM? It is really just looking at volumes. Let’s look at that.

Why this IQN business matters to SRM

So you might say, I have upper case IQNs with other SRAs and this doesn’t matter or show up in the logs. That might be. The reason for this is a special feature the FlashArray SRA uses in SRM that not all SRAs leverage. It’s called Dynamic Access Restrictions and it is an optional SRA command. As said in the SRA specification:

DynamicAccessRestriction–Ability to present snapshot and promoted devices to a list of initiators specified in testFailoverStart and failover requests.

Essentially, it allows the SRA to intelligently present the volume to the proper hosts during test failover and failover automatically. This is based on inventory mappings in SRM and also currently checks the configuration of the existing protected devices. This is done, among of reasons, to allow for restoreReplication (a “hidden” feature of SRM). This is why the device discovery fails.

If you SRA does not support DAR, I do not believe this would be an issue, because presentation/removal of the volumes is expected to be a manual operation by the user in the case of no support of DAR. You will see this as a device discovery response with a DAR-enabled SRA:

<SourceDevices>
 <SourceDevice id="iscsiSRM" state="read-write">
 <Name>iscsiSRM</Name>
 <TargetDevice key="peer-of-22598a73-81d7-4206-9450-42f2840a3e5c:iscsiSRM" />
 <Identity>
 <Lun initiatorGroupId="host-esxi1-on-array-22598a73-81d7-4206-9450-42f2840a3e5c">10</Lun>
 <Lun initiatorGroupId="host-esxi2-on-array-22598a73-81d7-4206-9450-42f2840a3e5c">10</Lun>
 </Identity>
 </SourceDevice>
</SourceDevices>
 <InitiatorGroups>
 <InitiatorGroup id="host-esxi1-on-array-22598a73-81d7-4206-9450-42f2840a3e5c">
 <Initiator type="iSCSI" id="iqn.1998-01.com.vmware:csg-vw-esx1-199b476e" />
 </InitiatorGroup>
 <InitiatorGroup id="host-esxi2-on-array-22598a73-81d7-4206-9450-42f2840a3e5c">
 <Initiator type="iSCSI" id="iqn.1998-01.com.vmware:csg-vmw-esx2-3ea85cd3" />
 </InitiatorGroup>
 </InitiatorGroups>

Dynamic Access Restriction,when this occurs

This will occur in the following situations with DAR enabled SRAs:

  1. Will fail when IQN is one case in ESXi and another on the array. This is regardless to how the SRA (FlashArray or otherwise) handles the IQN
  2. Upper case on ESXi and the FlashArray with a non-patched FlashArray SRA
  3. Lowercase on ESXi and uppercase on the array (patched FlashArray SRA and likely other DAR-enabled SRAs)

So because the issue still resides in SRM, even a patched FlashArray can still run into this issue if things are entered differently across the environment.

Workaround

So the nice thing is that the workaround is simple, make all of your IQNs lower case everywhere and this will never be an issue.