Recent ESXi 6 Storage Bugs and the FlashArray

As you might be aware, there have been a few storage-related issues with ESXi 6.0 as of late:

Accidental PDL during dropped paths:

Storage PDL responses may not trigger path failover in vSphere 6.0 (2144657)

Host issues during smartd inquiries:

Issuing a 0x85 SCSI Command from a VMware ESXi 6.0 host results in a PDL error (2133286)

The question that comes up for the Pure Storage FlashArray is are we susceptible? The short answer is no. Let’s explain why.

Accidental PDL

A bug was introduced in ESXi 6.0 that affects multipathing and prevent proper failover of paths. Direct from the KB:

“An inadvertent change in PDL multipathing behavior in ESXi 6.0 results in alternative working paths for a LUN not being checked if a PDL condition/error is detected. Upon encountering a PDL condition on the active path, the ESXi host initiates a health check against the remaining paths but does not fail over if another path is responsive/healthy. The correct response would be to failover to one of the healthy working paths. The result is the host is no longer be able to issue I/O to these LUNs until the ESXi host is rebooted.”

This was patched and the fix is included permanently in ESXi 6.0 U2.

So why has this affected some storage and not the FlashArray (and others)? Well the situation was encountered during code upgrades or reboots of controllers on other arrays where a certain number of paths for a given device would go down. When those paths went down, the arrays would respond with a PDL SCSI Sense Code instead of an APD one. Essentially the array was telling the ESXi server that it didn’t expect the storage/path to come back. The combination of this and the ESXi bug, started this failure scenario and even if there were surviving paths from other controllers they would not be used and the device would go PDL causing the VMs to hang on it. PDL responses are one of the following:

[table id=1 /]

The most common PDL response you will see is usually 0x5 0x25 0x0. On the FlashArray we do not send PDL responses when a controller is being rebooted, because we expect those paths to come back quickly. Instead we send responses like:

  • 0x5 0x24 0x0 ILLEGAL REQUEST INVALID FIELD IN CDB
  • 0x2 0x3a 0x0 NOT READY MEDIUM NOT PRESENT
  • 0x6 0x29 0x0 UNIT ATTENTION POWER ON, RESET, OR BUS DEVICE RESET OCCURRED

None of these result in the PDL workflow because the “failure” is temporary. These will appear in your /var/log/vmkernel.log like:

2016-03-24T21:35:06.408Z cpu0:33382)NMP: nmp_ThrottleLogForDevice:3286: Cmd 0x89 (0x439dc0918e40, 32798) to dev "naa.624a937073e940225a2a52bb0001a06b" on path "vmhba2:C0:T0:L10" Failed: H:0x2 D:0x0 P:0x0 Possible sense data: 0x5 0x24 0x0. Act:EVAL

Also, if you haven’t used it, this decoding tool is really helpful to figure out what these error messages mean, instead of having to look everything up:

New Tool: Decoder for ESXi SCSI Sense Codes

The only time that we send PDL responses are when a volume is removed from the host group or host. We use 0x5 0x25 0x0 so if you pull an active device from ESXi you will see that flare up in the vmkernel log.

Even though the FlashArray is not affected, you might want to fix this anyways. There is a patch you can request from ESXi to fix this if you have storage that might be affected, but I think VMware wants you to upgrade to ESXi 6.0 U2 which includes the fix. But that brings us to the ESXi 6.0 U2 bug…

ESXi 6.0 U2 Smartd/0x85 Bug

A new issue cropped up in ESXi 6.0 U2 that surrounds the smartd inquiries that ESXi does on devices. If you want to find out more about SMART check out this blog post:

http://cormachogan.com/2012/09/12/vsphere-5-1-storage-enhancements-part-6-iodm-ssd-monitoring/

Essentially, ESXi on a certain interval will send a 0x85 SCSI command ( ATA PASS-THROUGH(16)) to a device to look at mode page 1c. The issue in ESXi 6.0 U2 that if the array responds with 0x5 0x25 0x0 (ILLEGAL REQUEST LOGICAL UNIT NOT SUPPORTED) which as you remember is a PDL response, this could trigger the issue, which VMware describes as:

  • Widespread IO timeouts and subsequent aborts with the H:0x5 failure code.
  • Hosts may take a long time to reconnect to vCenter after reboot or hosts may enter a Not Responding state in vCenter Server
  • Storage-related tasks such as HBA rescan may take a very long time to complete

So if you see something like this:

2015-07-23T20:34:05.108Z cpu2:33198)NMP: nmp_ThrottleLogForDevice:3178: Cmd 0x85 (0x439e16768f40, 34616) to dev "naa.514f0c514ba0000e" on path "vmhba4:C0:T0:L10" Failed: H:0x0 D:0x2 P:0x0 Valid sense data: 0x5 0x25 0x0. Act:EVAL

Your array is susceptible. In this case, there is no patch yet, the solution is to disable the smartd.

  1. Stop the smartd service using this command:/etc/init.d/smartd stop
  2. Disable the service using this command:chkconfig smartd off

That being said the FlashArray does not respond with a PDL response when this mode page is queried. Instead we respond with 0x5 0x20 0x0 (INVALID COMMAND OPERATION CODE) which is not one of the PDL responses. You will see this in your log instead:

2016-03-24T21:35:29.457Z cpu23:33383)NMP: nmp_ThrottleLogForDevice:3286: Cmd 0x85 (0x43a5c08d2000, 34417) to dev "naa.624a937073e940225a2a52bb0001a2a1" on path "vmhba2:C0:T5:L16" Failed: H:0x0 D:0x2 P:0x0 Valid sense data: 0x5 0x20 0x0. Act:NONE

So in an only FlashArray environment this problem will not occur.

Conclusion

While the FlashArray does not respond in the ways that could contribute to these problems it is important to make sure nothing else in your environment does. So either patch your host or upgrade it (or delay moving to 6). If you are on ESXi 6.0 U2 already and have an array that could run into this issue, disable smartd until it is patched from VMware or your vendor changes their mode page 1c response.

AS ALWAYS, please contact Pure Storage and/or VMware support for official explanations and patches.

Thanks to Cormac Hogan and Jacob Hopkinson for their help on this post.

References:

Issuing a 0x85 SCSI Command from a VMware ESXi 6.0 host results in a PDL error (2133286)

Permanent Device Loss (PDL) and All-Paths-Down (APD) in vSphere 5.x and 6.x (2004684)

Storage PDL responses may not trigger path failover in vSphere 6.0 (2144657)

http://pubs.vmware.com/Release_Notes/en/vsphere/60/vsphere-esxi-60u2-release-notes.html

SCSI events that can trigger ESX server to fail a LUN over to another path (1003433)

Interpreting SCSI sense codes in VMware ESXi and ESX (289902)

https://en.wikipedia.org/wiki/SCSI_command

 

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.