Recovering from an iSCSI Permanent Device Loss (PDL) in ESXi

What if, for some reason, your iSCSI storage is unavailable for a very short amount of time as seen from your ESXi box(es)? Usually, everything will recover automatically as ESXi retries the connection by itself. I came across a situation this did not occur and explain how I fixed this.

In my particular situation, I had about 12 datastores on the same storage array (synchronously replicated) and several VMs on each datastore. During the storage hiccup, only one datastore was marked as unavailable and after inspecting the iSCSI initiator, it seemed the path was dead.

I first compared this state with the other hosts and all hosts were experiencing the same issue. All other datastores were connected fine, but this one remained offline. I checked the array, and did see active connections coming from the ESXi boxes  to this LUN.. Strange!

A rescan of the storage adapters on the hosts didn’t do anything so the only solution that had no impact whatsoever as it seemed to me, was to place each host in maintenance mode and reboot every one of them. While rebooting the first one, I couldn’t believe this was the only way and started digging a little bit throughout the GUI.

As a matter of fact.. The iSCSI sessions that have been brought up, and are connected to an actual LUN are placed in “Static Discovery” under your iSCSI initiator settings. See the screenshot below for an example. I used the Web Client here, but you can obviously find the same place in the vSphere Client.

After removing the dead LUN target on each host and a rescan of the storage adapter, the LUN was available on each ESXi box again and didn’t require me to reboot each host. Caution: be careful  when removing LUN targets as you ‘might’ disconnect the wrong one and cause extra impact.


Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.