SAN Loop Failure

Posted by : on

Category : esx   vmware   netapp   storage   troubleshooting   vmware-kb


Here’s an example of how important it is for your VMware Engineers to either have visibility into the SAN infrastructure, or work very intimately with the SAN Admins. We lost a fiber loop on one of our NetApp FAS 3160 SANs yesterday. If my team did not receive the critical fail-over emails, we would not have known that there were storage issues for a couple of vital hours.

When your customers are complaining that there is something wrong with their VMs, or they have lost access to them; it is imperative to investigate and start troubleshooting. If we would have started troubleshooting without the knowledge of the SAN issue, then we would have started working on the VMs, which would have quickly led back to the ESX hosts they resided on. Troubleshooting the ESX hosts could have potentially made our outage A LOT worse.

This particular environment consists of a virtual center 4.1 and eight ESX 4.01 U2 hosts. There happens to be a bug in ESX 4.XX that occurs when a rescan is issued while an all-paths-down state exists for any LUN in the vCenter Server cluster. Therefore, a virtual machine on one LUN stops responding (temporarily or permanently) because a different LUN in the vCenter Server cluster is in an all-paths-down state. (Yeah, I cut and paste that from the VM KB, that you can read here.) The VM KB also mentions that the bug was fixed in ESX 4.1 U1.

Since we were receiving the outage emails, we knew that something was up with our storage. This allowed us to work closely with our storage admins to understand the full extent of the outage.

The details of our recovery goes something like this: A fail-over from Node A to Node B was made during the outage, however, Node B did not have access to the failed loop, therefore the aggregate on that loop was down. Node B carried the load for all other working aggregates giving our storage guys and the NetApp technician time to work on the loop. When repairs were completed, a fail-back (give-back) was done to allow Node A to take back over its share of the load. We confirmed with the NetApp tech that all paths and LUNs were represented to the ESX hosts. We then went in and rescanned each ESX host in the cluster for the ESX host to recognize the downed LUNs once more. After the scan, we viewed the properties of the LUN to ensure all paths were up. Once that was verified, we QA’d our VMs. Of 45 affected VMs, we had one casualty. The VMX file of one VM got corrupted.

The situation could have been worse, much worse. But I’m very glad that we stayed smart and calm and worked closely with our storage admins.


About Sam Aaron
Sam Aaron

Father, Husband, Geek. Workaholic.

Email : mail@micronauts.us

Website : http://micronauts.us

About Sam Aaron

Father. Husband. Geek. Workaholic. US Marine Corps Veteran.

Sam Aaron is a Senior Consultant in the Professional Services Organization for Entelligence, bringing over a decade of expertise in enterprise cloud automation and infrastructure. Sam has spent almost eleven years at VMware leading cloud automation initiatives using VCF Automation (formerly Aria Automation & vRA) and designing scalable, multi-tenant environments with VMware Cloud Director (vCD).

Sam holds multiple certifications including VCF-Architect 2024, VCIX-CMA, and dual VCPs (DCV & CMA), and is a recognized contributor to VMware’s certification exams. As a VMware Hands-On Lab (HOL) Captain and content author from 2015-2025, Sam played a key role in educating and mentoring the global VMware community. He helped to create and develop the automation challenge and troubleshooting labs for VMworld and global virtual forums.

When Sam is not working, he has several hobbies, among these are 3D printing Star Wars robots and turning them into animatronics.

Launched in April 2010, micronauts is Sam's online presence. Here, he has been blogging and sharing knowledge with the virtualization community. This blog acts as a central repository to retain the resolutions and other trivial knowledge that Sam has discovered.

** No information provided here was reviewed or endorsed by VMware by Broadcom, Microsoft, or anyone else for that matter. All information here are opinions based on Sam's personal experience. Use this knowledge at your own risk. **

Star
Useful Links