Split brain in ESXi occurs when a member of an ESXi HA cluster is unable to determine their HA Master/Slave role. Basically, the HA Master election process fails. The most common scenario for this is where one member of a 2 HA cluster node loses its mind and can’t communicate with the other, fully functional, host.
I dealt with this again just last night.
It started with a customer contacting us for ‘slow performance’ on a couple of their Virtual Machines. Upon logging in, I discovered lots of failover activity. Half an hour later, most of the HA operations failed due to a loss of connection to both—separate—management (and VMkernel) networks. All VMs remained up on both hosts.
As a reminder, the recommended configuration is almost always to allow ESXi to automatically shut down all VMs in response to a host isolation event. The thing is, the ESXi host KNOWS it is isolated. What it doesn’t know is whether it’s the last remaining host. Shutting down VMs enables the storage locks to be released from the SAN. Instead of waiting on HA, which can’t do its job due to loss of connectivity, all you need to do is turn the VMs on the other host. Easy.
This did not happen last night. All VMs remained running on both ESXi hosts. Normally, you’d want to start driving to the data center (unless you have a OOBM access, which I did not). In this scenario, the main concern was to rescue a mission critical, Web Server that failed HA and locked up.
After trying SSH, Vsphere, Vcenter and every other connection in the book, I was desperate to get that server restarted ASAP.
So what is the ‘other way’ of recovering from Split Brain?
Kill the storage!
This is never recommended. But, it works if you don’t care about data loss.
This ‘method’ is akin to pulling the power plug on the ESXi, which in a realworld scenario, is slightly more significant than pulling the plug on a physical, non-virtualized server. The reason being that you not only risk data loss on the VM side but also within all the configuration pieces of each vm – .VMX file, the whole business. But you have backups and hourly SAN snapshots, right? Should you worry about it?
Admittedly, I did not have the courage to yank the storage away from the host and elected to make a trip to the data center. But hey, if all you have are headless web-servers, this could be an easy (unsupported and dangerous) way of dealing with the Split Brain problem quickly.
Jacob R, PEI