This post was originally finished on May 4th, 2010. It was never posted, and I was never allowed to share the fact that this issue happened, or try to warn the community about it. It’s been over three years, so I’m going to take a few of these posts out of the drafts folder and put them back on the site.
First a disclaimer: this is the most nervous I’ve been putting together a post. For some reason, I’m more than happy to share my opinion, philosophy or viewpoint with anyone and everyone, but the thought of smart people picking through the details of a technical post fills me with stage fright. There’s probably no reason for this irrational fear, but I figured I’d share before I dove into my first real technical post on this blog.
It’s also worth noting that I’m happy to go solo on the business-facing posts, but I definitely enlisted help from the team for this one. They vetted and fact-checked all of this information, so if I’m not talented enough to relay the information correctly, blame me and not them!
You know the saying that bad things comes in threes? Well, it definitely happened to us a couple weeks back. First, we had a Cisco 2960 access switch fail that was providing the 1Gbps ports for the management (console and VMkernel) networks that support our Enterprise Cloud clusters in one of our three nodes. This is something we don’t like to see, but it’s something that the network design plans for: normally, the switch fails, everything reconverges, problem over. On the management network especially, the HA agents throw a little fit, especially if one of the hosts thinks it’s isolated, but generally no big deal and certainly not customer-impacting. In this case, however, the switch FAILED.
In a good demonstration of the worst case scenario that can arise in a multi-switch, multi-trunk, redundant network environment, the switch decided that it was going to shut down all of the individual port ASICs and essentially turn itself into a hub. Yeah, all of you network engineers out there just cringed. The spanning tree keep-alives did what they were supposed to, and the resulting broadcast storm and slew of err-disable conditions that resulted on every trunk that the failed switch could reach made an unfortunate situation into a bad, bad night. Network Engineering identified the issue, cleaned up the trunk port connections and we got everything back up and running. Customer VMs never rebooted and as soon as the ports were re-enabled everything on the customer side came back up.
So here’s where bad thing #2 comes in, and it’s the core of what I want to share with the group. We have three different kinds of clusters in this particular environment: Legacy (HP G5 servers, ESX3.5 U5), Production (HP G6 servers, vSphere 4) and Internal (Dell R900, vSphere 4U1). Each of these clusters reacted very differently to both the loss of connectivity to the rest of the servers in the cluster as well as the broadcast storm from the failed switch. The Legacy cluster needed some cleanup on the HA agents, which was to be expected, but other than that we had no impact whatsoever. The Internal cluster was even less impacted, with even the HA agents recovering on their own. The Production cluster? Yeah it didn’t do so well.
The first thing we noticed was that even after the network connectivity was restored, half of the hosts were showing as disconnected in vCenter and all of their associated VMs showed as invalid. When we’d try to SSH into the hosts, we got no response at all. Using a console connection we were able to log in, but as soon as any command was run the host would go completely unresponsive. Now the customer VMs connected to the front end networks were fine, and nothing, to this point, had impacted them once the network came back. The management network, however, was well and truely wrecked. Almost immediately after the network came back up the third part of our triple-whammy showed up and one of the hosts crashed completely, throwing a PSOD and requiring a manual reboot. Once the host came back up, everything was back to normal! We opened a priority case with VMware, and after about 6 hours of troubleshooting we were told something that no one on my team had ever heard from them: they were telling us that we needed to reboot the hosts. Wow.
Because it’s a live, production environment we had been communicating with our customers the whole time. We notified them of the situation and planned for a maintenance window after the close of business to minimize the impact the reboots (still amazes me that it came to that!) would have on their environments. While we waited for the maintenance window two more of the hosts threw a PSOD and needed to rebooted, and in both cases the reboot immediately cleared all of the management network issues. When we finally got into the window, we rebooted the remainder of the hosts in coordination with our customers, and got everything cleared up, working smoothly and back to normal. Now the fun begins.
In conjunction with VMware we discovered through some pretty thorough log analysis and lab re-creation that we actually ran into two separate issues during the event. The first was that it appears that as a result of the broadcast storm the vmklinux failed to request memory and drove a wedge (VMware’s term) between the vNIC and the kernel rendering the network stack completely unusable. The Broadcom NIC driver stack actually removed itself from memory! Once this condition happened the host hangs and actually stops logging so there are big gaps in the logs, making the post-mortem more difficult. In fact, on each host the final log entry before the gap reads “VMNIX: WARNING: NetCos: 1086: virtual HW appears wedged (bug number 90831)” and at that point all of the SCSI devices went into a looped state. This rescan loop started eating more and more memory space, and after the vmklinux runs out of heap memory a PSOD is generated and the host dies, which was the second issue. For the hosts that have hung but that haven’t crashed, there is no recourse for resolving the issue other than a hard boot. Trying to do a “Service mgmt-vmware restart” or a “esxcfg-vswitch –U vmnic0 vSwitch0” command resulted in the console hanging and the host becoming complete unmanageable by any method.
From a resolution standpoint we have a ways to go. We know that there is a patch in vSphere4U1 that will mitigate the PSOD issue, meaning that the host will still be unmanageable and the network wedge will still be in place, but the hosts won’t crash. In our case that would mean we could have scheduled the reboots of the servers with our customers, which would have been preferable. As for the vmklinux memory issue, there’s no fix for it yet. We were able to reproduce the issue fairly easily by creating a spanning tree loop on a pair of switches and watching the resulting broadcast storm kill the hosts. We have provided serial-line logging to VMware per their request and they have identified another customer who is experiencing the same issue. VMware expects that once everything is analyzed they’ll need to put out some new code to resolve the issue. Wonderful, right? I hate being first.
On the commentary side, there were good things that came out of this issue and things that will need to be reconsidered going forward. Obviously my team was outstanding. Working straight through for 30+ hours to make sure all of the customers were taken care of was the least of their contributions to both the initial triage and the final resolution. All technology, no matter how advanced, has a core of people who make it work to its fullest capacity and our group is magnificent. They make me look good, and that’s no small feat.
The fact that all of this happened because of a broadcast storm is an issue for me. In a multi-tenant, multi-VLAN redundant network the risk of looping while adding a cross-connect is always present, and we (and I would imagine ALL service providers) have extensive policies around how, when and who makes physical changes to those environments. Switches that fail OPEN are another issue, and we’ll push Cisco to give us as much information there as we can, since that’s another first for me. Obviously there are some choices to be made on the network side. One option is to not allow the keep-alive checks to shut down inter-switch trunk ports when it detects a loop. We can use storm suppression to allow the switches to attempt to handle the storm internally without attempting to isolate the source. There’s a tradeoff here since the storm suppression will prevent new ARP entries from being handled and we can’t be sure that the sheer volume of packets won’t bring the switches down anyway! There’s no real downside to that compared to the errdisables shutting things down however, and it does introduce the possibility that the environment will stay up until we can find and remove the offending configuration/cross-connect/device. It’s not a resolution, but it does give us more options which is a good thing.
Consolidating multiple individual access switches into something stackable/chassis-based is also an option. Our upcoming migration to new Cisco Nexus 7010 10GB switches will reduce the risk of looping just by reducing the number of devices that could ever cause one. It’s harder to loop a system that contains two devices than it is in an environment that contains dozens! Overall though I don’t think there is an easy solution on the network side. VMware needs to patch this one quick, and we’ll push them to do so. If any of you out there are running the same kind of environment and want to reproduce this issue in the lab, please leave your contact information in a comment and I’ll get you hooked up with the team and get you the instructions. I don’t know the scope of the systems affected, so if you want to test on your platforms you can decide whether you need to open a ticket with VMware on your own.
Whew. Not only was this my first technical post, it also clocks in at over 1,600 words. If you stuck around this long you deserve a reward! Leave me a comment below and I’ll see what I can do. Thanks for reading.
— Note: one of the things that came out of the post-mortem of this issue with Cisco, was that the device that failed was manufactured in Europe and had been purchased off the grey market as a used device. At some point, you place your bets and you take your chances. In hindsight, I would have preferred new switches purchased from Cisco with SmartNet support, but that’s what happens in a small start-up environment.