Hurricane, then Superstorm, Sandy certainly affected all of us on the east coast to some degree or another. Between friends, family and colleagues who were in harm’s way and travel plans that were disrupted, most of the people I know felt some impact from the storm.
There were many plotlines to the coverage of the storm as well. Global warming, on-scene reporters being called reckless with their live shots, news folks falling for Photoshopped mock-ups of scuba divers in the subway. It’s safe to say that any news-generating event is magnified when it includes New York City and New Jersey, and this one was no different.
There was also a lot of chatter in the circles I run in about data centers and how the storm was going to affect them. For those of you who don’t know, there’s a LOT of data center space in Manhattan where more than a dozen providers have facilities. Many of those were in the “Zone A” flood zone, but most of them were affected by power outages and limited fuel availability based on street closures.
From looking at the reports, it seems like one of the most common issues was an inability to pump fuel up to the generators which were located many, many floors up in high-rise buildings. Most of these pumping facilities appear to be located on basement levels, and when those floors flooded the generators were stranded with the fuel they had access to locally. Some, like the Peer 1 facility on 75 Broad St. also saw the flooding contaminate primary fuel supplies, leaving only “day tanks” to supply the generators. The stories of customers spending days carrying fuel up flights of stairs to keep the generators running are pretty amazing.
There are, however, a few parts of the overall situation that bother me. First, how was it that despite being located in a known flood zone, data center operators allowed a single point of failure like the fuel pumps to be located in a basement? Sure, in hindsight it seems obvious, and the flooding was extraordinarily bad, but I’ll bet that there are a bunch of data center architects looking at ways to provide not just alternate pumps, but alternate ways to get fuel directly to the generators.
(Note: the picture to the right is the lobby of the Verizon data center at 140 West Street, taken Monday night. All five basement levels flooded during the storm, and 3 and 1/2 of them were still underwater four days later.)
Second, it appears that the effort to keep the Peer 1 data center online was instigated and to a large extent carried out by customers, led by Anthony Casalena, the founder and CEO of Squarespace, not by the data center operations team themselves. Once the adrenaline wears off, my guess is that there will be some customers wondering why THEY had to save the day on behalf of the provider they are paying princely sums to. Robert Miggins, senior vice president of business development at Peer1, is quoted as saying “we wouldn’t have had the manpower there to actually bring the fuel up in time,” Miggins said. “There’s a lot of good will, and there’s a lot of hard work and there’s a few lucky bounces for good measure,” he said. I’m sure there have been some incredibly hard working people from Peer 1 right there the whole time, but I don’t know how you take that statement as a good thing. In the Squarespace updates you can also see that Peer 1 is going to have to bring the generator down to replace the fuel filter, something that is easily avoided with a little pre-planning.
Finally, while there were many reports of data center operations teams sleeping in the facility over the duration of the storm, in most cases it appears those teams were locally based. To me, this is a huge operational no-no, and one that could definitely impact customers. Let me explain.
In a former life, I was the Director of Operations and Engineering for multiple free-standing data centers with around 50,000 sq/ft of usable space in Charlotte, NC. The company I worked for grew from three facilities in three markets to 19 facilities in 10 markets over my 6 years there, with three of those markets being in Florida. Hurricane and storm planning was a very real part of our day-to-day processes. Early on, we identified that we needed to standardize both the equipment and processes as much as possible from facility to facility so that were weren’t dependent on the local teams to be able to run things efficiently, especially in the Florida markets. Why? Because we didn’t want the local staff working in the data center during a disaster.
One of the primary things I learned from running the data centers, was that I never wanted to put an operations engineer in a position where he had to pick between his job and his family. Working in a critical facility role is inherently stressful all by itself; everything in the data center can be life-threatening. The best facilities engineers are the ones who are completely focused on the job at hand. In the case of a huge disaster like Sandy, every one of those engineers have friends and family who were also affected by the storm. How can they focus on the complex problems at the facility when they have kids at home with no power, or family with flooding and downed trees? They can’t, and it’s not fair to ask them to.
One of the plans we put in place (but thankfully never had to implement) at my time running data centers was that each facility had 3-5 designated engineers who were part of a disaster response team (DRT). If a significant event was forecast for a market that were had a data center in, we’d fly a team in 2-3 days early. Then the local market team would do a full handoff to the DRT, making sure that everyone was on the same page. Once that hand off was complete, we only asked one thing of the local team: go home and take care of your family. They knew the data center was in good hands, and we let them focus on the things that the company couldn’t. The ultimate winner there was the customer, who had a fully prepped, rested and non-distracted facilities team on-site with nothing else to focus on but making sure their equipment stayed powered and connected.
Maybe it just hasn’t gotten any press, but why wouldn’t Peer 1, Internap, Equinix have a plan like this in place? They have 19 data centers worldwide and had a week’s notice to prepare, why didn’t they have additional staffing on-site? Why were customers carrying fuel? What happens if one of them slipped on the stairs and got hurt? I’m glad these customers were able to keep the lights on despite their provider, but this certainly isn’t an optimal resolution.
In the end, it’s all still about people. Whether it’s on the software side or the facilities end, it’s still people who built, test, operate and support the “clouds” that our workloads call home. When one of those steps fails, the cascading effects typically become visible right away, with far-reaching consequences.