Steve has spent the past 27 years exploring and improving data center reliability –how to calculate it, how to measure it, and investigating failures. Data center reliability is Steve’s second career. As he worked towards Bachelor’s and Master’s degrees in Physics from MIT, plus a Master’s in Electrical Engineering and Computer Science, his first 20 professional years were spent in building and operating nuclear fusion research facilities. The goal was to produce energy via nuclear fusion reactions, similar to what powers the sun. The promise was limitless, clean, safe energy. The challenge was formidable – heat a gas (plasma) to 100 million degrees, then hold it in a magnetic bottle for a second or so to allow the reaction to take place. Steve’s fusion career culminated as head of Engineering and Operations during the design, commissioning, and initial operation of the MIT Alcator C-MOD facility, which ran from 1991 to 2016.
Seeking a field with more immediate practical applications, Steve left MIT in 1994 and focused on the reliability of data center power and cooling systems. “I attended the Uninterruptible Uptime Users Group (UUUG) conferences, and soon realized a lot of people were very unhappy with the reliability of their data center’s mechanical and electrical plants (MEP). I listened for two years, then started asking questions. ‘What did you ideally want? What did you expect for MEP reliability?’ I learned pretty quickly that the mantra of ‘five nines equals 5 minutes of downtime per year’ was largely responsible for misperceptions about reliability. There is no such thing as a five-minute outage in a data center. The damage is done in the first few milliseconds, and recovery takes many hours or days, sometimes longer.”
How I got into critical facility work:
I visited my first data center when I was a teen working on the Computers merit badge in the Boy Scouts. The L&N railroad had a very large (for the time) mainframe data center in Louisville, Kentucky, where I grew up.
Twenty-five years later I started learning about the data center industry, and I was surprised that reliability calculations were not used. The techniques to do these kinds of calculations were well developed, and I had encountered them in my nuclear fusion career. When I joined MTech in 1997, my first two projects were calculating the reliability of a proposed data center that would use fuel cells as the primary power source, and investigating a static switch failure that had brought down a large telecom facility.
The technique MTech prefers for calculating reliability and availability is called Fault-Tree Analysis. There are other methods like Reliability Block Diagrams and Monte Carlo Simulations. I prefer fault trees because they force analysts and clients to think carefully about how the facility succeeds and fails. Fault trees also produce a rich output that can be of great value when making difficult choices about redundancy, costs, complexity, space, and schedule. Our work resolves a lot of disputes because we deliver engineering calculations, not opinions or judgements.
We apply similar mathematical rigor when calculating availability, performing Failure Modes and Effects Analyses (FMEAs), and modeling the risks and rewards of maintenance interventions. While the potential market for our services is huge, I’ve deliberately kept MTech small because I want to continue to be able to do some of the technical work myself.
How I got involved with 7×24 Exchange:
I attended the 1994 UUUG conference. After listening to presentations and talking with members for two years, I gave a presentation to the 1996 conference in Boston. I introduced the concepts of reliability and risk, explained that it could be calculated, and then reviewed cases I had investigated where human error was a primary cause. UUUG became the 7×24 Exchange in 1997.
My 1998 presentation on the causes of the Challenger space shuttle accident made the point that poor communication and terrible graphics led to an all-night debate on the exact cause of the launch failure – the effect of cold on o-rings – the night before the launch. The decision makers got it wrong, seven people died, a billion-dollar vehicle was destroyed, and the US space program was hobbled for years. That proved to be an extremely popular presentation, and I gave it many times to chapter meetings. When the Columbia was destroyed in 2003, Bob Cassiliano asked me to present a failure analysis at the next 7×24 Exchange conference.
I’ve always believed that a more rigorous, engineered approach to data center reliability is key to advancing our industry. I teamed with MIT professor Mike Golay and gave a course called “Real Availability” the Sunday before several conferences. Our number one feedback was “Too much math!”
The 7×24 Exchange has consistently been the best conference and best organization for advancing the industry, increasing knowledge, and forging key interpersonal connections. I don’t advertise MTech in trade magazines or elsewhere; our presence at 7×24 Exchange conferences is more than enough. I’ve been privileged to work with fantastic clients, many of whom I’ve met through the 7×24 Exchange, and now I am fortunate to count some of them as friends.