By Jorge A Gil
Data Center Depth of Impact and Causes of Downtime
The DSL system is a very straight forward approach that can be used effectively to categorize the state of the Data Center facility, in the event of maintenance, failures, and non-inherent incidents. This methodological approach has been conceived considering two dimensional aspects of Downtime:
1. Depth of Impact
DSL is tied accordingly to the “Depth of Impact”, that might also be translated as the “Depth of Failure” (The Depth of Data Center failure). Data Centers inherent downtimes can be caused and aggregated from a broad perspective, by two different situations: Failure events and Maintenance events. Events that are non-inherent to the Data Center’s mission, could be “External Events” or “Non-inherent Internal Events”, and from the vantage point of the Data Center mission critical objective, every downtime event could be considered a “Failure Event”, since they prevent the “7×24” business objective.
The Depth of Impact of Data Center downtime, is the level of affection caused by the failure, maintenance, or by any other internal/external event. It is the result of Data Center Systems/Structures hierarchization. Critical Systems are conceptually composed by Capacity Components and Distribution Paths. That applies to ooms. Data Centers have also purpose-built structures/rooms, that may contain components/distribution paths from different systems. Structures/rooms are considered
part of the building and Site.
Depth of Impact Levels:
Component Level. Referred to a downtime impact affecting only Critical Components, i.e. UPS, CRAC, Chiller, Switchboard, etc. It usually refers to impacting, at least one critical component, without shutting down the whole related system. In that case, other Components or Mechanisms must exist in order to support the load of the off-line Component(s).
System Level. It is a Downtime event impacting a particular Critical System, i.e. Electrical System, Mechanical System, etc. In that kind of scenario, other parallel Systems must exist in order to support the load.
Critical Room Level. A Critical Room downtime means that the IT Load can no longer be supported, shutting down the IT Load delivery service. It could be a partial or full Critical Room downtime. A partial Critical Room downtime, affects only a specific area (Groups of IT Racks) of the Total Critical Room area.
Site Level. A downtime at Site Level means that an important area, or the full Site could be potentially compromised or damaged, also causing potentially Health/Life Threatening situations or insecure conditions. It might impact the IT Load (Critical Rooms), but not necessarily.
2. CAUSES OF DOWNTIME (CATEGORIES)
The other dimension considered by DSL, has to do with the causes of downtime. The DSL system categorizes Data Center downtimes in five categories.
CATEGORIES OF CAUSES OF DOWNTIME:
Planned/Preventive Maintenance. Downtime caused by the need of Planned Preventive Maintenance (PM, PdM, RCM). Depth of Impact at:
Failure / Corrective Maintenance. Downtime caused by Inherent Failures; therefore, a Corrective Maintenance must follow. The nature of the failure can
be: inherent failure, human error, cascade failures, etc.
Depth of Impact at:
Critical Event. Downtime usually caused by cascade Component/System failures resulting in an Outage. Human Errors at Critical Room level, Accidental EPO activation, and any other Inherent related events capable of causing Outages (IT service delivery shutdown) are considered Critical Event downtimes.
Depth of Impact at:
Catastrophic Event. Incident caused by Inherent Failures or Internal/External Non-inherent Events, that triggers a Catastrophic condition without impacting the Critical Room. Examples: diesel spilled, fire under control, external hazmat spilled, disaster type event, etc. Any potentially Catastrophic Event, regardless of the structure/system where it occurred, must be considered a Site Level impact Catastrophic Event. Catastrophic Events can be Potentially Life-Threatening situations.
Depth of Impact at:
Catastrophic Failure. Typically, the escalation of a Catastrophic Event, impacting the Critical Room/IT Load service availability.
Depth of Impact at:
THE DOWNTIME SEVERITY LEVELS (DSL)
The DSL is a metric for internal use purposes for Data Center operations. It was designed as a communication tool, in order to address the severity of any incident/ event that causes downtime to any component, system, structure of a Data Center, accordingly with the appropriate level of required response. DSL can also be used as a way to analyze Downtime events in order to understand the chain of impact and, the escalation and de-escalation process. It accounts and categorizes inherent and non-inherent events that could impact Data Centers / Telco facilities availability.
DSL consists of seven levels of severity for Data Center downtimes (Figure 3).
The 7 DSL is a simple way communicate about the level of severity of Downtimes. The corresponding Severity to any downtime will convey the Depth of Impact to Data Center operations.
Component level downtimes, for example, have two different levels of Severity according with the caused situation. A Preventive Maintenance work in a critical component, produces a controlled shutdown situation. That kind of downtime, creates a Vulnerable Condition to Data Centers, since redundant equipment should be used to keep the Data Center in operating condition. In that case, it would be a Severity 1 downtime (DSL Severity 1). However, if the Data Center depends of that sole piece of critical equipment (non-redundant component), the resulted severity would be a Severity 5 downtime (DSL Severity 5). Considering that the component shutdown consequences, would produce a Critical Downtime or Outage to the Critical Room. The Severity level (DSL) is attributed not only to the cause of the Downtime, but also to the level of impact to Data Center operations.
DSL Severity addresses where the failure/event is “contained”, as it can be appreciated in the figure 4. Nonetheless, if inherent failures/events were not properly addressed, they may end up causing an Outage or Critical Downtime to the Critical Rooms. On the other hand, non-Inherent events produce impact to a Site level or potentially to Site level. If the Site level event is not contained, it may end up causing also an Outage/ Critical Downtime to the Critical Rooms. In that case it would be a Severity 7 downtime event, designated as a Catastrophic failure, since it produces an Emergency Site Shutdown (either manually or automatically). DSL also states about the Overall Site Condition, by defining five Site Condition levels, associated with the Causes of Downtime categories, beginning with Normal Operating Condition (Full Operational Capacity), downtime Site Conditions are defined as:
It helps to convey and reinforce to the staff, about the consequential relationship and impact to Overall Site Operations, with the intention to keep everyone aligned
with the Data Center mission objective.
Finally, figure 6 shows the relationship of several simulated downtime events and different Data Center topologies according to their criticality levels, and the resulting Severity level (DSL). It demonstrates that the severity level of any downtime event, has a direct relationship with the resiliency level of any Data Center. Basic Data Centers, which have non redundant components, have very limited to no alternatives, in order to avoid a Critical Downtime (Severity 5 downtime) when performing Preventive Maintenance at Component level. By contrast, Fault tolerant Data Centers, should be able to contain most inherent downtime events, and even some catastrophic events; depending of the site design configuration. This approach (DSL) can also help, in order to select the level of resiliency and design configuration to any future Data Center/Telco facility, by simulating different scenarios with inherent and non-inherent events.
As shown in the figure 6, DSL can effectively be used to communicate about the severity of impact of any component/system/structure downtime, in order to understand where the failure occurred, what damage caused to Data Center operations, and how it should be treated accordingly, in order to restore the Site from the failure event to Normal Operating Conditions.
Jorge A. Gil, DCEP is Principal of SERES, LLC. He can be reached at [email protected]..