Data center risk assessment: A decision-making tool

Data center risk assessment: A decision-making tool

Risk management is a key aspect of data centers due to their mission-critical nature and our increasing reliance on them. Risk assessment helps us understand the risks that data centers face, enabling strategies and decision-making.

Risk management is a key aspect of data centers because of their mission-critical nature and our increasing reliance on them. Risk assessment helps us understand the risks that data centers are exposed to, enabling strategies and decision-making based on the risk level.

Omdia highlights the importance of data center risk assessment as a decision-making tool to reduce downtime. Risk assessment contributes to understanding the risks that the data center is exposed to, and comprises risk identification, analysis of risk level, and evaluation of risk tolerance. It also helps better communicate the risks to stakeholders and elaborate on mitigation strategies.

In general terms, the impact of data center failures is major because they may result in a lack of access to information. Data centers are exposed to a variety of natural, technological, and man-made threats.

Enterprises have decisions to make about risk management strategies. First, data center risk assessment must be considered as part of data center operations. Next, decisions about human and technological resources need to be made and executed. Finally, the strategy must consider a holistic view of the organization.

The Project Management Institute defines risk as an uncertain or unforeseen event that, if occurring, will impact a project’s outcomes. Risk is linked to present and past events, hence the importance of understanding and tracking them to better predict the future. Omdia defines risk as the product of the probability of occurrence of an event times its impact.

Failures are very expensive. The cost of downtime can be significant, owing to a lack of access to information depending on the industry. Data centers are exposed to a variety of natural, technological, and man-made threats. The most common threats are those associated with the specific location. These include natural disasters, adjacent properties, access to reliable power, proximity to high-speed telecommunication networks, and access to water service. Data centers can also be targets of malevolent intruders or hackers attempting to breach physical or data security. Operational and human factors also entail substantial risk, and automation has helped to mitigate it.

Reliability is the probability that an item will perform its intended function for a specified period of time under stated conditions. The goal is to reduce the probability of events leading to failures. The data center’s reliability is complementary to the probability of occurrence of failures.

One way to increase reliability is if the same function is performed by two or more elements arranged in parallel. This is called redundancy, a widely implemented data center strategy intended to eliminate single points of failure (SPOF). Structural redundancy uses more components for the same purpose. Examples include dual UPS, dual-circuit breakers, a reserve water pump, a spare generator, or a reserve electric line.

In addition, redundancy can be active or standby. In active redundancy, the parallel components work or are loaded simultaneously. In standby redundancy, only one component is active, and the redundant ones are switched on as the active element fails. The advantage of standby redundancy is that only one component is loaded and exposed to wear or other kinds of deterioration. A disadvantage is that such arrangements usually require a switch or similar item, which increases the costs and may contribute to the unreliability of the system.

Risk assessment is a systematic, step‐by‐step approach for evaluating risk. It comprises the identification, analysis, and evaluation of data center threats. By investing resources to assess data center risks, stakeholders can utilize the most effective solutions to reduce downtime and operational costs and improve data security and integrity.

There are no absolute rules as to how to perform risk assessment, so it is up to the organization to decide the scope and depth of the analysis. Figure 1 describes a proposed framework for data center risk assessment. The first step is to identify all potential threats associated with the data center and assign them weights according to relevance. This is a cyclical process as new threats can appear throughout the data center lifecycle.

The second step is risk analysis to determine the risk level, where we quantify the probability of occurrence of the events and their impact. The reliability of the data center infrastructure needs to be factored in to estimate impact.

Risk tolerance varies across different people and organizations. The last step is risk evaluation, to benchmark the estimated risk against an acceptable level, depending on the risk appetite. A holistic view of the organization is needed as each one is different and has its own needs and constraints.

Judgement is required from the right team of people, including data center subject matter experts, to identify, analyze, evaluate, and interpret results. The success of this practice depends on the quality of the information used.

Before making a decision, we should consider the risk treatment, which is the process to select and implement measures or controls to adjust the risk to the desired level. Decreasing the risk level with minimum investment involves a great deal of knowledge. The final step is risk acceptance. In this step, no further action is taken to treat the risk level, but ongoing monitoring and communication is encouraged.

Below are some examples of recent undesired events.

The COVID-19 pandemic has highlighted the importance of the critical infrastructure workforce. Supply chains have been disrupted, affecting operations and planning. Most data centers had to implement contingency and business continuity plans to adapt to the new reality, considering remote management and protecting their workforce while trying to meet objectives.

During the week of February 12–17, 2021 winter storm Uri caused major power outages throughout the state of Texas, affecting millions of households and businesses. Semiconductor manufacturing facilities had to shut down because of their substantial energy demand. Texas is not part of the national electric grid, and the state was not prepared to rapidly respond to power disruptions caused by extreme weather conditions. A few data centers also had to shut down, leaving users offline; however, most of them were prepared for disaster recovery using diesel generators. Because road conditions made transportation and diesel fuel supply challenging, data centers implemented workload management strategies to reduce energy consumption.

On March 10, 2021 a fire affected OVHcloud, a global cloud infrastructure provider, destroying the SBG2 five-story data center and damaging the adjacent SBG1 data center in Strasbourg, France. The event, which took hours to control, disrupted the IT operations of clients and millions of users worldwide, many of whom had no disaster recovery plan.

On June 24, 2021 at about 1:30 am, Champlain Towers, a 12-story building located in Surfside, Florida, partially collapsed, taking multiple human lives. People are not used to events like this in the US. Perception of risk is more commonly associated with driving a car, taking a plane, or crossing the street—anything but staying home. Events such as the building collapse are usually accompanied by previous alerts. In this specific case, structural damage had been reported in prior years. That being said, it is not enough to identify the source of risk. Corrective actions must be implemented in a timely manner.

Over the 4th of July weekend in 2021, a sizable cyberattack targeted Kaseya, a US-based IT management software developer. It affected thousands of businesses worldwide. Servers were shut down proactively, and the incident response began, including unplanned maintenance and the release of a software patch, to get services back live.

Several lessons can be learned from undesired events. Transparency and multiple sources of reliable information are much needed to understand what really happened and how to prevent similar events in the future. Integrity reminds us to do the right thing.

As an analogy, we have learned over time that doing regular health checkups may prevent us from undesired health situations. Similarly, for data centers, inspections, audits, and certifications should be performed periodically to understand risk, which helps elaborate a strategic plan.

Risk assessment can be initially daunting, but failures are not an option. Choosing the right team of people and promoting effective and transparent communications is key for a successful strategy. In addition, some data center standards and best practices such as ANSI/BICSI002(Data Center Design and Implementation Best Practices) and BICSI-009 (Data Center Operations and Maintenance Best Practices) have already incorporated the concept of risk.

Analyzing data over time is valuable, but we live in an age where we need to stay on top of trends so we can make educated and fast decisions. With advances in technology, real-time data has become a key player. Data and analytics help to refine insights and drive successful decision-making.

Just as many other processes in different fields have been automated, risk assessment can be automated by incorporating real-time and historical data to have continuous, consistent, and reliable monitoring. We can also take advantage of the fact that artificial intelligence can predict behavior almost seamlessly, making predictions more accurately than humanly possible and turning data into actionable information.

As the Roman philosopher Seneca said, “Luck is what happens when preparation meets opportunity.”

Images Powered by Shutterstock