TDWI | Training & Research | Business Intelligence, Analytics, Big Data, Data Warehousing

TDWI Articles

Resilience in Data Services Through Mitigation and Prevention: Lessons Learned

The OVHcloud data center fire highlights how simple backups aren't enough to get your business back to normal when data is lost or unavailable -- and that simple fire code compliance may not adequately limit the risk of data loss.

By Paul Amico
September 16, 2021

In March 2021, a massive fire broke out at the OVHcloud data center in Strasbourg, destroying one building and causing serious damage to another. The fire caught many clients unprepared; they lost access to their data or, even worse, lost the data itself. Did they put too much faith in the cloud or in the cloud service?

For Further Reading:

Why 2020 Was the Year of Disaster Recovery

Executive Q&A: Enterprise Security in the Post-Pandemic Era

Protect Your Data Before Disaster Strikes: Best Practices for Disaster Recovery Plans

The truth is that most companies don't think enough about the possibility of such a serious event when they select a cloud service provider or package. Instead, they focus only on encryption, cybersecurity, access controls, access speed, features and flexibility, technical support, or cost.

These are important considerations for selecting a cloud service, but companies should also scrutinize the cloud service's resilience in the event of extreme hazards, with fire being the most impactful in terms of frequency and consequences.

Resilience refers to how well an organization can handle a major disruptive event and return to normal operation. Operational resilience is the ability to deliver critical operations in the face of disruption. It allows organizations to absorb internal and external shocks and ensure the continuity of critical operations by protecting key processes and resources such as systems, data, people, and property.

Backup Versus Replication

Most people will say the simplest and most rudimentary way to ensure resilience is to back up data. However, having a backup just means that your system can be rebuilt -- eventually. If the site providing your cloud service is destroyed, it can take time to rebuild the data system so your business can return to normal operation. Furthermore, if no data is backed up, and even if your data can eventually be recovered from your own resources, the cloud service may not be at fault for your downtime, depending on your contract terms.

OVHcloud took 50 days to restore 118,000 of its 120,000 customers affected by the fire -- a significant timespan to be without services. Moreover, OVHcloud only compensated clients for their losses with refunds on service contracts. Customers whose virtual private server (VPS) hosting was destroyed but had not ordered paid backup, for example, received a six-month refund. OVHcloud did not offer reimbursement for business losses due to lost access or data.

Companies put tremendous faith in the protection, reliability, and availability afforded by a cloud service. Therefore, rebuild time (or "return to service") should be considered and specified in your service terms. I asked one IT manager whether his company had thought about the rebuild time when selecting a cloud service provider. He said no and, in fact, didn't know how long rebuilding would take.

The ultimate way to ensure rapid return to service is complete replication, which effectively eliminates the need to rebuild the data structure and function. This involves creating an exact replica of the data structure and sites such that the service provider can "throw a switch" and customers are back to normal quickly -- perhaps in a matter of hours. It is an expensive solution, but it may be worth the cost when balanced against the business interruption risk posed by a catastrophic event at the primary cloud site.

Prevention through Risk Assessment

Although mitigation will minimize the business impact of an event, it accepts that the likelihood of such events is acceptable. Wouldn't it be better if the facility owner had focused more on prevention and already taken actions to reduce the risk associated with such events? When selecting a supplier, ask: "Have you performed an assessment to determine the residual risk associated with events beyond the requirements of the codes and acted upon the findings?"

At a minimum, data center owners should conduct a fire risk assessment and update it every one to two years or when a major change is made to the facility because fires are likely the highest risk at these facilities. In the case of a major modification or new facility, the fire risk assessment should be an integral part of the design process and inform key design decisions. The difference between the level of protection afforded by the fire codes and the "ultimate" total protection of this data is the residual risk. It can be expressed as:

Business Risk ($/year) =
Frequency of Fire (events/year) x Probability of Fire Damage x
Consequences of Damage ($/event)

If performed correctly, the risk assessment will identify the largest contributors to the total in terms of key ignition sources, most important protective features, and areas of most vulnerability, and create an understandable and actionable numerical risk construct. This will allow a provider to apply risk-informed methods to identify where to spend money to reduce the risk.

A forward-thinking data center operator needs to apply the process of risk-informed engineering to the design (or upgrade) of the facility. The process is as follows:

Step 1: Collect the Data

For Further Reading:

Why 2020 Was the Year of Disaster Recovery

Executive Q&A: Enterprise Security in the Post-Pandemic Era

Protect Your Data Before Disaster Strikes: Best Practices for Disaster Recovery Plans

Review facility drawings and fire-protection program documents
Assess fire response procedures by conducting interviews and reviewing documentation

Step 2: Review the Codes

Review international and local codes and requirements, such as IBC/IFC requirements for the building construction, special separation requirements and strategies to reduce the inherent ystem hazards with passive and active fire protection features

Step 3: Identify Hazards

Identify fire and explosion hazards as well as designed mitigation strategies for such conditions, including programmatic controls such as ignition source ontrol, inspection, testing, maintenance, etc.
Conduct a code comparison to applicable codes as compared to the facility information available for review

Step 4: Assess the Fire Risk

Review risk thresholds and define risk acceptability thresholds
Perform a qualitative assessment on the needs for protection for areas and a fire risk assessment that consists of quantitative characterization of the frequencies and consequences of the identified fire scenarios. This characterization will include the effects of the designated fire-protection program.
Evaluate the effect of assumed prevention and mitigation strategies for the identified fire scenarios suggesting high risk (above acceptable consequence or frequency thresholds)
Develop a comprehensive list of recommendations given the risk assessment results (risk-informed engineering to identify and prioritize the addition of programs and features to obtain an acceptable risk cost-effectively)

Certainly, data center owners may be somewhat resistant to invest money in these processes. However, if clients start rewarding owners who do by choosing them as their service provider, other data center owners will see the value in going the extra mile to prevent large, catastrophic events.

Reassessment

The assessment of the appropriate level of operational resilience is not a one-time task. The value of data changes over time as do the threats to that data. Assessments need to be an integral part of your corporate risk management program.

Well-run businesses regularly reassess their risk profile. Questions such as "Do we have enough/the right insurance?" and "Is our cybersecurity adequate?" are not typically asked only once. They are asked repeatedly and regularly. The risk of data loss due to catastrophic data center damage needs to be part of this same update process. This could (and should) lead to changes in what cloud services to purchase and from whom.

A Final Word

Catastrophic events such as the OVHcloud data center fire need not result in extended business interruption. Although data center customers can take actions to mitigate the impact and improve resilience, data center owners can also take actions to prevent such events. Customers should consider this when selecting their cloud service provider and ask:

How is my data backed up? What options are available?
If my primary data center suffers catastrophic damage, how and when will my service be restored?
How long will it take to restore my service to normal?
When did you last perform a fire risk assessment? What have you done to reduce risk from catastrophic fire damage?

Revisit these questions periodically as part of your overall corporate risk management program to ensure that the level of protection addresses changes in the risk profile. In general, a review every one to two years should be adequate if there are no major changes to the facility. In the case of a planned major modification (or a new facility), the review should be conducted before the construction starts, and, in fact, you should ask if a fire risk assessment is an integral part of the design process and the insights used to inform the design.

The OVHcloud data center fire didn't need to result in such significant business losses. Due diligence about prevention and mitigation is the key to cloud service resilience.

About the Author

Paul Amico has specialized in risk assessment, risk management, and risk-informed, performance-based engineering services for over 40 years. During that time, Paul has participated in many risk-based projects for a wide range of industries. Most projects focused on risk mitigation, risk management, and performance-based design of protection systems for hazards associated with fires, floods, explosions, the full range of natural phenomena, and man-made hazards. His international experience includes projects in over two dozen countries. Paul is a charter member of the Society for Risk Analysis and has participated in development of risk assessment standards for the American Society of Mechanical Engineers, the U.S. Department of Energy, and the American Nuclear Society. You can contact the author via email or LinkedIn.

TDWI Membership

Accelerate Your Projects,
and Your Career

TDWI Members have access to exclusive research reports, publications, communities and training.

Individual, Student, and Team memberships available.

TDWI | Training & Research | Business Intelligence, Analytics, Big Data, Data Warehousing

Research & Resources

Webinars

Virtual Summits

TDWI Articles

Resilience in Data Services Through Mitigation and Prevention: Lessons Learned

Related Articles

Trending Articles

The Hidden Cost of AI at Scale: Why Data Architecture Matters More than Models

Making AI Compliance Practical: A Guide for Data Teams Navigating Risk, Regulation, and Reality

Bridging the AI Readiness Gap: Practical Steps to Move from Exploration to Production

The Future of CX Is Predictive: Advanced Analytics Is Driving Proactive Customer Engagement

TDWI Membership

Accelerate Your Projects,
and Your Career

TDWI

Engage

Research

Research & Resources

Webinars

Virtual Summits

TDWI Articles

Resilience in Data Services Through Mitigation and Prevention: Lessons Learned

Related Articles

Trending Articles

The Hidden Cost of AI at Scale: Why Data Architecture Matters More than Models

Making AI Compliance Practical: A Guide for Data Teams Navigating Risk, Regulation, and Reality

Bridging the AI Readiness Gap: Practical Steps to Move from Exploration to Production

The Future of CX Is Predictive: Advanced Analytics Is Driving Proactive Customer Engagement

TDWI Membership

Accelerate Your Projects, and Your Career

TDWI

Engage

Research

Accelerate Your Projects,
and Your Career