Resilience in Data Services Through Mitigation and Prevention: Lessons Learned
The OVHcloud data center fire highlights how simple backups aren't enough to get your business back to normal when data is lost or unavailable -- and that simple fire code compliance may not adequately limit the risk of data loss.
- By Paul Amico
- September 16, 2021
In March 2021, a massive fire broke out at the OVHcloud data center in Strasbourg, destroying one building and causing serious damage to another. The fire caught many clients unprepared; they lost access to their data or, even worse, lost the data itself. Did they put too much faith in the cloud or in the cloud service?
The truth is that most companies don't think enough about the possibility of such a serious event when they select a cloud service provider or package. Instead, they focus only on encryption, cybersecurity, access controls, access speed, features and flexibility, technical support, or cost.
These are important considerations for selecting a cloud service, but companies should also scrutinize the cloud service's resilience in the event of extreme hazards, with fire being the most impactful in terms of frequency and consequences.
Resilience refers to how well an organization can handle a major disruptive event and return to normal operation. Operational resilience is the ability to deliver critical operations in the face of disruption. It allows organizations to absorb internal and external shocks and ensure the continuity of critical operations by protecting key processes and resources such as systems, data, people, and property.
Backup Versus Replication
Most people will say the simplest and most rudimentary way to ensure resilience is to back up data. However, having a backup just means that your system can be rebuilt -- eventually. If the site providing your cloud service is destroyed, it can take time to rebuild the data system so your business can return to normal operation. Furthermore, if no data is backed up, and even if your data can eventually be recovered from your own resources, the cloud service may not be at fault for your downtime, depending on your contract terms.
OVHcloud took 50 days to restore 118,000 of its 120,000 customers affected by the fire -- a significant timespan to be without services. Moreover, OVHcloud only compensated clients for their losses with refunds on service contracts. Customers whose virtual private server (VPS) hosting was destroyed but had not ordered paid backup, for example, received a six-month refund. OVHcloud did not offer reimbursement for business losses due to lost access or data.
Companies put tremendous faith in the protection, reliability, and availability afforded by a cloud service. Therefore, rebuild time (or "return to service") should be considered and specified in your service terms. I asked one IT manager whether his company had thought about the rebuild time when selecting a cloud service provider. He said no and, in fact, didn't know how long rebuilding would take.
The ultimate way to ensure rapid return to service is complete replication, which effectively eliminates the need to rebuild the data structure and function. This involves creating an exact replica of the data structure and sites such that the service provider can "throw a switch" and customers are back to normal quickly -- perhaps in a matter of hours. It is an expensive solution, but it may be worth the cost when balanced against the business interruption risk posed by a catastrophic event at the primary cloud site.
Prevention through Risk Assessment
Although mitigation will minimize the business impact of an event, it accepts that the likelihood of such events is acceptable. Wouldn't it be better if the facility owner had focused more on prevention and already taken actions to reduce the risk associated with such events? When selecting a supplier, ask: "Have you performed an assessment to determine the residual risk associated with events beyond the requirements of the codes and acted upon the findings?"
At a minimum, data center owners should conduct a fire risk assessment and update it every one to two years or when a major change is made to the facility because fires are likely the highest risk at these facilities. In the case of a major modification or new facility, the fire risk assessment should be an integral part of the design process and inform key design decisions. The difference between the level of protection afforded by the fire codes and the "ultimate" total protection of this data is the residual risk. It can be expressed as:
Business Risk ($/year) =
Frequency of Fire (events/year) x Probability of Fire Damage x
Consequences of Damage ($/event)
If performed correctly, the risk assessment will identify the largest contributors to the total in terms of key ignition sources, most important protective features, and areas of most vulnerability, and create an understandable and actionable numerical risk construct. This will allow a provider to apply risk-informed methods to identify where to spend money to reduce the risk.
A forward-thinking data center operator needs to apply the process of risk-informed engineering to the design (or upgrade) of the facility. The process is as follows:
Step 1: Collect the Data
- Review facility drawings and fire-protection program documents
- Assess fire response procedures by conducting interviews and reviewing documentation
Step 2: Review the Codes
- Review international and local codes and requirements, such as IBC/IFC requirements for the building construction, special separation requirements and strategies to reduce the inherent ystem hazards with passive and active fire protection features
Step 3: Identify Hazards
- Identify fire and explosion hazards as well as designed mitigation strategies for such conditions, including programmatic controls such as ignition source ontrol, inspection, testing, maintenance, etc.
- Conduct a code comparison to applicable codes as compared to the facility information available for review
Step 4: Assess the Fire Risk
- Review risk thresholds and define risk acceptability thresholds
- Perform a qualitative assessment on the needs for protection for areas and a fire risk assessment that consists of quantitative characterization of the frequencies and consequences of the identified fire scenarios. This characterization will include the effects of the designated fire-protection program.
- Evaluate the effect of assumed prevention and mitigation strategies for the identified fire scenarios suggesting high risk (above acceptable consequence or frequency thresholds)
- Develop a comprehensive list of recommendations given the risk assessment results (risk-informed engineering to identify and prioritize the addition of programs and features to obtain an acceptable risk cost-effectively)
Certainly, data center owners may be somewhat resistant to invest money in these processes. However, if clients start rewarding owners who do by choosing them as their service provider, other data center owners will see the value in going the extra mile to prevent large, catastrophic events.
The assessment of the appropriate level of operational resilience is not a one-time task. The value of data changes over time as do the threats to that data. Assessments need to be an integral part of your corporate risk management program.
Well-run businesses regularly reassess their risk profile. Questions such as "Do we have enough/the right insurance?" and "Is our cybersecurity adequate?" are not typically asked only once. They are asked repeatedly and regularly. The risk of data loss due to catastrophic data center damage needs to be part of this same update process. This could (and should) lead to changes in what cloud services to purchase and from whom.
A Final Word
Catastrophic events such as the OVHcloud data center fire need not result in extended business interruption. Although data center customers can take actions to mitigate the impact and improve resilience, data center owners can also take actions to prevent such events. Customers should consider this when selecting their cloud service provider and ask:
- How is my data backed up? What options are available?
- If my primary data center suffers catastrophic damage, how and when will my service be restored?
- How long will it take to restore my service to normal?
- When did you last perform a fire risk assessment? What have you done to reduce risk from catastrophic fire damage?
Revisit these questions periodically as part of your overall corporate risk management program to ensure that the level of protection addresses changes in the risk profile. In general, a review every one to two years should be adequate if there are no major changes to the facility. In the case of a planned major modification (or a new facility), the review should be conducted before the construction starts, and, in fact, you should ask if a fire risk assessment is an integral part of the design process and the insights used to inform the design.
The OVHcloud data center fire didn't need to result in such significant business losses. Due diligence about prevention and mitigation is the key to cloud service resilience.