Executive Summary
- Downtime still largely exists despite the advances in automation, resilience and UPS systems, and can have disastrous impacts on business, operations, consumers and infrastructure.
- While theoretical scenarios planned out can help, real-world failure scenarios rarely match up as infrastructure is more complex, there are usually more than one point of failure and there must always be a margin for human error.
- The main causes of downtime are human error, network and IT system failures, power failures, cooling system failures and complex infrastructure. Operators must examine downtime causes and utilise real world simulation to innovate new ways of reducing downtime.
In an era of AI automations and cloud infrastructure, downtime still exists in data centres and largely that’s because data centre infrastructure nowadays is more complex and nuanced, with several systems in place, which increases the risk of downtime.
We can automate until the cows come home, but realistically, real systems operate in messy, human environments, so there’s always going to be a risk in which downtime occurs, no matter how slim the margin for error is.
Theoretical scenarios vs real-world failure scenarios
Theoretical scenarios testing resilience are employed in the initial design and architecture phase, the resilience modelling phase, documentation, compliance and certification, such as the Uptime Institute Tier Classifications.
The issue with theoretical scenarios is that they assume ideal conditions – such as perfect redundancy and predictable failure models and use concepts such as N+1 and N+2 redundancy that eliminate single points of failure. These scenarios don’t factor in the complexity of data centre infrastructure of today, multiple points of failure in a system and human error.
Real-world failure scenarios rarely look like this, especially with theoretical scenarios not fully accounting for human and operational factors such as rushed changes, incorrect procedures, and human error. In the real world, this is a good proportion of outage causes. You also have unpredictable environmental conditions, whilst models assume predictable conditions, they don’t take into consideration volatile weather, heat waves, storm season and freezing winter temperatures that can affect infrastructure. Furthermore, theoretical scenarios assume a single point of failure, but with complicated interconnected systems in data centre infrastructure, there can be compound failures – a seemingly minor issue, such as a sensor misreading, can trip multiple systems to fail, which is a phenomenon engineers call cascading faults.
Unplanned downtime can have a domino effect of consequences, from immediate financial loss, operational disruptions, and temporary inconvenience, to long-term strategic impacts and sometimes even life-threatening impacts. Just take a look at the five-day Berlin power outage and how that affected thousands of people, businesses, schools and hospitals.
Here, we examine the biggest real-world causes of unplanned downtime.
Power Failures
If there is insufficient redundancy or failures in the UPS system, this often leads to outages. Energy has been seen as the Achilles heel of data centres because they are one of the most common causes of downtime. In the Uptime Institute Report, “Annual Outage Analysis 2025”, it reports that energy and power failures are the cause of 54% of major impact outages reported in 2024.
Cooling Systems
Heatwaves and data centres are mortal enemies, and the rise in temperature greatly affects the efficiency of the cooling systems in place, whether the infrastructure uses air, liquid or a combination of both cooling solutions.
Complex Infrastructure
Data centres consist of multiple systems interconnected and are becoming more complex, and with that comes increased risk of misconfigurations that can bring down multiple systems in a matter of minutes. As a result, frequent outages occur.
Human Error
There is always a margin for human error, no matter how slim the chance. This could be a lack of processes, a step missed in a procedure or misconfiugrations during routine maintenance or upgrades that can lead to outages. Last year, outages caused by human error increased by 10%,and the most common reason was the failure to follow procedures to the letter. This could be amplified because of the staff shortages and rapid industry growth.
Networks and IT System Issues
These could be hardware faults, mismanaged traffic or latency and that can create severe bottlenecks. These kinds of failures can turn one point of disruption and turn it into widespread service disruptions.
Despite innovations and advances in technology and infrastructure, downtime is still an unavoidable reality that data centre operators must face. In order to minimise downtime, operators must truly understand all causes and factors that can contribute to downtime and to keep testing in simulations to innovate new ways of reducing risk, increasing resilience and ultimately, reduce downtime as much as possible.



