Data Centre Resilience: 100% Uptime is a myth

Image of racks in a data centre

Executive Summary

  • Achieving 100% uptime is a myth; even the best data centre resilience and redundancy strategies cannot eliminate all risks
  • Each tier of The Uptime Institute has strict uptime requirements and allotted allowances of downtime each year, which gives data centres guidelines around resilience expectations.
  • Redundancy protects infrastructure, but there are still processes and human risks that can’t be eliminated fully.

Data centre resilience is a hot topic in the industry, and keeping data centres running 24/7 is the ultimate goal. But with several factors that can cause errors, outages and downtime, in reality, 100% uptime is a myth we chase, like the pot of gold at the end of the rainbow. Myth it may be, it is still an expectation with limits.

Here we take a closer look at data centre tiers and the uptime classifications expected of them, to demonstrate that even the best redundancy cannot eliminate all risks.

The data centre tiers

The Uptime Institute has a tier classification system for data centres and is one of the most prominent standards in the industry. The system itself is progressive, so each higher tier automatically includes the requirements of the tiers that are below it.

Tier 1 data centres

This tier is most suited for small companies, start-ups with low budgets and low IT requirements.

Requirements

  • A single path for power and cooling
  • Limited cooling capacity at around 220-230 watts per square metre
  • No fault tolerance
  • No redundancy
  • A maximum of 28.8 hours of downtime per year, which works out to 99.67% uptime

Tier 2 data centres

These data centres within this tier are used for simple IT processes that need good performance, but they’re not mission-critical.

  • One path for power
  • Cooling capacity of 430-450 watts per square metre
  • Partial redundancy for cooling and power
  • Low fault tolerance
  • Maximum of 22 hours of downtime per year, which is a 9.75% expected uptime.

Tier 3 data centres

This is the recommended tier for businesses with a high standard for seamless IT processes, e-commerce and mission-critical processes.

  • Reliable redundancy for different components
  • Two servers with multiple paths for cooling and power
  • Good fault tolerance
  • Cooling capacity of 1,070-1,620 watts per square metre
  • Maximum of 1.6 hours of downtime per year, with a 99.98% expected uptime.

Tier 4 data centres

This fourth and final tier is for large companies with internationally connected computing networks, high mission-critical IT and 24/7 system availability.

  • Complete redundancy for all parts of their system
  • High fault tolerance with no single points of failure
  • Great cooling capacity of over 1,620 watts per square metre
  • A maximum of 0.8 hours (26 minutes roughly!) of downtime per year, an expectation of 99.991% uptime

With this tier system comes a set of core redundancy models that explain the baseline capacity and what kind of failures the centre can handle. In simplest terms, N is the baseline capacity needed, N+1 means they can handle a single component failure, 2N is a system duplication, meaning they can have a complete system outage, and 2N+1 means you also have additional resilience should there be one entire system also under maintenance.

With power failures being one of the top causes of data centre outages, power redundancy is a non-negotiable and certainly not a luxury. When planning data centres, managers face the dilemma of:

  • Accepting occasional downtime, which makes costs a bit cheaper
  • Invest in expensive duplicate systems to maximise the chance of resilience.

Most operators would lean towards the second option, since downtime, even in a company such as a financial trading firm, could cost them millions in just one hour alone.

Why would they want to take that risk, right?

After all, the uptime requirement for the tier classification is a minimum and most operators strive to reduce outages as much as humanly possible.

You can have great redundancy, but you can’t eliminate all risk

Redundancy can eliminate single points of failure, but it can’t eradicate all failure, even tier 4 structures. Tier 4 and 2N+1 data centres can withstand certain scenarios of downtime, but they can’t withstand all of them.

Share this Post: