The thermal chain: why AI demands a systemic rethink of cooling

Executive Summary

  • We are entering the era of the thermal chain, where cooling shifts from utility to thermal strategy and those who master it can build the resilience needed to power the next generation of intelligence.
  • In this article by Maurizio Frizziero, Vice President, Chilled Water Systems at Vertiv, he comments that there is a marked shift towards advanced technologies and solutions that trim the cooling loads. These systems allow for precise control over heat dissipation.
  • Rack densities soar past 50kW and are now moving to 1MW onwards. The transition to liquid cooling solutions to handle thermal management is inevitable, with most operators managing hybrid setups for years.

 

The rapid adoption of generative AI has fundamentally shifted the physics of the data centre. As rack densities soar past 50kW, approach and pass 100kW and move toward the unbelievable 1MW or more, the traditional methods of removing heat are hitting a thermodynamic wall. For densities higher than 50kW, we can no longer rely on air transfer alone to keep pace with the computational demands of the future.

We are entering the era of the “Thermal Chain.”

The physics of intelligence

To understand the cooling challenge, we must first understand the workload. Traditional enterprise applications are usually stable and predictable. They surge during peak hours and are idle during the night. Cooling infrastructure is designed to handle this flux, often running at partial load for much of its life.

AI training workloads are different. They run at maximum thermal output for weeks or even months at a time. The silicon is pushed to its absolute limit, generating a dynamic, intense heat flux. This fundamentally changes the requirements for the cooling plant.

Beyond the chiller

For decades, cooling has been treated as a utility. It was simply a case of keeping the room cold enough so the servers would continue to function at optimum levels. However, AI workloads run hotter and harder, more consistently and at much higher density than any traditional enterprise applications. To manage this heat effectively, we need to stop thinking about individual cooling units and start thinking about the entire thermal ecosystem.

This ecosystem extends to the room itself. Modern designs are increasingly leveraging perimeter cooling and thermal wall technology to manage ambient temperatures more efficiently. By using the physical structure of the room as part of the thermal strategy, data centre operators can create a robust first line of defence before heat even reaches the row. This approach maximises the efficiency of the air volume and stops high density clusters from creating hotspots that compromise the rest of the facility.

Perimeter cooling units work in concert with the containment aisles to allow a uniform distribution of cold air. This prevents the starvation of high-density racks, a common issue in legacy data centres where the sheer volume of air required by modern graphics processing units (GPUs) exceeds the capacity of standard floor grilles.

The hybrid reality

It’s not an exaggeration to say that the transition to liquid cooling is inevitable with most operators managing hybrid set ups for years to come. They will mix high density liquid cooled AI clusters with traditional air-cooled infrastructure in the same hall.

This presents a unique operational challenge. Data centre operators must balance the airflow dynamics of a legacy row with the hydraulic complexity of a liquid loop. The answer lies in integrated management. The air and liquid systems must work in concert, communicating real time data to the building management system to optimise performance across the entire floor.

Direct-to-Chip as the current standard

Direct-to-Chip, single phase, liquid cooling is rapidly becoming the standard for AI. By bringing the cooling medium directly to the heat source, inefficiencies or air transfer are bypassed entirely. Alongside direct-to-chip solutions, rear-door heat exchangers (RDHx) remain a practical option for sites transitioning to higher densities without fully committing to liquid cooling.

This shift allows for significantly higher compute density. It also enhances the Power Usage Effectiveness (PUE) of the facility. However, it requires a new set of skills and infrastructure. Operators need to be comfortable managing fluid dynamics, leak detection systems, and coolant distribution units (CDUs). This represents a significant cultural shift for teams used to managing fans and compressors.

The CDU acts as the backbone of the liquid cooling system. It manages the flow rate, pressure, and temperature of the coolant, enabling the chips to receive exactly what they need to function optimally. In modern designs, these units are intelligent, communicating directly with the rack management system to adjust cooling capacity in real-time based on the computational load.

Heat rejection and the evolving final link

The thermal chain extends beyond the building where heat rejection systems must evolve to handle the new thermal load. Increasingly, however, this does not have to be the end of the chain: waste heat can be captured and reused, extending the thermal pathway beyond simple rejection.

We are seeing a marked shift towards advanced technologies and solutions that trim the cooling loads. These systems allow for precise control over heat dissipation. They enable the high-grade heat generated by AI chips to be rejected efficiently without overworking the mechanical plant during peak loads and maximising the free cooling operation.

Oil-free centrifugal chillers, which leverage non-contact between moving parts, and therefore don’t require oil, offer exceptional efficiency (especially at partial loads) as well as quiet operation. This makes them ideal for maximising cooling efficiency and preserving power for the constant intensity of AI training. When combined with magnetic bearings, these systems operate with minimal friction, reducing maintenance needs and extending the lifespan of the equipment.

The trimming of the cooling is especially beneficial in hybrid operations. The solution offers broad applicability across all climates, full compatibility with the highest server densities, and the ability to operate at very high leaving water temperatures – exceeding any other technology – thus enabling maximum free cooling, as well as no water waste.

This technology provides flexibility across different climates and addresses unpredictable conditions, often making it the smartest choice when the best option is not immediately clear.

Inverter screw technology is another robust, reliable approach that performs effectively under a wide range of site conditions. By incorporating inverter-driven screw chillers along with advanced system-level controls into the chilled-water architecture, deployment timelines can be accelerated while maintaining both cooling continuity and efficiency. It can enable enhanced reliability, making it well-suited for demanding, high-density scenarios.

A bright future for those who master the thermal chain

Operators who master the entire thermal chain will be able to achieve a competitive advantage. By integrating solutions, mastering hybrid cooling management, and deploying advanced heat rejection and heat reuse technologies, the industry can build the resilience needed to power the next generation of intelligence. This is a fundamental evolution in how we build for high performance compute.

Share this Post: