Datacenters are having fewer, but bigger failures

On-Prem

And neither AI nor international conflict are helping

There’s good news and bad news when it comes to datacenter uptime. According to a recent report from the Uptime Institute, bit barns have actually gotten more resilient over the past five years. However, the report suggests that those datacenter failures that do occur are lasting longer and costing more to resolve.

According to Uptime, half of the operators surveyed reported an impactful or serious outage in the past three years. “This is the lowest level recorded since 2020 and continues a multi-year trend of improving reliability.”

However, the report also finds that datacenter operators may be having a harder time adding additional 9s of reliability to their SLAs. According to Uptime, failure rates are falling at a slower pace, suggesting that existing efforts to improve resiliency may be at the point of diminishing returns.

This doesn’t appear to be the result of complacency. Instead, analysts suggest that efforts to improve uptime are being offset by greater system complexity and more challenging operating environments caused by the widespread deployment of power dense infrastructure used in AI training and inference.

“Higher rack densities, load variability, and operating closer to available power limits may increase the likelihood of cascading failures,” Uptime warns.

Shortages of critical physical infrastructure like generators, switchgear, transformers, and other power and cooling systems have driven some operators to adopt second-hand or unproven hardware. 

“This is believed to have contributed to several failures and incidents at some datacenters,” the report reads.

Power-related failures remain the leading cause of major datacenter disruptions, but even this is improving. “While power issues accounted for 45 percent of respondents’ most impactful outages in 2025, this is down from 54 percent in 2024,” the analysts write.

However, the analysts also warn that this could change as local grids are stressed by ever larger datacenter deployments. 

While Uptime doesn’t expect grid power failure to be a primary cause of outages going forward, grid failures can still affect the availability of onsite power. During an outage, datacenters have a limited window to switch over to onsite generators, which can and do fail.

Overburdened grids aren’t the only external factors on Uptime’s radar. The industry watchers note that many public outages have been linked to fiber cuts and other networking disruptions.

“Digital infrastructure is becoming more distributed with outages originating outside the datacenter, including those tied to power availability, network connectivity or the reliance on external cloud services playing a larger role,” Uptime Analyst Andy Lawrence said in a statement.

According to the report, networking-related issues remain the most frequently cited cause for IT disruptions. Even if the datacenter itself doesn’t fail, a bad network configuration can still result in service outages.

The good news is that wide adoption of software-defined networking and automated traffic rerouting has helped mitigate this risk. The report found that 20 percent of those surveyed reported having no IT service outages in the past three years, an improvement of nine points from 2024.

Software-level resiliency is helping to mitigate localized disruptions, like a fiber cut, by distributing the workload across multiple sites. However, this software resiliency comes with its own challenges, most notably complexity. 

As we saw with the drone strikes on Amazon’s UAE and Bahrain datacenters, spreading your workloads out across multiple availability zones doesn’t do much good if the failure spreads to multiple sites.

While Uptime observed fewer outages in 2025, the report suggests outages may be lasting longer.

“While a majority of publicly reported incidents are still resolved within 12 hours (55 percent), the share lasting more than 48 hours has increased for the second consecutive year.”

As we mentioned earlier, many of these were tied to factors like damaged fiber lines, which Uptime notes occurred more than twice as often as usual.

As you might expect, the longer the outage, the more costly it can be, particularly when it concerns highly leveraged AI infrastructure. Uptime reports that one in five outages now exceeds $1 million in total costs, and expects that figure to continue to rise in the coming years. ®

Source link

Leave a Reply

Your email address will not be published. Required fields are marked *