When cloud giants neglect resilience

opinion
Apr 17, 20265 mins

Numerous cloud outages reveal the cracks in the providers’ foundations. Enterprises face tough choices as reliability declines in importance.

Businessman with an umbrella is facing strong headwind
Credit: Shutterstock

In a recent article chronicling the history of Microsoft Azure and its intensifying woes, we see a narrative that has been building throughout the industry for years. As cloud computing evolved from a buzzword to the backbone of digital infrastructure, major providers like Microsoft, Amazon, and Google have had to make compromises. Their promises of near-perfect uptime shifted from an expectation to “good enough,” influenced by economic pressures that have seen the cloud giants prioritize cost cuts and staff reductions over previously non-negotiable service reliability.

Frankly, many who follow the cloud space closely, including myself, have been warning about this situation for some time. Cloud outages are no longer rare, freak events. They are ingrained in the model as accepted collateral for the rapid growth and relentless cost-cutting that define this era of cloud computing. The story of Azure, as discussed in the referenced Register piece, is simply the latest and most prominent example of a much larger, industrywide trend.

This is not to say that cloud computing is inherently unstable or that its advantages—agility, scalability, rapid deployment—are a mirage. Enterprises aren’t abandoning the cloud. Far from it. Adoption continues at pace, even as these high-profile outages occur. The question is not whether the cloud is worth it, but rather, how much unreliability is acceptable for all that innovation and efficiency?

The price of cost optimization

If you trace the decisions of major public cloud players, a clear theme emerges. Competitive pressure from rivals translates to constant cost control, rushing services to market, shaving operational budgets, automating wherever possible, and reducing (or outright eliminating) teams of deeply experienced engineering talent who once ensured continuity and institutional knowledge. The comments from a former Azure engineer clearly illustrate how an exodus of talent, paired with an almost single-minded focus on AI and automation, is having downstream effects on the platform’s stability and support.

The irony is sharp: As cloud providers trumpet their AI prowess and machine-driven automation, the human expertise that built and reliably ran these platforms is no longer considered mission-critical. Automation isn’t a cure-all; companies still need experienced architects and operators who understand system limits, manage dependencies, handle failures, and respond deftly to unpredictable failures. Recent major outages reflect the slow but sure loss of that critically embedded human knowledge. Meanwhile, engineering decisions are increasingly made by those tasked with juggling ever-larger portfolios, new feature launches, and cost-reduction mandates, rather than contributing a methodical focus on resilience and craftsmanship.

Azure faces growing pains at scale, with tens of thousands of AI-generated lines of code created, tested, and deployed daily—sometimes by other AI agents —creating a self-reinforcing cycle of complexity and opacity. The resulting “compute crunch” puts even more strain on infrastructure, which, despite its sophistication, now handles heavier loads with fewer people providing oversight.

Outages aren’t driving users away

A natural question emerges: With reliability clearly taking a back seat, why aren’t enterprises reconsidering cloud altogether? I’ve argued for years that the game has changed. The benefits of cloud centralization, automation, and connectivity have become so fundamental to operations that the industry has quietly recalibrated its tolerance for outages. Public cloud is so deeply embedded into the business and digital operations that stepping back would mean undoing years, and often decades, of progress.

Headline-grabbing outages are dramatic but usually survivable. Disaster recovery plans, multi-region deployments, and architectural workarounds are now essentials for all major cloud-based companies. Building with failure in mind is a standard cost, not an avoidable exception. For most CIOs, the persistent risk of downtime is a manageable variable, balanced against the unmatchable benefits of cloud agility and in-house scale.

Providers know this well, and their actions reflect it. Outages may sting a bit in the press, but the real-world consequences have yet to outweigh the benefits to companies that push further into the cloud. As such, the providers’ logic is simple: As long as customers accept outages, however grudgingly, there’s little incentive to switch to costlier, less scalable systems.

How enterprises can adapt

With outages now the price of admission, enterprises should recognize that neither staff cuts nor the blind pursuit of automation will stop anytime soon. Cloud providers may promise improvements, but their incentives will remain focused on cost control over reliability. Organizations must adapt to this new normal, but they can still make choices that reduce their risk.

First, enterprises should prioritize fault-resistant cloud architecture. Adopting multicloud and hybrid cloud strategies, while complex, reduces the technical risk associated with reliance on a single provider.

Second, it’s crucial to invest in in-house expertise that understands both the workloads and the nuances of cloud service behavior. While the providers may treat their operations talent as expendable, nothing will replace the value of an enterprise’s in-house team to independently monitor, test, and prepare for the unexpected.

Finally, enterprises must enforce strict vendor management. This means holding providers accountable for promised service-level agreements, monitoring transparency in communication and incident reporting, and leveraging contracted services to their fullest extent, especially as the cloud market matures and customer influence grows.

The era of the infallible cloud is over. As public cloud providers pursue operational efficiency and AI dominance, resilience has taken a hit, and both providers and users must adapt. The challenge for today’s enterprises is to strategically mitigate the most likely consequences before the next outage strikes.

David Linthicum

David S. Linthicum is an internationally recognized industry expert and thought leader. Dave has authored 13 books on computing, the latest of which is An Insider’s Guide to Cloud Computing. Dave’s industry experience includes tenures as CTO and CEO of several successful software companies, and upper-level management positions in Fortune 100 companies. He keynotes leading technology conferences on cloud computing, SOA, enterprise application integration, and enterprise architecture. Dave writes the Cloud Insider blog for InfoWorld. His views are his own.

More from this author