Broadcom Software Academy Blog

Whose Fault Is It When the Cloud Fails? Does It Matter?

Written by Jeremy Rossbach | Oct 29, 2025 6:02:20 PM
Key Takeaways
  • See why, in today's interconnected environments, assigning blame for an outage is inefficient.
  • Gain visibility into networks you don't own, including the internet and cloud provider infrastructure.
  • Gain definitive proof of where a problem lies, so you can stop internal finger-pointing.

On Monday, October 20th, a significant portion of the digital services we use every day became inaccessible. For hours, banking, communication, and entertainment applications were unavailable. The root cause was later identified as a major outage within Amazon Web Services (AWS), the infrastructure that powers a vast number of online services. The initial response for any business affected by such an event is a frantic effort to diagnose the problem. Is it our application? Is our network down? What exactly is happening?

The technical explanation pointed to a DNS resolution issue within AWS' US-EAST-1 region, which led to a cascade of failures across other services. For most leaders, the technical specifics are secondary to the immediate business impact. What truly matters is the operational paralysis that occurs when a critical dependency fails and you’re left without information on the cause, the location of the problem, or the expected resolution time. These events are not isolated and will happen again, which raises a question: When your operations depend on a complex chain of third-party services, how can you maintain effective control?

The inefficiency of internal troubleshooting

Consider the typical response to a major outage. Teams assemble in an emergency session, each equipped with data to prove their area of responsibility is not the source of the problem. Application teams demonstrate their code is functioning correctly. Infrastructure teams show their servers are online and healthy. Network teams provide metrics indicating the corporate network is operating perfectly.

This approach is a holdover from an era when organizations owned and controlled their entire IT stack. In that environment, fault isolation was a logical process of elimination. However, that model is no longer relevant. Your applications now run on hardware you don't manage, connected by a global network of providers you don't control. The internet is your enterprise network, and the cloud provider's servers are your data center.

Attempting to manage this new reality with obsolete methods is ineffective. The goal should not be to assign blame. The more productive question is, "Where is the problem occurring?" Answering this question requires a shift in how you monitor your services. It requires a new category of visibility.

Gaining visibility into external networks

The primary challenge is a gap in visibility. You can thoroughly monitor the systems within your direct control, but performance issues often originate in networks that lie outside your administrative domain. The path from your user to your application and back crosses numerous independent networks, including local ISPs, internet backbone carriers, and the cloud provider’s own complex infrastructure. When your monitoring stops at your network's edge, you have no data to explain externally caused performance degradation.

To address this, you need a method to measure performance across the entire, end-to-end delivery path. This is accomplished through active, synthetic monitoring. By deploying lightweight monitoring points in user locations, data centers, or cloud environments, you can continuously send test traffic that simulates user transactions. This active testing measures critical metrics like latency, packet loss, and path performance on a hop-by-hop basis through every network segment. This provides a complete and segmented view of the entire service delivery chain.

During an outage, such as the recent AWS one, this capability would have allowed you to see that your own networks and the initial internet segments were performing as expected, but that performance collapsed at a specific point within the cloud provider infrastructure. The process of isolating the problem's location would have taken minutes, not hours. It's not about finding fault; it’s about finding the failure domain with certainty.

From reactive blame to proactive resolution

When you can definitively isolate a problem to a domain outside your control, the entire dynamic of your response changes. Unproductive internal debates cease. Instead of consuming valuable time troubleshooting healthy systems, your teams can immediately focus on mitigation and communication. You can provide accurate information to your users and engage with your provider, armed with specific data that pinpoints the issue.

This level of insight shifts your organization from a reactive to a proactive posture. You can identify and address network and application performance issues, sometimes before users are affected. The continuous stream of performance data allows you to validate whether your cloud and ISP vendors are meeting the expected service level. You can make data-driven architectural decisions, such as evaluating the resilience of a single cloud region or the reliability of a specific network provider.

The complex, interconnected services we all depend on will fail again. The choice is whether you will be without information, engaged in an inefficient internal process to assign blame, or if you will have the visibility to know exactly where the problem is the moment it occurs. You can’t control the cloud, and you can’t control the internet, but you can equip yourself with tools that give you complete visibility across these domains.

To learn more about building a foundation for true network observability from user to application, no matter where it's hosted, see what's required for a modern multi-cloud observability practice.