October 29, 2025

Whose Fault Is It When the Cloud Fails? Does It Matter?

Stop Hunting for the Guilty Party and Start Pinpointing the Problem

6 min read

Key Takeaways

See why, in today's interconnected environments, assigning blame for an outage is inefficient.
Gain visibility into networks you don't own, including the internet and cloud provider infrastructure.
Gain definitive proof of where a problem lies, so you can stop internal finger-pointing.

On Monday, October 20th, a significant portion of the digital services we use every day became inaccessible. For hours, banking, communication, and entertainment applications were unavailable. The root cause was later identified as a major outage within Amazon Web Services (AWS), the infrastructure that powers a vast number of online services. The initial response for any business affected by such an event is a frantic effort to diagnose the problem. Is it our application? Is our network down? What exactly is happening?

The technical explanation pointed to a DNS resolution issue within AWS' US-EAST-1 region, which led to a cascade of failures across other services. For most leaders, the technical specifics are secondary to the immediate business impact. What truly matters is the operational paralysis that occurs when a critical dependency fails and you’re left without information on the cause, the location of the problem, or the expected resolution time. These events are not isolated and will happen again, which raises a question: When your operations depend on a complex chain of third-party services, how can you maintain effective control?

The inefficiency of internal troubleshooting

Consider the typical response to a major outage. Teams assemble in an emergency session, each equipped with data to prove their area of responsibility is not the source of the problem. Application teams demonstrate their code is functioning correctly. Infrastructure teams show their servers are online and healthy. Network teams provide metrics indicating the corporate network is operating perfectly.

This approach is a holdover from an era when organizations owned and controlled their entire IT stack. In that environment, fault isolation was a logical process of elimination. However, that model is no longer relevant. Your applications now run on hardware you don't manage, connected by a global network of providers you don't control. The internet is your enterprise network, and the cloud provider's servers are your data center.

Attempting to manage this new reality with obsolete methods is ineffective. The goal should not be to assign blame. The more productive question is, "Where is the problem occurring?" Answering this question requires a shift in how you monitor your services. It requires a new category of visibility.

Gaining visibility into external networks

The primary challenge is a gap in visibility. You can thoroughly monitor the systems within your direct control, but performance issues often originate in networks that lie outside your administrative domain. The path from your user to your application and back crosses numerous independent networks, including local ISPs, internet backbone carriers, and the cloud provider’s own complex infrastructure. When your monitoring stops at your network's edge, you have no data to explain externally caused performance degradation.

To address this, you need a method to measure performance across the entire, end-to-end delivery path. This is accomplished through active, synthetic monitoring. By deploying lightweight monitoring points in user locations, data centers, or cloud environments, you can continuously send test traffic that simulates user transactions. This active testing measures critical metrics like latency, packet loss, and path performance on a hop-by-hop basis through every network segment. This provides a complete and segmented view of the entire service delivery chain.

During an outage, such as the recent AWS one, this capability would have allowed you to see that your own networks and the initial internet segments were performing as expected, but that performance collapsed at a specific point within the cloud provider infrastructure. The process of isolating the problem's location would have taken minutes, not hours. It's not about finding fault; it’s about finding the failure domain with certainty.

From reactive blame to proactive resolution

When you can definitively isolate a problem to a domain outside your control, the entire dynamic of your response changes. Unproductive internal debates cease. Instead of consuming valuable time troubleshooting healthy systems, your teams can immediately focus on mitigation and communication. You can provide accurate information to your users and engage with your provider, armed with specific data that pinpoints the issue.

This level of insight shifts your organization from a reactive to a proactive posture. You can identify and address network and application performance issues, sometimes before users are affected. The continuous stream of performance data allows you to validate whether your cloud and ISP vendors are meeting the expected service level. You can make data-driven architectural decisions, such as evaluating the resilience of a single cloud region or the reliability of a specific network provider.

The complex, interconnected services we all depend on will fail again. The choice is whether you will be without information, engaged in an inefficient internal process to assign blame, or if you will have the visibility to know exactly where the problem is the moment it occurs. You can’t control the cloud, and you can’t control the internet, but you can equip yourself with tools that give you complete visibility across these domains.

To learn more about building a foundation for true network observability from user to application, no matter where it's hosted, see what's required for a modern multi-cloud observability practice.

Tag(s): Network Monitoring , Network Observability , Network Management , ISP , Cloud , CSP , Network Outage , Internet

Jeremy Rossbach

As the Chief Technical Evangelist for NetOps by Broadcom, Jeremy is passionate about meeting with customers to identify their IT operational challenges and produce solutions that fit their business and network transformation goals. Prior to joining Broadcom, he spent over 15+ years working in IT, across both public...

Other resources you might be interested in

Video March 5, 2026

Transforming Enterprise AI: Agile Operations in 2026

In this video, Broadcom’s Serge Lucio shares his 2026 outlook, explaining why true enterprise AI requires moving beyond basic chatbots to deploy domain-specific AI agents built on a foundation of...

Watch Video

Blog February 11, 2026

The Architecture Shift Powering Network Observability

Discover how NODE (Network Observability Deployment Engine) from Broadcom delivers easier deployment, streamlined upgrades, and enhanced stability.

Read Blog

Course February 2, 2026

AppNeta: Design Browser Workflows for Web App Monitoring

Learn how to design, build, and troubleshoot Selenium-based browser workflows in AppNeta to reliably monitor web applications and validate user experience.

Go to Training

Blog January 28, 2026

When DIY Becomes a Network Liability

While seemingly expedient, custom scripts can cost teams dearly. See why it’s so critical to leverage a dedicated network configuration management platform.

Read Blog

Blog December 22, 2025

Top 3 Trends Defining Network Observability in 2026

Discover the three specific trends that will define network observability in 2026. See how unified observability and predictive AI will shape the landscape.

Read Blog

Blog December 17, 2025

Why 2025 Shattered the Old Rules of Network Management

This post reveals the five key lessons network operations leaders learned in 2025—and how they need to respond to be successful in 2026.

Read Blog

Blog December 17, 2025

The 2026 VMUG Report: Why Network Observability is the Heart of the New VCF Era

Get the top takeaways from the VMUG Cloud Operations and VCF User Experience Report 2026. See why network observability is key to successful VCF 9 migrations.

Read Blog

Solution and Capabilities Briefs December 5, 2025

Carrier-Grade Network Observability: A Technology Brief for Telco Network Operations

Network Observability by Broadcom unifies data to provide contextual, AI-enabled insights for superior service availability, accelerated MTTR and improved MTTI, reduced operational costs, and the...

Read Technology Brief

Blog December 3, 2025

You've Found the Waste In Your Network Operations. Now What?

Leverage the Six Sigma framework to gain a roadmap for converting network data into permanent optimizations. Start systematically eliminating network issues.

Read Blog