<img height="1" width="1" style="display:none;" alt="" src="https://px.ads.linkedin.com/collect/?pid=1110556&amp;fmt=gif">
Skip to content
    October 29, 2025

    Whose Fault Is It When the Cloud Fails? Does It Matter?

    Stop Hunting for the Guilty Party and Start Pinpointing the Problem

    6 min read

    Key Takeaways
    • See why, in today's interconnected environments, assigning blame for an outage is inefficient.
    • Gain visibility into networks you don't own, including the internet and cloud provider infrastructure.
    • Gain definitive proof of where a problem lies, so you can stop internal finger-pointing.

    On Monday, October 20th, a significant portion of the digital services we use every day became inaccessible. For hours, banking, communication, and entertainment applications were unavailable. The root cause was later identified as a major outage within Amazon Web Services (AWS), the infrastructure that powers a vast number of online services. The initial response for any business affected by such an event is a frantic effort to diagnose the problem. Is it our application? Is our network down? What exactly is happening?

    The technical explanation pointed to a DNS resolution issue within AWS' US-EAST-1 region, which led to a cascade of failures across other services. For most leaders, the technical specifics are secondary to the immediate business impact. What truly matters is the operational paralysis that occurs when a critical dependency fails and you’re left without information on the cause, the location of the problem, or the expected resolution time. These events are not isolated and will happen again, which raises a question: When your operations depend on a complex chain of third-party services, how can you maintain effective control?

    The inefficiency of internal troubleshooting

    Consider the typical response to a major outage. Teams assemble in an emergency session, each equipped with data to prove their area of responsibility is not the source of the problem. Application teams demonstrate their code is functioning correctly. Infrastructure teams show their servers are online and healthy. Network teams provide metrics indicating the corporate network is operating perfectly.

    This approach is a holdover from an era when organizations owned and controlled their entire IT stack. In that environment, fault isolation was a logical process of elimination. However, that model is no longer relevant. Your applications now run on hardware you don't manage, connected by a global network of providers you don't control. The internet is your enterprise network, and the cloud provider's servers are your data center.

    Attempting to manage this new reality with obsolete methods is ineffective. The goal should not be to assign blame. The more productive question is, "Where is the problem occurring?" Answering this question requires a shift in how you monitor your services. It requires a new category of visibility.

    Gaining visibility into external networks

    The primary challenge is a gap in visibility. You can thoroughly monitor the systems within your direct control, but performance issues often originate in networks that lie outside your administrative domain. The path from your user to your application and back crosses numerous independent networks, including local ISPs, internet backbone carriers, and the cloud provider’s own complex infrastructure. When your monitoring stops at your network's edge, you have no data to explain externally caused performance degradation.

    To address this, you need a method to measure performance across the entire, end-to-end delivery path. This is accomplished through active, synthetic monitoring. By deploying lightweight monitoring points in user locations, data centers, or cloud environments, you can continuously send test traffic that simulates user transactions. This active testing measures critical metrics like latency, packet loss, and path performance on a hop-by-hop basis through every network segment. This provides a complete and segmented view of the entire service delivery chain.

    During an outage, such as the recent AWS one, this capability would have allowed you to see that your own networks and the initial internet segments were performing as expected, but that performance collapsed at a specific point within the cloud provider infrastructure. The process of isolating the problem's location would have taken minutes, not hours. It's not about finding fault; it’s about finding the failure domain with certainty.

    From reactive blame to proactive resolution

    When you can definitively isolate a problem to a domain outside your control, the entire dynamic of your response changes. Unproductive internal debates cease. Instead of consuming valuable time troubleshooting healthy systems, your teams can immediately focus on mitigation and communication. You can provide accurate information to your users and engage with your provider, armed with specific data that pinpoints the issue.

    This level of insight shifts your organization from a reactive to a proactive posture. You can identify and address network and application performance issues, sometimes before users are affected. The continuous stream of performance data allows you to validate whether your cloud and ISP vendors are meeting the expected service level. You can make data-driven architectural decisions, such as evaluating the resilience of a single cloud region or the reliability of a specific network provider.

    The complex, interconnected services we all depend on will fail again. The choice is whether you will be without information, engaged in an inefficient internal process to assign blame, or if you will have the visibility to know exactly where the problem is the moment it occurs. You can’t control the cloud, and you can’t control the internet, but you can equip yourself with tools that give you complete visibility across these domains.

    To learn more about building a foundation for true network observability from user to application, no matter where it's hosted, see what's required for a modern multi-cloud observability practice.

    Jeremy Rossbach

    As the Chief Technical Evangelist for NetOps by Broadcom, Jeremy is passionate about meeting with customers to identify their IT operational challenges and produce solutions that fit their business and network transformation goals. Prior to joining Broadcom, he spent over 15+ years working in IT, across both public...

    Other resources you might be interested in

    icon
    Blog October 29, 2025

    Your Root Cause Analysis is Flawed by Design

    Discover the critical flaw in your troubleshooting approaches. Employ network observability to extend your visibility across the entire service delivery path.

    icon
    Blog October 29, 2025

    Whose Fault Is It When the Cloud Fails? Does It Matter?

    In today's interconnected environments, it is vital to gain visibility into networks you don't own, including internet and cloud provider infrastructures.

    icon
    Blog October 22, 2025

    What’s New in Network Observability for Fall 2025

    Discover how the Fall 2025 release of Network Observability by Broadcom introduces powerful new capabilities, elevating your insights and automation.

    icon
    eBook October 22, 2025

    Modernizing Monitoring in a Converged IT-OT Landscape

    The energy sector is shifting, driven by rapid grid modernization and the convergence of IT and OT networks. Traditional monitoring tools fall short.

    icon
    Blog October 22, 2025

    Your network isn't infrastructure anymore. It's a product.

    See why it’s time to stop managing infrastructure and start treating the network as your company's most critical product. Justify investments and prove ROI.

    icon
    Blog October 22, 2025

    The Network Engineers You Can't Hire? They Already Work for You

    See how the proliferation of siloed monitoring tools exacerbates IT skills gaps. Implement an observability platform that empowers the teams you already have.

    icon
    Blog October 8, 2025

    Nobody Cares About Your MTTR

    This post outlines why IT metrics like MTTR are irrelevant to business leaders, and it emphasizes that IT teams need network observability to bridge this gap.

    icon
    Blog October 8, 2025

    Tag(ging)—You’re It: How to Leverage AppNeta Monitoring Data for Maximum Insights

    Find out about tagging capabilities in AppNeta. Get strategies for making the most of tagging and see how it can be a game-changer for your operations teams.

    icon
    Blog October 1, 2025

    Why 1% Packet Loss Is the New 100% Outage

    In an era of real-time apps and multiple clouds, the old rules about 'acceptable' network errors no longer apply. See why you need end-to-end observability.