<img height="1" width="1" style="display:none;" alt="" src="https://px.ads.linkedin.com/collect/?pid=1110556&amp;fmt=gif">
Skip to content
    October 8, 2021

    Facebook Outage Underscores Risks of Modern SDN Networks

    On October 4, 2021, an outage took Facebook, WhatsApp, and Instagram down for six hours. This major outage is a powerful and high-profile example of the difficulties facing those tasked with managing modern network architectures. While there has been a lot of chatter about misconfiguration and DNS failures, the reality is that this outage could have more to do with Facebook's software-defined networks (SDN) and the propensity for errors organizations are exposed to when the control plane is centralized and segregated from network devices.

    An early adopter of SDN, Facebook pioneered innovative methods to automate provisioning of their massive networks. Najam Admad, director of technical operations, who architected SDN for Facebook, said in 2014, "We want to deploy, manage, monitor and fix networks using software." One of their widely publicized use cases is their software-defined backbone routing. Facebook uses SDN to augment BGP route selection. They’ve employed advanced congestion and capacity analytics in order to overcome the common shortcomings of BGP. 

    Facebook’s SDN controllers produce routes that then get sent to the edge routers, which are responsible for connecting to the outside world. Based on public speculation, the belief is that these SDN-controlled BGP route selection processes encountered a glitch and basically made the Facebook edge routers unreachable. This initial issue was then exacerbated by the fact that their internally hosted DNS went into hibernation mode in response to lost connections. The blog post by Facebook's Santosh Janardhan further supports this hypothesis: "During one of these routine maintenance jobs, a command was issued with the intention to assess the availability of global backbone capacity, which unintentionally took down all the connections in our backbone network, effectively disconnecting Facebook data centers globally."

    This type of control-plane and data-plane disconnect is also occurring in other SDN environments. Similar malfunctions are also happening in SD-WAN scenarios. For example, operators find that, while an SD-WAN controller issues a reroute command to an edge device, the edge device ignores the command. There have also been cases in which SD-WAN controllers issue rerouting commands unnecessarily. It is becoming vital that teams in the network operations center monitor the state and the activities of controllers, so they can ensure that critical software-defined transactions are executed correctly.

    The important lesson here is that as we evolve networks to be more intelligent, whether through centralizing the control plane or leveraging automation to boost efficiency, we need to make sure that suitable network observability is established. To support SDN and automation, new NetOps capabilities have to be deployed. For example, teams need to employ new operational workflows so they can validate commands using network intelligence, such as discovered topologies and historical performance and flow data. Teams can also initiate active tests that simulate traffic going to and coming from the edge router, so they can better preempt potential issues. SDN has provided many benefits, and promises more to come, but it has also introduced risks. Therefore, SDN-enabled network observability is needed to establish checks and balances. By establishing modern network observability, teams can monitor not only the networks, but the software intelligence that governs those networks.

    Jeremy Rossbach

    As the Chief Technical Evangelist for NetOps by Broadcom, Jeremy is passionate about meeting with customers to identify their IT operational challenges and produce solutions that fit their business and network transformation goals. Prior to joining Broadcom, he spent over 15+ years working in IT, across both public...

    Other resources you might be interested in

    icon
    Course April 24, 2026

    Automic Automation: Getting Started with the Automic Web Interface Version 26

    Get started with the v26 Automic Web Interface (AWI). Learn how to navigate the modernized UI, customize your workspace, and move between perspectives.

    icon
    Course April 24, 2026

    Automic Automation v26: Zero Downtime Upgrade (ZDU)

    Learn how to employ the Zero Downtime Upgrade (ZDU) process. Transition from Automic v24 to v26 while your mission-critical workflows continue to execute.

    icon
    Course April 22, 2026

    ValueOps ConnectALL: Synchronize Jira and Rally for Frictionless Cost Accounting

    This course teaches you how to integrate Jira data into Rally for the purpose of frictionless cost accounting in Clarity.

    icon
    Course April 22, 2026

    AppNeta: Introducing the Intelligent Alarms Experience

    Learn how to use the new Intelligent Alarms experience in AppNeta, including new metrics, new user workflows, and the new thresholding, event, and alarm system.

    icon
    Office Hours April 16, 2026

    Rally Office Hours: April 16, 2026

    Join Rally Office Hours to get expert tips and the latest product news. Explore new AI controls, Monte Carlo simulation for milestones, and more.

    icon
    Blog April 14, 2026

    Announcing AutoSys 24.2: Accelerating Operations with Self-Service Agility and Automated Security

    Learn how AutoSys 24.2 helps reduce administrative bottlenecks, minimize security risks, and accelerate incident resolution.

    icon
    Blog April 10, 2026

    The Next Chapter for AutoSys: Moving Toward the Intelligent Control Plane

    Is Broadcom still investing in AutoSys? Yes! Learn about the V26 roadmap, which features MCP orchestration, AI job types, and AI-powered developer assistance.

    icon
    Course April 10, 2026

    Automic Automation: Upgrading to Version 26

    This course guides you through and demonstrates the process to upgrade Automic Automation from version 24 to version 26 on a Windows platform. The Unix upgrade is virtually the same.

    icon
    Course April 10, 2026

    Automic Automation: Integrated Database Maintenance

    See how Automic administrators can leverage the Integrated Database Maintenance suite to optimize their Automation Engine database for peak performance.