Key Takeaways
|
|
With all the hype surrounding AI, it’s critical to focus on building resilient networks capable of handling the performance demands that AI will introduce. As I often say, if your network observability solution isn’t detecting packet loss, neither will your AI engine. When you ask, “What’s the status of our global network health this morning?” a flawed or incomplete response could jeopardize critical decisions.
Consider Meta’s documented challenges in contending with AI’s impact on their network operations.
In 2022, the team at Meta foresaw that the growing demands of AI on their backbone network would be unpredictable and turbulent. The company faced a tsunami of AI data traffic, driven by a huge increase in their GPU builds.
Figure 1: Meta discovered that 57% of its AI workload data was spent on network operations, highlighting the critical role of network I/O in enabling AI success. (Source: Alexis Bjorlinat, 2022 OCP Global Summit)
“The impact of large clusters, generative AI (GenAI), and artificial general intelligence (AGI) is still an unknown,” said Jyotsna Sundaresan, a Meta network strategist. “We haven’t fully explored what that means for the backend.”
Meta’s networking team saw a higher-than-anticipated increase in backbone traffic, leaving them unprepared. As Sundaresan acknowledged, “We were not ready for this.”
It’s a warning echoed by Gartner below: Are any of our networks truly prepared for AI workloads?
Figure 2: Gartner’s analysis on how gen AI will reshape network infrastructure.
Today’s consumer may tolerate some buffering or latency in certain scenarios. For instance, during the Mike Tyson/Jake Paul fight, the stream occasionally buffered at critical moments. While frustrating, it didn’t ruin the overall experience—it was free, and viewers could still catch the highlights and find out who won.
AI workloads, however, are far less forgiving. Massive data volumes, complex traffic patterns, and unprecedented bandwidth requirements will demand a level of network performance we’ve never seen before.
For example, imagine walking into a network operations center and asking your AI assistant, “What should I prioritize this morning?” If your Seattle branch office is dropping packets and your observability solution isn’t catching it, your AI engine won’t either. It will reply with a misleading, “Everything’s okay.”
Many AI initiatives will incorporate cloud and SD-WAN components, introducing additional challenges. With data transfers, front-end user interfaces, backups, and high availability systems potentially spread across multiple locations, visibility across the entire network path is essential.
If your observability solution doesn’t collect end-to-end delivery metrics from both internally and externally managed networks, you’ll struggle to troubleshoot when an end user reports an issue.
During a recent interview on The NetOps Expert podcast, I spoke with Julian Guthrie, CEO of Alphy, a startup using AI to improve human communication. When I asked about what the typical network path might look like for a user of their solution, it looked something like this:
👩💻 Hybrid worker → 🛜 Home/Hotel WiFi → 📡 Residential/Local ISP → 🚦 Transit ISPs → 🌦️ Cloud/SaaS Provider (hosting the AI solution).
Seems straightforward, right? But here’s the catch: Does your network team have administrative access to any of these networks? Likely not. If a user reports poor performance, will you tell them to call their ISP?
Figure 3: Visibility into the entire network path the user experience relies upon reveals poor performance of unmanaged cloud and ISP network devices.
Of course not. Regardless of who owns the infrastructure, your organization is responsible for delivering a seamless user experience. It’s your brand, revenue, and reputation on the line. A mature network observability solution provides visibility into ISP and cloud infrastructure, enabling you to triage such issues effectively.
Earlier this year, tens of thousands of AT&T users experienced widespread service disruptions. While AT&T restored service quickly, they acknowledged the disruption and pledged to prevent similar outages in the future.
With AI’s growing role in our networks, disruptions like this will have far greater consequences. AI-driven systems rely on flawless performance, and even minor issues can derail critical initiatives.
AI is not just on the horizon—it’s already here. The success of your AI initiatives will depend on the resilience of your network. Start building that resilience today to meet the demands AI will place on your infrastructure tomorrow.
Because in the world of AI, there’s no room for “buffering.”