November 22, 2024

Cloud Application Performance: Common Reasons for Slow-Downs

Key Takeaways

Gain an understanding of common factors that can cause cloud application performance issues.
Discover ways to boost performance by reducing network induced round trip latencies.
Employ the Universal Monitoring Agent (UMA) within AIOps and Observability solutions from Broadcom to monitor entire cluster interactions.

Worklog entry

It happens often that an application, when running on bare metal, performs well. However, after the application is packed into an image, tested on Docker/Podman, then migrated to Kubernetes, performance plummets. With an increasing number of database calls issued by the application, the application response times increase.

This common situation is often due to data transfer induced I/O delays.

Note: I/O stands for input/output and can be applied to disk read/write speeds, network transmission/reception, and other areas.

What experience has shown

Imagine running this same application on bare metal again but in this scenario: The database has its storage device created on a remote Network File System (NFS) device.

This means we need to account for these factors:

The number of applications that access the storage endpoint (assuming it is fast by design).
This will influence the overall response performance due to storage I/O performance.
The way the application has been programmed to gather data out of the database.
Does it cache the data, or does it just issue another request? Are the requests tiny or rather large?
The latency induced through the network access to the storage and the speed of the storage overall.

When accessing a network device, several major factors impact data transfer speeds:

Available link speed
Used maximum transmission time (MTU)/maximum segment size (MSS)
The size of the transferred payload
TCP window settings (sliding window)
Average round trip time (RTT) latency from A-to-B in ms.

In a data center setup, the link speed usually is not an issue, since all hosts are interconnected with at least 1 Gbps, sometimes 10 Gbps, and to the same switch. The MTU/MSS will be optimum for the configured link. What remains are the size of the transferred payload, the TCP sliding window, and the average RTT. The factors that will have the greatest impact will be the payload size, link latency, and quality, which will influence the sliding window.

Here’s an example for inter-node latencies of a current Kubernetes setup:

adm@k8s-jm-master:~$ ping k8s-jm-master
PING k8s-jm-master (127.0.1.1) 56(84) bytes of data.
64 bytes from k8s-jm-master (127.0.1.1): icmp_seq=1 ttl=64 time=0.062 ms
64 bytes from k8s-jm-master (127.0.1.1): icmp_seq=2 ttl=64 time=0.056 ms
64 bytes from k8s-jm-master (127.0.1.1): icmp_seq=3 ttl=64 time=0.045 ms

--- k8s-jm-master ping statistics ---
3 packets transmitted, 3 received, 0% packet loss, time 2041ms
rtt min/avg/max/mdev = 0.045/0.054/0.062/0.007 ms

adm@k8s-jm-master:~$ ping k8s-jm-node1
PING k8s-jm-node1.aiops-jm.broadcom.net (10.252.213.5) 56(84) bytes of data.
64 bytes from k8s-jm-node1.aiops-jm.broadcom.net (10.252.213.5): icmp_seq=1 ttl=64 time=0.370 ms
64 bytes from k8s-jm-node1.aiops-jm.broadcom.net (10.252.213.5): icmp_seq=2 ttl=64 time=0.347 ms
64 bytes from k8s-jm-node1.aiops-jm.broadcom.net (10.252.213.5): icmp_seq=3 ttl=64 time=0.469 ms

--- k8s-jm-node1.aiops-jm.broadcom.net ping statistics ---
3 packets transmitted, 3 received, 0% packet loss, time 2041ms
rtt min/avg/max/mdev = 0.347/0.395/0.469/0.052 ms

adm@k8s-jm-master:~$ ping k8s-jm-node2
PING k8s-jm-node2.aiops-jm.broadcom.net (10.252.213.6) 56(84) bytes of data.
64 bytes from k8s-jm-node2.aiops-jm.broadcom.net (10.252.213.6): icmp_seq=1 ttl=64 time=0.470 ms
64 bytes from k8s-jm-node2.aiops-jm.broadcom.net (10.252.213.6): icmp_seq=2 ttl=64 time=0.407 ms
64 bytes from k8s-jm-node2.aiops-jm.broadcom.net (10.252.213.6): icmp_seq=3 ttl=64 time=0.451 ms

--- k8s-jm-node2.aiops-jm.broadcom.net ping statistics ---
3 packets transmitted, 3 received, 0% packet loss, time 2038ms
rtt min/avg/max/mdev = 0.407/0.442/0.470/0.026 ms

Note: When increasing the payload size of the ping packet, as soon as we set it to more than the current MTU, response times will rise. For example, node2 has a payload size of 64k—an extreme case—that clearly illustrates the latency increase.

adm@k8s-jm-master:~$ ping -s 65500 k8s-jm-node1
PING k8s-jm-node1.aiops-jm.broadcom.net (10.252.213.5) 65500(65528) bytes of data.
65508 bytes from k8s-jm-node1.aiops-jm.broadcom.net (10.252.213.5): icmp_seq=1 ttl=64 time=1.42 ms
65508 bytes from k8s-jm-node1.aiops-jm.broadcom.net (10.252.213.5): icmp_seq=2 ttl=64 time=1.43 ms
65508 bytes from k8s-jm-node1.aiops-jm.broadcom.net (10.252.213.5): icmp_seq=3 ttl=64 time=1.47 ms

--- k8s-jm-node1.aiops-jm.broadcom.net ping statistics ---
3 packets transmitted, 3 received, 0% packet loss, time 2003ms
rtt min/avg/max/mdev = 1.422/1.440/1.474/0.023 ms
adm@k8s-jm-master:~$ tracepath k8s-jm-node1
1?: [LOCALHOST] pmtu 1500
1: k8s-jm-node1.aiops-jm.broadcom.net 0.560ms reached
1: k8s-jm-node1.aiops-jm.broadcom.net 0.359ms reached
Resume: pmtu 1500 hops 1 back 1

Taking these three points together, (size of the transferred payload, the TCP sliding window, and the average RTT), if a developer created an application without using built-in caching capabilities, then database requests will stack up and cause I/O pressure. This omission would route all requests through the network link and cause response times of the entire application to increase.

On a local storage or low latency (network and storage) environment, you can expect the application to perform well. This situation is typical when development teams test on a local environment. If we switch to network storage, network induced latency will influence the possible maximum request rate. When performing the same tests in production, teams are likely to find that a system that was capable of performing say 1000 DB requests per second, now see a system capable of delivering only 50 DB requests per second. In this case, consider three likely culprits noted above:

Variable packet sizes, average however larger than the configured MTU
Network latency because the DB was on a dedicated node, apart from the application
The sliding window was reduced as all surrounding systems connect to that DB node. This causes I/O pressure on filesystem requests which, in turn, makes the DB unable process requests fast enough.

Here, tests revealed that the sliding window algorithm caused data transfers (for background, this is what a remote storage over NFS uses) in 4Kb average payload sizes plummet from 12MBytes/s on a fast 1Gps link (8ms latency) gradually down to ~15Kbps when the link latency hit 180ms. Sliding windows can force each request to return an ACK packet for the sender prior to sending the subsequent packet on low latency links. This situation can be tested locally using the tc application on a Linux system. Here, teams can manually force a latency on a per-device basis. For example, the interface can be set to respond with a latency of 150ms.

On Kubernetes for example, you may want to deploy an application using a front-end, an app-server, and a back-end DB. The app-server and back-end DB happen to be deployed on different nodes. Even if these nodes are fast, they will induce network delays for every packet sent from the app server to the DB and back. These delays add up and will significantly degrade performance!
Below is a real-world example that shows application response times. The database is deployed on node2.

adm@k8s-jm-master:~/k8s/apmia$ k get pods -n cemperf -o wide
NAME READY STATUS RESTARTS AGE IP NODE
cemperf-app-7d669c76c-7qqsg 1/1 Running 2 (147d ago) 300d 192.168.255.136 k8s-jm-master
cemperf-db-9774fb7d9-cvp8d 1/1 Running 0 25m 192.168.21.197 k8s-jm-node2

Here, executing an application requires large numbers of DB requests and dramatically impacts average response times on affected components, in this case an average response time of 3.9 seconds.

ESD_FY25_Academy-Blog.Cloud Application Performance - Common Reasons for Slow-Downs.Figure 1

Note: For this test, the graph_dyn function queries the database for 80k data rows where each row provides metadata to build various graph points.

By moving the database to the same machine hosting the application part, average response time improved to 0.998 seconds!

jmertin@k8s-jm-master:~/k8s/apmia$ k get pods -n cemperf -o wide
NAME READY STATUS RESTARTS AGE IP NODE
cemperf-app-7d669c76c-7qqsg 1/1 Running 2 (147d ago) 300d 192.168.255.136 k8s-jm-master
cemperf-db-cbd45f9f6-frdwf 1/1 Running 0 7s 192.168.255.168 k8s-jm-master

ESD_FY25_Academy-Blog.Cloud Application Performance - Common Reasons for Slow-Downs.Figure 2

This enormous performance improvement was achieved by reducing network induced round trip latencies, that is, by moving the database container to the same node.

Another common factor that impacts Kubernetes performance is found in ESX environments where the cluster hosts are oversubscribed in terms of network I/O and the Kubernetes hosts are deployed in VM Images. The Kubernetes node will not see that the CPU, memory, disk I/O, and/or remote storage connected to it is already under load pressure. The result is that application response times experienced by users will suffer.

How to optimize application performance

First of all, make sure the application design takes into account caching (requests, DB requests, and so on) and does not perform unnecessary requests.

Second, when running Kubernetes/OpenShift on top of ESX, make sure over-subscription is not enabled for any of the available CPU and memory. Note that small delays stack up fast to cause an elastic behavior on the application response side. On the other hand, if the ESX host is already under pressure, the images running on that host will not be made aware of the issues and the Kubernetes orchestrator may deploy some other pods on that node because resources (CPU, memory, storage) are apparently still available. So, if you run Kubernetes/OpenShift and the like, and you can instead run it on bare metal, do it! You will have direct and real visibility to the host.

Third, on the Kubernetes side, reduce the network induced I/O primarily on the pod deployment. Pack the containers into the same pod so they are deployed on the same node. This will drastically reduce the network induced I/O delays between the app-server and back-end DB.

Finally, deploy the containers that require fast DB access on nodes that have fast storage. Take high availability and whatever LB is into account. If NFS storage is not fast enough, consider creating a PV on a specific node having a local RAID 10 disk. Note, that in this case, redundancy will have to be handled manually because the Kubernetes orchestrator will not be able to handle it if the specific node hardware fails.

How to monitor latencies between components

For Kubernetes, OpenShift, and Docker (Swarm), the Universal Monitoring Agent (UMA) used within AIOps and Observability solutions from Broadcom can be used to monitor the entire cluster interaction. This agent can extract all relevant metrics on the cluster. If UMA detects a running technology it can monitor such as Java, .Net, PHP, Node.js, it will automatically inject a probe and provide fine-grained application execution details on-demand. These details include I/O requests from that component to the DB, remote calls, and so on, as application internal metrics.

Without additional configuration, when an application is misbehaving, the built-in anomaly detection of UMA initiates a transaction trace. UMA will collect fine grained, detailed metrics from agents linked to components for the specific application execution path. These capabilities generate a detailed view of the application execution down to the line number in the code snippet causing the slow-down.

Note: Administrators can choose to set up full, end-to-end monitoring, which provides execution details that include front-end (browser) performance.

Here, the full end-to-end metric details will be collected so the admin team can analyze what went wrong and take corrective actions as required:

User front-end => Web-Server front-end => app-server => databases | remote services

One advantage of this monitoring solution is that it will show the average execution times of the components within the cluster. Over time it, the solution will assume bad average times to be normal in an oversubscribed ESX environment. Still, the operator can assess the real values and compare these to values from other applications running on the same node.

Tag(s): AIOps , DX APM , Cloud Application Performance , Kubernetes Optimization , I/O Latency , Universal Monitoring Agent (UMA)

Jörg Mertin

Jörg Mertin, a Master Solution Engineer on the AIOps and Observability team, is a self-learner and technology enthusiast. A testament to this is his early adopter work to learn and evangelize Linux in the early 1990s. Whether addressing coordinating monitoring approaches for full-fledged cloud deployments or a...

Other Resources You might be interested In

Blog August 20, 2025

What’s Hiding in Your Wiring Closets?

See why you must move from periodic audits to a state of perpetual awareness. Track every change, validate it against policy, and understand its impact.

Read Blog

Blog August 15, 2025

All Network Monitoring Tools Are Created Equal, Right?

See how observability platforms provide a unified view across multi-vendor environments and correlate network configuration changes with performance issues.

Read Blog

Blog August 15, 2025

Scale Observability, Streamline Operations with AppNeta Monitoring Policies

This post reveals how, with AppNeta’s monitoring policies, you can leverage a powerful framework for scalable, flexible, and accurate network observability.

Read Blog

Course August 14, 2025

AppNeta: Current Network Violation Map Dashboard

Learn how to configure and use the Current Network Violation Map dashboard in AppNeta to identify geographic regions impacted by WAN performance issues.

Go to Training

Course August 14, 2025

AppNeta On-Prem: Minimize Unplanned Downtime

Learn how to configure the AppNeta On-Prem environment following best practices for high availability and disaster recovery to maintain service continuity and minimize unplanned downtime.

Go to Training

Office Hours August 12, 2025

Rally Office Hours: August 7, 2025

Get tips on how to use the Capacity Planning feature in Rally, then follow the weekly Q&A session with Rally product experts.

View Recording

Blog August 11, 2025

dSeries Version 25.0 Boosts Insights, Security, and Operational Efficiency

Discover how ESP dSeries Workload Automation 25.0 represents a significant leap forward, making workload automation more secure, visible, and efficient.

Read Blog

Blog August 7, 2025

What Your SD-WAN Isn't Telling You

SD-WAN's limited view blinds it to underlay issues. Augment SD-WAN with end-to-end visibility to validate decisions and diagnose root causes for network resilience.

Read Blog

Blog August 7, 2025

How DX NetOps Topology Streamlines and Optimizes Triage

DX NetOps Topology gives you the context and clarity to stay ahead of problems and keep your networks running smoothly. Troubleshoot quickly and seamlessly.

Read Blog