Broadcom Software Academy Blog

Cloud Application Performance: Common Reasons for Slow-Downs

Written by Jörg Mertin | Nov 22, 2024 7:32:29 PM
Key Takeaways
  • Gain an understanding of common factors that can cause cloud application performance issues.
  • Discover ways to boost performance by reducing network induced round trip latencies.
  • Employ the Universal Monitoring Agent (UMA) within AIOps and Observability solutions from Broadcom to monitor entire cluster interactions.

Worklog entry

It happens often that an application, when running on bare metal, performs well. However, after the application is packed into an image, tested on Docker/Podman, then migrated to Kubernetes, performance plummets. With an increasing number of database calls issued by the application, the application response times increase.

This common situation is often due to data transfer induced I/O delays.

Note: I/O stands for input/output and can be applied to disk read/write speeds, network transmission/reception, and other areas.

What experience has shown

Imagine running this same application on bare metal again but in this scenario: The database has its storage device created on a remote Network File System (NFS) device.

This means we need to account for these factors:

  1. The number of applications that access the storage endpoint (assuming it is fast by design).
    This will influence the overall response performance due to storage I/O performance.
  2. The way the application has been programmed to gather data out of the database.
    Does it cache the data, or does it just issue another request? Are the requests tiny or rather large?
  3. The latency induced through the network access to the storage and the speed of the storage overall.

When accessing a network device, several major factors impact data transfer speeds:

  • Available link speed
  • Used maximum transmission time (MTU)/maximum segment size (MSS)
  • The size of the transferred payload
  • TCP window settings (sliding window)
  • Average round trip time (RTT) latency from A-to-B in ms.

In a data center setup, the link speed usually is not an issue, since all hosts are interconnected with at least 1 Gbps, sometimes 10 Gbps, and to the same switch. The MTU/MSS will be optimum for the configured link. What remains are the size of the transferred payload, the TCP sliding window, and the average RTT. The factors that will have the greatest impact will be the payload size, link latency, and quality, which will influence the sliding window.

Here’s an example for inter-node latencies of a current Kubernetes setup:

adm@k8s-jm-master:~$ ping k8s-jm-master
PING k8s-jm-master (127.0.1.1) 56(84) bytes of data.
64 bytes from k8s-jm-master (127.0.1.1): icmp_seq=1 ttl=64 time=0.062 ms
64 bytes from k8s-jm-master (127.0.1.1): icmp_seq=2 ttl=64 time=0.056 ms
64 bytes from k8s-jm-master (127.0.1.1): icmp_seq=3 ttl=64 time=0.045 ms

--- k8s-jm-master ping statistics ---
3 packets transmitted, 3 received, 0% packet loss, time 2041ms
rtt min/avg/max/mdev = 0.045/0.054/0.062/0.007 ms

adm@k8s-jm-master:~$ ping k8s-jm-node1
PING k8s-jm-node1.aiops-jm.broadcom.net (10.252.213.5) 56(84) bytes of data.
64 bytes from k8s-jm-node1.aiops-jm.broadcom.net (10.252.213.5): icmp_seq=1 ttl=64 time=0.370 ms
64 bytes from k8s-jm-node1.aiops-jm.broadcom.net (10.252.213.5): icmp_seq=2 ttl=64 time=0.347 ms
64 bytes from k8s-jm-node1.aiops-jm.broadcom.net (10.252.213.5): icmp_seq=3 ttl=64 time=0.469 ms

--- k8s-jm-node1.aiops-jm.broadcom.net ping statistics ---
3 packets transmitted, 3 received, 0% packet loss, time 2041ms
rtt min/avg/max/mdev = 0.347/0.395/0.469/0.052 ms

adm@k8s-jm-master:~$ ping k8s-jm-node2
PING k8s-jm-node2.aiops-jm.broadcom.net (10.252.213.6) 56(84) bytes of data.
64 bytes from k8s-jm-node2.aiops-jm.broadcom.net (10.252.213.6): icmp_seq=1 ttl=64 time=0.470 ms
64 bytes from k8s-jm-node2.aiops-jm.broadcom.net (10.252.213.6): icmp_seq=2 ttl=64 time=0.407 ms
64 bytes from k8s-jm-node2.aiops-jm.broadcom.net (10.252.213.6): icmp_seq=3 ttl=64 time=0.451 ms

--- k8s-jm-node2.aiops-jm.broadcom.net ping statistics ---
3 packets transmitted, 3 received, 0% packet loss, time 2038ms
rtt min/avg/max/mdev = 0.407/0.442/0.470/0.026 ms

Note: When increasing the payload size of the ping packet, as soon as we set it to more than the current MTU, response times will rise. For example, node2 has a payload size of 64k—an extreme case—that clearly illustrates the latency increase.

adm@k8s-jm-master:~$ ping -s 65500 k8s-jm-node1
PING k8s-jm-node1.aiops-jm.broadcom.net (10.252.213.5) 65500(65528) bytes of data.
65508 bytes from k8s-jm-node1.aiops-jm.broadcom.net (10.252.213.5): icmp_seq=1 ttl=64 time=1.42 ms
65508 bytes from k8s-jm-node1.aiops-jm.broadcom.net (10.252.213.5): icmp_seq=2 ttl=64 time=1.43 ms
65508 bytes from k8s-jm-node1.aiops-jm.broadcom.net (10.252.213.5): icmp_seq=3 ttl=64 time=1.47 ms

--- k8s-jm-node1.aiops-jm.broadcom.net ping statistics ---
3 packets transmitted, 3 received, 0% packet loss, time 2003ms
rtt min/avg/max/mdev = 1.422/1.440/1.474/0.023 ms
adm@k8s-jm-master:~$ tracepath k8s-jm-node1
 1?: [LOCALHOST]                      pmtu 1500
 1:  k8s-jm-node1.aiops-jm.broadcom.net                    0.560ms reached
 1:  k8s-jm-node1.aiops-jm.broadcom.net                    0.359ms reached
     Resume: pmtu 1500 hops 1 back 1

Taking these three points together, (size of the transferred payload, the TCP sliding window, and the average RTT), if a developer created an application without using built-in caching capabilities, then database requests will stack up and cause I/O pressure. This omission would route all requests through the network link and cause response times of the entire application to increase.

On a local storage or low latency (network and storage) environment, you can expect the application to perform well. This situation is typical when development teams test on a local environment. If we switch to network storage, network induced latency will influence the possible maximum request rate. When performing the same tests in production, teams are likely to find that a system that was capable of performing say 1000 DB requests per second, now see a system capable of delivering only 50 DB requests per second. In this case, consider three likely culprits noted above:

  • Variable packet sizes, average however larger than the configured MTU
  • Network latency because the DB was on a dedicated node, apart from the application
  • The sliding window was reduced as all surrounding systems connect to that DB node. This causes I/O pressure on filesystem requests which, in turn, makes the DB unable process requests fast enough.

Here, tests revealed that the sliding window algorithm caused data transfers (for background, this is what a remote storage over NFS uses) in 4Kb average payload sizes plummet from 12MBytes/s on a fast 1Gps link (8ms latency) gradually down to ~15Kbps when the link latency hit 180ms. Sliding windows can force each request to return an ACK packet for the sender prior to sending the subsequent packet on low latency links. This situation can be tested locally using the tc application on a Linux system. Here, teams can manually force a latency on a per-device basis. For example, the interface can be set to respond with a latency of 150ms.

On Kubernetes for example, you may want to deploy an application using a front-end, an app-server, and a back-end DB. The app-server and back-end DB happen to be deployed on different nodes. Even if these nodes are fast, they will induce network delays for every packet sent from the app server to the DB and back. These delays add up and will significantly degrade performance!
Below is a real-world example that shows application response times. The database is deployed on node2.

adm@k8s-jm-master:~/k8s/apmia$ k get pods -n cemperf -o wide
NAME                               READY   STATUS      RESTARTS       AGE    IP                NODE         
cemperf-app-7d669c76c-7qqsg        1/1     Running     2 (147d ago)   300d   192.168.255.136   k8s-jm-master
cemperf-db-9774fb7d9-cvp8d         1/1     Running     0              25m    192.168.21.197    k8s-jm-node2

Here, executing an application requires large numbers of DB requests and dramatically impacts average response times on affected components, in this case an average response time of 3.9 seconds.

Note: For this test, the graph_dyn function queries the database for 80k data rows where each row provides metadata to build various graph points.

By moving the database to the same machine hosting the application part, average response time improved to 0.998 seconds!

jmertin@k8s-jm-master:~/k8s/apmia$ k get pods -n cemperf -o wide
NAME                               READY   STATUS      RESTARTS       AGE    IP                NODE         
cemperf-app-7d669c76c-7qqsg        1/1     Running     2 (147d ago)   300d   192.168.255.136   k8s-jm-master
cemperf-db-cbd45f9f6-frdwf         1/1     Running     0              7s     192.168.255.168   k8s-jm-master

This enormous performance improvement was achieved by reducing network induced round trip latencies, that is, by moving the database container to the same node.

Another common factor that impacts Kubernetes performance is found in ESX environments where the cluster hosts are oversubscribed in terms of network I/O and the Kubernetes hosts are deployed in VM Images. The Kubernetes node will not see that the CPU, memory, disk I/O, and/or remote storage connected to it is already under load pressure. The result is that application response times experienced by users will suffer.

How to optimize application performance

First of all, make sure the application design takes into account caching (requests, DB requests, and so on) and does not perform unnecessary requests.

Second, when running Kubernetes/OpenShift on top of ESX, make sure over-subscription is not enabled for any of the available CPU and memory. Note that small delays stack up fast to cause an elastic behavior on the application response side. On the other hand, if the ESX host is already under pressure, the images running on that host will not be made aware of the issues and the Kubernetes orchestrator may deploy some other pods on that node because resources (CPU, memory, storage) are apparently still available. So, if you run Kubernetes/OpenShift and the like, and you can instead run it on bare metal, do it! You will have direct and real visibility to the host.

Third, on the Kubernetes side, reduce the network induced I/O primarily on the pod deployment. Pack the containers into the same pod so they are deployed on the same node. This will drastically reduce the network induced I/O delays between the app-server and back-end DB.

Finally, deploy the containers that require fast DB access on nodes that have fast storage. Take high availability and whatever LB is into account. If NFS storage is not fast enough, consider creating a PV on a specific node having a local RAID 10 disk. Note, that in this case, redundancy will have to be handled manually because the Kubernetes orchestrator will not be able to handle it if the specific node hardware fails.

How to monitor latencies between components

For Kubernetes, OpenShift, and Docker (Swarm), the Universal Monitoring Agent (UMA) used within AIOps and Observability solutions from Broadcom can be used to monitor the entire cluster interaction. This agent can extract all relevant metrics on the cluster. If UMA detects a running technology it can monitor such as Java, .Net, PHP, Node.js, it will automatically inject a probe and provide fine-grained application execution details on-demand. These details include I/O requests from that component to the DB, remote calls, and so on, as application internal metrics.

Without additional configuration, when an application is misbehaving, the built-in anomaly detection of UMA initiates a transaction trace. UMA will collect fine grained, detailed metrics from agents linked to components for the specific application execution path. These capabilities generate a detailed view of the application execution down to the line number in the code snippet causing the slow-down.

Note: Administrators can choose to set up full, end-to-end monitoring, which provides execution details that include front-end (browser) performance.

Here, the full end-to-end metric details will be collected so the admin team can analyze what went wrong and take corrective actions as required:

 User front-end => Web-Server front-end => app-server => databases | remote services

One advantage of this monitoring solution is that it will show the average execution times of the components within the cluster. Over time it, the solution will assume bad average times to be normal in an oversubscribed ESX environment. Still, the operator can assess the real values and compare these to values from other applications running on the same node.