by: Shamim Ahmed
I often hear from customers who complain about how “classic” performance testing (i.e., end-to-end testing with high volume of virtual users) of their applications before release slows down the cycle time by several weeks. In addition, the testing significantly consumes both people and infrastructure (hardware and software license) resources.
This is especially true for retailers and other e-commerce providers who do this type of testing typically before a major product/service launch, as well as before key shopping seasons. They ask how they can reduce or even eliminate the testing bottleneck, and instead be “peak performance ready” all the time, so that they can release software updates without incurring a delay but simultaneously not risk issues in production.
We have seen how Continuous Testing (CT) addresses the bottleneck problem from a functional testing perspective. When applications are architected using a component-based approach such as using micro-services, it is possible to effectively implement CT for performance testing of such applications. The key to enabling Continuous Performance Testing (CPT) for micro-services-based applications is being able to test and scale each component in isolation. This is especially applicable for modern cloud-native applications.
Most literature on micro-services testing — for example, this canonical approach from Martin Fowler — seems to focus primarily on continuous functional testing, and considerably less on continuous performance testing. In this blog, I will describe the full lifecycle implementing end-to-end continuous performance testing for micro-services.
Note that it is possible to implement similar continuous performance testing for monolithic applications as well, but that application is less elegant and more onerous. It will be the subject of a subsequent blog.
Continuous Performance Testing (CPT) derives from the principle of “Continuous Everything” in DevOps. It is a subset of Continuous Testing (CT), where performance testing needs to happen across the different phases of the Continuous Integration/Delivery (CI/CD) lifecycle as opposed to a single “performance testing phase.” CPT is a key enabler for Continuous Delivery (CD), as we will discuss in the next sections.
As mentioned before, the key to enabling CPT for micro-services-based applications is being able to specify performance requirements at the component level, the ability to test and scale each component in isolation. This allows us to run frequent shorter-duration performance tests, as well as test for scalability using smaller scale test profiles – which is what is needed for CPT.
For example, if we have an application that comprises Services A, B, C, and D (see Figure below), each with its own service level objective (or SLO, see the Requirements practice below), we can test each component quickly in isolation, in addition to API-based x-service system tests, and end-user journey tests.
Performance testing components in isolation gives us the ability to test early and often using smaller time slots, without having to rely on long duration traditional performance tests. If the component tests do not pass, the higher order tests will not pass as well, so there is no need to run them. Thus, we save time and resources that are critical from a CD perspective.
In addition to the above, we combine the key practice of change impact testing, where we test only those components (and higher order transactions and user journeys) that have been impacted by some change. This helps to further reduce the time and effort for performance testing that is key to CD.
We will expand on each step of the CPT lifecycle later in this blog. But first, let’s discuss the key practices required to support CPT.
Good testing starts with good requirements. This principal applies to performance testing as well; however, we often see performance requirements that are not as rigorously specified as functional testing requirements. Performance requirements are considered part of non-functional requirements (NFR) of an application/system.
There are different performance testing requirements that can be captured at different levels of system/application granularity. In general, there are three major types of performance requirements:
As with most activities in the CT lifecycle, CPT also requires that most of the performance testing work such as test specification, design, generation, and execution “shifts left,” so it mostly happens during the CI part of the DevOps lifecycle. This minimizes the delay that CPT processes can cause during the CD part of the lifecycle, where minimizing the Lead Time for Change is of utmost importance (see figure below). This is enabled by adherence to the testing pyramid, as seen in the next section.
A summary of key performance testing activities that are performed during the CI and CD process are shown in the figure below. These activities are mapped to the CPT lifecycle phases that we describe later.
Shift-left of testing is supported with the use of the test pyramid. We generally see the test pyramid used in the context of functional testing, but it applies just as well to performance testing of component-based applications. See a figure of the performance testing pyramid below.
It means that most of the (shorter duration) performance testing needs to be conducted rigorously at the unit/component/API levels (as part of the Continuous Integration process) and fewer longer duration tests executed at the system and e2e levels (as part of the Continuous Delivery process), and that, too, only as needed. This helps ensure we have lesser slowdown on the “Lead Time for Change” metric, which is indicative of velocity of change deployment. It is quite a departure from the classic performance testing approach, where more of the long duration testing is done at the system or e2e level before release.
As we have mentioned before, the key to reducing testing elapsed time in the context of the CD process within CPT is to reduce the number of tests that need to be done. A key way to do this is to leverage change impact testing – i.e., focus testing principally on those parts of the application that have changed. This is an important part of the Design phase of the CI/CD lifecycle.
There are various types of change impact techniques that we may leverage. We will describe two that we recommend here.
We may in fact extend the benefit of this technique towards the lower order tests (marked with broken outline), however, they are better served by the “inside-out” approach described earlier.
Service virtualization (SV) is a key enabler for all types of testing, and especially so for CPT of component-based applications. SV allows us to not only virtualize dependencies between components, but more importantly allow us to correctly emulate the performance of the dependent component, thereby allowing proper testing of the component under test. Virtual services also allow us to easily and quickly set up and configure ephemeral test environments that are needed for CPT.
While using SV in the context of CPT, we need to leverage the corresponding techniques around “Continuous” Service Virtualization. Please see my blog on that subject here.
Test Data Management (TDM) is another key enabler for all types of testing, and especially so for CPT. TDM allows us automate the creation and provision of data – which may sometimes be quite voluminous – for performance testing activities. It is a key capability that allow us to easily setup and configure ephemeral test environments that needed for CPT.
While using TDM in the context of CPT, we need to leverage the corresponding techniques around “Continuous” Test Data Management. Please see my blog on that subject here.
We recommend use of both Continuous SV and Continuous TDM together to support the needs of CPT, since SV helps to reduce the burden of the more onerous TDM activities.
Monitoring is a key component of all performance testing, and especially so for CPT. This is not only required for understanding performance bottlenecks (and follow-up tuning activities), but the high frequency and volume of tests — and resulting data — during CPT require the use of data analytics and alerting capabilities. For example, data analytics is required to establish performance baselines, report regressions, and generate alerts for other anomalous behaviors.
For component-based applications we also need specialized solutions tools for monitoring of containers that deploy application and test assets.
In fact, we recommend the use of Continuous Observability solutions to not only analyze performance data, but also provide proactive insights around problem detection and remediation. This is especially important for debugging issues like tail latencies and other system issues unrelated to application components.
CPT requires tests to be run frequently in response to change events (e.g., code updates, builds, deployments, etc.) with a variety of accompanying test assets like test scripts, test configurations, test data, and virtual services in dedicated test environments. It is practically impossible to manage all of this processing manually in the context of CI/CD lifecycle.
It is therefore key that all CPT activities be integrated with a CI/CD orchestration engine that triggers the provisioning of environments, deployments of application components and test assets, execution of tests, capturing and communication of test results, and cleanup after completion. For component-based applications that are deployed in containers, we may package test assets (including test data) in sidecar deployment containers and deploy them alongside application containers.
The following figure summarizes the different activities in a typical CPT process across the different stages of the CI/CD pipeline along with typical personas who perform them.
These activities are summarized below:
CPT starts with well-defined performance requirements. As we discussed earlier, this can either be performance constraints attached to functional requirements such as user stories and features or transactions, or customer journeys (typically defined by Product Owner), or SLOs defined for service components (typically defined by Site Reliability Engineers)
The CPT activities at this stage support the needs of Development and subsequent CI/CD phases. This includes:
During development, developers/SDETs conduct unit and component level performance testing of the code/component being built (or updated). Some of the key practices include:
This is an important stage where SDETs and testers create or update the performance test scenarios (and accompanying test assets such as virtual services and test data), and define/update the test asset packaging and deployment configurations for the rest of the CI/CD lifecycle based on change impact analysis as described earlier.
There are various approaches to defining performance tests for acceptance, integration, system and e2e test scenarios.
For performance requirements attached to functional requirements, we may use Behavior Driven Development (BDD) to define performance acceptance cases in Gherkin format. For example, the baseline acceptance test for an API may be as follows:
Scenario: API Tests with small load
Given API Query http://dbankdemo.com/bank
And 100 concurrent users
And the test executes for 10 minutes
And has ramp time of 5 minutes
When Response time is less than 1ms
Then Response is Good
The Gherkin feature file can then be translated into YAML that mat be executed using tools like jMeter.
For System and E2E test scenarios, our recommendation is to define those using model-based testing tools like ARD. This allows us to conduct automated change impact analysis and precise optimization, and generate performance test scripts and data directly from the model that can be executed with tools like jMeter.
As part of every build, we recommend running baseline performance tests on selected components based on the change impact analysis as described earlier. These are short duration, limited scale (using a small number of virtual users) performance tests on a single instance of the component at the API level, to establish a baseline, assess build-over-build regression in performance, and provide fast feedback to the development team.
A significant regression may be used as a criteria to fail the build. Tests on multiple impacted APIs can be run in parallel, each in its own dedicated environment. Tools such as jMeter and Blazemeter may be used for such tests.
The profile of such a test would depend on performance requirements of the component. For example, for the Search component that we discussed earlier, we could set the test profile as follows:
Thruput = 24K/sec TPS
Number of virtual users = 100
Wait time = 0.5 sec
Duration: 10 mins
Warmup time: about 15 sec
Key measures: thruput (TPS), Response time (P95/P99), CPU/memory usage etc.
See the figure below for examples of outputs from such tests that show baselines and trend charts.
If a component has dependencies on other components, we recommend using virtual services to stand in for the dependent components so that these tests can be spun up and executed in a lightweight manner within limited time and environment resources.
A variety of test assets (such as test scripts, test data, virtual services, and test configurations) are created in the previous steps and need to be packaged for deployment to the appropriate downstream environments for running different types of tests. As mentioned above, for component-based applications — where micro-services are typically deployed as containers — we recommend packaging these test assets as accompanying sidecar containers. The sidecar containers are defined appropriately for the type of tests that need to be run in specific CI/CD environments. This is an important aspect of being able to automate the orchestration of tests in the pipeline.
Scaled component tests are conducted on isolated impacted components based on change impact analysis to test for SLO conformance and auto-scaling.
This is a single-component isolated performance test that is typically run at 20% to 30% of the target production load in a dedicated test environment. We may run this at higher loads, but that will take longer time (thereby increasing the Lead Time for Change, and increase test environment resources), so run at the highest possible load keeping in mind the maximum time allotted to run the test. Typically, scaled component tests should be limited to no more than 30 minutes in order to minimize delays to the CD pipeline.
The typical process for running such a test is shown in the figure below.
After the SLO validation tests are completed, the results are reported and CD pipeline is progressed. However, we recommend running spike and soak tests over a longer duration of time, often greater than a day, without holding up the CD pipeline. These tests often help catch creeping regressions and other reliability problems that may not be caught by limited duration tests.
Another key item to monitor during these tests are tail latencies, which are typically not detected in baseline tests described above. We need to closely monitor the P99 percentile performance.
Scaled component tests should leverage service virtualization to isolate dependencies on dependent components. Such virtual services must be configured with response times that conform to their SLOs. See more on this in the section on the use of service virtualization.
To minimize test data provisioning time, these tests need to use hybrid test data – i.e., a mix of mostly synthetic and some production-like test data (typically 70:30 ratio).
Although distributed load generators can be used to account for network overheads, use of appropriate network virtualization significantly simplifies the provisioning of environments for such tests.
Scaled system tests are API-level transaction tests (based on change impact analysis) across multiple components to test for transaction SLO conformance and auto-scaling. These tests should be run after functional x-service contract tests have passed.
A transaction involves a sequence of service invocations (using the service APIs) in a chain (see figure below) where the transaction invokes Service A, followed by B and C, etc. These tests help expose communication latencies and other performance characteristics over and above individual component performance. Distributed load generators should be used to account for network overheads (or with suitable network virtualization).
Since such tests take more time and resources, we recommend that such tests be limited to run only periodically and that too for critical transactions that have been impacted by some change.
Such tests also need to be run with real components that have been impacted by the change, but use virtual services for dependent components that have not.
Also, to minimize test data provisioning time, these tests need to use hybrid test data – i.e., a balanced mix of synthetic and production-like test data (typically a 50:50 ratio).
The typical process for running such a test is shown in the figure below. Depending on cycle time availability in the CD pipeline, we need to limit the amount load level.
System tests are probably the most challenging performance tests in the context of CPT, since they cross component boundaries. At this stage of the CD pipeline, we should be confident that individual components that have impacted are well tested and scale correctly. However, additional latencies may creep from other system components, such as the network, message buses, shared databases, other cloud infrastructure, and aggregation of tail latencies across multiple components, typically occurring due to some other system component outside of the application components.
For this reason, we recommend that other system components also be performance tested individually using the CPT methodology described here. It is easier to do so for systems that use infrastructure-as-code, since changes to such systems can be detected (and tested) more easily. Leveraging Site Reliability Engineering techniques are ideal for addressing such problems.
In the pre-prod environments, we recommend running scaled e2e user journey tests for selected journeys based on change impact analysis. These tests measure customer experience as perceived by the user.
Since these tests typically take more times and resources to run, we recommend that these be run sparingly and selectively in the context of CPT; for example, when multiple critical transactions have been impacted, or major configuration updates have been done.
Such tests are typically run with real service instances (virtual services may be used to stand in for dependent components if they are not part of the critical path), realistic test data, a realistic mix of user actions typically derived from production usage, and real network components with distributed load generators. For example, an e-commerce site has a mix of varied user transactions such as login, search, checkout, etc., each with different loading patterns.
As in System testing described above, it is vital to closely monitor other system components more so than application components during these tests.
Performance canary tests are similar to functional canary tests, except that they validate system performance using a limited set of users. Such tests are a great way to validate performance scenarios that are difficult or time-consuming to run in pre-production environments, and provide valuable feedback before the application changes are rolled out to a wider body of users.
Canary environments can also be used for chaos tests and destructive experimental testing to understand impact on application performance. Component based applications lend themselves very well to controlled chaos experiments, since we can simulate failure at various levels of granularity to enable us to understand and resolve problems faster. Some organizations even use virtual services in chaos environments to easily simulate failure conditions.
As we have mentioned before, one of the key requirements of CPT is being able to orchestrate all of the key testing processes and steps described above in an automated manner. For component-based applications, this is a complex problem to manage for thousands of changes occurring across hundreds of components and their corresponding deployment pipelines.
For component-based applications, production monitoring additionally involves tracking the SLIs, SLOs and SLAs at the component, transaction, and business services levels.
This blog has provided an overall approach for Continuous Performance Testing practices for component-based applications. As you can tell, component-based applications are particularly well suited for this approach.
Testing, however, is an activity, while quality is the real outcome that we desire. The goal of CPT is to help us continuously ensure that quality outcomes such as reliability can be continuously assured. We call this Continuous Reliability (CR). My next blog will focus more on CR concepts and how we can achieve CR.