Continuous testing has emerged as a popular practice within DevOps, assisting teams in their quest to release high quality software on demand. While test data management (TDM) practices are extremely important to ensure that testing is effective, various surveys have indicated that TDM issues are one of the leading causes of delays in testing and application delivery. That is because traditional TDM has relied on ETL (extract-transform-load) type activities to extract and mask subsets of data from production data stores. Consequently, if teams are to keep pace with the demands of continuous testing, they will need to modernize TDM practices and processes.
In this blog, we will describe some best practices for establishing continuous TDM as part of continuous testing.
What is Continuous Test Data Management?
“Continuous” TDM is derived from the principle of “continuous everything” in DevOps. Continuous TDM is a key component of continuous testing within the DevOps framework. With this approach, testing and the management of test data happen across the different phases of the CI/CD lifecycle, as opposed to a single “testing phase.” See figure below.
The continuous nature of testing means that TDM activities need to be embedded in all parts of the lifecycle (such as design, development, and all testing and deployment activities) along the Agile CI/CD pipeline. See an example in the figure below.
Since these tests happen within short windows of time, it means that the TDM activities (such as test data creation, provisioning, deployment, and so on) also need to happen within these windows. This article provides an overview of continuous TDM in the context of DevOps, and reveals how it is different from traditional TDM.
Key Practices for Continuous TDM
Traditional TDM has typically relied on ETL (extract-transform-load) type activities to subset and mask data from production data stores (right side of figure below).
This approach relies on understanding the schema of the production data stores as well as the ability to gain direct access to data stores to extract a subset of the data. As the number of applications (and associated data stores) increases, this often becomes a time-consuming process, and contributes to significant delays in testing and release cycles.
This approach does not align with the needs of agile teams, who need fast access to updated data in frequent tests in the context of continuous testing described above. In order to address these needs, we need to establish new practices, including shifting TDM activities left, using synthetic test data, and integrating with model-based testing and service virtualization (the left side of the above figure). In addition, this requires new techniques, such as traffic sensing and data virtualization. These practices are described as follows.
1. Shift-Left TDM Activities
As with most activities in the continuous testing lifecycle, continuous TDM also requires that most of the TDM work (such as test data specification, design, and generation) “shifts left,” so it happens during the CI part of the DevOps lifecycle. This minimizes the delay that TDM processes can cause during the CD part of the lifecycle, helping to speed deployment cycle time. (See figure below.) We will describe what TDM activities happen in each stage of the lifecycle later in this blog.
2. Synthesize as Much Test Data as Possible
A key tenet of shifting TDM left is the need to synthesize as much of the test data as possible, as opposed to traditional approaches where we extract and mask a subset of data from production. Synthetic test data supports a more agile, and more developer and tester friendly approach to TDM, since it reduces dependency on production data and operations teams that control that data. It allows for fast, controlled generation of data suited for purpose, and is free from privacy and security concerns associated with personally identifiable information. This is especially important for shift-left tests, such as unit and component tests, in which we are building out new functionality (for which data may not exist in production), and developers and SDETs need access to small amounts of test data with more variety as quickly as possible.
For tests in the latter stages of the CD pipeline, such as end-to-end tests, it probably makes sense to have more realistic test data, often taking a subset from production. For other tests along the CI/CD pipeline, we recommend the use of hybrid test data with an emphasis on synthetic test data. See the next section on the test pyramid.
3. Adhere to the Test Pyramid for Test Data
The test pyramid is one the key tenets of shifting testing left. This means that more and more TDM processes need to support tests in the lower half of the pyramid, for example unit, component, integration, and component tests. (See figure below.) This is a good thing, since test data is easier to create and provision in the lower tiers of the pyramid than it is in the higher tiers.
This also influences the type of TDM approach we use:
- For the vast majority of unit and component tests, which are typically run by developers, we need to use synthetic test data. Synthetic test data is easier and faster to generate, and can be used to provide extensive test data variety and the high degree of coverage required for such tests.
- For the vast majority of end-to-end, UAT, and business process tests, TDM is a significant challenge, since testing often occurs across multiple systems or applications. This means we need to provision synchronized test data across these systems. Typically, for such tests, we need more realistic production-like data. Such data is typically obtained by extracting a subset of data from production environments and masking it. This is a more expensive process, and typically offers less test data coverage. Therefore, we need to optimize the amount of such testing (and related TDM efforts).
- For most Integration and system tests, we need to provide hybrid test data (that is, a mix of synthetic and production-like) with an intermediate level of test data coverage.
4. Take a Model-Driven Approach for Test Data
We recommend model-based testing (MBT) as a key enabler for continuous testing. It makes sense therefore to integrate TDM with MBT. This requires us to specify test data requirements or constraints as part of the model itself. The figure below offers an example of how we do this using Broadcom Agile Requirements Designer. Other MBT tools provide similar support. This example shows how we set the test data for the “UserName” of a simple login model.
The test data rules embedded in the model could either be static (hardcoded or tied to data in a spreadsheet), formulaic (as in the example above) for synthetic generation, or tied to a back-end TDM system. The test data is generated or refreshed automatically every time tests are generated from the model.
This approach extends all of the benefits of MBT to TDM, namely:
- It enables the test data specification to be shifted left, since modeling is implicitly a shift-left testing activity. It allows more rigorous test data specification, and is stored as part of the “source of truth” about system behavior that the models encapsulate. It also provides a better collaboration platform for TDM between multiple stakeholders, such as the test data engineer, data engineers, tester (or SDET), and developers.
- The test data is “matched” to test cases generated from the model. Since the models support multiple optimization techniques for generating tests (such as maximal, optimal, or minimal coverage), it limits the test data generation to such optimized tests only. This also makes it easy to integrate the provisioning and deployment of tests and associated test data together (for example, in test containers).
- Models are a great way to support change impact testing to minimize the amount of testing. This also allows us to do change-impact based TDM—we only generate test data for affected test cases. This is especially useful for end-to-end and system integration testing in the upper tiers of the test pyramid as described above, where we want to minimize testing effort.
5. Use Service Virtualization in Conjunction with TDM
Service virtualization is a well established practice for agile development and testing. By virtualizing the dependencies (which the system under test depends upon), we also reduce the TDM burden for those dependent components.
In fact, the type of test data used (whether synthetic, hybrid, or production-like) correlates with the extent of service virtualization used. At the bottom of the test pyramid, we aggressively use both synthetic test data and virtual services. Towards the top of the pyramid, we use more realistic test data with real application components. We can use hybrid approaches for the middle tiers.
This correlation of progressive service virtualization with progressive TDM along the CI/CD lifecycle is shown below.
When using virtual services for test data, we need to make sure that we maintain consistency between data used to drive the tests, the application database, and the virtual service. The first two are typically taken care of by TDM tools. For the virtual service, we need to either record or synthesize the virtual service with the same test data set used for the test. See figure below.
6. Consider Alternate Approaches to TDM
In addition to the TDM approaches described above, where available, we also need to consider other complementary approaches to generating test data. Some of these approaches include the following:
- Data virtualization. There are different definitions of “data virtualization.” One (promoted by vendors of such solutions such as Broadcom vTDM, Delphix, and Actifio) is about creating “virtual copies” of datastores. The other related definition, supported by other solutions, is about creating a unified data access layer across multiple underlying datastores (using virtual tables that map into the backend databases). Regardless of the definition, both are applicable in the context of TDM. “Virtual copies” of data—like service virtualization—allow us to provision fast, lightweight views of datastores. This also solves the problem of data refresh that comes with data copying in ETL approaches. Similarly, virtual tables simplify data across multiple datastores (across applications), which is often required for end-to-end system testing.
- Data traffic sniffing. This is specially useful for generating test data for API testing. Test data can be generated by sniffing and recording data that flows in and out of APIs (using tools like Wireshark or Fiddler). This approach is very similar to the way we record and create virtual services (as described above).
7. Integrate TDM with CI/CD Processes
In order to achieve continuous TDM, we need to ensure that test data provisioning and deployment are also automated as part of provisioning and deployment automation along the CI/CD lifecycle. As discussed in the previous section, this is especially important in the CD part of the lifecycle, where we need to minimize elapsed time to reduce cycle time. This can be achieved by integrating the deployment of test data with deployment automation tools. For applications that are deployed in containers, we may package test assets (including test data) in side-car deployment containers and deploy them alongside application containers.
Pulling it All Together: Continuous TDM Lifecycle
The following figure summarizes the different activities in a typical continuous TDM process across the different stages of the CI/CD pipeline.
These activities are summarized below:
Step 1(a): Backlog Grooming
Test data management starts with well-defined acceptance criteria for the backlog items. This provides the dev/test team with seed test data that can be used for defining acceptance test cases. In keeping with our recommended approach for model-driven TDM, we recommend that teams capture this information as part of the model, which defines the behavior associated with the backlog item. In this way, test data can be generated from the model along with acceptance tests.
Step 1(b): Agile Design
The TDM activities at this stage support the needs of development and subsequent CI/CD phases. This includes:
- Generation of synthetic test data (and virtual services) for supporting unit and acceptance tests (BDD).
- Definition of test data constraints in the models for other tests (such as integration or system tests) in the CI/CD pipeline.
- Definition of the specifications of test data that needs to be extracted from production.
- Specification of test data packaging for deployment into various environments of the CI/CD pipeline.
Step 2(a): Agile Parallel Development
During development, developers and SDETs execute unit and component tests using the synthetic test data (and virtual services) that were created in the previous step. The availability of test data and virtual services is in fact a big enabler for supporting extensive unit testing.
Step 2(b): Agile Parallel Testing
This is an important stage in which testers and test data engineers design (or even generate/refresh) the test data for impacted test scenarios (based on the backlog items under development) that will be run in subsequent stages of the CI/CD lifecycle. The test data developed here will typically be hybrid (mix of synthetic data and a subset of data from production) based on the testing pyramid discussed above. In addition, the test data will need to be packaged (for example in containers or using virtual data copies) in order to ease and speed provisioning into the appropriate test environment (along with test scripts and other artifacts).
Step 3: Build
In this step, we typically run automated build verification tests and regression tests using the test data generated in the previous step.
Step 4: Testing in the CD Lifecycle Stages
The focus in these stages is to run tests (in the upper layers of the test pyramid) using hybrid test data created during Step 2(b). (See figure below.) The key in these stages is to minimize the elapsed time TDM activities require. For example, the time taken to create, provision, or deploy the required test data must not exceed the time to deploy the application in each stage.
How do we get started with continuous TDM?
Continuous TDM is meant to be practiced in conjunction with continuous testing. Various resources offer insights into evolving to continuous testing. If you are already practicing continuous testing and want to move to continuous TDM, our recommendation is to proceed as follows:
- For new functionality, follow the TDM approach we have described.
- For existing software, you may choose to focus continuous TDM efforts on the most problematic or change-prone application components, since those are the ones you need to test most often. It would help to model the tests related to those components, since you can derive the benefits of combining TDM with model based testing. While focusing on TDM for these components, aggressively virtualize dependencies on other legacy components, which can lighten your overall TDM burden.
- For other components that do not change as often, you need to test less often. As described above, virtualize these components while testing others that need testing. In this way, teams can address TDM needs as part of technical debt remediation for these components.
Next Up: Continuous Testing for Microservices
This blog has provided an overall approach for continuous TDM practices. As you can probably tell, microservices-based applications are extremely well suited to supporting continuous TDM. This is true because such applications are modular and componentized. I have previously blogged on new approaches to TDM for such applications. In a future blog, I will discuss approaches for continuous TDM for microservices applications.
Until such time, my friends, stay well, and may all your TDM efforts be continuous!