Scaling Data Masking with Test Data Manager

Written by Abhijit Mugali | Sep 12, 2022 4:29:59 AM

Demands for Increasing Data Masking Scale

For teams responsible for data masking, demands continue to grow. It seems each month the size of tables that need masking gets significantly larger. Now, it’s not uncommon for teams to receive requests to mask tables with hundreds of millions and even billions of rows. However, while table sizes keep growing, time frames don’t. No matter how large the table, teams still typically need to turn around requests within eight to 12 hours.

How do teams scale their masking to accommodate these expanding demands and tight turnaround times? The good news is that Broadcom Test Data Manager is helping customers meet these demands every day. In this post, I’ll offer an introduction to a new feature in Broadcom Test Data Manager called Scalable Masking and outline how you can use the feature most effectively.

Working with Scalable Masking

Starting with release 4.9, Test Data Manager offered Scalable Masking capabilities. Customers that upgrade to the latest release, 4.10, will be able to leverage these features and more. Users can initiate masking jobs through the TDM portal and centrally run masking jobs across a range of tables and data models.

Container Approach Yields Significant Performance Advantages

In these new releases, Test Data Manager was offered as a Docker container view. This allows teams to run multiple masking engines as a Docker container. As a result of this container-based approach, when teams have large tables, they can split jobs across multiple masking engines, which provides significant performance and scalability benefits.

While a lot of variables will affect performance, Scalable Masking can mask between four and 15 million cells per minute. (See the table below for examples of differing data source sizes and configurations.)

When requests are made in the portal, they are submitted as RESTful requests to the message bus. Based on the number of engines available and the processing status of the engines, the message bus sends those requests to the appropriate engine.

Masking engines connect to the target table, the table is sent to the engine, and the engine conducts masking. The message bus reports on progress back to the portal. This reporting enables administrators to track status and it provides a documented record that can be retained for auditing purposes.

It is important to note that the Docker masking engine communicates directly with the database instance, and they both reside on the same subnet, which can provide significant benefits in performance and throughput.

Example: How Job Splitting Works

To illustrate how masking jobs can be split, here’s a hypothetical an example:

Environment. An organization has the TDM portal, docker container, and message bus running and they have four containers set up as part of their implementation. By default, each container will have four engines.
Scope. The team wants to run a masking job with 10 tables, with each table having an average of 5M rows. They have five columns in each table that have to be masked.
Split. The TDM portal submits the request to the message bus, which splits the job.
- Each table gets its own Scalable Masking engine, so all 10 tables can be masked in parallel.
- Two containers, which each have four masking engines, will execute eight of the jobs.
- A third container will handle the last two jobs, while the other two engines, and the fourth container, remain idle.

Tips and Best Practices

To get the most out of the power of Scalable Masking, following are some key strategies:

Calculate and plan based on table sizes and masking scope. Upfront, it is important to establish a count of records to be masked, which is calculated by multiplying the number of columns being masked and the number rows.
Allocate adequate space and resources. Masking jobs may fail due to server issues or lack of required memory, so it is important to allocate the resources required. Teams need to allocate enough memory in order to create the tablespace necessary and they need to have enough processors available to complete the job in a timely manner. As a rule, the more processors, the shorter the masking window will be.
Validate database instance configuration and performance. Work closely with the DBA to make sure recommended configurations are applied and to ensure masking is working correctly.
Manage heap size. This setting determines how much RAM is allocated to each instance. If the heap size is too small, teams may see their masking job fail. In general, in both Oracle and SQL Server, about 3 GB of heap size is sufficient to run most jobs properly.

Recommended Settings

Following are suggested settings for Scalable Masking:

BATCHSIZE=37500
BLANKSASNULLS=Y
COMMIT=37500
EMPTYASNULL=Y
FETCHSIZE=75000
GETTABLEROWCOUNTS=N
ORDERBY=N
PARALLEL=<Based on the number of CPU cores available>
LARGETABLESPLITENABLED=Y
LARGETABLESPLITSIZE=<Your calculation based on the largest table row count>

Conclusion

For today’s development teams, the ability to scale data masking continues to get more critical. By employing Scalable Masking and properly configuring their environments, teams will be able to dramatically scale their masking capacity. To learn more, be sure to read Masking Performance Optimization in CA TDM Portal.

View full post