Test Data Management
Last Updated:
Analyst Coverage: Daniel Howard
Traditionally, testing and quality assurance create test data by copying the live database. However, the average Global 2000 company has seven such copies, which is expensive in terms of license fees, hardware, and running costs. A cheaper option is to take subsets of the database instead of copies. However, without sophisticated tools that ensure the subset you take is representative of the database as a whole, you cannot ensure that you will be able to fully cover all the testing scenarios that might apply. Thus there is a trade-off between cost and quality of testing.
The second problem that assails the test data environment is that you need to put as little workload onto DBAs as possible, otherwise the testing (and therefore the development environment as a whole) will be less agile than it needs to be. Operations is frequently seen as an obstacle to providing test data while Development is all too frequently seen as a nuisance by DBAs. DevOps is a generalised approach to ensuring improved collaboration across these environments while test data management is a specific technology designed to achieve this; while at the same time supporting an agile development environment where testing is conducted early and often.
Test data management aims to square the circle of providing fully representative data with right-sized datasets (you may need a differently sized subset for different types of test) together with minimal impact on the database administrator. There are two methods generally in use for generating test data: either you take a subset of the data that is representative or you generate a synthetic set of data. The latter can be achieved either by sub-setting the data and then repeatedly applying data masking techniques while the former relies on having profiled the source data using a data profiling and discovery tool.
The advantage of a completely synthetic approach is that you don’t touch the live data at all, other than for the original profiling, and therefore it is very quick and easy to generate new test data sets without having to go to operations for assistance. Thus this is a particularly suitable approach for agile requirements.
Test data management solutions will also include data masking capabilities, so that personally identifiable and other sensitive data can be discovered and masked in an appropriate fashion (this is really a governance issue); although it should be noted that this is not necessary if you are generating completely synthetic data.
Those in charge of testing teams and quality control will be the most interested but this is also relevant for compliance officers (especially when development is to be outsourced) because of the synthetic or masked aspects of the data used for testing.
In addition, development teams adopting an agile methodology should care because agile development is not much use without agile testing and you can’t have agile testing if you don’t also have agile test data.
While test data management has actually been around for some years it is only in this decade that it has really come to the fore. In our view the most likely trend going forward is the merger of test data management with service virtualisation to further speed up testing processes. Indeed, partnerships and acquisitions are already taking place within this sector to enable exactly this.
One noticeable fissure in the market is between those companies providing test data management from the perspective of developers (integrating with service virtualisation, testing tools, code coverage and so on) as exemplified by Grid-Tools, and those that offer a more data-centric approach, as typified by Informatica. In practice, nearly all vendors are in the latter camp which potentially gives Grid-Tools an advantage.
Informatica acquired Applimation, IBM acquired Greenhat (a service virtualisation provider) and Grid-Tools has extended its portfolio to include service virtualisation. The latter has also partnered with a number of the service virtualisation vendors. New entrants into the field include Rever and Delphix where the latter is a virtualised environment for SQL Server and Oracle. It works with, rather than provides, data masking.
The big trend, however, is towards synthetic data generation. It used be that only Grid-Tools offered this but now GenRocket has emerged, Rever has introduced a test data management product (SEAL) that also includes data masking, and Informatica has added synthetic data generation. We expect IBM to follow suit in due course.
The next step for vendors will be to introduce something comparable to Grid-Tools’ test data warehouse. Informatica has announced that it will do so later in 2014.