"When looking into various test data management solutions, we found that Datamaker was very useful for getting the right kind of data for testing and development. This was incredibly important for us and one of the main factors in our tool selection. Yes, we needed data for testing and development, but we also needed many data scenarios. Datamaker gave us a small amount of data, but a rich spread of data."
Jochen Westheide,
The ARAG Group
"As far as we know, Grid-Tools is the only specialist vendor in this space. In our view, Datamaker is the most extensive and most complete test data management product that is available on the market today."
Philip Howard, Bloor analyst
Read the full publication on test data management methods here
Test Data Management (TDM) is about the provisioning of data for non-production environments, especially for test purposes but also for development, training, quality assurance, the creation of demonstration or other dummy data, and so on. There are essentially three ways to derive such data: you can take a copy of your production database and, where appropriate, mask any sensitive data; or you can subset your data and mask it; or you can generate synthetic data, based on an understanding of the underlying data model, which means that no masking is required. Each of these approaches to TDM has both advantages and disadvantages.
Copying your database has the advantage of being relatively simple. However, it is expensive in terms of hardware, license and support costs to have multiple copies of the same database. It is not unknown for companies with large development shops to have upwards of twenty different copies of the same database for development and testing purposes.
Sub-setting your database is less expensive. However, it suffers from the same problems as all sampling processes in that you can miss outliers. This is particularly important in development environments because outliers may cause the system to break whereas normal results do not, so it is important that outliers are properly tested. Therefore, if you are using sub-setting then you need to ensure that outliers are captured and represented within the process of creating your subset. However, this pre-supposes that those outliers are present in the production database at the point at which the data is sub-setted. Since it is unlikely that all possible outliers are present at any one time this means that a copied or sub-setted database can never be fully representative of what you need to test.
A further issue that occurs with both full and partial database copies is that any sensitive data may need to be masked. Of course, there are many applications that address data that is not sensitive or subject to data privacy laws. On the other hand, there are also large numbers of applications where it is necessary to de-identify data to meet compliance requirements or to protect intellectual property. Where that is the case then data will need to be masked. This is not as simple as it may appear at first sight. In the first instance, you need to identity data that needs to be hidden: typically, this is done by looking for relevant patterns of information such as credit card numbers ending xxxx. This can be done manually but it is an onerous process that is better automated.
Related to this point is the question of how the masking is to be achieved. This will depend, at least in part, on why you are doing the masking. For example, you could simply hide a credit card number by replacing each digit with an x (xxxx-xxxx-xxxx-xxxx), which will be fine if you are only concerned with data protection. However, if you want to test a payment application then you will need to work with real (pseudo-) numbers in order to test your applications. Similarly, simple shuffling techniques (for example, replacing zip code 12345 with 54321) will not work if your application requires a valid zip code. For test data management you will need to mask in such a way that the data remains valid.
Further, it may not be simply a question of identifying what data needs to be masked and then hiding it. This is because you need to ensure that data relationships remain intact during the masking process, otherwise application testing may break down. This will, of course, be dependent on the application but in complex environments it can be critical. For example, a patient has a disease, which has a treatment, which has a consulting physician who practices in a particular hospital and uses a designated operating theatre. If you scramble the data so that a patient with flu ends up having open heart surgery then your software may break down simply because your masking routines have not ensured that important relationships remain intact. So, discovery of these relationships may be essential.
It should also be borne in mind that masking is never perfect. In healthcare environments, to continue the preceding example, a determined hacker may still be able to identify individuals, precisely because of the need to retain relationships. In addition, and as another example, your largest customer will still be your largest customer even if he, she or it is not immediately identifiable by name.
The third alternative is to generate synthetic data. From an a priori perspective this is preferable to using either of the other two approaches because the dataset can be relatively small, assuming that it is representative, thereby keeping costs down and because there is no requirement for masking. Moreover, there is no requirement to access production data, which means no impact on operational performance. However, in order to generate representative synthetic data you do need to have a good understanding of the data relationships that are not only embedded with the database schema (or file system) but also those relationships that are implicit within the data but which are not formally detailed within the schema. In other words you need some sort of discovery process; but then you need a comparable capability to do a good job of masking, for the reasons just discussed. You will also want to be able to be able to include errors within your synthetic data generation, as you will wish to test the software in this respect. A further point is that the world does not stand still: trading patterns change over time and you may want to discover such trends that already exist within your data and project those forward to test against patterns of data that may be applicable in two or three years time. This is clearly something that you cannot do using either copy or subset-based approach but which should be possible with synthetic data creation.
We should also note that some vendors claim to be able to generate synthetic data based on subsetting or copying the data and then repeatedly masking it. While this can be used to support load testing that is probably the limit of its value and we would not describe this as synthetic data generation in any real sense of that term. Otherwise, claims for synthetic data generation may be based on nothing more than having seed tables. If you are interested in synthetic data generation you will therefore need to be wary of different vendors calling different things by the same name.
It is also worth noting that it is a common misconception that tools such as HP’s QuickTest Professional generates data: they do not.
Finally, another major issue in test data management and, indeed, testing in general, is that of coverage. What you would really like to achieve is testing of every possible code path with every possible combination of data with a minimum of tests. Unfortunately, that is very far from the experience of most testers. Taking a database copy, for example, often supports no more than 30% coverage and often much less. There are mechanisms (which we will discuss later) available to improve this percentage and reduce duplicated testing but this cannot be eliminated because of the very nature of the production data, not least because production data is not representative of all possible data, as previously discussed. The same problem also applies to sub-setted data.
Conversely, the aim of synthetic data is to provide a truly representative dataset but without duplication. When combined with appropriate mechanisms it is possible to get as much as 100% functional coverage and 90% code coverage with an absolute minimum of tests. This is much more thorough than typical development environments (where 50% coverage is nearer the norm) and should result in the production of better code in less time and at less cost, because of the reduced number of tests that need to be run. Anecdotal evidence suggests that the use of synthetic data can reduce testing cycles by as much as one third.
Grid-Tools provides a complete suite of Test Data Management tools, which we will describe in detail in this report. These can be licensed en masse as the single product Datamaker or on a modular basis. As far as we know the company is the only vendor to support all of the methods just discussed for managing test data. Grid-Tools works with data in both flat files and databases and it can also be used to support SOA and user interface test environments.
In the opinion of Bloor Research the following represent the key facts of which prospective users should be aware with respect to Grid-Tools Test Data Management:
Read the full publication on test data management methods here
Back to the top