Grid-Tools Test Data Management

quote image

"When looking into various test data management solutions, we found that Datamaker was very useful for getting the right kind of data for testing and development. This was incredibly important for us and one of the main factors in our tool selection. Yes, we needed data for testing and development, but we also needed many data scenarios. Datamaker gave us a small amount of data, but a rich spread of data."

Jochen Westheide,
The ARAG Group

"As far as we know, Grid-Tools is the only specialist vendor in this space. In our view, Datamaker is the most extensive and most complete test data management product that is available on the market today."

Philip Howard, Bloor analyst

Executive introduction: major issues in test data management

Read the full publication on test data management methods here

Test Data Management (TDM) is about the provisioning of data for non-production en­vironments, especially for test purposes but also for development, training, quality assur­ance, the creation of demonstration or other dummy data, and so on. There are essentially three ways to derive such data: you can take a copy of your production database and, where appropriate, mask any sensitive data; or you can subset your data and mask it; or you can generate synthetic data, based on an under­standing of the underlying data model, which means that no masking is required. Each of these approaches to TDM has both advantages and disadvantages.

Copying your database has the advantage of being relatively simple. However, it is expen­sive in terms of hardware, license and support costs to have multiple copies of the same da­tabase. It is not unknown for companies with large development shops to have upwards of twenty different copies of the same database for development and testing purposes.

Sub-setting your database is less expensive. However, it suffers from the same problems as all sampling processes in that you can miss outliers. This is particularly important in de­velopment environments because outliers may cause the system to break whereas normal results do not, so it is important that outliers are properly tested. Therefore, if you are using sub-setting then you need to ensure that out­liers are captured and represented within the process of creating your subset. However, this pre-supposes that those outliers are present in the production database at the point at which the data is sub-setted. Since it is unlikely that all possible outliers are present at any one time this means that a copied or sub-setted database can never be fully representative of what you need to test.

A further issue that occurs with both full and partial database copies is that any sensi­tive data may need to be masked. Of course, there are many applications that address data that is not sensitive or subject to data privacy laws. On the other hand, there are also large numbers of applications where it is necessary to de-identify data to meet compliance re­quirements or to protect intellectual property. Where that is the case then data will need to be masked. This is not as simple as it may appear at first sight. In the first instance, you need to identity data that needs to be hidden: typically, this is done by looking for relevant patterns of information such as credit card numbers ending xxxx. This can be done manually but it is an onerous process that is better automated.

Related to this point is the question of how the masking is to be achieved. This will depend, at least in part, on why you are doing the mask­ing. For example, you could simply hide a credit card number by replacing each digit with an x (xxxx-xxxx-xxxx-xxxx), which will be fine if you are only concerned with data protection. How­ever, if you want to test a payment application then you will need to work with real (pseudo-) numbers in order to test your applications. Similarly, simple shuffling techniques (for example, replacing zip code 12345 with 54321) will not work if your application requires a valid zip code. For test data management you will need to mask in such a way that the data remains valid.

Further, it may not be simply a question of identifying what data needs to be masked and then hiding it. This is because you need to ensure that data relationships remain intact during the masking process, otherwise ap­plication testing may break down. This will, of course, be dependent on the application but in complex environments it can be critical. For example, a patient has a disease, which has a treatment, which has a consulting physician who practices in a particular hospital and uses a designated operating theatre. If you scram­ble the data so that a patient with flu ends up having open heart surgery then your software may break down simply because your masking routines have not ensured that important rela­tionships remain intact. So, discovery of these relationships may be essential.

It should also be borne in mind that masking is never perfect. In healthcare environments, to continue the preceding example, a determined hacker may still be able to identify individuals, precisely because of the need to retain rela­tionships. In addition, and as another example, your largest customer will still be your largest customer even if he, she or it is not immedi­ately identifiable by name.

The third alternative is to generate synthetic data. From an a priori perspective this is preferable to using either of the other two approaches because the dataset can be rela­tively small, assuming that it is representative, thereby keeping costs down and because there is no requirement for masking. Moreover, there is no requirement to access production data, which means no impact on operational performance. However, in order to generate representative synthetic data you do need to have a good understanding of the data relation­ships that are not only embedded with the da­tabase schema (or file system) but also those relationships that are implicit within the data but which are not formally detailed within the schema. In other words you need some sort of discovery process; but then you need a compa­rable capability to do a good job of masking, for the reasons just discussed. You will also want to be able to be able to include errors within your synthetic data generation, as you will wish to test the software in this respect. A further point is that the world does not stand still: trading patterns change over time and you may want to discover such trends that already exist within your data and project those forward to test against patterns of data that may be appli­cable in two or three years time. This is clearly something that you cannot do using either copy or subset-based approach but which should be possible with synthetic data creation.

We should also note that some vendors claim to be able to generate synthetic data based on subsetting or copying the data and then re­peatedly masking it. While this can be used to support load testing that is probably the limit of its value and we would not describe this as synthetic data generation in any real sense of that term. Otherwise, claims for synthetic data generation may be based on nothing more than having seed tables. If you are interested in synthetic data generation you will therefore need to be wary of different vendors calling dif­ferent things by the same name.

It is also worth noting that it is a common mis­conception that tools such as HP’s QuickTest Professional generates data: they do not.

Finally, another major issue in test data man­agement and, indeed, testing in general, is that of coverage. What you would really like to achieve is testing of every possible code path with every possible combination of data with a minimum of tests. Unfortunately, that is very far from the experience of most testers. Tak­ing a database copy, for example, often sup­ports no more than 30% coverage and often much less. There are mechanisms (which we will discuss later) available to improve this percentage and reduce duplicated testing but this cannot be eliminated because of the very nature of the production data, not least be­cause production data is not representative of all possible data, as previously discussed. The same problem also applies to sub-setted data.

Conversely, the aim of synthetic data is to pro­vide a truly representative dataset but without duplication. When combined with appropriate mechanisms it is possible to get as much as 100% functional coverage and 90% code cover­age with an absolute minimum of tests. This is much more thorough than typical development environments (where 50% coverage is nearer the norm) and should result in the production of better code in less time and at less cost, because of the reduced number of tests that need to be run. Anecdotal evidence suggests that the use of synthetic data can reduce test­ing cycles by as much as one third.

Fast facts

Grid-Tools provides a complete suite of Test Data Management tools, which we will de­scribe in detail in this report. These can be licensed en masse as the single product Da­tamaker or on a modular basis. As far as we know the company is the only vendor to sup­port all of the methods just discussed for man­aging test data. Grid-Tools works with data in both flat files and databases and it can also be used to support SOA and user interface test environments.

Key findings

In the opinion of Bloor Research the following represent the key facts of which prospective users should be aware with respect to Grid-Tools Test Data Management:

  • Grid-Tools supports database copying with masking, sub-setting combined with mask­ing and the generation of synthetic data with full support for referential integrity.
  • Test data, however derived, is stored in a test data warehouse. This allows you to ma­nipulate, filter and, in the case of synthetic data, re-generate the data without having to access the production database. Re-gen­eration, in particular, supports an agile ap­proach to development and testing.
  • To support both masking and synthetic data creation, as well as archival, Grid-Tools provides data profiling capabilities in order to understand data relationships that exist within the data, regardless of whether these are explicit or implicit. The product can also link to third party tools that can expose relevant data models.
  • The product has advanced coverage support.
  • Synthetic data may include a defined percentage of errors for testing purposes.
  • Two different data masking products are provided. One uses native database connectors for popular databases and generic connectors otherwise. As a result the former is very fast at masking but the latter not so. The former also has extended functionality. We expect (and hope) to see more data sources moving into the native category as time progresses.
  • There is a module designed specifically to support SOA testing.
  • A workflow engine is built into the product.
  • There is a module available to identify trends within your data (how its profile is changing) that can be used in the generation of data to cater for future requirements.

Read the full publication on test data management methods here

Back to the top