Building the Subset Logic
Taken from A Short Guide to Database Subsetting
Written by Huw Price and produced by Grid-Tools Ltd © Grid-Tools Ltd
Deciding on which data to be extracted is probably the most complex part or a project, it will usually break down into a series of components, these are:
- The initial selection criteria, usually at an initial driving table.
- The sub tables associated with the driving tables and how these spread out across the database. These sub tables may themselves have some kind of criteria, for example only extract active orders, do not extract more than a 100 items per bill etc.
- The intersection or union of different parts of an application. The subset database will be used by different testing teams each with their own needs. Each team will need data from other parts of the database and a superset of data needs to be built. You may, for example have providers and claimants in a health care application, the common components of these would be claims. You would have to include a set of claims that satisfies the testing for both testing teams.
- Managing in-flight data. If an application is in constant usage you need to pick a time when transactional data is not flowing through the system or adjust your selection criteria to exclude this data.
Before spending too much on building your subsetting logic a very worthwhile exercise is to categorize your tables.

Figure 6 –Categorize Tables into Large, Small, Subset, Reference & Ignore
The categories I tend to use are as follows:
- Small – Move no matter what, it is not worth the effort subsetting them. I use about 10,000 as my threshold here.
- Large – Should be subsetted, over a million. It is probably worth obtaining some statistics from your DBA here, tables which are growing rapidly should be added to this category.
- Reference – Should be moved as it would affect the processing of new rows when testing. In other words if you have a large pricing table the tester would expect all prices to be available when testing not just the ones that have actually been used in production.
- Ignore – These tend to be work, transient or interface tables. These may well get populated during processing in the testing environment. This can also be used for tables which fall into the buckets “haven’t a clue what this is for” or “I think this an old table that isn’t used any more”. As you iterate through and validate that you can actually work with the subset database this list will shrink.
- Subset – Tables to be subsetted.
Once you have completed the categorization you can perform some pre-validation of your potential subset database. The types of issues that can be resolved prior to actually running are:

Figure 7 – Example of a pre subset check
You need to be aware of database constraints when loading data into your subset database. Generally, I would suggest disabling foreign key constraints and then re-enabling them after population, life’s too short! However you can calculate the load order based on the constraints.

Figure 8 –A calculated load order
This load order should be used when loading the data into your subset database.
Before beginning a subsetting project you will need to spend time understanding your data. In reality you do not need to understand all of the data relationships, see categorization above, however understanding the main transactional data is key to building a useful test database.
To begin building up a picture you will need to gather all of the available “free” information surrounding the key tables to be subsetted, this includes:
- Foreign Keys. How are tables related in the database?
- Documentation. This is usually held in a variety of formats and applications, however, they are rarely current.
- User knowledge. What is the users understanding of how and where key data is held and displayed?
- Naming standards. A surprisingly good source of information, column names in tables can give a strong hint to their use and relationship to other columns.
Once you have gathered a basic picture you will need to investigate the data itself to verify any documentation and try to understand in detail where the data is held and how it relates to other data. There are a number of problems common across most systems, these include:
- Data columns being used for multiple purposes. It is quite common for limitations in an application to be overcome by creative use of fields. Thus a field used for one purpose contains data to identify data for other uses
- Invalid Data, see figure 1. As applications and databases evolve and merge with other systems data may be created that is invalid. Users usually have an idea that this invalid data exists however have made the decision to ignore the data problems as there is no critical problem that would justify the time to clean up the data. When it comes to subsetting you need to decide whether to extract the invalid data or leave it behind. Orphan records are a typical example of this, they will generally not be extracted as they do not link to parent rows.

Figure 9 – Relationships and Table Categorization
