Taken from A Guide to Database Subsetting
Written by Huw Price and produced by Grid-Tools Ltd © Grid-Tools Ltd
If an organization has been running an application for many years there will, inevitably, be a build-up of information within the database, across any associated file storage and in data passed to other applications. Run times of month end jobs get longer and the need to constantly upgrade production and testing hardware is common.
Many home grown and virtually all packaged applications have no viable archiving strategy and data tends to grow inside databases, over time your main transaction tables will include transactional data spanning many years. Users are usually very reluctant to remove data from an application as they are worried that there will be unforeseen consequences and cannot spare the time to investigate fully all effects of deleting data. If you are considering building an archiving strategy then you could do worse than use database subsetting as a step toward building an archiving solution as all of the steps and methods can be transferred almost directly between the two projects, one man’s subset is another man’s archive!
While database growth in production can be managed by applying Moore’s law, i.e. that technology will double exponentially every two years. It is rare that the same high spec equipment is used in development and testing. The result of this is that typically developers and testers will be using full size production data on low powered hardware, a job that runs in three hours in production could easily run for six in development. Reducing the run time of jobs in development provide a powerful argument for database subsetting, however there are many more which should be considered:
All project managers have to fight with budgets and cost justify any new procedures and methodologies. One perennial problem with infrastructure projects, which database subsetting is, is who pays for them? Nowadays project funding is usually controlled by users of the application who will carefully cost out a new feature and project teams are expected to bring in the new features in or under budget. Presenting a project proposal which would just seem to benefit the development team can be tricky and will usually get “bumped” as immediate revenue generating projects will take precedence, bringing to market a new type of investment bond as quickly as possible is more important than taking your top analyst off a project for a month.
For these reason the cost of infrastructure projects tend to be spread across multiple cost centres, however this means that you now have to get the buy in of all users which can of course mean you get into politics and usually involves some kind of horse trading. Some organizations will already have budgets set aside for infrastructure projects and if you are lucky and can tie the subsetting projects to cost savings then you are in with a chance of implementing earlier rather than later.
One of the most effective cost justification areas is disk, in a large organization it is not uncommon for up to twenty copies of production sized database to exist. These copies can be used for development, unit testing, regression testing, performance testing and so forth. Usually copies of production are taken once are twice a week and then the copy is then moved into different regions when requested. When calculating the cost of disk the management of the copies should be considered, while a one terabyte disk for your PC may cost $800 the internal cost is much higher. Most organization nowadays have a much more realistic cost per megabyte, while it might seem that disk is cheap when you add in the cost of managing the disk, backups, DBA time, disaster recovery etc the cost quickly rises. A typical internal cost of $200 a Gigabyte is not uncommon.
While you will not eliminate all full sized copies and indeed you may actually increase the total number of test and development databases, see later, the cost savings are quite straight forward and can be expressed in a simple formula:
D = Cost per Gigabyte
S = Database size in Gig
NP = Number of current production copies
Perc = Percentage reduction in size
Cost Savings = NP * D * S – (NP * Perc * D * S)
An example would be:
D = $200
S = 1 Terabyte = 1,000 Gig
NP = 18
Perc = 5%
Cost saving = 1,000 * 18 * 200 – (1,000,000*0.05*200 * 18) = $3,420,000
In reality you are a likely to increase the number of database copies so this number may be closer to $2.5Million however this is still a significant number and should allow you to cost justify the initial and ongoing costs easily. It should however be remembered that this number only really makes sense at the time you are forced to buy more disk which is usually when you are just about to run out of space. CIOs should try and plan ahead and try and implement subsetting prior to crisis points allowing them to negotiate better deals with service and infrastructure providers.
Testing with large production size databases has many challenges especially when it comes to testing batch processes, typically these are end of period processes for example close out this months’ transactions and post the balances to the general ledger. These batch jobs can run for a long time and if an error occurs can require the entire database to be reset or rolled back, it is not uncommon for batch jobs to run for several hours and for end to end processes to span tens of hours.
One of our customer’s end of month processes took 30 hours to complete, testing any changes to this on a full size production copy meant that only one or two changes could be made to the production codes set as testing just took too long. After building a subset database the month end run was reduced to 3 hours allowing many more changes to their production code sets.
Typically many developers or testers will use the same database at the same time to test an application. From time to time users will request a database refresh and a new copy of production data is copied over and testing continues. The problem, of course with this approach is that some testers may be in the middle of a complex series of tests where data has been entered manually and this data will now be lost. The use of subset databases can solve this problem, different teams can “draw down” subset databases without interfering with other users.
One of our customers who had invested time in database subsetting changed their development strategy and each developer would import a subset database into their own local RDBMs, in this case Oracle, for each new piece of development work. The DBAs automatically prepared several different flavours of subset databases each night and the developers would choose and import the appropriate one using standard database utilities.
There are some other benefits which are not immediately obvious but can dramatically improve the entire testing and development framework. These include:

Figure 1. A Data Check Report
Figure 2. Data Browsing and Relational Data Editing