"The Datamaker tool had features we’d never thought of. It not only fulfilled the expectations of our development team, it exceeded them! It was taking us over 20 hours to create data from one row or schema. Once we started using Datamaker for test data creation, it was taking us less than 2-3 hours"
Large financial services organization
Every system needs to be tested, and every test needs data. Whether it’s data about customers in a CRM system, products and parts in a commerce environment, or stock market prices for a fiscal modeling application, without data, you’ve simply got nothing to test your system against. How will you provision that data?
Many companies use real data, taken from their production databases. However, production data doesn’t actually test a system thoroughly – most data follows predictable patterns, with only a few cases really pushing the boundaries. It’s also a very bad idea from a security point of view: at best, it’s risky, relying not merely on engineers not acting maliciously, but on them not even making accidental mistakes. At worst, it’s illegal, when standards like PCI-DSS, HIPAA, or data protection laws are involved.
Data masking provides a fast and effective solution to this problem. By identifying the sensitive information in the production data, and masking or ‘scrambling’ it, a copy of the data can be created that is significantly lower risk, and that satisfies most legislative requirements. It’s easy to understand and easy to implement.
Masked data is not perfect, however. Firstly, while it might be sufficient for testing business cases that you’ve previously encountered in the wild, it doesn’t do much to help you test for the situations of tomorrow. If you’re developing a system that is designed to exceed the scope of your previous operations, you need data to match that. Secondly, even if the properties of your data aren’t changing, you might simply need more of it – and data masking can only give you as much data as you started with. Thirdly, while data masking makes the extraction of sensitive information very difficult, it’s not unbreakable; outliers in the data can sometimes be identified by cross-referencing them against other information sources.
The best solution to these problems is to extend your data with additional synthetic (‘fake’) data.
Synthetic data can be created according to your system specification, ensuring that you test everything that the system is intended to do in the future, and not merely those things it has done in the past. Any quantity of data can be generated, allowing you to turn 250,000 rows of masked production data into 2 million rows as quickly as your RDBMS will permit. Lastly, sensitive information can never be extracted from the synthetic data, because there’s nothing to extract.
Imagine you’re developing a new invoicing system, and you want to include a facility for giving discounts to customers who’ve spent more than $100,000 with you in the past – but it’s not a situation you’ve actually encountered yet. To properly model what this situation would be like, you need not just the customer’s data and the order that should be discounted, but also the $100,000 worth of past orders they’ve made. By describing the data you need to the generator – first and last names, credit card numbers, order dates and totals, tax amounts, and so on – it can generate all the rows required, across all the relevant tables, ensuring that referential integrity is maintained.