Why masked data is no good!
Posted on 31. Mar, 2010 by Huw Price in Data Masking, Synthetic versus masked data, Test Data Creation, Test Data Management
I’m a keen bright developer working for a bank working on a new report, the DBAs have given me a full copy of masked production data. To test I need to find some data that’s changed over time. Where should I start? I think my boss has just had a big pay rise so let’s try and find her, already I have three pieces of information a) The sex is female ; b) The monthly direct debit has increased by over 10% and c) it happened in the last 30 days. There are a million customers in the bank a) reduces them to 500,000 b) reduces them to 2,212 and c) reduces them to 38. Now I have a list of 38 people, let’s look for when the date the annual company bonus is paid and bingo there she is! What a clever developer I am, now I can run off my reports and present them to my boss, won’t she be pleased.
A combination of good old human curiosity will generally find a way. For most complex systems there is so much information that finding the intersection of a few data points will usually get you to the data you need. If you look at pretty much any HR or Health care system there are so many data points that the complexity of trying to second guess human curiosity is mind boggling. Changing a name and address is not enough!
The only safe way is to generate the data based on the characteristics of production, not the actual production data!




Terry
12. Jul, 2010
Scary stuff! What would Orwell say??
Mike
12. Jul, 2010
All data masking is equal, but some is more equal than others? I’m putting Grid-Tools amongst the latter!
Huwprice
12. Jul, 2010
All data masking is equal, but not all data is equal after masking! That’s the point here. The ability to use data generation in combination with masking, or as an alternative, can really improve the quality of the data being used!