When security is a concern, synthetic data generation is unmatched
Posted on 09. Dec, 2009 by JamesKoopmann in Test data
Gartner reports that more than 80% of companies will use live production data for nonproduction purposes. This can include BI, QA, Development, and Testing. Clearly we all know by now the problems with moving production data around to non-production and non-secure environments. I mean really, how many times do we have to read about data theft? The problem here lies in the fact that these systems need data. So why do we just copy production data? I’ll tell you, it’s just darn too easy to get what you want and for the most part intentions to not leak information are usually good.
Well, unfortunately most data centers are starting to read the writing on the wall and beginning to discourage just copying data from production system to test system to QA system to who knows where. Data centers are now implementing one of two types of methods, masking data or creating synthetic data, to help secure sensitive information in transit to these secondary systems. But how secure are either of these two methods? Is one better than the other?
Data masking is very easy to understand and involves the obscuring of specific pieces of information—ensuring that sensitive information is replaced with realistic data but data that doesn’t identify the data it has replaced. So if John Doe had a SSN, with data masking, we would replace his SSN with another SSN that obviously could not be traced or reverse engineered to get back at John Doe’s original SSN. But how secure is this method of masking data? There are two very typical scenarios which include:
- The inability to understand what needs to be masked. Let’s face it databases are complex, the information is often obscure, and it takes years to master the schemas of these systems. Unfortunately many database systems are being managed by teams that do not fully understand the data. This leaves their ability to mask all the sensitive data that databases contain to nothing more than a shot in the dark—creating a scenario where sensitive information is copied throughout the enterprise and leaving a gaping security hole
- The inability to use a masking tool effectively. Call this operator error but many times the copy mechanism is performed before masking takes place either without knowing or to be masked on the target system. The fact of the matter is that under both scenarios sensitive information has left the production system, has traveled through the network, and landed on an unsecure system. Again, leaving sensitive data vulnerable as it travels on the network and for a time when not masked on the target system.
Synthetic Data Generation on the other hand is a process of creating real data but through a detailed process of data anonymization, that is the creation of data that has no real identity. The most important aspect of synthetic data generation is that because it doesn’t strictly rely on production data, rules or conditions can be met to generate data that is able to test certain aspects of an application or system that normal production data might not be able to. How secure is synthetic data generation?
- Clearly, synthetic data generation is the ultimate in protecting the privacy of real production data. No longer are test systems tied to production data and no longer is production data flying through the enterprise unsecured.
I enjoy some banter on this subject as for me it would seem that synthetic data generation is the most secure method of providing test data throughout the enterprise. I can see how many would be thrown back by what would seem to be a very complex initiative. But when using tools such as Datamaker from Grid-Tools I see this aversion to creating test data being reduced. Plus, with the added value of ensuring representative data is always available for testing, synthetic data generation can be extremely beneficial in providing data that goes far beyond just raw production data while encompassing relationship, properties, and nuances of data not normally seen in production.




fJChloe
31. Dec, 2009
Scholars search for the thesis proposal about this good topic. If they get know about your hot fact, they would plausibly purchase the thesis.
WP Themes
18. Jan, 2010
Nice fill someone in on and this post helped me alot in my college assignement. Say thank you you for your information.
Per
01. Feb, 2010
When testing systems that need > 40 years of historical data I can’t see how the use Synthetic Data Generation. Have someone tried?
To get a realistic data for these systems I think you have to start with production and then perhaps add synthetic data to it. I think it is far to hard to get a generated data that will perform as well as production data that was first registred 30 years ago, and have gone through a number of migrations etc.
About the security issue i think you can solve it by having a secured masking environment where you land your data and perform the necessary masking actions.