Much has been said about how big data will help solve many of the world’s thorniest problems, including pandemics, hunger, cancer treatments, and conservation. However, because of the seriousness of the problems, and complexity of big data and its analysis, a great deal of testing is required before any results can be considered trustworthy. Unfortunately, most businesses and organizations do not have the in-house capability to achieve any semblance of trust. Thus, the normal procedure has been to outsource the work to third-party vendors.
The operative phrase is “has been.” Big data, more often than not, contains sensitive information pertaining to individuals serviced by the organization, and releasing that information to outside resources may place the organization or business in jeopardy with state and federal privacy regulations.
SEE: Video: The top 5 reasons you should care about privacy (TechRepublic)
A possible solution to this privacy issue
Three researchers at MIT may have figured out a way to assuage privacy concerns. Principal researcher Kalyan Veeramachaneni along with researchers Neha Patki and Roy Wedge in their paper The Synthetic Data Vault (PDF) describe a machine-learning system that automatically creates what the researchers call “synthetic data.” Originally, the team’s goal was to create artificial data that can be used to develop and test algorithms and analytical models. Besides generating enough data for data scientists, synthetic data can be easily massaged so as to remove the ability to associate data with a particular individual.
To start, the researchers needed to decide what synthetic data would look like. They came up with the following requirements:
- To ensure realism, the data must resemble the original…