The need for smart(er) data protection mechanisms

Samuel Cristobal

2017-11-07 09:35:06
Reading Time: 2 minutes

The extent of open data in the aeronautics field is relatively scarce. Many data owners are still reluctant to share their data even within exclusive partnerships. Data has always been considered as a major asset but it is unarguably becoming increasingly relevant due to the business intelligence explosion. Nevertheless, most data owners agree that sharing some data might be beneficial for the community, specifically in shared objectives such as improved aviation safety. Sharing only what is considered by the data owner as non-critical data is often not sufficient. For example, most airlines would not share the specific day of a particular operation considering the possibility of identifying specific flights as well as the crew.

A well-known common approach for data de-identification is to delete the fields from the source or simply replace the bits by a placeholder. This approach generates a significant loss of information and could impede data analysis. On one hand, it is impossible to know a priori if a removed feature might have any major impact on additional data mining exercises. That being considered, a thoughtful preliminary analysis must be carried out to verify the influence of removed variables in the model, but then again, that analysis can not be performed without the private data.

Additionally, flight data might need to be identified with other sources. For instance, the date and time of operation might be needed in order to link the flight with the corresponding weather reports, or other resources, to better understand the context, e.g. ATM constraints. Then again, it is not possible to determine a priori which specific dependencies have a relevant impact on the data mining models.

Both problems are critical within the project and two tools have been developed to address this: Secure Multiparty Computation allows computation of distributed mathematical functions in such a way that the participants do not share any information and only the outcome of the algorithms is known by all participants at the end this solves the first problem of de-identification. The second tool is Smart Data Protection by storing obfuscated information on the data repository different data sources can still be merged without disclosing the original information. However, having the right tools it is not sufficient as they need to be supported by the right infrastructure. In each data provider has its own local environment, either physically on premise or in a private cloud infrastructure, that is only accessible to them. A shared environment fuses the information from local nodes and coordinates the data mining tasks in such a way so that private data is always kept at the local environment and only references to this data is communicated out to the shared environment.