Data Protection is key as part of the data science process

David Perez

2018-04-18 10:30:11
Reading Time: 2 minutes

If the General Data Protection Regulation (GDPR) (EU) 2016/679, the regulation in EU law on data protection and privacy for all individuals within the European Union, had not been enough to stress the importance of protecting personal data properly, we now also have the example of the scandal of how Cambridge Analytics handled personal Facebook data.

The data protection discussion in this latest scandal is around a few subtopics. The first key element to be clarified is if all data collected was done with the authorisation of the users. The second element to understand relates to the transfer of data between different entities, which can entail losing trace of the original protection requirements of the data. The third element refers to the use and deletion of the data that apparently was only certified by a dishonest declaration, according to Facebook.

Although this particular case (and the GDPR) is about personal data, important lessons can and should be learned for the protection of confidential data in general, like data from air transport operations coming from airlines, airports, air navigation service providers, etc.

The first key lesson to learn is that the trust of data owners is very difficult to gain and very easy to lose, many times forever. Those of us that are trusted with third party data must honor this trust and put all means towards ensuring that our partners, customers, suppliers respect the data protection policies that we are given.

However, as apparently happened to Facebook, trusting or even regulating the use of the data by downstream partners is not enough. Regulating the data usage is the first step, but the data protection sits in the center of the data science process. The way data is ingested, stored, and fused with other data is a key element to be designed with data protection in mind. But the analytics themselves are also to be designed with data protection in mind – for instance, using advanced secure computation techniques to avoid exposing confidential data while still allowing analytics authorised by the data providers. Last but not least, “certifications” of any kind are not enough: strong security audits procedures (and technology supporting those) need to be designed to ensure analytic audits can be performed efficiently with minimal resources (and therefore, frequently).

We trust that our partners follow the regulations that come with the data we manage, but we also complement this trust with technology that guarantees data de-identification, smart data fusion (enabling the fusion of de-identified datasets through cryptography), secure multi party computation techniques and/or other crypto techniques like homomorphic encryption, cloud computing infrastructures to allow audit of the data used and analytic notebooks that allow insight into what any analyst may be computing.

Author: David Perez