As the second installment in this series of posts, I will touch upon on the topic of privacy in data science and algorithms. In particular, I’m going to discuss a relatively novel concept of privacy called **differential privacy **that promises, similar to algorithmic fairness, a way of quantifying the privacy of AI algorithms.

When we, as humans, talk about privacy, we mostly refer to a desire to not be observed by others. However, what does privacy mean in the context of algorithms that “observe” us by using data that has information on us?

### What do we mean by “privacy” in the context of statistics and machine learning?

In a very general sense, we could say that privacy will be preserved if, after analysis, the algorithm that used our data (e.g. an application on our smartphones) doesn’t know anything about us. We want our machine learning models to only learn what they’re supposed to learn from data and nothing more — without retaining any information on particular individuals in the dataset.

This request was addressed in the proposition of an ad omnia guarantee of privacy in 1977 by Tor Dalenius. The proposition was then discussed by Cynthia Dwork, pioneer of differential privacy, in her paper from 2008:

*Anything that can be learned about a respondent from the statistical database should be learnable without access to the database.*

This essentially means that we should make sure that everything that a model learns about an individual is already public information. However, this is difficult to implement in statistical analysis for several reasons. First, how do we know what information is publicly available? Second, how do we selectively access any private dataset for that matter, especially in cases where the application needs certain information relevant for its purpose (for example, a model that determines whether a person suffers from some disease)?

### Differential privacy to the rescue

With those issues in mind, **differential privacy** has been developed to enable statistical analyses on datasets without compromising the privacy of individuals whose records may be contained in the dataset. As with algorithmic fairness, the ideas behind differential privacy stem from the application of algorithmic concepts to the study of privacy.

In her intuitive definition of differential privacy in The Algorithmic Foundations of Differential Privacy, Cynthia Dwork mentions its fundamental motivation:

*“Differential privacy” describes a promise, made by a data holder, or curator, to a data subject: “You will not be affected, adversely or otherwise, by allowing your data to be used in any study or analysis, no matter what other studies, data sets, or information sources, are available.”*

The last part holds one of the key ideas behind differential privacy: it aims to protect an individual’s private information regardless of whatever other information might be available to a potential attacker (perhaps in other datasets, for example).

### Why don’t we just anonymise data?

You might be wondering: Can’t we simply anonymise data? It turns out that anonymisation precisely cannot guarantee this last bit of Cynthia Dwork’s promise: that the users are protected against attacks in case there are other information sources out there.

One of the most known examples of this flaw of anonymisation of data is the Netflix Prize dataset, a large dataset of users’ movie reviews which Netflix released that contained names of movies and individuals replaced by integers. Researchers A. Narayanan and V. Shmatikov showed in their paper Robust De-anonymization of Large Sparse Datasets that, even without releasing private information about users, it was still possible to uncover private aspects of the information by using another private dataset that was independently released and anonymised. In this case, Narayanan and Shmatikov used the IMDB movie site to find individuals who rated movies both on Netflix and IMDB; they ultimately were able to de-anonymise 99% of both the names of the movies and users.

This research is just one example of how data anonymisation can be a fundamentally flawed technique with a weak promise of privacy. If an attacker has auxiliary information from other data sources, something we can’t ever really know or guarantee protection against, anonymisation is not enough. Differential privacy strengthens the promise of privacy as it protects users’ privacy even against attackers that use other datasets.

### Intuition behind differential privacy

The math that differential privacy is based on was developed years ago, and differential privacy relies on it to give a rigorous mathematical definition of privacy. Essentially, differential privacy promises that an algorithm is (differentially) private if, by looking at the output, an observer cannot tell whether any individual’s data is included in the dataset used to produce said output.

To better understand the fundamental concept of differential privacy, let’s look at an example of a simple database with only one column: user responses to a question (e.g. *Have you ever committed a crime?*), with 1 being a **YES** and 0 a **NO **(in addition to the ID column).

**Figure 1**. Simple database

Let’s say we want to know how many people in the database committed a crime, for which we construct a query counting the number of rows that contain value 1. In order to measure the privacy of a query, we are going to compare the output of the query on two databases: first, a full database, and second, a database with one entry removed.

If we remove a person from the database with the value 1 in the *committed_crime* column, the result of our query will change. That means that the removed person is leaking information into the output of the query — thus the promise of differential privacy is not fulfilled. In other words, this query is not differentially private and is vulnerable to a so-called **differencing attack**. Differential privacy is interested in whether we can construct a query such that its result does not change when any person is removed from the database.

**Figure 2**. Coin flip example

### Just add noise!

Differential privacy addresses this issue by a simple and traditional method: adding **noise** to data. For example, imagine you decide to, before sending individuals’ answers to the server, mask some answers at random according to the diagram below. That is, for each individual’s answer, you flip two coins. If the first coin is heads, do not change the answer and save as is. If it is tails, flip the second coin and change the answer according to the second coin flip. This introduces a degree of plausible deniability since an observer of the data can no longer be certain whether a value is true.

In real world applications, this is a bit more complicated. Differential privacy adds noise from **Laplacian** or **Gaussian** distribution and the decision of “how much” noise to inject is made based on which function is to be computed over the data and the acceptable amount of data leakage. Additionally, there are two kinds of differential privacy: local and global. **Local differential privacy** adds noise to each data entry (as in the coin example above) while **global** adds noise to the output of the function calculated over the data.

### The good…

Differential privacy gives a way of quantifying privacy. Indeed, it gives a mathematical definition of privacy and a way to calculate privacy budget or privacy loss. In the definition shown in the image below, taken from The Algorithmic Foundations of Differential Privacy, that quantification is expressed through parameters ϵ and δ.

**Figure 3.** Definition of differential privacy, from The Algorithmic Foundations of Differential Privacy by C. Dwork

This makes it possible for us to compare different algorithms and privacy preserving techniques. Privacy evolves from the binary notion of whether an individual’s data is exposed or not and is instead expressed as a degree of risk. Moreover, it provides us with a tool to design strategies that are robust even against hackers with additional information.

### …and not so good

However, every time the same dataset is processed, the risk of revealing private information increases and the budget may be reached. That budget is quantified by the parameters ϵϵ and δδ (see the definition on Figure 3).

Moreover, as was the case with algorithmic fairness, there is a tradeoff between privacy and accuracy. A perfectly accurate picture can be difficult to obtain from a noisy dataset or noisy output. However, for many aggregate purposes and applications, the loss of accuracy is acceptable. For those reasons, differential privacy techniques may be considered to be applicable only on larger datasets, as the results obtained on smaller datasets might be too noisy for practical purposes. Nevertheless, that’s precisely what makes differential privacy such a promising tool in the area of deep learning!

### The promise of differential privacy for the aviation community

Differentially private tools have enormous potential in the aviation industry. They can enable useful information to be shared among ATM researchers, especially when it comes to sensitive datasets. For example, data owners could feel encouraged to share passenger data with greater ease knowing that the data is more protected due to the concept of plausible deniability.

***

In this short series of two posts, I talked about the concepts of fairness and privacy in the algorithmic domain. While many issues still remain open, it is very motivating to see that there are steps being taken towards quantifying such abstract concepts, making it possible to solidify our discussions on privacy and fairness in the era of algorithmic decision making.