Unique identifiers and the benefits and risks of data anonymization

Hector Ureta

2018-01-04 17:12:36
Reading Time: 2 minutes

A unique identifier is an attribute (usually a number, code or piece of information) which is guaranteed to be unique among all of the used identifiers. For instance, the serial number of your phone or your DNA information are unique identifiers. In these examples, the identifier is linked to an specific individual: either an object or a human being. There exists more than 1 billion iPhones, but there isn’t any repetition of the serial numbers. You may share 99% of DNA information with someone else, but the full DNA full sequence is unique for every organism.

Researchers, engineers and data scientists apply data anonymization in various technical endeavours and prevent unique identifiers being “published”. This protects individual’s privacy and personal information, while ensuring the data is still valid for proposed objective. There are two ways of anonymizing data:

Encrypting: Encoding the original value into something that appears to be random and meaningless. In this way, only authorized parties with the code can access it and those who are not authorized cannot. The threat appears in the case of a non-authorised party accessing the encrypted information and decrypting it after obtaining the key.

Deletion of source data: The other option for protecting privacy is simply removing any kind of unique identifier, and only keeping data that is not confidential or sensitive. For instance, most health-related research projects do not include (nor worry about) personal names. Simply knowing the relevant attributes (for instance gender and age) is sufficient in a specific context. The same pertains to personalized marketing, in which there is no need to use sensitive personal informal information (such as your name); the profiling only requires non-sensitive consumer data such as buying behaviour, lifestyle, preferences or age range.

What are the threats associated to this second anonymization approach? The answer is potential data de-identification and anonymized data fusion. Could people’s identity (or any other unique identifier) be extracted from “smart” connection of information with some of the datasets lacking unique identifiers? The answer is clear: yes it can, in most cases. For instance, the combination of gender, birth date and postal code is sufficient to identify 87% of individuals in the United States. (source)

Whether de-identification is feasible or not, solely depends on the so called “K anonymity” property. It was first labelled as such in 1998 in this paper and is a property possessed by certain anonymized datasets. In brief, it guarantees that the unique identifiers of data (like personal information)  cannot be re-identified while the data remain useful. A release of data is said to have the k-anonymity property if the information for each person contained in the release cannot be distinguished from at least k-1 individuals whose information also appear in the release.

There are several methods to ensure k-anonymity of datasets (suppression, generalization etc)- perhaps we will explore in another post!

Author: Hector Ureta

© datascience.aero