Anonymization and pseudonymization of personal data
Cybercriminals are waging a war on our personal data. The latest research from IBM and Ponemon on the cost of cybercrime shows that data record breaches carry a high price tag. The price per exposed data record now stands at a mean of $150 per exposed record. Healthcare records are the costliest when exposed, at $429 per record.
Personal data exposure isn’t just a problem in terms of security and financial cost. Privacy, too, is a crucial consideration. Consumers want to have their privacy respected, so much so that privacy is now a competitive differentiator. A poll carried out by Harris and Finn Partners found that 65% of U.S. consumers said privacy was very important when dealing with a company.
However, protecting personal data is a complicated business. One way that is often touted is to use specialist techniques like anonymization or pseudonymization. Here, I take a look at the pros and cons of these techniques.
Definitions of de-identification, anonymization and pseudonymization
Personal data or Personally Identifiable Information (PII) is information that can be used to identify an individual. Many of the privacy and data security regulations, such as HIPAA and GDPR, are based around the ability to link personal data back to an individual. Therefore, being able to remove links between data and an individual can, in theory, help with meeting some parts of these regulations.
If you can somehow hide or obfuscate identifying links in some manner, this should act to protect an individual, and, by the same token, help to comply with regulations like GDPR. For example, Article 4 of the GDPR states that:
“The principles of data protection should therefore not apply to anonymous information, namely information which does not relate to an identified or identifiable natural person or to personal data rendered anonymous in such a manner that the data subject is not or no longer identifiable.”
The article goes on to state that:
“ … the processing of personal data in such a manner that the personal data can no longer be attributed to a specific data subject without the use of additional information.”
Before continuing, it is worth defining the de-identification mechanisms of interest here:
Some mechanism used to convert identifiable data into anonymized data in a manner where the anonymized data cannot be transformed back into the original personal data.
This term is often used alongside health data. However, it could be applied to any personal data to enhance privacy. In general, the technique of de-identification is used to separate PII from health data. The data is not anonymized. The de-identified data has the potential to be re-associated at a later date.
This is a technique that takes personal identifiers and replaces them with artificial claims. For example, the technique may take a first name and surname and replace it with a pseudonym. In the case of pseudonymization, an individual could potentially be identified if the pseudonymous data and other identifiable data are linked.
Typical types of data that can be anonymized
Any data that can be defined as “personal” can be anonymized, but typical data this applies to includes:
- First name
- Zip/postal codes
- Postal address
- Identity documents such as national identification number, bank number, etc.
There are several models of privacy-preserving methods that anonymization is applied to, discussion of which is outside the remit of this article. The most commonly used models, particularly in the health data context, are known as k-anonymity, l-diversity and t-closeness. For more information on these concepts, see here.
When and why are de-identification, anonymization or pseudonymization used?
There are a number of use cases where the use of one of these techniques may be used. Some examples include:
Anonymization can facilitate the sharing of health data in research scenarios. In addition, anonymizing patient data helps to meet HIPAA privacy rule compliance, as it removes the need for consent.
Personal data is a driver for many smart city services. This includes behavioral and geolocation data. Connected devices and sensors need a continuous feed of such data to optimize these services, creating big data.
There are a number of reasons why anonymization of these data is important. Privacy is the obvious one, but also the safety of individuals.
Vendors are now looking at ways to anonymize what are massive amounts of data, across multiple services, often interconnected and aggregated. This causes scalability issues in traditional methods of de-identification. Hackathons run by the likes of Global Urban Datafest are offering opportunities for companies to explore the complexities of the challenge.
Training sets for AI
Artificial intelligence requires training data. These data are often personal, behavioral and sensitive. Anonymization is an important aspect of AI to preserve privacy.
Frameworks such as the OPen ALgorithms (OPAL) project provide privacy-preserved data sets for training AI algorithms. However, OPAL-like frameworks need to be expanded to offer more varied datasets for training AI.
Methods and practices in de-identifying personal data
There are a growing number of methods that can be used to decouple identifying data or replace it with artificial claims. Software programs to perform anonymization are fairly common and use a variety of techniques. These include transforming data by removing identifiers and replacing them or, in the case of photos, blurring the images. In the former case, randomization techniques are often used to assign a value to an attribute deemed as sensitive.
De-identification of data, however, is not the end of the story. Using de-identification and anonymization techniques is as much about governance and people as it is about technology.
De-identification governance requires a framework to ensure successful implementation of the technology. If you only apply technology to anonymize data, you miss out on a vital area of the overall strategy — the people and decisions behind the solution. Without these elements, you miss the tenets of governance — accountability, transparency and applicability.
Problems with de-identification and anonymization
The fundamental question we have to ask about using techniques to separate identity from data is: do they work, and is it worth it?
Like many technological methods, it doesn’t work if it is not implemented correctly.
A case of de-identification where re-identification of an individual was demonstrated was performed at Harvard University Data Privacy Lab. The lab was able to show how publicly available data sets and news items about hospitalizations could be used to re-identify patients that were previously de-identified.
Melbourne University has also found ways of re-identifying personal data by using known information and finding links with de-identified data. A mass of known data is available because of our modern way of living our lives online, so creating a massive digital footprint.
Imperial College, London, recently published a paper that demonstrated how modern data anonymization techniques do not meet the standards required by GDPR to protect data. Their model was based on probability, generated using machine learning algorithms.
The resulting research challenged the claim that a low population uniqueness is sufficient to protect people’s privacy. The paper claims that using this technique they could re-identify 99.98% of Americans using 15 demographic attributes. If you are thinking that is a lot of attributes, the paper points out that Experian sold a “de-identified dataset containing 248 attributes per household for 120M Americans.”
Conclusion: Privacy frameworks and de-identification/anonymization
The best way to keep data safe is to not collect it in the first place. Of course, in reality, this is not ever going to happen. Various use cases across e-commerce, banking, government and healthcare need to process personal and health data. However, minimizing the data collected should be part of an overall approach to maintaining security and privacy of personal data. Online digital footprints containing behavioral, location and other data will be harder to contain.
We must accept that de-identification, pseudonymization and even full anonymization comes with risks and is not a panacea for data privacy. A program of de-identification coupled with good data practices and governance can help to reduce those risks.
- GUIDE TO BASIC DATA ANONYMISATION TECHNIQUES, PDPC
- Anonymization tool, ARX — Data Anonymization Tool
- Cost of a Data Breach Report highlights, IBM
- Harris Poll And Finn Partners Unveil New Metric For The Return On Investment For Social Good, PR Newswire
- A Smart Cities Hackathon, Global Urban Datafest
- Matching Known Patients to Health Records in Washington State Data, Latanya Sweeney
- Research reveals de-identified patient data can be re-identified, The University of Melbourne
- Estimating the success of re-identifications in incomplete datasets using generative models, Luc Rocher, Julien M. Hendrickx & Yves-Alexandre de Montjoye