Data PrivacyAI-powered Data Anonymization Techniques for Preserving Privacy

March 20, 20240

INTRODUCTION

The integration of digital technology into all areas of a business has fundamentally changed how the business  operates and delivers value to its customers. As businesses undergo digital transformation, they increasingly rely on vast amounts of data for various purposes such as improving customer experience, optimizing operations, and creating new products and services.

However, with this reliance on data comes a heightened risk to individual’s privacy. The term  ‘ever-expanding volumes of data’ refers to the exponential growth in the amount of data being generated, collected, and processed by organizations. This data often contains sensitive information about individuals, including personal details, preferences, and behaviour patterns.

To address these concerns, organizations must prioritize data privacy. Data privacy refers to the set of practices, policies, and measures designed to protect individual’s personal information and ensure that it is used only for its intended purpose. This includes implementing robust security measures to safeguard data against unauthorized access, ensuring compliance with relevant regulations such as Digital Personal Data Protection Act, 2023 (hereinafter referred to as “DPDP Act”), the General Data Protection Regulation (hereinafter referred to as “GDPR” ) or California Consumer Privacy Act (hereinafter referred to as “CCPA”), and being transparent with customers about how their data is collected, stored, and used.

DATA ANONYMIZATION AND ITS NEEDS

Data anonymization is like a digital camouflage which plays a crucial role in protecting individual’s privacy while still allowing organizations to derive valuable insights from data. By removing or encrypting identifiable information from datasets, anonymization ensures that individuals cannot be directly identified from the data alone.

However, it is essential to strike a balance between preserving privacy and retaining the usefulness of the data for analysis or operational purposes. If anonymization is too aggressive, it may result in data that is no longer useful for its intended purposes. Conversely, if it is too lenient, it may fail to adequately protect individuals’ privacy.

The Personal Data Protection Bill, 2019 defined anonymization under section 3(2) which stated that “”anonymisation” in relation to personal data, means such irreversible process of transforming or converting personal data to a form in which a data principal cannot be identified, which meets the standards of irreversibility specified by the Authority”. However, the DPDP Act does not provide for any categorical definition of anonymisation.

In a data-driven world where organizations rely heavily on data for decision-making and innovation, anonymization helps navigate the ethical considerations associated with data use. It allows organizations to extract valuable insights while minimizing the risk of exposing individuals’ identities.

This not only helps maintain trust with customers but also ensures compliance with data protection regulations. Ultimately, effective anonymization techniques enable organizations to harness the power of data responsibly and ethically.

Personally Identifiable Information

This is a type of information that can help to identify an individual when used alone or with other relevant data. Personally Identifiable Information (hereinafter referred to as “PII”) can encompass direct identifiers, such as passport information, which can uniquely identify an individual, or quasi-identifiers, such as race, which when combined with other quasi-identifiers like date of birth, can effectively lead to the recognition of a specific individual.

Essential data containing PII can be utilized if handled and protected properly, for example, by encrypting sensitive data. However, non-essential data containing PII should either be deleted when appropriate or anonymized to ensure the best protection for the individual.

GENERATIVE AI FOR DATA ANONYMIZATION

In today’s digital environment, ensuring data privacy and security is of utmost importance. With the continuous accumulation and exchange of sensitive data, safeguarding individual privacy has emerged as a vital necessity. An effective approach to tackle this issue involves employing generative artificial intelligence (hereinafter referred to as “AI”) models for anonymizing data. This method shows great potential in creating synthetic data that mirrors real data closely, yet maintains privacy intact.

Consequently, organizations can safely share and analyse datasets without the fear of exposing confidential information, thereby minimizing the risk of data breaches. Benefits of using AI for data anonymization include:

1. Preserving Data Utility

2. Privacy Protection

3. Enabling Data Sharing and Collaboration

4. Reducing the Impact of Data Breaches

5. Differential Privacy

TECHNIQUES OF DATA ANONYMIZATION

i. k-Anonymity: k-anonymity denotes a dataset wherein the details of each individual cannot be discerned from at least k-1 other persons in the data release. Its primary benefit is shielding against the revelation of identities.

ii. I-Diversity: I-diversity is an extension of k-anonymity which guarantees that within every anonymized group, there are at least “l” distinct values for the sensitive attributes. Its main benefit lies in providing a strong safeguard against attribute disclosure.

iii. t-closeness: T-closeness is where the distribution of a sensitive data within any anonymized group closely resembles its distribution in the overall dataset, within a specified threshold “t”. This approach aims to rectify the limitations of I-diversity by offering a more balanced trade-off between privacy and utility.

iv. Data masking: It is a security technique employed to safeguard sensitive data by concealing or substituting it with fictitious data. This method serves as a protective measure for confidential data, PII, financial data etc.

v. Synthetic data: Synthetic data pertains to data that is artificially generated to replicate the statistical characteristics of authentic data. It is produced using algorithms or computational techniques.

vi. Generalization: Generalization is a method employed to safeguard sensitive information by decreasing the level of detail or specificity in the data. This technique entails substituting specific values or data points with broader or more abstract values, maintaining the overall meaning and statistical characteristics of the data intact.

vii. Data Perturbation: Data perturbation aimed at safeguarding sensitive information by introducing random noise or errors into the data. Its objective is to heighten the challenge of identifying individuals within a dataset.

viii. Data Swapping: Data swapping is employed to safeguard sensitive information by exchanging or swapping data among individuals or entities. Its aim is to increase the challenge of identifying individuals while preserving the data’s usability for research or statistical analysis purposes.

PROCESS OF DATA ANONYMIZATION

  1. The organization collects data from the customer or user;
  2. Different anonymization techniques are applied to the raw data which includes sensitive personal and personal data;
  3. Once the anonymization technique is applied on the collected data, the data is anonymized making it difficult to identify the individual;
  4. This anonymized data is then stored in the organization’s database and shared with third parties if required.

Risk-Based anonymization process:

After collecting data from the user, the data is sorted on the basis of its attributes into two types: direct identifiers like names and addresses, and quasi-identifiers like age or income. Then the sensitive attribute is defined, which shouldn’t be linked to an individual. The risk of identifying someone from this data is evaluated and compared to a preset acceptable risk level. A “risk threshold” is established, indicating the level of risk deemed acceptable.

A risk threshold of zero implies that no risk of re-identification is acceptable, resulting in no data being shared since any level of risk is considered too high. On the other hand, a threshold value of one indicates that no anonymization has been applied, meaning the data remains fully identifiable. The acceptable level of risk falls somewhere between zero and one. In other words, some degree of risk is deemed acceptable, allowing for the sharing of data while still maintaining a reasonable level of anonymity. Typically, the acceptable risk threshold is higher when sharing data externally or when dealing with highly regulated personal data like health records.

PSEUDONYMIZATION AND HOW IT IS DIFFERENT FROM ANONYMIZATION

Pseudonymization involves substituting one attribute, commonly a unique identifier like an individual’s name, with another value. This practice diminishes the ability to link a dataset with the original identity of a data subject. Popular pseudonymization techniques include encryption, hash functions, deterministic encryption, and tokenization. These methods help to obscure personal identifiers, enhancing data privacy and security.

Article 4(5) of GDPR defines ‘pseudonymisation’ which means the processing of personal data in such a manner that the personal data can no longer be attributed to a specific data subject without the use of additional information, provided that such additional information is kept separately and is subject to technical and organisational measures to ensure that the personal data are not attributed to an identified or identifiable natural person;

Data strategies aimed at pseudonymization or full anonymization are governed by varying criteria within the framework of the GDPR. Complete anonymization essentially discharges the data collector of their data stewardship responsibilities, whereas pseudonymization presents a more complex scenario.

The GDPR obligations which include providing notice to data subject, giving right to be forgotten, data retention limitations, data security, data breach notification, record keeping obligation and obtaining consent is not required under anonymization whereas it is required under pseudonymization.

RE-IDENTIFICATION OF ANONYMIZED DATA

Data re-identification refers to the process of uncovering PII within supposedly anonymized or scrubbed datasets. Scrubbing, or anonymization, typically involves removing direct identifiers such as names, addresses, or social security numbers to protect individuals’ privacy. However, even after this process, it is still possible to infer or deduce individuals’ identities through various means.

1. Insufficient De-identification: This occurs when the anonymization process is not thorough enough, leaving traces of PII or indirect identifiers in the dataset. For example, removing names but leaving other identifiable information such as birth dates or zip codes might still enable re-identification.

2. Pseudonym Reversal: In some cases, individuals are assigned pseudonyms or unique identifiers to replace their actual names. However, if these pseudonyms can be linked back to the real identities of individuals, re-identification becomes possible. This could happen if there is a breach in the security of the pseudonymization process or if external data sources can be used to match the pseudonyms to real identities.

3. Combing Datasets: Re-identification can also occur by combining supposedly anonymized datasets with other available data sources. By correlating information from multiple datasets, it might be possible to uncover individuals’ identities or infer sensitive information. This could involve merging datasets from different sources or cross-referencing with publicly available data.

These techniques are not mutually exclusive; they can be used in combination to increase the likelihood of re-identifying individuals within scrubbed datasets. Direct identifiers directly reveal the real identity of individuals, while indirect identifiers provide information about preferences and habits, which can also contribute to re-identification. Therefore, even after data has been scrubbed, there’s a risk that it can still be re-identified, compromising individuals’ privacy.

Risks of Re-Identification:

Given the continual influx of data breaches worldwide, the risk of a malicious actor successfully engaging in harmful re-identification of a dataset remains constant. According to a study released by the European Union (hereinafter referred to as “EU”) in January 2020, it was revealed that there were 1,60,000 reported data breaches under the GDPR during the 8-month period spanning from May 2018 to January 2019.

In a widely documented case brought to light by Anthony Tockar of Northwestern University in 2014, the New York City Taxi and Limousine Commission received a Freedom of Information Law (hereinafter referred to as “FOIL”) request for data regarding all taxi rides in 2013. This dataset included detailed information such as pickup and drop-off times, locations, fare and tip amounts, as well as anonymized (hashed) versions of the taxi’s license and medallion numbers. As outlined in the findings, a simple web search for “celebrities in taxis in Manhattan in 2013” gave in a picture, which when correlated with quasi-identifiers in the data, allowed the identification of two specific celebrities, their starting and ending points, and the amounts they paid and tipped. This represented significant instances of breach of privacy.

REGULATORY FRAMEWORK

1. Advisory No.eNo.2(4)/2023-CyberLaws–3 published by Ministry of Electronics and Information Technology (hereinafter referred to as “MeitY”) dated 01st March, 2024  on “Due diligence by Intermediaries / Platforms under the Information Technology Act, 2000 and Information Technology (Intermediary Guidelines and Digital Media Ethics Code) Rules, 2021”  states that-

Where any intermediary through its software or any other computer resource permits or facilitates synthetic creation, generation or modification of a text, audio, visual or audio-visual information, in such a manner that such information may be used potentially as misinformation or deepfake, it is advised that such information created, generated, or modified through its software or any other computer resource is labeled or embedded with a permanent unique metadata or identifier, by whatever name called, in a manner that such label, metadata or identifier can be used to identify that such information has been created, generated or modified using computer resource of the intermediary, or identify the user of the software or such other computer resource, the intermediary through whose software or such other computer resource such information has been created, generated or modified and the creator or first originator of such misinformation or deepfake.”

2. In July 2022, a draft report on Guidelines On Data Anonymization For E-Governance For Public Consultation” was assigned by MeitY and prepared by Standardization Testing Quality Certification (hereinafter referred to as “STQC”) Directorate and Centre for Development of Advanced Computing (hereinafter referred to as “C-DAC”).

The Guidelines provide various techniques and SOPs that can be used by e-governance projects to anonymise the data they collect which can be utilized for other projects. They also aim to support the implementation of data anonymization provisions in policies and laws enacted by the Government and was open for public consultation till 21st September, 2022. However, The draft report was pulled down by MeitY stating that the draft has been “released without adequate expert consultation.”

AMLEGALS REMARKS

In conclusion, the intricate interplay between data privacy and AI-assisted anonymization underscores the evolving landscape of digital ethics and regulatory frameworks. While advancements in AI present promising avenues for safeguarding sensitive information, the potential for unintended consequences, breaches and ethical dilemmas necessitates vigilant oversight and interdisciplinary collaboration.

By prioritizing transparency, accountability, and the alignment of technological innovation with ethical principles, stakeholders can navigate the complexities of data privacy in an increasingly interconnected world. As we continue to harness the power of AI for anonymization, it is imperative to strike a delicate balance between innovation and protection, ensuring that individuals’ rights to privacy remain paramount in the digital age.

Using AI-powered methods to anonymize data offers a promising approach to protecting privacy in today’s digital era. Sophisticated algorithms can transform sensitive information to preserve anonymity while maintaining its usefulness for analysis and research.

– Team AMLEGALS assisted by Ms. Prishita Saraiwala


For any queries or feedback feel free to reach out to mridusha.guha@amlegals.com or jason.james@amlegals.com

© 2020-21 AMLEGALS Law Firm in Ahmedabad, Mumbai, Kolkata, New Delhi, Bengaluru for IBC, GST, Arbitration, Contract, Due Diligence, Corporate Laws, IPR, White Collar Crime, Litigation & Startup Advisory, Legal Advisory.

 

Disclaimer & Confirmation As per the rules of the Bar Council of India, law firms are not permitted to solicit work and advertise. By clicking on the “I AGREE” button below, user acknowledges the following:
    • there has been no advertisements, personal communication, solicitation, invitation or inducement of any sort whatsoever from us or any of our members to solicit any work through this website;
    • user wishes to gain more information about AMLEGALS and its attorneys for his/her own information and use;
  • the information about us is provided to the user on his/her specific request and any information obtained or materials downloaded from this website is completely at their own volition and any transmission, receipt or use of this site does not create any lawyer-client relationship; and that
  • We are not responsible for any reliance that a user places on such information and shall not be liable for any loss or damage caused due to any inaccuracy in or exclusion of any information, or its interpretation thereof.
However, the user is advised to confirm the veracity of the same from independent and expert sources.