Novel reversible text data de-identification techniques based on native data structures

Al-Abdullah, Bayan (2022) Novel reversible text data de-identification techniques based on native data structures. Doctoral thesis (PhD), University of Sussex.

[img] PDF - Published Version
Download (3MB)


Technological development in today's digital world has resulted in the collection and storage of large amounts of personal data. These data enable both direct services and non-direct activities, known as secondary use. The secondary use of data can improve decision-making, service experiences, and healthcare systems. However, the widespread reuse of personal data raises significant privacy and policy issues, especially for health- related information; these data may contain sensitive data, leading to privacy breaches if compromised. Legal systems establish laws to protect the privacy of personal data disclosed for secondary use. A well-known example is the General Data Protection Regulation (GDPR), which outlines a specific set of rules for sharing and storing personal data to protect individual privacy. The GDPR explicitly points to data de-identification, especially pseudonymization, as one measure that can help meet the requirements for the processing of personal data.

The literature on privacy preservation approaches has largely been developed in the field of data anonymization, where personal data are irreversibly removed or obfuscated and there is no means by which to recover an individual's identity if needed. By contrast, pseudonymization is a promising technique to protect privacy while enabling the recovery of de-identified data. Significantly, many existing approaches for pseudonymization were developed long before the GDPR requirements were established, and so they may fail to satisfy its provisions. Therefore, it is worthwhile to offer technical solutions to preserve privacy while supporting the legitimate use of data.

This thesis proposes a novel de-identification system for unstructured textual data, known as ARTPHIL, that generates de-identified data in compliance with the GDPR requirement for strong pseudonymization. The system was evaluated using 2014 i2b2 testing data. The proposed system achieved a recall of 96.93% in terms of detecting and encrypting personal health information, as specified under guidelines provided by the Health Insurance Portability and Accountability Act (HIPAA). The system used a novel and lightweight cryptography algorithm E-ART to encrypt personal data cost-effectively and without compromising security. The main novelty of the E-ART algorithm is the use of the reflection property of a balanced binary tree data structure as substitution method instead of complex and multiple iterations. The performance and security of the proposed algorithm were compared to two symmetric encryption algorithms: The Advanced Encryption Standard and Data Encryption Standard. The security analysis showed comparable results, but the performance analysis indicated that E‐ART had the shortest ciphertext and running time with comparable memory usage, which indicates the feasibility of using ARTPHIL for delay-sensitive or data-intensive applications

Item Type: Thesis (Doctoral)
Schools and Departments: School of Engineering and Informatics > Informatics
Subjects: K Law > K Law in General. Comparative and uniform Law. Jurisprudence > K7000 Private international law. Conflict of laws > K7093.A-Z Concepts applying to several branches of law, A-Z > K7093.C66 Computers. Internet. Data protection
Q Science > QA Mathematics > QA0075 Electronic computers. Computer science > QA0076.9.A-Z Other topics, A-Z > QA0076.9.D343 Data mining
T Technology > T Technology (General) > T0055.4 Industrial engineering. Management engineering > T0058.5 Information technology
Depositing User: Library Cataloguing
Date Deposited: 25 May 2022 19:39
Last Modified: 25 May 2022 19:39

View download statistics for this item

📧 Request an update