Privacy Preserving Data Sanitization and Publishing
Recent trends have shown a drastic increase in large data repositories by corporations, governments, and healthcare organizations. According to Bernard Marr of the Forbes Tech magazine (2015), the growth in data in 2014/15 alone was twice that created in the entire history of the human race. Data sharing is beneficial in areas such as healthcare services, and collaborative research works. However, there is a significant risk of compromising sensitive information, for example through de-anonymization. Privacy Preserving Data Publishing (PPDP) is a way to allow one to share sanitized data while ensuring protection against identity disclosure of an individual. Removing explicit identifiers/personally identifiable information (PII) from a data set and making the data set compliant according to the Health Insurance Portability and Accountability Act (HIPAA) does not guarantee the privacy of data donors. Data sanitization may be achieved in different ways, by k-anonymization, l-diversity or delta-presence, to name but a few, however, differential privacy paradigm provides the strongest privacy guarantee for sanitized data publishing. This research proposes two privacy preserving algorithms that satisfy the epsilon-differential privacy requirement and adopts the non-interactive privacy model for sanitizing and publishing data. Along with the differential privacy, generalization and suppression of attributes is applied to impose privacy and to prevent re-identification of records of a data set. The key contributions of this thesis are: 1) the proposed algorithm adopts the non-interactive model for data publishing; as a result data miners have full access to the published data set for further processing, to promote data sharing in a safe way; 2) the algorithm can sanitize micro and/or HIPPA compliance data sets for publishing; 3) the published data is independent of adversary's background knowledge; 4) the algorithm is independent of the choice of quasi-identifiers (QIDs), and finally, 5) it protects published data set from the re-identification risk. The published sanitized data using the proposed algorithm is shown to have higher data usability in the case of data classification accuracy compared to other existing works, and significantly reduces the risk of re-identification.