Ιδιωτικότητα σε Βάσεις Δεδομένων

Ιδιωτικότητα σε Βάσεις Δεδομένων

Οκτώβρης 2011

Roadmap

• Motivation• Core ideas• Extensions

2

Roadmap


3

Reasons for privacy preserving data publishing

• Vast amount of data collected nowadays• Estimated user data per day– 8-10 GB public content– ~ 4 TB private content (emails, SMSs, content

annotations, social networks…)

4


• Organizations (hospitals, ministries, internet providers, …) publicly release data concerning individual records (internet searches, medical records, …)

• The laws oblige these agencies to protect the individuals’ privacy

5


• So, data are stripped of the attributes that can reveal the individuals’ identities

• Unfortunately, this is not enough…

6

Sweeney’s breach of governor’s medical record

• “ … In Massachusetts, the Group Insurance Commission (GIC) is responsible for purchasing health insurance for state employees.

• For twenty dollars I purchased the voter registration list for Cambridge Massachusetts and received the information on two diskettes.

• The rightmost circle in Figure 1 shows that these data included the name, address, ZIP code, birth date, and gender of each voter.

• This information can be linked using ZIP code, birth date and gender to the medical information, thereby linking diagnosis, procedures, and medications to particularly named individuals. …”

7

Sweeney’s breach of governor’s medical record

• “ … For example, William Weld was governor of Massachusetts at that time and his medical records were in the GIC data. Governor Weld lived in Cambridge Massachusetts. According to the Cambridge Voter list, six people had his particular birth date; only three of them were men; and, he was the only one in his 5-digit ZIP code.…”

8

AOL’s exposure of user 4417749• August 2006: AOL publicizes anonymized data for

21M user queries• User’s 4417749 had a strong essence of geo and

thematic locality• Researchers would focus more and more the search,

based on these queries• Ms Arnold, 62, would prove to be the user search for

medication, resorts, dogs and family members, …

9

The context of privacy-preserving data publishing

10

Detailedmicrodata

T

Anonymizedpublic data

T*

Bob (the victim) to be hidden

Ben, the benevolent, data miner

Alice, the external attacker

Deborah, a star DBA & a TRUSTED data publisher

Roadmap


11

Anonymization

• To retain privacy one must:– Remove the attributes that directly identify

individuals (name, SSN, …)– Organize the tuples and the cell values of the data

set in such a way that:• The statistical properties of the data set are retained• The attacker cannot guess to which individual a tuple

corresponds with statistical meaningful guarantee

12

Fundamentals

• Identifier(s): attribute(s) that explicitly reveal the identity of a person (name, SSN, …). These attributes are removed from the public data set

• Quasi identifier: attribute(s) that if joined with external data can reveal sensitive information (zip code, birth date, sex,…)– Typically accompanied by “generalization hierarchies”

• Sensitive attribute: containing the values that should be kept private (disease, salary,…)

13

14

Generalization hierarchies

General methods for Anonymization

• “Hide tuples in the crowd”– Generalization– Anatomization

• “Lies to the attacker, truth to the statistician”– Noise injection– Value perturbation

15

k-anonymity (TKDE 01, IJUFKS 02)

17

A relation Τ is k-anonymous when every tuple of the relation is identical to k-1 other tuples with respect to their Quasi-Identifier set of attributes.

Naïve l-diversity

18

A relation T satisfies the naïve l-diversity property whenever every group of the relation contains at least l different values in its sensitive attributes.

Information utility

• Must prevent the attackers, by satisfying the privacy criterion (k for k-anonymity, l for l-diversity) – Fundamental anonymization technique: hide

individual in groups of identical QI values!!

• Must serve the well-meaning users, by maximizing information utility i.e., by minimizing

• The tuples we remove (see next)• the amount of generalization that we apply to the QI

attributes.

19

Generalization vs suppression

20

This anonymization suppressed no tuples, and guarantees 3-anonymity.

What if we want 4-anonymity?

Generalization vs suppression

21

Low height, 6 tuples suppressed

Higher height, no tuples suppressed

//the difference is in the work_class field

Incognito (SIGMOD 2005)

• Two fundamental ideas can be exploited with hierarchies:• If a data set generalized at a certain level (e.g., 1345*) is

k-anonymous, then it is also k-anonymous if it is even more generalized (e.g., 134**)

• If a data set of N attributes is not k-anonymous if – n attributes are not fully anonymized (age) and N-n are fully

anonymized (sex, zip)• then, the same data set is still not k- anonymous with

– n+1 attributes are not fully anonymized (age,sex) and N-n-1 are fully anonymized (zip)

22

Incognito

23

Birth date, zip code, sex

Combinations of 2 attributes

Incognito

24

Birth date, zip code, sex

Combinations of 3 attributes, after non-anonymous gener. are pruned

25

What disease Bob is suffering from? Since Alice is Bob’s neighbor, she knows that Bob is a 31-year-old American male who lives in the zip code 13053. Therefore, Alice knows that Bob’s record number is 9,10,11, or 12.Now, all of those patients have the same medical condition (cancer), and so Alice concludes that Bob has cancer.

Umeko is a 21 year oldJapanese female who currently lives in zip code 13068.Therefore, Umeko’s informationis contained in record number 1,2,3, or 4.

+BCGR Knowledge: it is well-knownthat Japanese have an extremely low incidence of heart disease. Therefore, Alice concludes with near certaintythat Umeko has a viral infection.

L-diversity (ICDE 2006)

• Every q-block group, has – At least k tuples– At least l well-represented values– Well-represented?

• Ούτε όλες οι τιμές σε ένα group είναι ίδιες (έχω τουλάχιστον l, l>=2)

• Ούτε κάποια τιμή είναι απίθανο να υπάρχει => μπορώ να συνάγω ότι ισχύει κάποια άλλη αν το l είναι σχετικά μικρό

26

Well-represented

• Distinct l-diversity: simply l different values

• Entropy l-diversity: for each pair (public value q*, sensitive value s) measure the value p(q*,s)logp(q*,s)

• Entropy of a q-block with value q* is -Σsp()logp() over all sensitive values s

• You need to have E > log(p) for all groups (and this can be guaranteed if it holds for the whole table, too)

• Recursive l-diversity: the most frequent values do not appear too frequently and the less frequent do not appear too rarely

27

Roadmap


28

Mondrian (ICDE 2006)

29

Why must we generalize fully every attributes?

Some records are in regions with many records and anonymity is easily preserved even by giving out more information. Some others are in sparse areas and need to be generalized more…

age

zip

Mondrian (ICDE 2006)

30

Global recoding

local recoding

Original data

M-invariance (SIGMOD ‘07)

31

If I know that Bob is in group 1 + he has been taken to the hospital twice, I can deduce bronc. from:

{dysp., bronch.}

{dysp.,gastr.}

M-invariance

32

Many other extensions

• Concerning multi-relational privacy• Data perturbations• More sophisticated “local recoding” a-la

Mondrian• Trajectory, set-valued, OLAP, … data• …

33

Ιδιωτικότητα σε Βάσεις Δεδομένων

Documents

Transcript of Ιδιωτικότητα σε Βάσεις Δεδομένων