Ιδιωτικότητα σε Βάσεις Δεδομένων

32
Ιδιωτικότητα σε Βάσεις Δεδομένων Οκτώβρης 2011

description

Ιδιωτικότητα σε Βάσεις Δεδομένων. Οκτώβρης 20 11. Roadmap. Motivation Core ideas Extensions. Roadmap. Motivation Core ideas Extensions. Reasons for privacy preserving data publishing. Vast amount of data collected nowadays Estimated user data per day 8-10 GB public content - PowerPoint PPT Presentation

Transcript of Ιδιωτικότητα σε Βάσεις Δεδομένων

Page 1: Ιδιωτικότητα σε Βάσεις Δεδομένων

Ιδιωτικότητα σε Βάσεις Δεδομένων

Οκτώβρης 2011

Page 2: Ιδιωτικότητα σε Βάσεις Δεδομένων

Roadmap

• Motivation• Core ideas• Extensions

2

Page 3: Ιδιωτικότητα σε Βάσεις Δεδομένων

Roadmap

• Motivation• Core ideas• Extensions

3

Page 4: Ιδιωτικότητα σε Βάσεις Δεδομένων

Reasons for privacy preserving data publishing

• Vast amount of data collected nowadays• Estimated user data per day– 8-10 GB public content– ~ 4 TB private content (emails, SMSs, content

annotations, social networks…)

4

Page 5: Ιδιωτικότητα σε Βάσεις Δεδομένων

Reasons for privacy preserving data publishing

• Organizations (hospitals, ministries, internet providers, …) publicly release data concerning individual records (internet searches, medical records, …)

• The laws oblige these agencies to protect the individuals’ privacy

5

Page 6: Ιδιωτικότητα σε Βάσεις Δεδομένων

Reasons for privacy preserving data publishing

• So, data are stripped of the attributes that can reveal the individuals’ identities

• Unfortunately, this is not enough…

6

Page 7: Ιδιωτικότητα σε Βάσεις Δεδομένων

Sweeney’s breach of governor’s medical record

• “ … In Massachusetts, the Group Insurance Commission (GIC) is responsible for purchasing health insurance for state employees.

• For twenty dollars I purchased the voter registration list for Cambridge Massachusetts and received the information on two diskettes.

• The rightmost circle in Figure 1 shows that these data included the name, address, ZIP code, birth date, and gender of each voter.

• This information can be linked using ZIP code, birth date and gender to the medical information, thereby linking diagnosis, procedures, and medications to particularly named individuals. …”

7

Page 8: Ιδιωτικότητα σε Βάσεις Δεδομένων

Sweeney’s breach of governor’s medical record

• “ … For example, William Weld was governor of Massachusetts at that time and his medical records were in the GIC data. Governor Weld lived in Cambridge Massachusetts. According to the Cambridge Voter list, six people had his particular birth date; only three of them were men; and, he was the only one in his 5-digit ZIP code.…”

8

Page 9: Ιδιωτικότητα σε Βάσεις Δεδομένων

AOL’s exposure of user 4417749• August 2006: AOL publicizes anonymized data for

21M user queries• User’s 4417749 had a strong essence of geo and

thematic locality• Researchers would focus more and more the search,

based on these queries• Ms Arnold, 62, would prove to be the user search for

medication, resorts, dogs and family members, …

9

Page 10: Ιδιωτικότητα σε Βάσεις Δεδομένων

The context of privacy-preserving data publishing

10

Detailedmicrodata

T

Anonymizedpublic data

T*

Bob (the victim) to be hidden

Ben, the benevolent, data miner

Alice, the external attacker

Deborah, a star DBA & a TRUSTED data publisher

Page 11: Ιδιωτικότητα σε Βάσεις Δεδομένων

Roadmap

• Motivation• Core ideas• Extensions

11

Page 12: Ιδιωτικότητα σε Βάσεις Δεδομένων

Anonymization

• To retain privacy one must:– Remove the attributes that directly identify

individuals (name, SSN, …)– Organize the tuples and the cell values of the data

set in such a way that:• The statistical properties of the data set are retained• The attacker cannot guess to which individual a tuple

corresponds with statistical meaningful guarantee

12

Page 13: Ιδιωτικότητα σε Βάσεις Δεδομένων

Fundamentals

• Identifier(s): attribute(s) that explicitly reveal the identity of a person (name, SSN, …). These attributes are removed from the public data set

• Quasi identifier: attribute(s) that if joined with external data can reveal sensitive information (zip code, birth date, sex,…)– Typically accompanied by “generalization hierarchies”

• Sensitive attribute: containing the values that should be kept private (disease, salary,…)

13

Page 14: Ιδιωτικότητα σε Βάσεις Δεδομένων

14

Generalization hierarchies

Page 15: Ιδιωτικότητα σε Βάσεις Δεδομένων

General methods for Anonymization

• “Hide tuples in the crowd”– Generalization– Anatomization

• “Lies to the attacker, truth to the statistician”– Noise injection– Value perturbation

15

Page 16: Ιδιωτικότητα σε Βάσεις Δεδομένων

k-anonymity (TKDE 01, IJUFKS 02)

17

A relation Τ is k-anonymous when every tuple of the relation is identical to k-1 other tuples with respect to their Quasi-Identifier set of attributes.

Page 17: Ιδιωτικότητα σε Βάσεις Δεδομένων

Naïve l-diversity

18

A relation T satisfies the naïve l-diversity property whenever every group of the relation contains at least l different values in its sensitive attributes.

Page 18: Ιδιωτικότητα σε Βάσεις Δεδομένων

Information utility

• Must prevent the attackers, by satisfying the privacy criterion (k for k-anonymity, l for l-diversity) – Fundamental anonymization technique: hide

individual in groups of identical QI values!!

• Must serve the well-meaning users, by maximizing information utility i.e., by minimizing

• The tuples we remove (see next)• the amount of generalization that we apply to the QI

attributes.

19

Page 19: Ιδιωτικότητα σε Βάσεις Δεδομένων

Generalization vs suppression

20

This anonymization suppressed no tuples, and guarantees 3-anonymity.

What if we want 4-anonymity?

Page 20: Ιδιωτικότητα σε Βάσεις Δεδομένων

Generalization vs suppression

21

Low height, 6 tuples suppressed

Higher height, no tuples suppressed

//the difference is in the work_class field

Page 21: Ιδιωτικότητα σε Βάσεις Δεδομένων

Incognito (SIGMOD 2005)

• Two fundamental ideas can be exploited with hierarchies:• If a data set generalized at a certain level (e.g., 1345*) is

k-anonymous, then it is also k-anonymous if it is even more generalized (e.g., 134**)

• If a data set of N attributes is not k-anonymous if – n attributes are not fully anonymized (age) and N-n are fully

anonymized (sex, zip)• then, the same data set is still not k- anonymous with

– n+1 attributes are not fully anonymized (age,sex) and N-n-1 are fully anonymized (zip)

22

Page 22: Ιδιωτικότητα σε Βάσεις Δεδομένων

Incognito

23

Birth date, zip code, sex

Combinations of 2 attributes

Page 23: Ιδιωτικότητα σε Βάσεις Δεδομένων

Incognito

24

Birth date, zip code, sex

Combinations of 3 attributes, after non-anonymous gener. are pruned

Page 24: Ιδιωτικότητα σε Βάσεις Δεδομένων

25

What disease Bob is suffering from? Since Alice is Bob’s neighbor, she knows that Bob is a 31-year-old American male who lives in the zip code 13053. Therefore, Alice knows that Bob’s record number is 9,10,11, or 12.Now, all of those patients have the same medical condition (cancer), and so Alice concludes that Bob has cancer.

Umeko is a 21 year oldJapanese female who currently lives in zip code 13068.Therefore, Umeko’s informationis contained in record number 1,2,3, or 4.

+BCGR Knowledge: it is well-knownthat Japanese have an extremely low incidence of heart disease. Therefore, Alice concludes with near certaintythat Umeko has a viral infection.

Page 25: Ιδιωτικότητα σε Βάσεις Δεδομένων

L-diversity (ICDE 2006)

• Every q-block group, has – At least k tuples– At least l well-represented values– Well-represented?

• Ούτε όλες οι τιμές σε ένα group είναι ίδιες (έχω τουλάχιστον l, l>=2)

• Ούτε κάποια τιμή είναι απίθανο να υπάρχει => μπορώ να συνάγω ότι ισχύει κάποια άλλη αν το l είναι σχετικά μικρό

26

Page 26: Ιδιωτικότητα σε Βάσεις Δεδομένων

Well-represented

• Distinct l-diversity: simply l different values

• Entropy l-diversity: for each pair (public value q*, sensitive value s) measure the value p(q*,s)logp(q*,s)

• Entropy of a q-block with value q* is -Σsp()logp() over all sensitive values s

• You need to have E > log(p) for all groups (and this can be guaranteed if it holds for the whole table, too)

• Recursive l-diversity: the most frequent values do not appear too frequently and the less frequent do not appear too rarely

27

Page 27: Ιδιωτικότητα σε Βάσεις Δεδομένων

Roadmap

• Motivation• Core ideas• Extensions

28

Page 28: Ιδιωτικότητα σε Βάσεις Δεδομένων

Mondrian (ICDE 2006)

29

Why must we generalize fully every attributes?

Some records are in regions with many records and anonymity is easily preserved even by giving out more information. Some others are in sparse areas and need to be generalized more…

age

zip

Page 29: Ιδιωτικότητα σε Βάσεις Δεδομένων

Mondrian (ICDE 2006)

30

Global recoding

local recoding

Original data

Page 30: Ιδιωτικότητα σε Βάσεις Δεδομένων

M-invariance (SIGMOD ‘07)

31

If I know that Bob is in group 1 + he has been taken to the hospital twice, I can deduce bronc. from:

{dysp., bronch.}

{dysp.,gastr.}

Page 31: Ιδιωτικότητα σε Βάσεις Δεδομένων

M-invariance

32

Page 32: Ιδιωτικότητα σε Βάσεις Δεδομένων

Many other extensions

• Concerning multi-relational privacy• Data perturbations• More sophisticated “local recoding” a-la

Mondrian• Trajectory, set-valued, OLAP, … data• …

33