SIGMOD 2013 - Patricia's talk on "Value invention for Data Exchange"

65
Value Invention in Data Exchange Patricia Arocena 1 Boris Glavic 2 Ren´ ee J. Miller 1 University of Toronto 1 DBGroup Illinois Institute of Technology 2 DBGroup SIGMOD 2013 - June 25, 2013 - New York, USA

description

The creation of values to represent incomplete information, often referred to as value invention, is central in data exchange. Within schema mappings, Skolem functions have long been used for value invention as they permit a precise representation of missing information. Recent work on a powerful mapping language called second-order tuple generating dependencies (SO tgds), has drawn attention to the fact that the use of arbitrary Skolem functions can have negative computational and programmatic properties in data exchange. In this paper, we present two techniques for understanding when the Skolem functions needed to represent the correct semantics of incomplete information are computationally well-behaved. Specifically, we consider when the Skolem functions in second-order (SO) mappings have a first-order (FO) semantics and are therefore programmatically and computationally more desirable for use in practice. Our first technique, linearization, significantly extends the Nash, Bernstein and Melnik unskolemization algorithm, by understanding when the sets of arguments of the Skolem functions in a mapping are related by set inclusion. We show that such a linear relationship leads to mappings that have FO semantics and are expressible in popular mapping languages including source-to-target tgds and nested tgds. Our second technique uses source semantics, specifically functional dependencies (including keys), to transform SO mappings into equivalent FO mappings. We show that our algorithms are applicable to a strictly larger class of mappings than previous approaches, but more importantly we present an extensive experimental evaluation that quantifies this difference (about 78% improvement) over an extensive schema mapping benchmark and illustrates the applicability of our results on real mappings.

Transcript of SIGMOD 2013 - Patricia's talk on "Value invention for Data Exchange"

Page 1: SIGMOD 2013 - Patricia's talk on "Value invention for Data Exchange"

Value Invention in Data Exchange

Patricia Arocena1 Boris Glavic2 Renee J. Miller1

University of Toronto1

DBGroupIllinois Institute of Technology2

DBGroup

SIGMOD 2013 - June 25, 2013 - New York, USA

Page 2: SIGMOD 2013 - Patricia's talk on "Value invention for Data Exchange"

Outline

1 Introduction

2 Linearization

3 Exploiting Source Constraints

4 Experiments

5 Conclusions

Page 3: SIGMOD 2013 - Patricia's talk on "Value invention for Data Exchange"

The Data Exchange Problem1

Schema Mappings M = (S,T,Σ)

• Source Schema S and Target Schema T

• High-level specification Σ• models the relationship between S and T

Data Exchange

• Given an instance of S

• How to materialize a target instance of T?

Source Schema S Target Schema T

Source Data Target Data

M

Source Schema S Target Schema T

Source Data Target Data

M Source Schema S Target Schema T

Source Data Target Data

M

1R. Fagin et al., Theor. Comput. Sci. 336 (2005).Slide 1 of 26 Arocena, Glavic, and Miller - Value Invention in Data Exchange: Introduction

Page 4: SIGMOD 2013 - Patricia's talk on "Value invention for Data Exchange"

The Data Exchange Problem1

Schema Mappings M = (S,T,Σ)

• Source Schema S and Target Schema T

• High-level specification Σ• models the relationship between S and T

Data Exchange

• Given an instance of S

• How to materialize a target instance of T?

Source Schema S Target Schema T

Source Data Target Data

M

Source Schema S Target Schema T

Source Data Target Data

M

Source Schema S Target Schema T

Source Data Target Data

M Source Schema S Target Schema T

Source Data Target Data

M

1R. Fagin et al., Theor. Comput. Sci. 336 (2005).Slide 1 of 26 Arocena, Glavic, and Miller - Value Invention in Data Exchange: Introduction

Page 5: SIGMOD 2013 - Patricia's talk on "Value invention for Data Exchange"

The Data Exchange Problem1

Schema Mappings M = (S,T,Σ)

• Source Schema S and Target Schema T

• High-level specification Σ• models the relationship between S and T

Data Exchange

• Given an instance of S

• How to materialize a target instance of T?

Source Schema S Target Schema T

Source Data Target Data

M

Source Schema S Target Schema T

Source Data Target Data

M

Source Schema S Target Schema T

Source Data Target Data

M Source Schema S Target Schema T

Source Data Target Data

M

Source Schema S Target Schema T

Source Data Target Data

M

1R. Fagin et al., Theor. Comput. Sci. 336 (2005).Slide 1 of 26 Arocena, Glavic, and Miller - Value Invention in Data Exchange: Introduction

Page 6: SIGMOD 2013 - Patricia's talk on "Value invention for Data Exchange"

Example

Source Schema S Target Schema T

Source Data Target Data

M WorksOn(Department,Project,City)

Source Schema S Target Schema T

Source Data Target Data

M Projects(PId, City, ManagerId)Source Schema S Target Schema T

Source Data Target Data

M Source Schema S Target Schema T

Source Data Target Data

M

Source Schema S Target Schema T

Source Data Target Data

M

IT Web TorontoIT Big Data Chicago

Sales Mobile New York

NULL Toronto NULLNULL Chicago NULLNULL New York NULL

We usually create values to represent incomplete information!

Slide 2 of 26 Arocena, Glavic, and Miller - Value Invention in Data Exchange: Introduction

Page 7: SIGMOD 2013 - Patricia's talk on "Value invention for Data Exchange"

Value Invention

Source Schema S Target Schema T

Source Data Target Data

M WorksOn(Department,Project,City)

Source Schema S Target Schema T

Source Data Target Data

M Projects(PId, City, ManagerId)Source Schema S Target Schema T

Source Data Target Data

M Source Schema S Target Schema T

Source Data Target Data

M

Source Schema S Target Schema T

Source Data Target Data

M

IT Web TorontoIT Big Data Chicago

Sales Mobile New York

f(Web) Toronto g(IT)f(Big Data) Chicago g(IT)

f(Mobile) New York g(Sales)

We usually create values to represent incomplete information!

Slide 2 of 26 Arocena, Glavic, and Miller - Value Invention in Data Exchange: Introduction

Page 8: SIGMOD 2013 - Patricia's talk on "Value invention for Data Exchange"

Value Invention

Source Schema S Target Schema T

Source Data Target Data

M WorksOn(Department,Project,City)

Source Schema S Target Schema T

Source Data Target Data

M Projects(PId, City, ManagerId)Source Schema S Target Schema T

Source Data Target Data

M Source Schema S Target Schema T

Source Data Target Data

M

Source Schema S Target Schema T

Source Data Target Data

M

IT Web TorontoIT Big Data Chicago

Sales Mobile New York

f(Web) Toronto g(IT)f(Big Data) Chicago g(IT)

f(Mobile) New York g(Sales)

We usually create values to represent incomplete information!∃f ∃g ( WorksOn (d , p, c)→ Project (f (p), c , g(d)) )

Slide 2 of 26 Arocena, Glavic, and Miller - Value Invention in Data Exchange: Introduction

Page 9: SIGMOD 2013 - Patricia's talk on "Value invention for Data Exchange"

Our Goal

• Understand when schema mappings specified by SO tgds• Flexible and precise value invention

• . . . can be rewritten into nested GLAV mappings• Desirable computational and programatic properties

Slide 3 of 26 Arocena, Glavic, and Miller - Value Invention in Data Exchange: Introduction

Page 10: SIGMOD 2013 - Patricia's talk on "Value invention for Data Exchange"

Skolem Functions

• Introduced by Thoralf A. Skolem (1920s)

• Widely used in Mathematical Logic and Computer Science

Many important uses in Information Integration

• to model object identifier (OID) inventiona

• to express correlation semantics (e.g., grouping and data merging)

• to provide a precise representation ofmissing and incomplete information

aR. Hull, M. Yoshikawa, In VLDB (1990).

Slide 4 of 26 Arocena, Glavic, and Miller - Value Invention in Data Exchange: Introduction

Page 11: SIGMOD 2013 - Patricia's talk on "Value invention for Data Exchange"

Skolem Functions

• Introduced by Thoralf A. Skolem (1920s)

• Widely used in Mathematical Logic and Computer Science

Many important uses in Information Integration

• to model object identifier (OID) invention

• to express correlation semantics (e.g., grouping and data merging)abcd

• to provide a precise representation ofmissing and incomplete information

aL. Popa et al., In VLDB (2002).bA. Fuxman et al., In VLDB (2006).cL. Libkin, C. Sirangelo, J. Comput. Syst. Sci. 77 (2011).dB. Alexe et al., VLDB J. 21 (2012).

Slide 4 of 26 Arocena, Glavic, and Miller - Value Invention in Data Exchange: Introduction

Page 12: SIGMOD 2013 - Patricia's talk on "Value invention for Data Exchange"

Skolem Functions

• Introduced by Thoralf A. Skolem (1920s)

• Widely used in Mathematical Logic and Computer Science

Many important uses in Information Integration

• to model object identifier (OID) invention

• to express correlation semantics (e.g., grouping and data merging)

• to provide a precise representation ofmissing and incomplete informationabc

aY. Papakonstantinou et al., In VLDB (1996).bL. Popa et al., In VLDB (2002).cR. Fagin et al., TODS 30 (2005).

Slide 4 of 26 Arocena, Glavic, and Miller - Value Invention in Data Exchange: Introduction

Page 13: SIGMOD 2013 - Patricia's talk on "Value invention for Data Exchange"

Schema Mapping Languages

Various logical mapping formalisms

• s-t tgds (also known as GLAV)a

• Nested s-t tgds (nested GLAV)b

• Second-Order (SO) tgdsc

aR. Fagin et al., Theor. Comput. Sci. 336 (2005).bA. Fuxman et al., In VLDB (2006).cR. Fagin et al., TODS 30 (2005).

Slide 5 of 26 Arocena, Glavic, and Miller - Value Invention in Data Exchange: Introduction

Page 14: SIGMOD 2013 - Patricia's talk on "Value invention for Data Exchange"

Schema Mapping Languages

Various logical mapping formalisms

• s-t tgds (also known as GLAV)

• Nested s-t tgds (nested GLAV)

• Second-Order (SO) tgds

Expressiveness

• SO tgds permits arbitrary Skolems!a

• FO mapping languages have more desirable programmatic andcomputational propertiesb

aR. Fagin et al., TODS 30 (2005).bB. ten Cate, P. Kolaitis, In ICDT (2009).

Slide 5 of 26 Arocena, Glavic, and Miller - Value Invention in Data Exchange: Introduction

Page 15: SIGMOD 2013 - Patricia's talk on "Value invention for Data Exchange"

Characterization of Mapping Languages234

Property GLAV nested GLAV SO tgdsComposition Not closed Not closed ClosedValue Invention No Linear Fully customized

correlation correlation correlationTargetHomomorphisms Closed Closed Not closedModel Checking PTIME PTIME NP-Complete

2R. Fagin et al., Theor. Comput. Sci. 336 (2005).3R. Fagin et al., TODS 30 (2005).4B. ten Cate, P. Kolaitis, In ICDT (2009).

Slide 6 of 26 Arocena, Glavic, and Miller - Value Invention in Data Exchange: Introduction

Page 16: SIGMOD 2013 - Patricia's talk on "Value invention for Data Exchange"

The Quest for FO Rewritability

Rewritability

• Many SO tgds are equivalent to FO mappings!• We call this FO/GLAV/nested GLAV rewritable

• Some SO tgds are not FO rewritablea

• . . . Even testing for FO rewritability is undecidableb

aR. Fagin et al., TODS 30 (2005).bI. Feinerer et al., In AMW (2011).

Nash, Bernstein and Melnik• First sufficient condition for GLAV rewritabilitya

• Tailored to consider SO tgds produced by mapping composition

aA. Nash et al., TODS 32 (2007).

Slide 7 of 26 Arocena, Glavic, and Miller - Value Invention in Data Exchange: Introduction

Page 17: SIGMOD 2013 - Patricia's talk on "Value invention for Data Exchange"

Our Contributions

1 Sufficient condition for nested GLAV rewritability of SO tgds

2 Linearize:• PTIME algorithm for rewriting SO tgds

3 Equivalence preserving transformation of SO tgds using sourcesemantics

4 LinearizeFDs:• PTIME algorithm for rewriting SO tgds using source FDs

5 Extensive experimental evaluation• STBenchmark 2.0a

• Real-life mapping scenarios

aP. C. Arocena et al., “STBenchmark 2.0”, tech. rep. (Uni. of Toronto, 2013).

Slide 8 of 26 Arocena, Glavic, and Miller - Value Invention in Data Exchange: Introduction

Page 18: SIGMOD 2013 - Patricia's talk on "Value invention for Data Exchange"

Outline

1 Introduction

2 Linearization

3 Exploiting Source Constraints

4 Experiments

5 Conclusions

Page 19: SIGMOD 2013 - Patricia's talk on "Value invention for Data Exchange"

Intuition of Rewriting

Rewrite SO tgds into nested GLAV

• Replace second-order existentials with first-order existentials• ∃f (x)→ ∃vf

• Apply logical equivalence of Skolemization in reverse direction• May have to reorder universal quantifiers to create ∀x

Skolemization Equivalence

∀x∃vf δ(x, vf ) ≡ ∃f ∀x δ(x, vf )[vf ← f (x)]

Slide 9 of 26 Arocena, Glavic, and Miller - Value Invention in Data Exchange: Linearization

Page 20: SIGMOD 2013 - Patricia's talk on "Value invention for Data Exchange"

UnSkolemization Revisited

Example: Key InventionSource Schema

WorksOn (Department, Project, BudgetId)Audit (BudgetId, Auditor)City (Department, City)

Target SchemaProject (PId, BudgetId)Dept (Dept, Year, Project, NumEmp)Location (Department, DepId, City, State)Budget (Project, Leader, Size)

∃f (∀d∀p∀b WorksOn (d , p, b)→ Project (f (d , p), b))

Slide 10 of 26 Arocena, Glavic, and Miller - Value Invention in Data Exchange: Linearization

Page 21: SIGMOD 2013 - Patricia's talk on "Value invention for Data Exchange"

UnSkolemization Revisited

Example: Key InventionSource Schema

WorksOn (Department, Project, BudgetId)Audit (BudgetId, Auditor)City (Department, City)

Target SchemaProject (PId, BudgetId)Dept (Dept, Year, Project, NumEmp)Location (Department, DepId, City, State)Budget (Project, Leader, Size)

∃f (∀d∀p∀b WorksOn (d , p, b)→ Project (f (d , p), b))

We need to introduce ∃vf nested within the scope of d and p

Slide 10 of 26 Arocena, Glavic, and Miller - Value Invention in Data Exchange: Linearization

Page 22: SIGMOD 2013 - Patricia's talk on "Value invention for Data Exchange"

UnSkolemization Revisited

Example: Key InventionSource Schema

WorksOn (Department, Project, BudgetId)Audit (BudgetId, Auditor)City (Department, City)

Target SchemaProject (PId, BudgetId)Dept (Dept, Year, Project, NumEmp)Location (Department, DepId, City, State)Budget (Project, Leader, Size)

∃f (∀d∀p∀b WorksOn (d , p, b)→ Project (f (d , p), b))

We need to introduce ∃vf nested within the scope of d and p

∀d∀p∃vf ∀b WorksOn (d , p, b)→ Project (vf , b)

Slide 10 of 26 Arocena, Glavic, and Miller - Value Invention in Data Exchange: Linearization

Page 23: SIGMOD 2013 - Patricia's talk on "Value invention for Data Exchange"

Sufficient Rewriting Condition

Approach

• When can Unskolemization be applied to all Skolems of SO tgd?

• Adapt notions from SO quantifier elimination methodsa

• Consistency:• OK: . . . f (a) . . . f (a) → ∀a∃vf• NOT OK: . . . f (a) . . . f (b) → ∀a∃vf ∀b∃vf

• Linearity:• OK: . . . f (a) . . . g(a, b) → ∀a∃vf ∀b∃vg• NOT OK:. . . f (a, b) . . . g(b, c) → ∀a∀b∃vf ∀c∃vg

• Partitioning scheme for multi-clause SO tgds

aD. Gabbay et al.,Second Order Quantifier Elimination: Foundations, Computational Aspects and Applications,(College Publications, 2008).

Slide 11 of 26 Arocena, Glavic, and Miller - Value Invention in Data Exchange: Linearization

Page 24: SIGMOD 2013 - Patricia's talk on "Value invention for Data Exchange"

Sufficient Rewriting Condition

Approach

• When can Unskolemization be applied to all Skolems of SO tgd?

• Adapt notions from SO quantifier elimination methods• Consistency:

• OK: . . . f (a) . . . f (a) → ∀a∃vf• NOT OK: . . . f (a) . . . f (b) → ∀a∃vf ∀b∃vf

• Linearity:• OK: . . . f (a) . . . g(a, b) → ∀a∃vf ∀b∃vg• NOT OK:. . . f (a, b) . . . g(b, c) → ∀a∀b∃vf ∀c∃vg

• Partitioning scheme for multi-clause SO tgds

Theorem: Linearity

I Given an SO tgd θ without equalities between or with Skolem terms• Consistent• Linear

I ⇒ θ can be rewritten as nested GLAV

Slide 11 of 26 Arocena, Glavic, and Miller - Value Invention in Data Exchange: Linearization

Page 25: SIGMOD 2013 - Patricia's talk on "Value invention for Data Exchange"

Linearize Algorithm

Properties of the Algorithm

• Rewrites an SO tgd into nested GLAV

• PTIME

• Size of resulting formula is linear in the size of the input

Linearize(θ)

1 Partition θ into independent sub-formulas (maximal partitioning Π)

2 For each partition• Check consistency and linearity

3 If all partitions are linear and consistent then• Rewrite θ into Ω

Slide 12 of 26 Arocena, Glavic, and Miller - Value Invention in Data Exchange: Linearization

Page 26: SIGMOD 2013 - Patricia's talk on "Value invention for Data Exchange"

Outline

1 Introduction

2 Linearization

3 Exploiting Source Constraints

4 Experiments

5 Conclusions

Page 27: SIGMOD 2013 - Patricia's talk on "Value invention for Data Exchange"

A Note on Linearity

• Linearity is an syntactic but not a semantic condition

• ⇒ There is hope that an equivalent mapping exists that is linear

• ⇒ Approach: Find an equivalent mapping that is linear• Modify Skolem arguments?

Non-Linear SO tgd θ

Linear SO tgd θ′ nested GLAV Ω

Equivalence Preserving Transformation

Linearize

Slide 13 of 26 Arocena, Glavic, and Miller - Value Invention in Data Exchange: Exploiting Source Constraints

Page 28: SIGMOD 2013 - Patricia's talk on "Value invention for Data Exchange"

Using Source Functional Dependencies

So far• Only considered an SO tgd θ

• Have not considered additional knowledge that may be available

Source SchemaWorksOn (Department, Project, BudgetId)Audit (BudgetId, Auditor)City (Department, City)

Target SchemaProject (PId, BudgetId)Dept (Dept, Year, Project, NumEmp)Location (Department, DepId, City, State)Budget (Project, Leader, Size)

∃f ∃g WorksOn (d , p, b) ∧ Audit (b, a)→ Budget (p, f (d , p), g(b, a))

Slide 14 of 26 Arocena, Glavic, and Miller - Value Invention in Data Exchange: Exploiting Source Constraints

Page 29: SIGMOD 2013 - Patricia's talk on "Value invention for Data Exchange"

Using Source Functional Dependencies

Source constraints

• Functional dependencies (FDs) ΣS that hold over the source• Primary keys (and other FDs if available)

• FDs imply dependencies between the arguments of Skolem terms

Source SchemaWorksOn (Department, Project, BudgetId)Audit (BudgetId, Auditor)City (Department, City)

Target SchemaProject (PId, BudgetId)Dept (Dept, Year, Project, NumEmp)Location (Department, DepId, City, State)Budget (Project, Leader, Size)

WorksOn: Department,Project → BudgetId Audit: BudgetId → Auditor

∃f ∃g WorksOn (d , p, b) ∧ Audit (b, a)→ Budget (p, f (d , p), g(b, a))

Implied FD1 : d , p → b, a Implied FD2 : b → a

Slide 14 of 26 Arocena, Glavic, and Miller - Value Invention in Data Exchange: Exploiting Source Constraints

Page 30: SIGMOD 2013 - Patricia's talk on "Value invention for Data Exchange"

Using Source Functional Dependencies

Source constraints

• Functional dependencies (FDs) ΣS that hold over the source• Primary keys (and other FDs if available)

• FDs imply dependencies between the arguments of Skolem terms• FD x→ y be used to augment Skolem arguments: f (x, z)→ f (x, z, y)

Source SchemaWorksOn (Department, Project, BudgetId)Audit (BudgetId, Auditor)City (Department, City)

Target SchemaProject (PId, BudgetId)Dept (Dept, Year, Project, NumEmp)Location (Department, DepId, City, State)Budget (Project, Leader, Size)

WorksOn: Department,Project → BudgetId Audit: BudgetId → Auditor

∃f ∃g WorksOn (d , p, b)∧ Audit (b, a)→ Budget (p, f (d , p, b, a), g(b, a))

Implied FD1 : d , p → b, a Implied FD2 : b → a

Slide 14 of 26 Arocena, Glavic, and Miller - Value Invention in Data Exchange: Exploiting Source Constraints

Page 31: SIGMOD 2013 - Patricia's talk on "Value invention for Data Exchange"

Equivalence Preserving Transformation

Approach

• Augment Skolem arguments using implied FDs (Re-Skolemization)• Result θ′ that is equivalent as long as the FDs hold.

Non-Linear SO tgd θ and source FDs ΣS

Linear SO tgd θ′ and source FDs ΣS nested GLAV Ω

Re-Skolemize using implied FDs

Linearize

Theorem: Re-Skolemization with FDs preserves equivalence

Given an implied source FD x→ y valid over θ:

θ[f (x)← f (x, y)] ∪ ΣS ≡ θ ∪ ΣS

Slide 15 of 26 Arocena, Glavic, and Miller - Value Invention in Data Exchange: Exploiting Source Constraints

Page 32: SIGMOD 2013 - Patricia's talk on "Value invention for Data Exchange"

Why Augmentation?

Does Re-Skolemization affect Linearity?

• Augmentation (θ[f (x)← f (x, y)])• θaug : Result of applying augmentation until no longer possible

• Minimization (θ[f (x, y)← f (x)])a

• θmin: Result of applying minimization until no longer possible

aB. Marnette et al., PVLDB 3 (2010).

Theorem: Only augmentation preserves Linearity

Linear(θ)→ Linear(θaug)Linear(θ) 6→ Linear(θmin)

Slide 16 of 26 Arocena, Glavic, and Miller - Value Invention in Data Exchange: Exploiting Source Constraints

Page 33: SIGMOD 2013 - Patricia's talk on "Value invention for Data Exchange"

LinearizeFDs Algorithm

Properties of the Algorithm

• Rewrites SO tgd into nested GLAV

• PTIME

• Size of resulting formula is linear in the size of the input

LinearizeFDs(θ,ΣS)

1 Compute implied FDs

2 Augment arguments of each Skolem term based on FDs• Using attribute closure• Result: θaug

3 Return Linearize(θaug )

Slide 17 of 26 Arocena, Glavic, and Miller - Value Invention in Data Exchange: Exploiting Source Constraints

Page 34: SIGMOD 2013 - Patricia's talk on "Value invention for Data Exchange"

Outline

1 Introduction

2 Linearization

3 Exploiting Source Constraints

4 Experiments

5 Conclusions

Page 35: SIGMOD 2013 - Patricia's talk on "Value invention for Data Exchange"

Mapping Generator and Experiments

STBenchmark• Generator for data exchange scenariosa

• Schemas, Data and Mappings

• Construct complex mappings from simple primitives• e.g., Horizontal Partitioning (HP)• Parameterized and randomized (e.g., join path length)

aB. Alexe et al., PVLDB 1 (2008).

Extensions

• Arbitrary Skolem terms (SO tgds)

• New primitives (e.g., Adding and Deleting Attributes, etc.)

• Combining primitives into more complex mappings• e.g., simulating composition and complex correlations

• Primary Keys (PKs) and Functional dependencies (FDs)

Slide 18 of 26 Arocena, Glavic, and Miller - Value Invention in Data Exchange: Experiments

Page 36: SIGMOD 2013 - Patricia's talk on "Value invention for Data Exchange"

Random Scenarios

• 12,500,000 randomly generated mapping scenarios

• Measure success rate

• Compare NBM, Linearize, LinearizeFDs, LinearizeMin

• NBM is only rewriting into GLAV!

Slide 19 of 26 Arocena, Glavic, and Miller - Value Invention in Data Exchange: Experiments

Page 37: SIGMOD 2013 - Patricia's talk on "Value invention for Data Exchange"

Effect of Primary Keys

• Activate/Deactivate source PKs

• Vary amount of non-PK FDs

0%

20%

40%

60%

80%

100%

No PKs With PKs No PKs With PKs No PKs With PKs

SOURCE FDs = 0% SOURCE FDs = 25% SOURCE FDs = 50%

Succ

ess

Rat

e

Linearize LinearizeFDs

Slide 20 of 26 Arocena, Glavic, and Miller - Value Invention in Data Exchange: Experiments

Page 38: SIGMOD 2013 - Patricia's talk on "Value invention for Data Exchange"

Outline

1 Introduction

2 Linearization

3 Exploiting Source Constraints

4 Experiments

5 Conclusions

Page 39: SIGMOD 2013 - Patricia's talk on "Value invention for Data Exchange"

Conclusions

Rewriting SO-tgds → nested GLAV

• Linearization• SO tgd is linear → can be rewritten

• Equivalence preserving Re-Skolemization• Using source FDs to augment Skolem arguments

Experimental and Theoretical Results

• Using FDs improves chance to rewrite• 78% increased success rate

• Primary keys are most effective• > 75% increased success rate

• Augmentation is better than minimization• about 16% increased success rate

Slide 21 of 26 Arocena, Glavic, and Miller - Value Invention in Data Exchange: Conclusions

Page 40: SIGMOD 2013 - Patricia's talk on "Value invention for Data Exchange"

Future Work

Integrate insights on Re-Skolemization into . . .

• Mapping operators such as• Composition• MapMerge

• Mapping generation

FO Rewritability of SO tgds

• Combine our sufficient condition with that of [NBM07]a

• we know how to do it!

• Exploit Augmentation and Minimization together• to simplify and optimize SO mappings

• Use target FDs

aA. Nash et al., TODS 32 (2007).

Slide 22 of 26 Arocena, Glavic, and Miller - Value Invention in Data Exchange: Conclusions

Page 41: SIGMOD 2013 - Patricia's talk on "Value invention for Data Exchange"

Questions?

Slide 23 of 26 Arocena, Glavic, and Miller - Value Invention in Data Exchange: Conclusions

Page 42: SIGMOD 2013 - Patricia's talk on "Value invention for Data Exchange"

Appendix

6 References

7 Additional Notation

8 A Note on Complexity

9 Partitioning Scheme

10 Additional Experiments

11 STBenchmark 2.0

12 Additional Linearization Examples

13 Example Augmentation

Page 43: SIGMOD 2013 - Patricia's talk on "Value invention for Data Exchange"

R. Fagin, P. Kolaitis, R. J. Miller, L. Popa,Data Exchange: Semantics and Query Answering.Theor. Comput. Sci. 336 (2005).

R. Hull, M. Yoshikawa,ILOG: Declarative Creation and Manipulation of Object Identifiers.In VLDB (1990).

L. Popa, Y. Velegrakis, R. J. Miller, M. A. Hernandez, R. Fagin,Translating Web Data. In VLDB (2002).

A. Fuxman et al., Nested Mappings: Schema Mapping Reloaded. InVLDB (2006).

L. Libkin, C. Sirangelo,Data Exchange and Schema Mappings in Open and Closed Worlds.J. Comput. Syst. Sci. 77 (2011).

B. Alexe, M. A. Hernandez, L. Popa, W. C. Tan,MapMerge: Correlating Independent Schema Mappings. VLDB J.21 (2012).

Slide 1 of 4 Arocena, Glavic, and Miller - Value Invention in Data Exchange: References

Page 44: SIGMOD 2013 - Patricia's talk on "Value invention for Data Exchange"

Y. Papakonstantinou, S. Abiteboul, H. Garcia-Molina,Object Fusion in Mediator Systems. In VLDB (1996).

R. Fagin, P. Kolaitis, L. Popa, W.-C. Tan,Composing Schema Mappings: Second-Order Dependencies to the Rescue.TODS 30 (2005).

B. ten Cate, P. Kolaitis,Structural Characterizations of Schema-Mapping Languages. InICDT (2009).

I. Feinerer, R. Pichler, E. Sallinger, V. Savenkov,On the Undecidability of the Equivalence of Second-Order Tuple Generating Dependencies.In AMW (2011).

A. Nash, P. Bernstein, S. Melnik,Composition of Mappings Given by Embedded Dependencies. TODS32 (2007).

P. C. Arocena, M. D’Angelo, B. Glavic, R. J. Miller, “STBenchmark2.0”, tech. rep. (Uni. of Toronto, 2013).

Slide 2 of 4 Arocena, Glavic, and Miller - Value Invention in Data Exchange: References

Page 45: SIGMOD 2013 - Patricia's talk on "Value invention for Data Exchange"

D. Gabbay, R. Schmidt, A. Szalas,Second Order Quantifier Elimination: Foundations, Computational Aspects and Applications,(College Publications, 2008).

B. Marnette, G. Mecca, P. Papotti,Scalable Data Exchange with Functional Dependencies. PVLDB 3(2010).

B. Alexe, W. C. Tan, Y. Velegrakis,STBenchmark: Towards a Benchmark for Mapping Systems.PVLDB 1 (2008).

Slide 3 of 4 Arocena, Glavic, and Miller - Value Invention in Data Exchange: References

Page 46: SIGMOD 2013 - Patricia's talk on "Value invention for Data Exchange"

Appendix

6 References

7 Additional Notation

8 A Note on Complexity

9 Partitioning Scheme

10 Additional Experiments

11 STBenchmark 2.0

12 Additional Linearization Examples

13 Example Augmentation

Page 47: SIGMOD 2013 - Patricia's talk on "Value invention for Data Exchange"

Notation

GLAV (s-t tgds): ∀z, x(φ(z, x)→ ∃yψ(x, y))

∀d ∀p ∀b Works(d , p, b)→ ∃y1 ∃y2 Project(y1, b, y2)

-

nested GLAV: Q(x, y)((φ1(x)→ ψ1(x, y)) ∧ . . . ∧ (φn(x)→ ψn(x, y))),where Q(x, y) is a sequence of quantifiers, that is, ∀ for x and ∃ for y

∀d ∃y1 ∀p ∃y2 ∀b Works(d , p, b)→ Project(y1, b, y2)

SO tgds: ∃f( (∀x1(φ1 → ψ1)) ∧ · · · ∧ (∀xn(φn → ψn)) )

Note: we usually omit universal quantifiers∃f ∃g(Works(d , p, b)→ Project(f (p), b, g(d)))

Slide 4 of 4 Arocena, Glavic, and Miller - Value Invention in Data Exchange: Additional Notation

Page 48: SIGMOD 2013 - Patricia's talk on "Value invention for Data Exchange"

Appendix

6 References

7 Additional Notation

8 A Note on Complexity

9 Partitioning Scheme

10 Additional Experiments

11 STBenchmark 2.0

12 Additional Linearization Examples

13 Example Augmentation

Page 49: SIGMOD 2013 - Patricia's talk on "Value invention for Data Exchange"

Model Checking

Complexity

• NP-complete for SO tgds vs. P for nested GLAV

• Are we only solving the simple cases?

Approach

• Find an SO tgd for which model checking is hard

• But can be rewritten using (implied) source FDs

Slide 5 of 4 Arocena, Glavic, and Miller - Value Invention in Data Exchange: A Note on Complexity

Page 50: SIGMOD 2013 - Patricia's talk on "Value invention for Data Exchange"

Model Checking: 3-colorability

Schema Mappings

θ = ∀X ,Y :E (X ,Y )→ C (f (X ), g(Y ))

V (X ,Y )→ S(f (X ), g(Y ))Not linear!

Instance• The source relations encode an undirected graph G

• For each edge (x , y) we create two tuples E (x , y) and E (y , x)• For each vertex x we create a tuple V (x , x)

• The target relations represent a coloring of the vertexes of G usingthree colors r , g , and b

• C: (r , g), (r , b), (g , r), (g , b), (b, r), (b, g) - colors of adjacent nodes• S: (r , r), (g , g), (b, b) - colors of vertexes

Theorem: Model Checking is 3-colorability

I G is 3-colorable if θ holds over I

Slide 6 of 4 Arocena, Glavic, and Miller - Value Invention in Data Exchange: A Note on Complexity

Page 51: SIGMOD 2013 - Patricia's talk on "Value invention for Data Exchange"

Model Checking: 3-colorability

Schema Mappings

θ = ∀X ,Y :E (X ,Y )→ C (f (X ,Y ), g(Y ))

V (X ,Y )→ S(f (X ,Y ), g(Y ))Linear!

FD: X → Y

Instance• The source relations encode an undirected graph G

• For each edge (x , y) we create two tuples E (x , y) and E (y , x)• For each vertex x we create a tuple V (x , x)

• The target relations represent a coloring of the vertexes of G usingthree colors r , g , and b

• C: (r , g), (r , b), (g , r), (g , b), (b, r), (b, g) - colors of adjacent nodes• S: (r , r), (g , g), (b, b) - colors of vertexes

Theorem: Model Checking is 3-colorability

I G is 3-colorable if θ holds over I

Slide 6 of 4 Arocena, Glavic, and Miller - Value Invention in Data Exchange: A Note on Complexity

Page 52: SIGMOD 2013 - Patricia's talk on "Value invention for Data Exchange"

Appendix

6 References

7 Additional Notation

8 A Note on Complexity

9 Partitioning Scheme

10 Additional Experiments

11 STBenchmark 2.0

12 Additional Linearization Examples

13 Example Augmentation

Page 53: SIGMOD 2013 - Patricia's talk on "Value invention for Data Exchange"

SO tgds are closed conjunction

• Every SO tgd θ can be written as a set of clauses (φi → ψi )

• Splitting this set of clauses to form new SO tgds θ1, . . . θn isequivalence preserving

• If θi and θj share none Skolems (they are uncorrelated)• then θi and θj can be rewritten independently

Slide 7 of 4 Arocena, Glavic, and Miller - Value Invention in Data Exchange: Partitioning Scheme

Page 54: SIGMOD 2013 - Patricia's talk on "Value invention for Data Exchange"

Maximal Partition

Maximal Partition• Given an SO tgd θ

• Partition clauses into Π = π1, . . . , πn1 No πi and πj share any skolems2 There is no Π′ with more elements than Π that fulfills condition 1)

Theorem: Rewritability and Maximal Partitions

Rewritable (θ)⇔ ∀i : Rewritable (πi )

Slide 8 of 4 Arocena, Glavic, and Miller - Value Invention in Data Exchange: Partitioning Scheme

Page 55: SIGMOD 2013 - Patricia's talk on "Value invention for Data Exchange"

Appendix

6 References

7 Additional Notation

8 A Note on Complexity

9 Partitioning Scheme

10 Additional Experiments

11 STBenchmark 2.0

12 Additional Linearization Examples

13 Example Augmentation

Page 56: SIGMOD 2013 - Patricia's talk on "Value invention for Data Exchange"

Real-Life Mappings

• Three real-life mapping scenarios from the literature

• Created SO tgds based on• Semantics of the schemas• Documented data transformations

• Compared all rewriting techniques

Slide 9 of 4 Arocena, Glavic, and Miller - Value Invention in Data Exchange: Additional Experiments

Page 57: SIGMOD 2013 - Patricia's talk on "Value invention for Data Exchange"

Appendix

6 References

7 Additional Notation

8 A Note on Complexity

9 Partitioning Scheme

10 Additional Experiments

11 STBenchmark 2.0

12 Additional Linearization Examples

13 Example Augmentation

Page 58: SIGMOD 2013 - Patricia's talk on "Value invention for Data Exchange"

Towards STBenchmark 2.0

Noteworthy Features

• Support for arbitrary Skolem Functions (SO tgds) and variousSkolemization modes (e.g., Key, All and Random)

• Simulating some cases of composition using Skolem Noise

• Reuse of source schema elements using Source Reuse

• PKs and random multi-attribute FDs over the source

Usability Case?

• Thinking about comparing different notions of mapping inverse

• Any suggestions?

Slide 10 of 4 Arocena, Glavic, and Miller - Value Invention in Data Exchange: STBenchmark 2.0

Page 59: SIGMOD 2013 - Patricia's talk on "Value invention for Data Exchange"

Appendix

6 References

7 Additional Notation

8 A Note on Complexity

9 Partitioning Scheme

10 Additional Experiments

11 STBenchmark 2.0

12 Additional Linearization Examples

13 Example Augmentation

Page 60: SIGMOD 2013 - Patricia's talk on "Value invention for Data Exchange"

Notation

GLAV (s-t tgds): ∀z, x(φ(z, x)→ ∃yψ(x, y))

∀d ∀p ∀b Works(d , p, b)→ ∃y1 ∃y2 Project(y1, b, y2)

nested GLAV: Q(x, y)((φ1(x)→ ψ1(x, y)) ∧ . . . ∧ (φn(x)→ ψn(x, y))),where Q(x, y) is a sequence of quantifiers, that is, ∀ for x and ∃ for y

∀d ∃y1 ∀p ∃y2 ∀b Works(d , p, b)→ Project(y1, b, y2)

SO tgds: ∃f( (∀x1(φ1 → ψ1)) ∧ · · · ∧ (∀xn(φn → ψn)) )

Note: we usually omit universal quantifiers∃f ∃g(Works(d , p, b)→ Project(f (p), b, g(d)))

Slide 11 of 4 Arocena, Glavic, and Miller - Value Invention in Data Exchange: Additional Linearization Examples

Page 61: SIGMOD 2013 - Patricia's talk on "Value invention for Data Exchange"

Nesting

Example: Skolems with Overlapping ArgumentsSource Schema

WorksOn (Department, Project, BudgetId)Audit (BudgetId, Auditor)City (Department, City)

Target SchemaProject (PId, BudgetId)Dept (Dept, Year, Project, NumEmp)Location (Department, DepId, City, State)Budget (Project, Leader, Size)

∃f ∃g( WorksOn (d , p, b)→ Dept (d , f (d), p, g(d , p)))

Slide 12 of 4 Arocena, Glavic, and Miller - Value Invention in Data Exchange: Additional Linearization Examples

Page 62: SIGMOD 2013 - Patricia's talk on "Value invention for Data Exchange"

Nesting

Example: Skolems with Overlapping ArgumentsSource Schema

WorksOn (Department, Project, BudgetId)Audit (BudgetId, Auditor)City (Department, City)

Target SchemaProject (PId, BudgetId)Dept (Dept, Year, Project, NumEmp)Location (Department, DepId, City, State)Budget (Project, Leader, Size)

∃f ∃g( WorksOn (d , p, b)→ Dept (d , f (d), p, g(d , p)))

We need to introduce two ∃ quantifiers without violating the dependenciesmodeled by f and g

Slide 12 of 4 Arocena, Glavic, and Miller - Value Invention in Data Exchange: Additional Linearization Examples

Page 63: SIGMOD 2013 - Patricia's talk on "Value invention for Data Exchange"

Nesting and Linearization

Example: Skolems with Overlapping ArgumentsSource Schema

WorksOn (Department, Project, BudgetId)Audit (BudgetId, Auditor)City (Department, City)

Target SchemaProject (PId, BudgetId)Dept (Dept, Year, Project, NumEmp)Location (Department, DepId, City, State)Budget (Project, Leader, Size)

∃f ∃g( WorksOn (d , p, b)→ Dept (d , f (d), p, g(d , p)))

We need to introduce two ∃ quantifiers without violating the dependenciesmodeled by f and g

∀d∃vf ∀p∃vg∀b WorksOn (d , p, b)→ Dept (d , vf , p, vg )

Slide 12 of 4 Arocena, Glavic, and Miller - Value Invention in Data Exchange: Additional Linearization Examples

Page 64: SIGMOD 2013 - Patricia's talk on "Value invention for Data Exchange"

Appendix

6 References

7 Additional Notation

8 A Note on Complexity

9 Partitioning Scheme

10 Additional Experiments

11 STBenchmark 2.0

12 Additional Linearization Examples

13 Example Augmentation

Page 65: SIGMOD 2013 - Patricia's talk on "Value invention for Data Exchange"

Augmentation is better than Minimization

Source Schema

WorksOn (Department, Project, BudgetId)Audit (BudgetId, Auditor)City (Department, City)

Target Schema

Project (PId, BudgetId)Dept (Dept, Year, Project, NumEmp)Location (Department, DepId, City, State)Budget (Project, Leader, Size)

WorksOn: Department,Project → BudgetId Audit: BudgetId → Auditor

θ = ∃f ∃g WorksOn (d , p, b) ∧ Audit (b, a)→ Budget (p, f (d , p), g(b, a))

θaug = ∃f ∃g . . .→ Budget (p, f (d , p, b, a), g(b, a))

θmin = ∃f ∃g . . .→ Budget (p, f (d , p), g(b))

FD1 : d , p → b, a FD2 : b → a

Slide 13 of 4 Arocena, Glavic, and Miller - Value Invention in Data Exchange: Example Augmentation