Graph Indexing: Tree + Δ ≥ Graph

25
University of Illinois at University of Illinois at Urbana-Champaign Urbana-Champaign Graph Indexing: Tree + Graph Indexing: Tree + Δ Δ Graph Graph Peixiang Zhao Jeffrey Xu Yu Peixiang Zhao Jeffrey Xu Yu Philip S. Yu Philip S. Yu CS@UIUC SEEM@CUHK CS@UIUC SEEM@CUHK IBM T. J. Watson Research Center IBM T. J. Watson Research Center September 12 September 12 th th , 2007 , 2007 VLDB’07 Vienna, Austria VLDB’07 Vienna, Austria

description

Peixiang Zhao Jeffrey Xu Yu Philip S. Yu CS@UIUC SEEM@CUHK IBM T. J. Watson Research Center September 12 th , 2007. Graph Indexing: Tree + Δ ≥ Graph. VLDB’07 Vienna, Austria. Synopsis. Introduction Graph Containment Query Algorithmic Framework - PowerPoint PPT Presentation

Transcript of Graph Indexing: Tree + Δ ≥ Graph

Page 1: Graph Indexing: Tree +  Δ  ≥ Graph

University of Illinois at Urbana-ChampaignUniversity of Illinois at Urbana-Champaign

Graph Indexing: Tree + Graph Indexing: Tree + ΔΔ ≥ Graph ≥ Graph

Peixiang Zhao Jeffrey Xu Yu Philip S. YuPeixiang Zhao Jeffrey Xu Yu Philip S. Yu CS@UIUC SEEM@CUHK IBM T. J. Watson Research CenterCS@UIUC SEEM@CUHK IBM T. J. Watson Research Center

September 12September 12thth, 2007, 2007

VLDB’07 Vienna, AustriaVLDB’07 Vienna, Austria

Page 2: Graph Indexing: Tree +  Δ  ≥ Graph

Sept. 12Sept. 12thth, 2007, 2007VLDB’07 Vienna, AustriaVLDB’07 Vienna, Austria 22 of 25of 25

SynopsisSynopsis• IntroductionIntroduction

• Graph Containment Query• Algorithmic Framework

• Related WorkRelated Work• Tree + Tree + ΔΔ

• Indexability of frequent Trees• Discriminative graph feature selection: Δ

• Experimental StudyExperimental Study• ConclusionConclusion

Page 3: Graph Indexing: Tree +  Δ  ≥ Graph

Sept. 12Sept. 12thth, 2007, 2007VLDB’07 Vienna, AustriaVLDB’07 Vienna, Austria 33 of 25of 25

IntroductionIntroduction

• GraphGraph is a mathematical construct and a general data structure is a mathematical construct and a general data structure

representing relations among entitiesrepresenting relations among entities

• The emergence and the dominance of graphs asks for effective The emergence and the dominance of graphs asks for effective

graph data management and mining tools so that users can graph data management and mining tools so that users can

organize, access, and analyze graph data efficientlyorganize, access, and analyze graph data efficiently• Structural Pattern Mining: Given a graph database, what are the

potentially interesting structural patterns and how can we find them?

• Graph Indexing and Search: How can we index graphs and perform

searching, either exactly or approximately, in large graph databases?

Page 4: Graph Indexing: Tree +  Δ  ≥ Graph

Sept. 12Sept. 12thth, 2007, 2007VLDB’07 Vienna, AustriaVLDB’07 Vienna, Austria 44 of 25of 25

IntroductionIntroduction• Graph Containment QueryGraph Containment Query

• Given a graph database G = {g1, g2, …, gN} and a query graph q, find the set

• NP, since subgraph-isomorphism checking is NP-Complete

• Infeasible to check subgraph isomorphism sequentially for every gi in G, especially challenging when graphs in G are large, or G is large and diverse

• Graph indexing!

i i isup( q ) { g | q g ,g G }

Page 5: Graph Indexing: Tree +  Δ  ≥ Graph

Sept. 12Sept. 12thth, 2007, 2007VLDB’07 Vienna, AustriaVLDB’07 Vienna, Austria 55 of 25of 25

Graph Indexing: Algorithmic FrameworkGraph Indexing: Algorithmic Framework

• Index constructionIndex construction generates the index feature set generates the index feature set FF from the from the graph database graph database GG. For each feature . For each feature ff, , supsup((ff) is maintained) is maintained

• Query processingQuery processing is performed in a is performed in a filtering-verification filtering-verification fashion:fashion:• The filtering phase uses indexing features contained in q to compute the

candidate answer set

Every graph in Cq contains all q's indexed features. Therefore,

the query answer set, sup(q), is a subset of Cq

• The verification phase checks subgraph isomorphism for every graph in Cq. False positives are pruned and the true answer set sup(q) is returned

Page 6: Graph Indexing: Tree +  Δ  ≥ Graph

Sept. 12Sept. 12thth, 2007, 2007VLDB’07 Vienna, AustriaVLDB’07 Vienna, Austria 66 of 25of 25

Query Cost ModelQuery Cost Model• The cost of processing a graph containment query The cost of processing a graph containment query q q upon upon GG, ,

denoted denoted CC, can be modeled as below, can be modeled as below

• Cf : the filtering cost

• Cv : the verification cost (NP-Complete)

• AnalysisAnalysis1. The key issue to improve query performance is to minimize |Cq|

2. The indexing feature set F is quite relevant to Cf and |Cq|

3. Index construction performance: the feature selection cost Cfs to

construct F from among G

Page 7: Graph Indexing: Tree +  Δ  ≥ Graph

Sept. 12Sept. 12thth, 2007, 2007VLDB’07 Vienna, AustriaVLDB’07 Vienna, Austria 77 of 25of 25

Related WorkRelated Work• PathPath-based Indexing approach-based Indexing approach

• All existing paths up to a certain length lp are enumerated as indexing features

– Index can be constructed efficiently– Index size is quite large when lp is not small

– Limited pruning power, mainly because the structural information exhibited in graphs is lost when breaking graphs into paths

• GraphGrep (PODS’02)

• GraphGraph-based Indexing approach-based Indexing approach• Subgraphs of G with different characteristics are selected as indexing

features– A costly index construction process– Compact index structure– Great pruning power, since structural information of graph is well-preserved

• gIndex (SIGMOD’04, PODS’05), C-Tree (ICDE’06), GString (ICDE’07), GDIndex (ICDE’07), FG-Index (SIGMOD’07)

Page 8: Graph Indexing: Tree +  Δ  ≥ Graph

Sept. 12Sept. 12thth, 2007, 2007VLDB’07 Vienna, AustriaVLDB’07 Vienna, Austria 88 of 25of 25

An alternative approach: (An alternative approach: (Tree + Tree + ΔΔ))

• TreeTree--based Graph Indexingbased Graph Indexing• Tree: Better indexability in comparison with path and graph

– The majority of frequent graph-features of G are usually tree-features indeed

– Frequent tree-features and graph-features share similar distributions and

frequent tree-features have similar pruning power like graph-features

– tree mining can be done much more efficiently than graph mining on G

• Δ : On-demand select a small number of discriminative graph-features

without conducting costly graph mining beforehand

• Orders of magnitude smaller in index size, but performs much better

than existing approaches in indexing construction and query processing

Page 9: Graph Indexing: Tree +  Δ  ≥ Graph

Sept. 12Sept. 12thth, 2007, 2007VLDB’07 Vienna, AustriaVLDB’07 Vienna, Austria 99 of 25of 25

Indexability of Path, Tree and GraphIndexability of Path, Tree and Graph• Frequent features (paths, trees, graphs) expose intrinsic

characteristics of a graph database, G. They are representatives to discriminate between different groups of graphs in a graph database

• Which one should we index? Path, Tree or Graph?1. The frequent feature set size: | F |

2. The feature selection cost: Cfs

3. the candidate answer set size: |Cq|

Page 10: Graph Indexing: Tree +  Δ  ≥ Graph

Sept. 12Sept. 12thth, 2007, 2007VLDB’07 Vienna, AustriaVLDB’07 Vienna, Austria 1010 of 25of 25

The Frequent Feature Set Size:The Frequent Feature Set Size: | FF |• Evidences:Evidences:

• Among all frequent graph-features of G, a majority of them are trees indeed

– All subtrees of a frequent graph are frequent

– There is little chance that subtrees of frequent graph g coincide with those of frequent graph g’, due to the structural diversity and label variety

• Frequent paths share a very small portion, because a path-feature has a simple linear structure, which has little variety in structural complexity

• In terms of feature distributions, tree-features and graph-features share a very similar distribution w.r.t. feature size

Page 11: Graph Indexing: Tree +  Δ  ≥ Graph

Sept. 12Sept. 12thth, 2007, 2007VLDB’07 Vienna, AustriaVLDB’07 Vienna, Austria 1111 of 25of 25

Experiments on Two Datasets Experiments on Two Datasets w.r.t.w.r.t. | FF |

The Real Dataset

The Synthetic Dataset

Page 12: Graph Indexing: Tree +  Δ  ≥ Graph

Sept. 12Sept. 12thth, 2007, 2007VLDB’07 Vienna, AustriaVLDB’07 Vienna, Austria 1212 of 25of 25

The feature selection cost: The feature selection cost: CCfsfs

• Given a graph database, G, and a minimum support threshold, σ, to discover the frequent feature set F (FP / FT / FG ) from G

• Tree• A good compromise between

– the more expressive, but computationally harder general graph – the faster but less expressive path

• Specialization of general graph avoiding undesirable theoretical properties and algorithmic complexity incurred by graph

PathPath TreeTree GraphGraph

IsomorphismIsomorphism O(O(nn)) O(O(nn)) P or NPC (P or NPC (??))

Sub-IsomorphismSub-Isomorphism O(O(n + mn + m)) O(O(mm3/23/2nn/log/logmm)) NP-CompleteNP-Complete

Page 13: Graph Indexing: Tree +  Δ  ≥ Graph

Sept. 12Sept. 12thth, 2007, 2007VLDB’07 Vienna, AustriaVLDB’07 Vienna, Austria 1313 of 25of 25

The Candidate Answer Set Size: The Candidate Answer Set Size: |C|Cqq||

• We define the pruning power power(f) of a frequent feature f as

• The pruning power of a frequent feature set S = {f1, f2 , …, fn}

• Theorem 1: Given a frequent graph-feature g, and let its frequent sub-

tree set be T (g) = {t1, t2 , …, tn}. Then, power(g) ≥ power(T (g))

• Theorem 2: Given a frequent tree-feature t, and let its frequent sub-

path set be P (t) = {p1, p2 , …, pm}. Then, power(t) ≥ power(P (t))

Page 14: Graph Indexing: Tree +  Δ  ≥ Graph

Sept. 12Sept. 12thth, 2007, 2007VLDB’07 Vienna, AustriaVLDB’07 Vienna, Austria 1414 of 25of 25

Pruning PowerPruning Power• The pruning power of all frequent subtree features, T (g), of a

frequent graph-feature g can be similar to the pruning power of g

• There is a big gap between the pruning power of a graph-feature g and that of all its frequent sub-path features, P(g)

The Real Dataset

Page 15: Graph Indexing: Tree +  Δ  ≥ Graph

Sept. 12Sept. 12thth, 2007, 2007VLDB’07 Vienna, AustriaVLDB’07 Vienna, Austria 1515 of 25of 25

Indexability of Path, Tree and GraphIndexability of Path, Tree and Graph

• It is feasible and effective to select FT , instead of FG, as

indexing features for the graph containment query problem• The frequent tree-feature set, FT , dominates FG

• Discovering frequent tree-features from G can be done much more efficiently than mining frequent general graph-features

• FT can contribute similar pruning power like that provided by FG

Page 16: Graph Indexing: Tree +  Δ  ≥ Graph

Sept. 12Sept. 12thth, 2007, 2007VLDB’07 Vienna, AustriaVLDB’07 Vienna, Austria 1616 of 25of 25

Discriminative Graph Features Discriminative Graph Features • Consider a query graph q which contains a subgraph g

• If power(T (g)) ≈ power(g), there is no need to index the graph-feature g, because its subtrees jointly have the similar pruning power

• if power(g) >> power(T (g)), it will be necessary to select g as an index feature because g is more discriminative than T (g), in terms of pruning

• Discriminative graph-features (w.r.t. its subtree-features, controlled by ε0) are selected from queries on-demand, without mining the whole set of frequent graph-features from G beforehand• Discriminative graph-features are used as additional indexing features,

denoted Δ, which can also be reused further to answer subsequent queries

ΔΔ

Page 17: Graph Indexing: Tree +  Δ  ≥ Graph

Sept. 12Sept. 12thth, 2007, 2007VLDB’07 Vienna, AustriaVLDB’07 Vienna, Austria 1717 of 25of 25

Discriminative Graph SelectionDiscriminative Graph Selection• Given two graphs g, g’ q , where g g’

• If the gap between power(g’) and power(g) is large enough, g’ will be reclaimed from G;

• Otherwise, g is discriminative enough for pruning purpose, and there is no need to reclaim g’ in the presence of g

• Approximate the discriminative computation between g’ and g, in the presence of our knowledge on frequent tree-features discovered

Page 18: Graph Indexing: Tree +  Δ  ≥ Graph

Sept. 12Sept. 12thth, 2007, 2007VLDB’07 Vienna, AustriaVLDB’07 Vienna, Austria 1818 of 25of 25

Discriminative Graph SelectionDiscriminative Graph Selection• The occurrence probability of g in the graph database, G

• the conditional occurrence probability of g’, w.r.t. g, models the probability to select g’ from G in the presence of g

• The upper and lower bound of Pr(g’|g)

• The conditional occurrence probability of Pr(g’|g), is solely upper-bounded by T (g’)

Page 19: Graph Indexing: Tree +  Δ  ≥ Graph

Sept. 12Sept. 12thth, 2007, 2007VLDB’07 Vienna, AustriaVLDB’07 Vienna, Austria 1919 of 25of 25

Experimental StudiesExperimental Studies• The Real DatasetThe Real Dataset

• The AIDS antiviral screen dataset from Developmental Theroapeutics Program in NCI/NIH

• 42390 compounds retrieved from DTP's Drug Information System

• 63 kinds of atoms in this dataset, most of which are C, H, O, S, etc.

• Three kinds of bonds are popular in these compounds: single-bond, double-bond and aromatic-bond

• On average, compounds in the dataset has 43 vertices and 45 edges.

• The graph of maximum size has 221 vertices and 234 edges

Page 20: Graph Indexing: Tree +  Δ  ≥ Graph

Sept. 12Sept. 12thth, 2007, 2007VLDB’07 Vienna, AustriaVLDB’07 Vienna, Austria 2020 of 25of 25

Experimental StudiesExperimental Studies• The real dataset: index constructionThe real dataset: index construction

Page 21: Graph Indexing: Tree +  Δ  ≥ Graph

Sept. 12Sept. 12thth, 2007, 2007VLDB’07 Vienna, AustriaVLDB’07 Vienna, Austria 2121 of 25of 25

Experimental StudiesExperimental Studies• The real dataset: false positive ratio (|The real dataset: false positive ratio (|CqCq|/|sup(|/|sup(qq)|) )|) w.r.t.w.r.t. the the

database size (= 1,000; 2,000; 4,000; 8,000; 10,000)database size (= 1,000; 2,000; 4,000; 8,000; 10,000)

Page 22: Graph Indexing: Tree +  Δ  ≥ Graph

Sept. 12Sept. 12thth, 2007, 2007VLDB’07 Vienna, AustriaVLDB’07 Vienna, Austria 2222 of 25of 25

Experimental StudiesExperimental Studies

• The Synthetic DatasetThe Synthetic Dataset• Generated by a widely-used graph generator, which is

controlled by the following parameters:

Page 23: Graph Indexing: Tree +  Δ  ≥ Graph

Sept. 12Sept. 12thth, 2007, 2007VLDB’07 Vienna, AustriaVLDB’07 Vienna, Austria 2323 of 25of 25

Experimental StudiesExperimental Studies

• The synthetic dataset: false positive ratioThe synthetic dataset: false positive ratio

Page 24: Graph Indexing: Tree +  Δ  ≥ Graph

Sept. 12Sept. 12thth, 2007, 2007VLDB’07 Vienna, AustriaVLDB’07 Vienna, Austria 2424 of 25of 25

ConclusionConclusion• Graph indexing plays a critical role in graph containment

query processing on large graph databases

• Path-based and graph-based indexing approaches suffer from overly large index size, substantial index construction overhead and expensive query processing cost

• (Tree+Δ) is an effective and efficient graph indexing feature to answer graph containment queries• (Tree+Δ) holds a compact index structure, achieves good performance

in index construction and most importantly, provides satisfactory query performance for answering graph containment queries over large graph databases

Page 25: Graph Indexing: Tree +  Δ  ≥ Graph

University of Illinois at Urbana-ChampaignUniversity of Illinois at Urbana-Champaign

Thank youThank you

VLDB’07 Vienna, AustriaVLDB’07 Vienna, Austria