Database Implementation of a Model-Free Classifier
description
Transcript of Database Implementation of a Model-Free Classifier
Database Implementation of a Model-Free Classifier
Konstantinos Morfonios
ADBIS 2007
University of Athens
Introduction
LOCUS
Parallel Execution
Experimental Evaluation
Conclusions & Future Work
Motivation
Introduction
LOCUS
Parallel Execution
Experimental Evaluation
Conclusions & Future Work
Motivation
Introduction
ω1 = ω2 =
Classification
x = <x1, x2, …, xD> ω = f(x)
<x1,1, x1,2, …, x1,D, ω1><x2,1, x2,2, …, x2,D, ω2><x3,1, x3,2, …, x3,D, ω1><x4,1, x4,2, …, x4,D, ω1>
.
.
.“Lazy”“Eager”
Introduction
x1 = <x1, x2, …, xD>x2 = <x1, x2, …, xD>
(+) Faster decisions( - ) Large/complex datasets( - ) Dynamic datasets( - ) Dynamic models
(Nearest Neighbors)(Decision Trees)
Introduction
LOCUS
Parallel Execution
Experimental Evaluation
Conclusions & Future Work
Motivation
Introduction
LOCUS
Parallel Execution
Experimental Evaluation
Conclusions & Future Work
Motivation
Motivation
Large/complex datasets
Motivation
Motivation
Large/complex datasets Dynamic datasets
Motivation
Motivation
Large/complex datasets Dynamic datasets Dynamic models
Motivation
Motivation
Large/complex datasets Dynamic datasets Dynamic models
Lazy (model-free)
Motivation
Large/complex datasets Dynamic datasets Dynamic models
Lazy (model-free)
Nearest Neighbors
Disk-based
Motivation
Nearest Neighbors
Suffers from “curse of dimensionality”• Not reliable [Beyer et al., ICDT 1999]• Not indexable [Shaft et al., ICDT 2005]
LOCUS(Lazy Optimal Classifier of Unlimited Scalability)
Motivation
• Category?
LOCUS(Lazy Optimal Classifier of Unlimited Scalability)
Motivation
• Lazy
LOCUS(Lazy Optimal Classifier of Unlimited Scalability)
Motivation
• Lazy
LOCUS(Lazy Optimal Classifier of Unlimited Scalability)
• Scaling?
Motivation
• Lazy
LOCUS(Lazy Optimal Classifier of Unlimited Scalability)
• Based on simple SQL queries
Motivation
• Lazy
LOCUS(Lazy Optimal Classifier of Unlimited Scalability)
• Based on simple SQL queries• Accuracy?
Motivation
• Lazy
LOCUS(Lazy Optimal Classifier of Unlimited Scalability)
• Based on simple SQL queries• Converges to optimal Bayes Classifier
Motivation
• Lazy
LOCUS(Lazy Optimal Classifier of Unlimited Scalability)
• Based on simple SQL queries• Converges to optimal Bayes Classifier• Other features?
Motivation
• Lazy
LOCUS(Lazy Optimal Classifier of Unlimited Scalability)
• Based on simple SQL queries• Converges to optimal Bayes Classifier• Parallelizable
Introduction
LOCUS
Parallel Execution
Experimental Evaluation
Conclusions & Future Work
Motivation
Introduction
LOCUS
Parallel Execution
Experimental Evaluation
Conclusions & Future Work
Motivation
LOCUS
x = <f1, f2>
ω2 =
ω1 =
(f1 [0, 20], f2 [0, 10])
f2
f1
Example
LOCUSf2
f1
Ideally: Dense space
LOCUS
0
1
2
3
4
5
6
7
8
9
10
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
f2
f1
ω(<7, 4>) = ?Ideally: Dense space
LOCUS
0
1
2
3
4
5
6
7
8
9
10
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
f2
f1
ω(<7, 4>) =
LOCUS
0
1
2
3
4
5
6
7
8
9
10
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
f2
f1
Reality:• Many features• Large domains Sparse space
Reality:• Many features• Large domains Sparse space
0
1
2
3
4
5
6
7
8
9
10
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
LOCUSf2
f1
ω(<7, 4>) = ?
?
0
1
2
3
4
5
6
7
8
9
10
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
LOCUSf2
f1
ω(<7, 4>) = ?ω1: 2ω2: 1
3-NN
0
1
2
3
4
5
6
7
8
9
10
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
LOCUSf2
f1
ω(<7, 4>) = ω1: 2ω2: 1
3-NN
0
1
2
3
4
5
6
7
8
9
10
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
LOCUSf2
f1
ω(<7, 4>) = ?
LOCUS
0
1
2
3
4
5
6
7
8
9
10
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
LOCUSf2
f1
ω(<7, 4>) = ?ω1: 7ω2: 3
LOCUS
0
1
2
3
4
5
6
7
8
9
10
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
LOCUSf2
f1
ω(<7, 4>) = ω1: 7ω2: 3
LOCUS
0
1
2
3
4
5
6
7
8
9
10
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
LOCUSf2
f1
Disk-based implementation
LOCUS
0
1
2
3
4
5
6
7
8
9
10
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
LOCUS
2δ1
2δ2
SELECT ω, count(*)FROM RWHERE f1≥x1-δ1 AND f1≤x1+δ1 AND f2≥x2-δ2 AND f2≤x2+δ2
GROUP BY ωR(f1, f2, ω)
<x1, x2>
ω1: 7ω2: 3
ω(<7, 4>) =
LOCUS SELECT ω, count(*)FROM RWHERE f1≥x1-δ1 AND f1≤x1+δ1 AND f2≥x2-δ2 AND f2≤x2+δ2
GROUP BY ωR(f1, f2, ω)
What if R is large?
Classical optimization techniques for a well-known type of aggregate queries
• Indexing
• Presorting• Materialized views
LOCUS SELECT ω, count(*)FROM RWHERE f1≥x1-δ1 AND f1≤x1+δ1 AND f2≥x2-δ2 AND f2≤x2+δ2
GROUP BY ωR(f1, f2, ω)
Method reliability?
LOCUS converges to the optimal Bayes classifier as the size of the dataset increases (proof in the paper)
LOCUS SELECT ω, count(*)FROM RWHERE f1≥x1-δ1 AND f1≤x1+δ1 AND f2≥x2-δ2 AND f2≤x2+δ2
GROUP BY ωR(f1, f2, ω)
What if a feature, say f2, is categorical? (e.g. sex)
LOCUS SELECT ω, count(*)FROM RWHERE f1≥x1-δ1 AND f1≤x1+δ1 AND f2=x2
GROUP BY ωR(f1, f2, ω)
Not a problem, since generally in practice:
• Combinations of categorical and numeric features• Categorical features have small domains
Hence, they do not contribute to sparsity
What if a feature, say f2, is categorical? (e.g. sex)
Introduction
LOCUS
Parallel Execution
Experimental Evaluation
Conclusions & Future Work
Motivation
Introduction
LOCUS
Parallel Execution
Experimental Evaluation
Conclusions & Future Work
Motivation
SELECT SELECT
SELECT
SELECT
Parallel ExecutionR1
R2
R3
R4
R = R1 R2 R3 R4
Parallel Execution
ω1: 5ω2: 2
ω1: 7ω2: 1
ω1: 5ω2: 1
ω1: 6ω2: 0
R1
R2
R3
R4
Count: distributive function
ω1: 23ω2: 4
52
123
183
234
ω1: 7ω2: 1
ω1: 5ω2: 1
ω1: 6ω2: 0
ω1: 5ω2: 2
Parallel Execution
Small network traffic Load balancing Lightweight operations on the main serverSELECT SELECT
SELECT
SELECT
R1
R2
R3R4
ω1: 7ω2: 1
ω1: 5ω2: 1
ω1: 6ω2: 0
ω1: 5ω2: 2
52
123
183
234
Introduction
LOCUS
Parallel Execution
Experimental Evaluation
Conclusions & Future Work
Motivation
Introduction
LOCUS
Parallel Execution
Experimental Evaluation
Conclusions & Future Work
Motivation
Experimental Evaluation
LOCUS vs DTs and NNs (weka) Synthetic datasets
Ten functions [Agrawal et al., IEEE TKDE 1993]D = 9N [5103, 5106]
Real-world datasetsUCI Repository
Experimental Evaluation
0
2
4
6
8
10
12
1 2 3 4 5 6 7 8 9 10
Function
Err
or
rate
(%
)
LOCUS DT
Classification error rate (synthetic datasets, N = 5104)
Experimental Evaluation
0
1
2
3
4
5
6
1 2 3 4 5 6 7 8 9 10
Function
Err
or
rate
(%
)
5000 50000 500000 5000000
Effect of dataset size on classification error rate of LOCUS (synthetic datasets, N [5103, 5106])
Experimental Evaluation
0
100
200
300
400
500
600
1.E+03 1.E+04 1.E+05 1.E+06 1.E+07
Training Set Size
Ave
rag
e D
ecis
ion
Tim
e (m
sec)
Effect of dataset size on time scalability of LOCUS (synthetic datasets, N [5103, 5106])
Experimental Evaluation
0
5
10
15
20
25
30
35
40
Patien
t
Glas
sLi
ver
Breas
tCan
cer
Diabe
tes
Lette
rs
CovTy
pe 5
0000
Err
or
rate
(%
)
LOCUS DT
Classification error rate (real-world datasets)
0
5
10
15
20
25
30
35
DT LOCUS
Err
or
rate
(%
)
5000 50000 500000
Experimental EvaluationEffect of dataset size on classification error rate
(dataset CovType, N [5103, 5105])
Introduction
LOCUS
Parallel Execution
Experimental Evaluation
Conclusions & Future Work
Motivation
Introduction
LOCUS
Parallel Execution
Experimental Evaluation
Conclusions & Future Work
Motivation
Conclusions & Future Work
LOCUSLazy (complex/dynamic datasets and models)Efficient (based on simple SQL queries)Reliable (converging to optimal)Parallelizable
Conclusions & Future Work
Similar techniques for feature selectionregression
Implementation of a parallel version
Questions?
Thank you!