Ι/Ο Data Εngineering

80
I/O Data Engineering “Garbage in, garbage out” November 2016 Kyriakos C. Chatzidimitriou - http://kyrcha.info 1

Transcript of Ι/Ο Data Εngineering

Page 1: Ι/Ο Data Εngineering

Kyriakos C. Chatzidimitriou - http://kyrcha.info 1

I/O Data Engineering“Garbage in, garbage out”

November 2016

Page 2: Ι/Ο Data Εngineering

Kyriakos C. Chatzidimitriou - http://kyrcha.info 2

Preamble• This work is a fusion of ideas and work (slides, text, images etc.) I found on the internet or had

and wrote on my own regarding the area of input/output data engineering in data mining and machine learning.

• Attribution:• The slides are based on the PowerPoint accompanying slides of the Data Mining, Practical Machine

Learning Tools, Witten et al., 4th ed., 2017 and in particular Chapter 8, available at: http://www.cs.waikato.ac.nz/ml/weka/book.html

• Slides from the Machine Learning MOOC by Prof. Andrew Ng: http://ml-class.org (PCA parts)• Slides from Learning from Data MOOC by Prof. Yaser S. Abu-Mustafa its support site:

http://work.caltech.edu/telecourse.html (The digits dataset and the non-linear transformation)• Slides from the Pattern Recognition class by Prof. Andreas L. Symeonidis, ECE department, Aristotle

University of Thessaloniki• A tutorial on Principal Component Analysis by Lindsay I. Smith, February 2002 (

http://www.cs.otago.ac.nz/cosc453/student_tutorials/principal_components.pdf) • Introduction to Data Mining, Tan et al., 2006: http://www-users.cs.umn.edu/~kumar/dmbook/

November 2016

Page 3: Ι/Ο Data Εngineering

Kyriakos C. Chatzidimitriou - http://kyrcha.info 3

Successful data mining: Just apply a learner! No?• Select the learning algorithm• Scheme/parameter selection/tuning

• Treat selection process as part of the learning process to avoid optimistic performance estimates

• Estimate the expected true performance of a learning scheme• Split• Cross-validation

• Data Engineering• Engineering the input data into a form suitable for the learning scheme chosen

• Data engineering to make learning possible or easier• Engineering the output to make it more effective

• Converting multi-class problems into two-class ones• Re-calibrating probability estimates

November 2016

Page 4: Ι/Ο Data Εngineering

Kyriakos C. Chatzidimitriou - http://kyrcha.info 4

Data Transformations – OutlineTopics covered:1. Attribute Selection2. Discretizing Numeric Attributes3. Data Projection4. Data Cleansing5. Transforming Multiple Classes to Binary Ones

November 2016

Page 5: Ι/Ο Data Εngineering

Kyriakos C. Chatzidimitriou - http://kyrcha.info 5

It is a jungle out there!

Data Transformations

Feature SelectionFeature Engineering

Data Engineering Dimensionality Reduction

Principal Components Analysis

Pre- and post-processing

Data CleansingETL

Feature LearningWrapper methods

Filter methods

Independent Component Analysis

Outlier Detection

November 2016

Page 6: Ι/Ο Data Εngineering

Kyriakos C. Chatzidimitriou - http://kyrcha.info 6

Attribute/Feature SelectionRemoving attributes that are not useful to the task at hand

November 2016

Page 7: Ι/Ο Data Εngineering

Kyriakos C. Chatzidimitriou - http://kyrcha.info 7

Motivation• Experiments have showed that adding useless attributes causes the performance of learning schemes

(decision tree and rules, linear regression, instance-based learners) to deteriorate• Adding a random binary variable effects:

• Divide-and-conquer tree learners and separate-and-conquer rule learners• If you reach depths at which only a small amount of data is available for picking a split, the random attribute will look good by

chance• C4.5 deterioration in performance 5-10% for 1 random variable

• Instance-based learners• Susceptible as well, reason: work in local neighborhoods• The number of training instances needed to produce a predetermined level of performance for instance-based learning increases

exponentially with the number of irrelevant attributes present• Naive Bayes

• Not susceptible• It assumes by design that all attributes are independent of one another, an assumption that is just right for random “distracter”

attributes• On the other hand: pays a heavy price in other ways because its operation is damaged by adding redundant attributes

• Independence “thrown out of the window”• Conclusion: Relevant attributes can also be harmful if they mislead the learning algorithm

November 2016

Page 8: Ι/Ο Data Εngineering

Kyriakos C. Chatzidimitriou - http://kyrcha.info 8

Advantages of Attribute Selection• Improves performance of learning algorithms• Speeds them up• Outweighed by the computation involved in attribute selection

• Yields a more compact, more easily interpretable representation

November 2016

Page 9: Ι/Ο Data Εngineering

Kyriakos C. Chatzidimitriou - http://kyrcha.info 9

Attribute Selection Types• Manually• The best way• Requires deep understanding of the learning problem and what the attributes

actually mean

• Filter-method – Scheme-Independent Attribute Selection• Make an independent assessment based on general characteristics of the

data

• Wrapper method – Scheme-Dependent Attribute Selection• Evaluate the subset using the machine learning algorithm that will ultimately

be employed for learning

November 2016

Page 10: Ι/Ο Data Εngineering

Kyriakos C. Chatzidimitriou - http://kyrcha.info 10

Scheme-Independent Attribute Selection• aka Filter approach to attribute selection:

assess attributes based on general characteristics of the data

• Attributes are selected in a manner that is independent of the target machine learning scheme

• One method: find smallest subset of attributes that separates data

• Another method: use a fast learning scheme that is different from the target learning scheme to find relevant attributes• E.g., use attributes selected by C4.5, or

coefficients of linear model, possibly applied recursively (recursive feature elimination)

November 2016

By Lucien Mousin - Own work, GFDL, https://commons.wikimedia.org/w/index.php?curid=37776286

Page 11: Ι/Ο Data Εngineering

Kyriakos C. Chatzidimitriou - http://kyrcha.info 11

Recursive Feature Elimination12…

F1 F2 F3

LearningAlgorithm

Ranking: F2 F1 F3

12…

F1 F2

LearningAlgorithm

Ranking: F1 F2 Final Ranking: F1 F2 F3

Learning algorithm should produce a ranking, i.e. a linear SVM, where ranks are based on the size of the coefficients

November 2016

Page 12: Ι/Ο Data Εngineering

Kyriakos C. Chatzidimitriou - http://kyrcha.info 12

Correlation-based Feature Selection (CFS)• Correlation between attributes measured by symmetric uncertainty:

where H is the entropy function:

• Goodness of subset of attributes measured by

where C is the class attribute, breaking ties in favour of smaller subsets.

]1,0[)()(

),()()(2),(

BHAHBAHBHAHBAU

j i j jij AAUCAU ),(),(

𝐻 ( 𝑋 ,𝑌 )=−∑𝑆𝑋

∑𝑆𝑌

𝑝 (𝑥 , 𝑦 ) log (𝑝 (𝑥 , 𝑦 ))𝐻 ( 𝑋 )=−∑𝑆𝑋

𝑝 (𝑥 ) log (𝑝 (𝑥 ))

November 2016

Page 13: Ι/Ο Data Εngineering

Kyriakos C. Chatzidimitriou - http://kyrcha.info 13

The Weather DataOutlook Temperature Humidity Windy Play

Sunny Hot High False No

Sunny Hot High True No

Overcast Hot High False Yes

Rainy Mild High False Yes

Rainy Cool Normal False Yes

Rainy Cool Normal True No

Overcast Cool Normal True Yes

Sunny Mild High False No

Sunny Cool Normal False Yes

Rainy Mild Normal False Yes

Sunny Mild Normal True Yes

Overcast Mild High True Yes

Overcast Hot Normal False Yes

Rainy Mild High True NoNovember 2016

Page 14: Ι/Ο Data Εngineering

Kyriakos C. Chatzidimitriou - http://kyrcha.info 14

Symmetric Uncertainty Example Calculation• H(Outlook) = - 5/14log(5/14) – 4/14log(4/14) – 5/14 log(5/14) = 1.577• H(Temperature) = - 4/14log(4/14) – 6/14log(6/14) – 4/14 log(4/14) = 1.556• H(Outlook, Temperature) =

- p(s,h)logp(s,h) - p(s,m)logp(s,m) - p(s,c)logp(s,c) - p(o,h)logp(o,h) - p(o,m)logp(o,m) - p(o,c)logp(o,c) - p(r,h)logp(r,h) - p(r,m)logp(r,m) - p(r,c)logp(r,c) = - 2/14log(2/14) – 2/14log(2/14) – 1/14 log(1/14) - 2/14log(2/14) – 1/14log(1/14) – 1/14 log(1/14) - 0/14log(0/14) – 3/14log(3/14) – 2/14 log(2/14) = 2.896

• U(Outlook, Temperature) = 2*(1.577 + 1.566 – 2.896)/(1.577 + 1.566) = 0.1512927

November 2016

Page 15: Ι/Ο Data Εngineering

Kyriakos C. Chatzidimitriou - http://kyrcha.info 15

Attribute subsets for weather dataThe number of possible attribute

subsets increases exponentially with the number of attributes, making

exhaustive search impractical on all but the simplest problems.

November 2016

Page 16: Ι/Ο Data Εngineering

Kyriakos C. Chatzidimitriou - http://kyrcha.info 16

Scheme-specific selection• Wrapper approach to attribute selection:

attributes are selected with target scheme in the loop

• Implement “wrapper” around learning scheme• Evaluation criterion: cross-validation performance

• Time consuming in general• greedy approach, k attributes, evaluation time

multiplied by a factor of k2, worst case• prior ranking of attributes, complexity linear in k

• Can use significance test (paired t-test) to stop cross-validation for a subset early if it is unlikely to “win” (race search)• Can be used with forward, backward selection,

prior ranking, or special-purpose schemata search• Efficient for decision tables and Naïve Bayes

(Selective Naïve Bayes)

November 2016

By Lastdreamer7591 - Own work, CC BY-SA 4.0, https://commons.wikimedia.org/w/index.php?curid=37208688

Page 17: Ι/Ο Data Εngineering

Kyriakos C. Chatzidimitriou - http://kyrcha.info 17

Selective Naïve Bayes• Use the forward selection algorithm• Better able to detect a redundant attribute than backward elimination

• Use as metric the quality of an attribute to be simply the performance on the training set• We know that: Training set performance not a reliable indicator of

test set performance• But Naïve Bayes is less likely to overfit• Plus, as discussed, robust to random variables

November 2016

Page 18: Ι/Ο Data Εngineering

Kyriakos C. Chatzidimitriou - http://kyrcha.info 18

Complexity example• If I do 10-fold CV I must train the algorithm 10 times = 10• I should do also the 10-fold CV 10 times to obtain a more reliable

estimate = 10*10• If I have 10 features the total search space is 2^10 = 1024 different

subsets = 10*10*1024 = 102,400 • Then I should also tune the parameters of the learning algorithm…• Or should I do that before…

November 2016

Page 19: Ι/Ο Data Εngineering

Kyriakos C. Chatzidimitriou - http://kyrcha.info 19

Searching the attribute space• Number of attribute subsets is exponential in

the number of attributes• Common greedy approaches:

• forward selection• backward elimination

• More sophisticated strategies:• Bidirectional search• Best-first search:

• can find optimum solution, • does not just terminate when the performance starts

to drop keeps a list of all attribute subsets evaluated so far, sorted in order of the performance measure, so that it can revisit an earlier configuration

• Beam search: approximation to best-first search, keeps a truncated list

• Genetic algorithms

November 2016

Page 20: Ι/Ο Data Εngineering

Kyriakos C. Chatzidimitriou - http://kyrcha.info 20

DiscretizationTransforming numeric attributes into discrete ones

November 2016

Page 21: Ι/Ο Data Εngineering

Kyriakos C. Chatzidimitriou - http://kyrcha.info 21

Motivation• Essential if the task involves numeric attributes but the chosen

learning scheme can only handle categorical ones• Schemes that can handle numeric attributes often produce better

results, or work faster, if the attributes are pre-discretized.• The converse situation, in which categorical attributes must be

represented numerically, also occurs (although less often)

November 2016

Page 22: Ι/Ο Data Εngineering

Kyriakos C. Chatzidimitriou - http://kyrcha.info 22

Attribute discretization• Discretization can be useful even if a learning algorithm can be run on numeric

attributes directly• Avoids normality assumption in Naïve Bayes and clustering• Examples of discretization we have already encountered:

• Decision trees perform local discretization

• Global discretization can be advantageous because it is based on more data• Apply learner to

• k-valued discretized attribute or to• k – 1 binary attributes that code the cut points

• The latter approach often works better when learning decision trees or rule sets

November 2016

Page 23: Ι/Ο Data Εngineering

Kyriakos C. Chatzidimitriou - http://kyrcha.info 23

Discretization: unsupervised• Unsupervised discretization: determine

intervals without knowing class labels• When clustering, the only possible way!

• Two well-known strategies:• Equal-interval binning• Equal-frequency binning

(also called histogram equalization)• Unsupervised discretization is normally

inferior to supervised schemes when applied in classification tasks• But equal-frequency binning works well with

Naïve Bayes if the number of intervals is set to the square root of the size of dataset (proportional k-interval discretization)

Data Equal interval width

Equal frequency K-means

November 2016

Page 24: Ι/Ο Data Εngineering

Kyriakos C. Chatzidimitriou - http://kyrcha.info 24

Discretization: supervised• Classic approach to supervised discretization is entropy-based• This method builds a decision tree with pre-pruning on the attribute being discretized

• Uses entropy as splitting criterion• Uses the minimum description length principle as the stopping criterion for pre-pruning

• Works well: still the state of the art• To apply the minimum description length principle, the “theory” is

• the splitting point (can be coded in log2[N – 1] bits)• plus class distribution in each subset (a more involved expression)

• Description length is the number of bits needed for coding both the splitting point and the class distributions

• Compare description lengths before/after adding split

November 2016

Page 25: Ι/Ο Data Εngineering

Kyriakos C. Chatzidimitriou - http://kyrcha.info 25

Example: temperature attributePlayTemperature

Yes No Yes Yes Yes No No Yes Yes Yes No Yes Yes No64 65 68 69 70 71 72 72 75 75 80 81 83 85

November 2016

Page 26: Ι/Ο Data Εngineering

Kyriakos C. Chatzidimitriou - http://kyrcha.info 26

Final

It can be shown theoretically that a cut point that minimizes the information value will never occur between two instances of the same class

November 2016

Page 27: Ι/Ο Data Εngineering

Kyriakos C. Chatzidimitriou - http://kyrcha.info 27

Formula for MDL stopping criterion• Can be formulated in terms of the information gain• Assume we have N instances

• Original set: k classes, entropy E• First subset: k1 classes, entropy E1

• Second subset: k2 classes, entropy E2

• If the information gain is greater than the expression on the right, we continue splitting

• Results in no discretization intervals for the temperature attribute in the weather data• Fail to play a role in the final decision structure

NEkEkkE

NNgain

k221122 )23(log)1(log

November 2016

Page 28: Ι/Ο Data Εngineering

Kyriakos C. Chatzidimitriou - http://kyrcha.info 28

Supervised discretization: other methods• Can replace top-down procedure by bottom-up method• This bottom-up method has been applied in conjunction with the chi-squared

test• Continue to merge intervals until they become significantly different

• Can use dynamic programming to find optimum k-way split for given additive criterion• Requires time quadratic in the number of instances• But can be done in linear time if error rate is used instead of entropy• Error rate: count the number of errors that a discretization makes when predicting each

training instance’s class, assuming that each interval receives the majority class.• However, using error rate is generally not a good idea when discretizing an attribute as

we will see

November 2016

Page 29: Ι/Ο Data Εngineering

Kyriakos C. Chatzidimitriou - http://kyrcha.info 29

Error-based vs. entropy-based• Question:

could the best discretization ever have two adjacent intervals with the same class?• Wrong answer: No. For if so,• Collapse the two• Free up an interval• Use it somewhere else• (This is what error-based discretization will do)

• Right answer: Surprisingly, yes.• (and entropy-based discretization can do it)

November 2016

Page 30: Ι/Ο Data Εngineering

Kyriakos C. Chatzidimitriou - http://kyrcha.info 30

Error-based vs. entropy-basedA 2-class,2-attribute problem

Entropy-based discretization can detect change of class distribution (from 100% to 50%)

Class 1: a1 < 0.3 or if a1 < 0.7 and a2 < 0.5Class 2: otherwise

Best discretization

a2: no problema1: middle will have whatever labelhappens to occur most

November 2016

Page 31: Ι/Ο Data Εngineering

Kyriakos C. Chatzidimitriou - http://kyrcha.info 31

Data Projections and Dimensionality ReductionProjecting data into a more suitable space

November 2016

Page 32: Ι/Ο Data Εngineering

Kyriakos C. Chatzidimitriou - http://kyrcha.info 32

Motivation• Curse of Dimensionality• Visualization• Add new, synthetic attributes whose purpose is to present existing

information in a form that is suitable for the machine learning scheme to pick up on.

November 2016

Page 33: Ι/Ο Data Εngineering

Kyriakos C. Chatzidimitriou - http://kyrcha.info 33

Curse of Dimensionality• When dimensions increase, data

become increasingly sparse• Density and distance between

points which are important criteria for clustering and outlier detection loose their importance

• Create 500 points• Calculate the max and min distance between

any pair of points

November 2016

Page 34: Ι/Ο Data Εngineering

Kyriakos C. Chatzidimitriou - http://kyrcha.info 34

Projections• Definition: a projection is a kind of function or mapping that transforms data in

some way• Simple transformations can often make a large difference in performance• Example transformations (not necessarily for performance improvement):

• Difference of two date attributes age• Ratio of two numeric (ratio-scale) attributes

• Useful for algorithms doing axis parallel splits• Concatenating the values of nominal attributes• Encoding cluster membership• Adding noise to data• Removing data randomly or selectively• Obfuscating the data

• Anonymising

November 2016

Page 35: Ι/Ο Data Εngineering

Kyriakos C. Chatzidimitriou - http://kyrcha.info 35

Digits dataset

November 2016From: Learning from data MOOC, http://work.caltech.edu/telecourse.html

Page 36: Ι/Ο Data Εngineering

Kyriakos C. Chatzidimitriou - http://kyrcha.info 36

Input representation• ‘raw’ input x = (x0, x1, x2, …,

x256)• Linear model: (w0, w1, w2, …,

w256)

• Features: extract useful information, e.g.,• Intensity and symmetry x = (x0, x1,

x2)• Linear model: (w0, w1, w2)

From: Learning from data MOOC, http://work.caltech.edu/telecourse.htmlNovember 2016

Page 37: Ι/Ο Data Εngineering

Kyriakos C. Chatzidimitriou - http://kyrcha.info 37

Illustration of features

November 2016From: Learning from data MOOC, http://work.caltech.edu/telecourse.html

Page 38: Ι/Ο Data Εngineering

Kyriakos C. Chatzidimitriou - http://kyrcha.info 38

Another one

November 2016

From: Learning from data MOOC, http://work.caltech.edu/telecourse.html

Page 39: Ι/Ο Data Εngineering

Kyriakos C. Chatzidimitriou - http://kyrcha.info 39

Methods• Unsupervised• Principal Components Analysis (PCA)• Independent Component Analysis (ICA)• Random Projections

• Supervised• Partial Least Squares (PLS)• Linear Discriminant Analysis (LDA)

November 2016

Page 40: Ι/Ο Data Εngineering

Kyriakos C. Chatzidimitriou - http://kyrcha.info 40

Principal Components Analysisaka PCA

November 2016

Page 41: Ι/Ο Data Εngineering

Kyriakos C. Chatzidimitriou - http://kyrcha.info 41

Principal component analysis in a glance• Unsupervised method for identifying the important directions in a dataset• We can then rotate the data into the new coordinate system that is given by

those directions• Finally we can keep the new dimension that are of more importance• PCA is a method for dimensionality reduction• Algorithm:

1. Find direction (axis) of greatest variance2. Find direction of greatest variance that is perpendicular to previous direction and

repeat

• Implementation: find eigenvectors of the covariance matrix of the data• Eigenvectors (sorted by eigenvalues) are the directions

November 2016

Page 42: Ι/Ο Data Εngineering

Kyriakos C. Chatzidimitriou - http://kyrcha.info 42

PCA problem formulation

Reduce from 2-dimension to 1-dimension: Find a direction (a vector )onto which to project the data so as to minimize the projection error.Reduce from n-dimension to k-dimension: Find vectors onto which to project the data, so as to minimize the projection error.

November 2016

Page 43: Ι/Ο Data Εngineering

Kyriakos C. Chatzidimitriou - http://kyrcha.info 43

Data in a matrix form• Let n instances with d attributes. Every instance is described by d

numerical values.• We represent our data as a nd matrix A with real numbers.• We can use linear algebra to process the matrix

• Our goal is to produce a new nk matrix B such as:• It contains as much information as the original matrix A• To reveal something about the structure of data of A

November 2016

Page 44: Ι/Ο Data Εngineering

Kyriakos C. Chatzidimitriou - http://kyrcha.info 44

Principal Components• The first principal component is the direction of the axis with the

largest variance in the data• The second principal component is the next orthogonal direction with

the largest variance in the data• And so on..• The 1st PC contains the largest variance• The kth PC contains the kth fraction of variance• For n original dimensions, the covariance matrix is nxn and has up to n

eigenvectors. Thus, n PC.

November 2016

Page 45: Ι/Ο Data Εngineering

Kyriakos C. Chatzidimitriou - http://kyrcha.info 45

Example: 10-dimensional data

• Data is normally standardized or mean-centered for PCA• Can also apply this recursively in a tree learnerNovember 2016

Page 46: Ι/Ο Data Εngineering

Kyriakos C. Chatzidimitriou - http://kyrcha.info 46

PCA example• Dataset with 2 attributes x1 and x2.

November 2016

Page 47: Ι/Ο Data Εngineering

Kyriakos C. Chatzidimitriou - http://kyrcha.info 47

PCA example – Step 1: Get some data

November 2016

X - Data:x1 x22.5 2.40.5 0.72.2 2.91.9 2.23.1 3.02.3 2.7

2 1.61 1.1

1.5 1.61.1 0.9

Page 48: Ι/Ο Data Εngineering

Kyriakos C. Chatzidimitriou - http://kyrcha.info 48

Data Pre-processingPreprocessing (feature scaling/mean normalization):

Replace each with .If different features on different scales (e.g., size of house, number of bedrooms), scale features to have comparable range of values.

November 2016

Page 49: Ι/Ο Data Εngineering

Kyriakos C. Chatzidimitriou - http://kyrcha.info 49

PCA example – Step 2: Subtract the mean

November 2016

X’- Mean normalization:x1 x2

.69 .49-1.31 -1.21.39 .99.09 .29

1.29 1.09.49 .79.19 -.31-.81 -.81-.31 -.31

-.71 -1.01

Page 50: Ι/Ο Data Εngineering

Kyriakos C. Chatzidimitriou - http://kyrcha.info 50

Principal Component Analysis (PCA) algorithmReduce data from -dimensions to -dimensionsCompute “covariance matrix”:

Compute “eigenvectors” of matrix :[U,S,V] = svd(Sigma);

November 2016

Page 51: Ι/Ο Data Εngineering

Kyriakos C. Chatzidimitriou - http://kyrcha.info 51

PCA Example – Step 3: Calculate the covariance matrix

• Given that non-diagonal values are positive, we expect that x1 and x2 will increase together (+ sign of cov(x1, x2))

November 2016

S

Page 52: Ι/Ο Data Εngineering

Kyriakos C. Chatzidimitriou - http://kyrcha.info 52

Principal Component Analysis (PCA) algorithm

From , we get: [U,S,V] = svd(Sigma)

November 2016

Page 53: Ι/Ο Data Εngineering

Kyriakos C. Chatzidimitriou - http://kyrcha.info 53

PCA Example – Step 4: Compute the eigenvectors of S

• [U, D, V] = SVD(S)• The 1st eigenvector has an

eigenvalue of 1.2840277, while the 2nd an eigenvalue of 0.0490834• Eigenvectors are perpendicular to

each other: orthogonal

November 2016

U

Page 54: Ι/Ο Data Εngineering

Kyriakos C. Chatzidimitriou - http://kyrcha.info 54

Choosing k number of principal componentsTypically, choose to be smallest value so that

“99% of variance is retained”

(1%)

November 2016

Page 55: Ι/Ο Data Εngineering

Kyriakos C. Chatzidimitriou - http://kyrcha.info 55

Choosing k number of principal components[U,S,V] = svd(Sigma)Pick smallest value of for which

(99% of variance retained)

November 2016

Page 56: Ι/Ο Data Εngineering

Kyriakos C. Chatzidimitriou - http://kyrcha.info 56

PCA Example – Step 5: Choosing components• Choosing the 1st component will

retain more than 95% of the variance

November 2016

Page 57: Ι/Ο Data Εngineering

Kyriakos C. Chatzidimitriou - http://kyrcha.info 57

Principal Component Analysis (PCA) algorithmAfter mean normalization (ensure every feature has zero mean) and optionally feature scaling:

Sigma =

[U,S,V] = svd(Sigma);Ureduce = U(:,1:k);z = Ureduce’*x;

November 2016

Page 58: Ι/Ο Data Εngineering

Kyriakos C. Chatzidimitriou - http://kyrcha.info 58

PCA Example – Step 6: Deriving the new data• FinalData = RowFeatureVector x RowDataAdjust• RowDataAdjust = X’T

• Normalized with zero mean with every row being a dimension and every column a point (inverted form)

• RowFeatureVector = U’T

• Eigenvectors are in rows with the most important in the first row

November 2016

Page 59: Ι/Ο Data Εngineering

Kyriakos C. Chatzidimitriou - http://kyrcha.info 59

Supervised learning speedup

Extract inputs:Unlabeled dataset:

New training set:

Note: Mapping should be defined by running PCA only on the training set. This mapping can be applied as well to the examples and in the cross validation and test sets.

November 2016

Page 60: Ι/Ο Data Εngineering

Kyriakos C. Chatzidimitriou - http://kyrcha.info 60

ApplicationsApplication of PCA

- Compression- Reduce memory/disk needed to store data- Speed up learning algorithm

- Visualization

November 2016

Page 61: Ι/Ο Data Εngineering

Kyriakos C. Chatzidimitriou - http://kyrcha.info 61

Bad use of PCA: To prevent overfittingUse instead of to reduce the number of features to Thus, fewer features, less likely to overfit.

This might work OK, but isn’t a good way to address overfitting. Use regularization instead.

November 2016

Page 62: Ι/Ο Data Εngineering

Kyriakos C. Chatzidimitriou - http://kyrcha.info 62

PCA is sometimes used where it shouldn’t beDesign of ML system:

- Get training set- Run PCA to reduce in dimension to get- Train logistic regression on- Test on test set: Map to . Run on

How about doing the whole thing without using PCA?Before implementing PCA, first try running whatever you want to do with the original/raw data . Only if that doesn’t do what you want, then implement PCA and consider using .

November 2016

Page 63: Ι/Ο Data Εngineering

Kyriakos C. Chatzidimitriou - http://kyrcha.info 63

Data CleansingData Cleaning, Data Scrubbing, or Data Reconciliation

November 2016

Page 64: Ι/Ο Data Εngineering

Kyriakos C. Chatzidimitriou - http://kyrcha.info 64

What is data cleansing?• "Detect and remove errors and inconsistencies from data in order

to improve the quality of data" [Rahm]• "The process of detecting and correcting (or removing) corrupt or

inaccurate records from a record set, table, or database" [Wikipedia]

• Integral part of data processing and maintenance• Usually semi-automatic process, highly application specific

November 2016

Page 65: Ι/Ο Data Εngineering

Kyriakos C. Chatzidimitriou - http://kyrcha.info 65

How to• Necessity of getting to know your data: understanding the meaning of

all the different attributes, the conventions used in coding them, the significance of missing values and duplicate data, measurement noise, typographical errors, and the presence of systematic errors—even deliberate ones.• There are also automatic methods of cleansing data, of detecting

outliers, and of spotting anomalies, which we describe—including a class of techniques referred to as “one-class learning” in which only a single class of instances is available at training time.

November 2016

Page 66: Ι/Ο Data Εngineering

Kyriakos C. Chatzidimitriou - http://kyrcha.info 66

Data Cleansing vs. Data Validation• Data validation almost invariably means data is rejected from the

system at entry and is performed at entry time, rather than on batches of data.

• Example: Data validation in web forms

November 2016

Page 67: Ι/Ο Data Εngineering

Kyriakos C. Chatzidimitriou - http://kyrcha.info 67

Anomalies Classification• Syntactical Anomalies

• Lexical Errors (Gender: {M, M, F, 5' 8})• Domain format errors (Smith, John vs. Smith John)• Irregularities: non-uniform use of values, units, abbreviations (examples: different currencies in

the salaries, different use of abbreviations)

• Semantic Anomalies• Integrity constraint violations (AGE >= 0)• Contradictions (AGE vs. CURRENT_DATE - DATE_OF_BIRTH)• Duplicates• Invalid tuples

• Coverage Anomalies• Missing values• Missing tuples

November 2016

Page 68: Ι/Ο Data Εngineering

Kyriakos C. Chatzidimitriou - http://kyrcha.info 68

Data Quality Criteria: Accuracy + Uniqueness• Accuracy = Integrity + Consistency + Density• Integrity = Completeness + Validity• Completeness: M in D / M i.e. I should have 1000 tuples and I have 500 => 50% (Missing

values)• Validity: M in D / D, i.e. From the 500 tuples I have in D the 400 are valid => 80% (Illegal

values)• Consistency = Schema conformance + Uniformity• Schema conformance: tuples conforming to syntactical structure / overall number of tuples (if

in the database then it conforms)• Uniformity: attributes with no irregularities (non-uniform use of values) / total number of

attributes• Density: missing values in the tuples in D / total values in D

• Uniqueness: tuples of the same entity / total number of tuples

November 2016

Page 69: Ι/Ο Data Εngineering

Kyriakos C. Chatzidimitriou - http://kyrcha.info 69

Data Cleansing Operations1. Format adaptation for tuples and values2. Integrity constraint enforcement3. Derivation of missing values from existing ones4. Removing contradictions within or between tuples5. Merging and eliminating duplicates6. Detection of outliers, i.e. tuples and values having a high potential

of being invalid

November 2016

Page 70: Ι/Ο Data Εngineering

Kyriakos C. Chatzidimitriou - http://kyrcha.info 70

Data Cleansing Process1. Data auditing2. Workflow specification, i.e. choose appropriate methods to

automatically detect and remove them3. Workflow execution, apply the methods to the tuples in the data

collection4. Post-processing / Control

November 2016

Page 71: Ι/Ο Data Εngineering

Kyriakos C. Chatzidimitriou - http://kyrcha.info 71

Data Auditing• Data Profiling: Instance analysis• Data Mining: Whole data collection analysis• Examples

• Minimal, Maximal values• Value range• Variance• Uniquness• Null value occurences• Typical string patterns (through RegExps for example)

• Search for characteristics that could be used for the correction of anomalies

November 2016

Page 72: Ι/Ο Data Εngineering

Kyriakos C. Chatzidimitriou - http://kyrcha.info 72

Methods for Data Cleansing• Parsing (Syntax Errors)• Data Transformation (source to target format)• Integrity Constraint Enforcement (checking & maintenance)• Duplicate Elimination• Statistical Methods

• Outliers • Detection: mean, std, range, clustering, association rules• Remedy: set to average or other statistical value, censored, truncated (dropped)

• Missing• Detection: It's missing :)• Remedy: Filling-in (imputing) by a number of ways (mean, median, regression, propensity score, Markov-

Chain-Monte-Carlo method)

November 2016

Page 73: Ι/Ο Data Εngineering

Kyriakos C. Chatzidimitriou - http://kyrcha.info 73

Outlier (Error) Detection in Datasets• Statistical: mean (μ), std (σ), range (Chebyshev's theorem)• Accept (μ - ε * σ) < f < (μ + ε σ), where ε=5, else reject• Needs training/testing data for finding the best ε {3,4,5,6, etc.}

• Boxplots (univariate data)• Clustering, high-computational burden• Pattern-based, i.e. find a pattern where 90% of data exhibit the

same characteristics• Association rules: pattern = association rule with high confidence

and support

November 2016

Page 74: Ι/Ο Data Εngineering

Kyriakos C. Chatzidimitriou - http://kyrcha.info 74

Detecting anomalies• Visualization can help to detect anomalies• Automatic approach: apply committee of different learning schemes,

e.g.,• decision tree• nearest-neighbor learner• linear discriminant function

• Conservative consensus approach: delete instances incorrectly classified by all of them• Problem: might sacrifice instances of small classes

November 2016

Page 75: Ι/Ο Data Εngineering

Kyriakos C. Chatzidimitriou - http://kyrcha.info 75

One-Class Learning• Usually training data is available for all classes• Some problems exhibit only a single class at training time• Test instances may belong to this class or a new class not present at

training time• This the problem of one-class classification

• Predict either target or unknown

• Note that, in practice, some one-class problems can be re-formulated into two-class ones by collecting negative data

• Other applications truly do not have negative data, e.g., password hardening, nuclear plant operational status

November 2016

Page 76: Ι/Ο Data Εngineering

Kyriakos C. Chatzidimitriou - http://kyrcha.info 76

Outlier detection• One-class classification is often used for outlier/anomaly/novelty detection• First, a one-class models is built from the dataset• Then, outliers are defined as instances that are classified as unknown• Another method: identify outliers as instances that lie beyond distance d

from percentage p of training data• Density estimation is a very useful approach for one-class classification and

outlier detection• Estimate density of the target class and mark low probability test instances as

outliers• Threshold can be adjusted to calibrate sensitivity

November 2016

Page 77: Ι/Ο Data Εngineering

Kyriakos C. Chatzidimitriou - http://kyrcha.info 77

Transforming multiple classes to binary onesOutput processing

November 2016

Page 78: Ι/Ο Data Εngineering

Kyriakos C. Chatzidimitriou - http://kyrcha.info 78

Transforming multiple classes to binary ones• Some learning algorithms only work with two class problems

• e.g., standard support vector machines—only work with two-class problems.

• Sophisticated multi-class variants exist in many cases but can be very slow or difficult to implement

• A common alternative is to transform multi-class problems into multiple two-class ones

• Simple methods:• Discriminate each class against the union of the others – one-vs.-rest• Build a classifier for every pair of classes – pairwise classification

• We will discuss error-correcting output codes, which can often improve on these

November 2016

Page 79: Ι/Ο Data Εngineering

Kyriakos C. Chatzidimitriou - http://kyrcha.info 79

Error-correcting output codes• Multiclass problem multiple binary

problems• Simple one-vs.-rest scheme:

One-per-class coding• 1010??

• Idea: use error-correcting codes instead• base classifiers predict

1011111, true class = ??• Use bit vectors (codes) so that we

have large Hamming distancebetween any pair of bit vectors:• Can correct up to (d – 1)/2 single-bit errors

0001d0010c0100b1000aclass

vectorclass

0101010d0011001c0000111b1111111a

class vector

class

November 2016

Page 80: Ι/Ο Data Εngineering

Kyriakos C. Chatzidimitriou - http://kyrcha.info 80

The EndKyriakos C. Chatzidimitriouhttp://[email protected]

November 2016