1 Εξόρυξη Γνώσης (data mining) Χ. Παπαθεοδώρου Εργαστήριο...

1

Εξόρυξη Γνώσης(data mining)

Χ. Παπαθεοδώρου

Εργαστήριο Ψηφιακών Βιβλιοθηκών &

Ηλεκτρονικής Δημοσίευσης

Τμήμα Αρχειονομίας – Βιβλιοθηκονομίας,

Ιόνιο Πανεπιστήμιο

2

Data Mining Εξόρυξη γνώσης από πολύ μεγάλες συλλογές

δεδομένων Γνώση: κανόνες, πρότυπα συμπεριφοράς και

συσχετίσεις μεταξύ αντικειμένων (όχι προφανής,

λανθάνουσα, προηγουμένως άγνωστη, και χρήσιμη) Αντικείμενο: Αποτελείται από ένα σύνολο

χαρακτηριστικών Δεν είναι:

(Deductive) query processing. Expert systems, small machine learning /statistical

programs

3

Why Data Mining? Potential Applications

Database analysis and decision support Market analysis and management

target marketing, customer relation management, market

basket analysis, cross selling, market segmentation Risk analysis and management

Forecasting, customer retention, improved underwriting,

quality control, competitive analysis Fraud detection and management

Other Applications Text mining (news group, email, documents) and Web analysis. Intelligent query answering

4

Market Analysis and Management (1)

Where are the data sources for analysis? Credit card transactions, loyalty cards, discount coupons,

customer complaint calls, plus (public) lifestyle studies

Target marketing Find clusters of model customers who share the same

characteristics: interest, income level, spending habits, etc. Determine customer purchasing patterns over time

Conversion of single to a joint bank account: marriage, etc.

Cross-market analysis Associations/co-relations between product sales Prediction based on the association information

5

Market Analysis and Management (2) Customer profiling

data mining can tell you what types of customers

buy what products (clustering or classification)

Identifying customer requirements identifying the best products for different customers

use prediction to find what factors will attract new

customers

Provides summary information various multidimensional summary reports

statistical summary information (data central

tendency and variation)

6

Corporate Analysis and Risk Management

Finance planning and asset evaluation cash flow analysis and prediction contingent claim analysis to evaluate assets cross-sectional and time series analysis (financial-

ratio, trend analysis, etc.) Resource planning:

summarize and compare the resources and spending Competition:

monitor competitors and market directions group customers into classes and a class-based

pricing procedure set pricing strategy in a highly competitive market

7

Steps of a KDD Process Learning the application domain:

relevant prior knowledge and goals of application Creating a target data set: data selection Data cleaning and preprocessing: (may take 60% of

effort!) Data reduction and transformation:

Find useful features, dimensionality/variable reduction, invariant representation.

Choosing functions of data mining summarization, classification, regression, association,

clustering. Choosing the mining algorithm(s) Data mining: search for patterns of interest Pattern evaluation and knowledge presentation

visualization, transformation, removing redundant patterns, etc. Use of discovered knowledge

Data Mining: A KDD Process

Data mining: the core of knowledge discovery process.

Data Cleaning

Data Integration

Databases

Data Warehouse

Task-relevant Data

Selection

Data Mining

Pattern Evaluation

9

Data pre-processing

Data preparation is a big issue for data mining

Data preparation includes

Data cleaning and data integration

Data reduction and feature selection

Discretization

A lot a methods have been developed but still

an active area of research

10

Data pre-processing

11

Clustering

Partition data set into clusters, and one can store cluster representation only

Can have hierarchical clustering and be

stored in multi-dimensional index tree

structures

There are many choices of clustering

definitions and clustering algorithms

12

Cluster Analysis

13

Classification Classification is an extensively studied problem (mainly

in statistics, machine learning & neural networks)

Classification is probably one of the most widely used

data mining techniques with a lot of extensions

Scalability is still an important issue for database

applications: thus combining classification with

database techniques should be a promising topic

Research directions: classification of non-relational data,

e.g., text, spatial, multimedia, etc..

14

Classification process Model construction: describing a set of predetermined

classes Each tuple/sample is assumed to belong to a predefined class,

as determined by the class label attribute The set of tuples used for model construction: training set The model is represented as classification rules, decision trees,

or mathematical formulae Model usage: for classifying future or unknown objects

Estimate accuracy of the model The known label of test sample is compared with the

classified result from the model Accuracy rate is the percentage of test set samples that are

correctly classified by the model Test set is independent of training set, otherwise over-fitting

will occur

15

Classification Process (1): Model Construction

TrainingData

NAME RANK YEARS TENUREDMike Assistant Prof 3 noMary Assistant Prof 7 yesBill Professor 2 yesJim Associate Prof 7 yesDave Assistant Prof 6 noAnne Associate Prof 3 no

ClassificationAlgorithms

IF rank = ‘professor’OR years > 6THEN tenured = ‘yes’

Classifier(Model)

16

Classification Process (2): Use the Model in Prediction

Classifier

TestingData

NAME RANK YEARS TENUREDTom Assistant Prof 2 noMerlisa Associate Prof 7 noGeorge Professor 5 yesJoseph Assistant Prof 7 yes

Unseen Data

(Jeff, Professor, 4)

Tenured?

17

Supervised vs. Unsupervised Learning

Supervised learning (classification) Supervision: The training data (observations,

measurements, etc.) are accompanied by labels

indicating the class of the observations New data is classified based on the training set

Unsupervised learning (clustering) The class labels of training data is unknown Given a set of measurements, observations, etc. with

the aim of establishing the existence of classes or

clusters in the data

18

Document category modellingExample: Filtering spam email.Task: classify incoming email as spam

and legitimate (2 document categories).Simple blacklist and keyword-based

methods have failed.More intelligent, adaptive approaches

are needed (e.g. naive Bayesian category modeling).

19

Document category modelling

Step 1 (linguistic pre-processing): Tokenization, removal of stopwords, stemming/lemmatization.

Step 2 (vector representation): bag-of-words or n-gram modeling (n=2,3).

Step 3 (feature selection): information gain evaluation.

Step 4 (machine learning): Bayesian modeling, using word/n-gram frequency.

20

What Is Association Mining?

Association rule mining: Finding frequent patterns, associations, correlations, or

causal structures among sets of items or objects in transaction databases, relational databases, and other information repositories.

Applications: Basket data analysis, cross-marketing, catalog design,

loss-leader analysis, clustering, classification, etc. Example.

Rule form: "Body ead [support, confidence]. buys(x, "diapers) buys(x, "beers) [0.5%, 60%]

21

Association Rule: Basic Concepts

Given: (1) database of transactions, (2) each transaction is a list of items (purchased by a customer in a visit)

Find: all rules that correlate the presence of one set of items with that of another set of items

E.g., 98% of people who purchase tires and auto accessories also get automotive services done

Applications * Maintenance Agreement (What the store should do to boost

Maintenance Agreement sales) Home Electronics * (What other products should the store

stocks up?)

22

Rule Measures: Support and Confidence

Find all the rules X & Y Z with minimum confidence and support

support, s, probability that a transaction contains {X & Y & Z}

confidence, c, conditional probability that a transaction having {X & Y} also contains Z

Customerbuys diaper

Customerbuys both

Customerbuys beer

Find the rules with support and confidence equal or grater than a given threshold

23

Mining Association RulesAn Example

For rule A C:support = support({A =>C}) = 50%confidence = support({A =>C})/support({A}) =

66.6%

Transaction ID Items Bought2000 A,B,C1000 A,C4000 A,D5000 B,E,F

Frequent Itemset Support{A} 75%{B} 50%{C} 50%{A,C} 50%

Min. support 50%Min. confidence 50%

24

References

U. M. Fayyad, G. Piatetsky-Shapiro, P. Smyth, and R. Uthurusamy. Advances in Knowledge Discovery and Data Mining. AAAI/MIT Press, 1996.

J. Han and M. Kamber. Data Mining: Concepts and Techniques. Morgan

Kaufmann, 2000.

T. Imielinski and H. Mannila. A database perspective on knowledge

discovery. Communications of ACM, 39:58-64, 1996.

G. Piatetsky-Shapiro, U. Fayyad, and P. Smith. From data mining to

knowledge discovery: An overview. In U.M. Fayyad, et al. (eds.), Advances

in Knowledge Discovery and Data Mining, 1-35. AAAI/MIT Press, 1996.

G. Piatetsky-Shapiro and W. J. Frawley. Knowledge Discovery in Databases.

AAAI/MIT Press, 1991.

1 Εξόρυξη Γνώσης (data mining) Χ. Παπαθεοδώρου Εργαστήριο...

Documents

Transcript of 1 Εξόρυξη Γνώσης (data mining) Χ. Παπαθεοδώρου Εργαστήριο...