1 Εξόρυξη Γνώσης (data mining) Χ. Παπαθεοδώρου Εργαστήριο...
-
date post
21-Dec-2015 -
Category
Documents
-
view
218 -
download
1
Embed Size (px)
Transcript of 1 Εξόρυξη Γνώσης (data mining) Χ. Παπαθεοδώρου Εργαστήριο...

1
Εξόρυξη Γνώσης(data mining)
Χ. Παπαθεοδώρου
Εργαστήριο Ψηφιακών Βιβλιοθηκών &
Ηλεκτρονικής Δημοσίευσης
Τμήμα Αρχειονομίας – Βιβλιοθηκονομίας,
Ιόνιο Πανεπιστήμιο

2
Data Mining Εξόρυξη γνώσης από πολύ μεγάλες συλλογές
δεδομένων Γνώση: κανόνες, πρότυπα συμπεριφοράς και
συσχετίσεις μεταξύ αντικειμένων (όχι προφανής,
λανθάνουσα, προηγουμένως άγνωστη, και χρήσιμη) Αντικείμενο: Αποτελείται από ένα σύνολο
χαρακτηριστικών Δεν είναι:
(Deductive) query processing. Expert systems, small machine learning /statistical
programs

3
Why Data Mining? Potential Applications
Database analysis and decision support Market analysis and management
target marketing, customer relation management, market
basket analysis, cross selling, market segmentation Risk analysis and management
Forecasting, customer retention, improved underwriting,
quality control, competitive analysis Fraud detection and management
Other Applications Text mining (news group, email, documents) and Web analysis. Intelligent query answering

4
Market Analysis and Management (1)
Where are the data sources for analysis? Credit card transactions, loyalty cards, discount coupons,
customer complaint calls, plus (public) lifestyle studies
Target marketing Find clusters of model customers who share the same
characteristics: interest, income level, spending habits, etc. Determine customer purchasing patterns over time
Conversion of single to a joint bank account: marriage, etc.
Cross-market analysis Associations/co-relations between product sales Prediction based on the association information

5
Market Analysis and Management (2) Customer profiling
data mining can tell you what types of customers
buy what products (clustering or classification)
Identifying customer requirements identifying the best products for different customers
use prediction to find what factors will attract new
customers
Provides summary information various multidimensional summary reports
statistical summary information (data central
tendency and variation)

6
Corporate Analysis and Risk Management
Finance planning and asset evaluation cash flow analysis and prediction contingent claim analysis to evaluate assets cross-sectional and time series analysis (financial-
ratio, trend analysis, etc.) Resource planning:
summarize and compare the resources and spending Competition:
monitor competitors and market directions group customers into classes and a class-based
pricing procedure set pricing strategy in a highly competitive market

7
Steps of a KDD Process Learning the application domain:
relevant prior knowledge and goals of application Creating a target data set: data selection Data cleaning and preprocessing: (may take 60% of
effort!) Data reduction and transformation:
Find useful features, dimensionality/variable reduction, invariant representation.
Choosing functions of data mining summarization, classification, regression, association,
clustering. Choosing the mining algorithm(s) Data mining: search for patterns of interest Pattern evaluation and knowledge presentation
visualization, transformation, removing redundant patterns, etc. Use of discovered knowledge

Data Mining: A KDD Process
Data mining: the core of knowledge discovery process.
Data Cleaning
Data Integration
Databases
Data Warehouse
Task-relevant Data
Selection
Data Mining
Pattern Evaluation

9
Data pre-processing
Data preparation is a big issue for data mining
Data preparation includes
Data cleaning and data integration
Data reduction and feature selection
Discretization
A lot a methods have been developed but still
an active area of research

10
Data pre-processing

11
Clustering
Partition data set into clusters, and one can store cluster representation only
Can have hierarchical clustering and be
stored in multi-dimensional index tree
structures
There are many choices of clustering
definitions and clustering algorithms

12
Cluster Analysis

13
Classification Classification is an extensively studied problem (mainly
in statistics, machine learning & neural networks)
Classification is probably one of the most widely used
data mining techniques with a lot of extensions
Scalability is still an important issue for database
applications: thus combining classification with
database techniques should be a promising topic
Research directions: classification of non-relational data,
e.g., text, spatial, multimedia, etc..

14
Classification process Model construction: describing a set of predetermined
classes Each tuple/sample is assumed to belong to a predefined class,
as determined by the class label attribute The set of tuples used for model construction: training set The model is represented as classification rules, decision trees,
or mathematical formulae Model usage: for classifying future or unknown objects
Estimate accuracy of the model The known label of test sample is compared with the
classified result from the model Accuracy rate is the percentage of test set samples that are
correctly classified by the model Test set is independent of training set, otherwise over-fitting
will occur

15
Classification Process (1): Model Construction
TrainingData
NAME RANK YEARS TENUREDMike Assistant Prof 3 noMary Assistant Prof 7 yesBill Professor 2 yesJim Associate Prof 7 yesDave Assistant Prof 6 noAnne Associate Prof 3 no
ClassificationAlgorithms
IF rank = ‘professor’OR years > 6THEN tenured = ‘yes’
Classifier(Model)

16
Classification Process (2): Use the Model in Prediction
Classifier
TestingData
NAME RANK YEARS TENUREDTom Assistant Prof 2 noMerlisa Associate Prof 7 noGeorge Professor 5 yesJoseph Assistant Prof 7 yes
Unseen Data
(Jeff, Professor, 4)
Tenured?

17
Supervised vs. Unsupervised Learning
Supervised learning (classification) Supervision: The training data (observations,
measurements, etc.) are accompanied by labels
indicating the class of the observations New data is classified based on the training set
Unsupervised learning (clustering) The class labels of training data is unknown Given a set of measurements, observations, etc. with
the aim of establishing the existence of classes or
clusters in the data

18
Document category modellingExample: Filtering spam email.Task: classify incoming email as spam
and legitimate (2 document categories).Simple blacklist and keyword-based
methods have failed.More intelligent, adaptive approaches
are needed (e.g. naive Bayesian category modeling).

19
Document category modelling
Step 1 (linguistic pre-processing): Tokenization, removal of stopwords, stemming/lemmatization.
Step 2 (vector representation): bag-of-words or n-gram modeling (n=2,3).
Step 3 (feature selection): information gain evaluation.
Step 4 (machine learning): Bayesian modeling, using word/n-gram frequency.

20
What Is Association Mining?
Association rule mining: Finding frequent patterns, associations, correlations, or
causal structures among sets of items or objects in transaction databases, relational databases, and other information repositories.
Applications: Basket data analysis, cross-marketing, catalog design,
loss-leader analysis, clustering, classification, etc. Example.
Rule form: "Body ead [support, confidence]. buys(x, "diapers) buys(x, "beers) [0.5%, 60%]

21
Association Rule: Basic Concepts
Given: (1) database of transactions, (2) each transaction is a list of items (purchased by a customer in a visit)
Find: all rules that correlate the presence of one set of items with that of another set of items
E.g., 98% of people who purchase tires and auto accessories also get automotive services done
Applications * Maintenance Agreement (What the store should do to boost
Maintenance Agreement sales) Home Electronics * (What other products should the store
stocks up?)

22
Rule Measures: Support and Confidence
Find all the rules X & Y Z with minimum confidence and support
support, s, probability that a transaction contains {X & Y & Z}
confidence, c, conditional probability that a transaction having {X & Y} also contains Z
Customerbuys diaper
Customerbuys both
Customerbuys beer
Find the rules with support and confidence equal or grater than a given threshold

23
Mining Association RulesAn Example
For rule A C:support = support({A =>C}) = 50%confidence = support({A =>C})/support({A}) =
66.6%
Transaction ID Items Bought2000 A,B,C1000 A,C4000 A,D5000 B,E,F
Frequent Itemset Support{A} 75%{B} 50%{C} 50%{A,C} 50%
Min. support 50%Min. confidence 50%

24
References
U. M. Fayyad, G. Piatetsky-Shapiro, P. Smyth, and R. Uthurusamy. Advances in Knowledge Discovery and Data Mining. AAAI/MIT Press, 1996.
J. Han and M. Kamber. Data Mining: Concepts and Techniques. Morgan
Kaufmann, 2000.
T. Imielinski and H. Mannila. A database perspective on knowledge
discovery. Communications of ACM, 39:58-64, 1996.
G. Piatetsky-Shapiro, U. Fayyad, and P. Smith. From data mining to
knowledge discovery: An overview. In U.M. Fayyad, et al. (eds.), Advances
in Knowledge Discovery and Data Mining, 1-35. AAAI/MIT Press, 1996.
G. Piatetsky-Shapiro and W. J. Frawley. Knowledge Discovery in Databases.
AAAI/MIT Press, 1991.