ES205 Lecture Notes: Pattern Recognitioncfins.au.tsinghua.edu.cn/personalhg/jiaqingshan/PEOCS... ·...

Post on 31-Mar-2020

1 views 0 download

Transcript of ES205 Lecture Notes: Pattern Recognitioncfins.au.tsinghua.edu.cn/personalhg/jiaqingshan/PEOCS... ·...

LECTURE NOTES 12_1VERSION: 2.0

DATE: 2002-5-17

FROM ES 205 LECTURE #15 PATTERN RECOGNITION, DATA MINING AND KNOWLEDGE DISCOVERY

Lecture Notes 15: Pattern Recognition, Data Mining and Knowledge Discovery

I. An Example: Handwriting Recognition (HR)

The goal is to develop an algorithm that assigns any image to one of the two categories or rejects any of them. We define a class variable Y = Yk where k = 1, 2. Y1

corresponds to ‘a’ and Y2 corresponds to ‘b’.

One trivial algorithm is to assign a label to all possible images and store them in a lookup table. However, constructing such a lookup table is totally unfeasible. A typical image may have 128128 pixels, which can be stored in a vector D. Hence, the dimension of D is 16,384. For simplicity, we assume each pixel is either black or white. With such a highly simplified setting, we may have as many as 216384 different images! On the other hand, we only have a few thousand examples in the training set. We hope to squeeze the information or “pattern” hidden in the training set, and to construct an algorithm so as to classify correctly the unseen image vectors. This is the problem of prediction or generalization.

The 16,384-dimensional vector D keeps all the information available; however, if we feed such a large number of input variables to some learning algorithms, we can hardly achieve any good results. This is because (i) it takes too much time to process them; (ii) the final model can be very sensitive to the training set we choose, which will deteriorate the performance of generalization. As stated in the previous paragraph, because of the high dimensionality of input vector, the total number of different images is extremely large. For each of this image, we may need several examples to achieve a confident estimation. Unfortunately, even for such a toy example, we can never have enough data to attain good estimation for all the possible images. This is often referred as the curse of dimensionality.

Exercise: if we have N input variables, each of them has V different possible values, what is the total number of different input outcomes?

One way to solve this problem is to combine input variables to a smaller number of new variables called features. Such procedure is usually referred as feature extraction.

Copyright © by Yu-Chi Ho 1

For example, one of the possible features is “the ratio of height of the character to its width,” which we denote as X1.

Plotting the histogram of the X1 according to the counting from our training set,

Histogram of X1 in the training set

the following decision rule can be formulated: if X1 , Y = Y2; otherwise, Y = Y1.

Exercise: why this is the best decision rule for this problem based on the information we have about X1?

We say that feature X1 gives some degree of discrimination between the two classes. Intuitively, if we consider a second feature X2, for example, “deviation along the vertical central axis,” we may improve the degree of discrimination.

Linear decision boundary

The line that separates Y1 and Y2 is known as decision boundary. It does not have to be a straight line, namely liner decision boundary. Instead, some algorithms may learn a non-linear decision boundary that separates the instances in the feature space.

Copyright © by Yu-Chi Ho 2

X1

Y1 Y2

X1

Y1

Y2

X2

X1

Y1

Y2

X2

Non-linear decision boundary

If we keep on adding features, we shall anticipate that level of discrimination may be improved further; therefore, we should achieve better performance. However in practice what we usually see is that, for many learning algorithms, beyond a certain number, adding features could even worsen the generalization performance. One may be curious why using more information actually deteriorates the performance. Again, this is due to the curse of dimensionality as we mentioned above (also known as over-fitting). This is a very important issue, and we shall discuss it in detail later.

HR (Handwriting Recognition) is a perfect example of traditional pattern recognition problem. The basic idea is inductive learning, namely, “learning from examples”. The same idea can be easily extend to data mining problems where we are given a large database and we want to discover “knowledge” from it to achieve certain goals.

II. Another Example: Direct Mailing Fund Raising (DMFR)

The task is to maximize the yield of a non-profit charity organization by mailing the fund raising package to “promising” donors. A large database of all the previous donors is available. For each individual donor, there are as many as 400 fields describing his or her demographic information and donation history. Altogether, the database contains the data of more than 100,000 people.

Mailing the package has certain amount of cost, for example $c. Hence, if the person does not response, we end up with a negative yield $c. If a donor contributes $d, the yield would be $(dc). Therefore, the definition of a “promising” donor is naturally the one who will generate positive yield and the problem reduced to how to identify promising donors in the database.

We shall point out some difference between DMFR and HR:(i) In DMFR, features are already defined. The difficulty is that the dimension of the feature vector is still too high to be used in learning algorithms. Furthermore, among the 400 features, many could be totally irrelevant to the problem; hence, feature selection or feature reduction is necessary.

(ii) The task is no longer simply classify a person to two categories “likely to donate” or “not likely,” instead, we have to estimate how much (s)he is going to contribute. The problem of HR is usually referred as classification, while DMFR deals with regression problem. There are a lot of statistical textbooks talking about linear or non-linear regression algorithm. However, most of these methods can hardly be applied directly to very large data sets such as the one in our DMFR example.

We will re-visit these examples in the future when we talk about different learning algorithms. III. Pattern Recognition: a Multistage Process

Copyright © by Yu-Chi Ho 3

Raw Data

Working Data

Predictive Model

Generalization Results

Final Decision

Pre-processing: feature extraction / selection

Inductive learning by examples

Prediction or generalization

Interpretation of the results

Model validation

As we see in the above examples, the problem of pattern recognition is a multistage procedure:

Most of the textbooks only focus on the learning stage. A lot of inductive algorithms have been proposed, such as decision tree, neural network, Bayesian (belief) network, etc. However, in real world, the interfaces between the end-user, who owns the raw data and wants to make decision out of them and those well-studied algorithms can be even more important and time-consuming.

Smart decisions hardly begin with wrong assumptions. If problem can be detected and formulated at earlier stages, we may have a better chance to reach a good solution. If one has to make decision based upon some information, one shall anticipate that the information is accurate, clean and adequate. Data pre-processing is the first step towards problem solving. All the inductive learning algorithms rely heavily on the product of this stage, which we call working data. To a large extend, this stage decides the direction of the final solution; hence, we cannot emphasis this more.

In learning we often reduce the learning problem to an optimization problem (see below section 4A). In this area, we have the no-free-lunch theorem, which says that there is no universally applicable optimization algorithm that works well for all problems. On average (over all possible problems), without specialization, all algorithms performance the same as blind search. So do not be lured by fancy algorithms, instead, try to understand the problem better from the very beginning.

Question: can you give an example that by choosing a wrong feature, one makes a decision worse than blind pick, namely, without any information?

Copyright © by Yu-Chi Ho 4

IV. Statistical Pattern Recognition: a Formal Treatment

A. Asking the problem from the view of optimizationA general optimization problem can be described as

(4.1)where is the parameter (feature) space and J is the performance evaluation function. can be either some features, or values of the parameters or structure of the models, for example, the connectivity of a neural net. An optimization algorithm searches the feature space to locate * that minimizes J().

Pattern Recognition is a special form of optimization problem: given a set of examples, known as training data set, a learning algorithm tries to construct a model or a hypothesis that summarizes the data, and can be used to predict the behaviors of new examples. One of the nature performance evaluation functions is the model’s prediction accuracy.

To be more formal, we have the following notations. An object can be described by a collection of N input variables X RN, known as features. X is the value space of the features. An instance x of the object specifies a feasible value to each of the N features. We also have an output variable Y called target, which is the label of an instance. If Y has categorical value (like that in the HR example, Y can be either ‘a’ or ‘b’), the problem is classification. If it is a continuous variable (as in the DMFR example, Y is the amount of money a person contributes), the problem is called regression. Both classification and regression can be seen as some kind of function approximation, in other words, finding a (usually non-linear) mapping from the feature space to the target variable that minimizes the classification or regression error.

A set of instances is collected in the hope that we may recognize some interesting patterns from it. A learning scheme may encounter two types of scenario. The first is called unsupervised learning because the target Y is not given, or even not defined. We want to detect “some” interesting patterns from the training set, but we do not have a clear idea about them. The second is supervised learning where the training set has M labeled instances, i.e., S = { (xm, cm) | m = 1, …, M }, where xm

is the mth

instance in the training set, and cm is its label.

In our HR example, all the training images are pre-classified by human being, while in the DMFR example, we use historical data as our “instructor.”

In this lecture, we will mainly focus on the supervised classification problem. Unsupervised learning is also called clustering. The basic idea is to explore the similarity among instances in training set. One famous method is the K-nearest-

Copyright © by Yu-Chi Ho 5

neighbor algorithm1. The students are encouraged to read more about unsupervised learning if they are interested in it.

The task of a supervised learning algorithm is to induce, from the training set S, a predictive model, which is a mapping from the feature space to the target variable:

or

The learned model can be used to predict the label of unseen instances. A common criterion of a learning algorithm is to minimize its prediction error rate. We can further define a loss function if we mislabel an instance, r[h(X | S)]. Let H denote the space of all possible models, a supervised learning algorithm searches the space H to locate a model h* that minimizes the expectation of our loss,

(4.2)This is exactly an optimization problem as described in (4.1).

B. Framework of Statistical MethodsTo tackle this problem, we need to discover the rules that govern the “labeling procedure” of instances. Due to the complexity of the feature space and possible intrinsic random nature of the problem, statistical method is the most general and nature framework to formulate solutions to pattern recognition problems. We assume that the training set S is randomly sampled from a probability distribution P(X, Y). The learning procedure is therefore concentrated on the estimation of this distribution based on available training data. Once this is done, and the cost for mislabeling is given, a decision rule is easy to formulate.

Let us first focus on the binary classification problem, where Y can only take the value from {0, 1}. For a given instance X with label Y, if our learning algorithm predicts its label as h(X), the loss function of our decision is defined as

Denoting

the misclassification risk on a particular point X is defined as the expected loss of our model:

To minimize our risk, we choose h(X) = 1 if l1 p(X) is greater than l0 [1 p(X)], h(X) = 0 otherwise. This is called Bayesian decision rule and denoted by hB(X):

(4.3)where 1(x) is an indicator function, namely, 1(x) = 1 if x is true, 0 otherwise.

1 See Robert J. Schalkoff (1992), Pattern Recognition: Statistical, Structural and Neural Approaches.

Copyright © by Yu-Chi Ho 6

Exercise: explain why Bayesian decision rule is the optimal decision rule given our definition of loss function.Exercise: when we make Bayesian decision, what is our misclassification risk? – which is known as Bayesian risk. If l0 = l1 = 1, as in most cases, the loss function becomes

which is often referred as zero-one loss in machine learning literature. Under the zero-one loss function, Bayesian decision rule (4.3) is reduced to the most likely class, i.e.,

(4.4)

Also, it is obvious that, if we make Bayesian decisions at all the possible points in the feature space, our overall misclassification risk will be minimized.

From the above discussion, posterior distribution P(Y | X) can be seen as a discriminate function, and the Bayesian decision rule can be expressed as

(4.5)

Exercise: we shall mention that P(Y | X) is not the unique choice of discriminate function that minimizes the risk. Can you give another discriminate function?

Now, the only thing left is how to estimate from the training set S. In Bayesian statistics, according to Bayes’ rule

where P(Y) is the prior or unconditional probability that Y = k and P(X | Y) is called likelihood function or class density function.

As in the HR example, once we collect a large number of images, we may find the frequency that the two characters occur. For example, if ‘a’ occurs three times as often as ‘b’ does, we have P(Y = 1) = 0.75 and P(Y = 0) = 0.25. If we were forced to classify the label of a new image without being allowed to see it, the best we can do is to predict it as ‘a’, because P(Y = 1) > P(Y = 0).

To be more formal, the maximum likelihood estimation of P(Y) can be calculated from training set S

assuming that instances in S are uniformly sampled from the distribution P(X, Y).

Let us denote the prior probability that Y = k as(k = 0, 1)

and the class density function of X given Y = k as

Copyright © by Yu-Chi Ho 7

(k = 0, 1)

Plug into Bayes’ rule, we have

(4.6)To be more general, for domains with K classes

(4.7)

As we have seen, the prior probability can be estimated easily from the training set. However, estimating the class density function pk(X) is not a trivial task, because we are dealing with a function of N + 1 variables. We will talk about such methods later on when we come to the learning model part

In some real world domains, N can be very large. As for our HR example, before we conduct feature extraction, we have 16,384 input variables; for the DMFR example, N is about 400. When N is large, a learning algorithm is likely to encounter the problem referred as the curse of dimensionality.

C. The Curse of DimensionalityIf we have N features, and each feature has V different values, the whole feature space is divided into a grid with a lot of cells. Each instance in the training set corresponds to one cell and carries a label Y. The following figure shows the situation when only 2 features present and each feature has 5 different values. The shadow indicates that Y = 1, and clear cells have Y = 0.

If we are given a new instance, we can determine its label by finding the cell it falls in, and return the majority (for regression problem, the average) of Y for all the training data lying in that cell. However, the total number of cells in the grid is VN, which grows exponentially with the dimension of the feature space. This phenomenon is termed as the curse of dimensionality. If we are forced to work with limited quantity of training data, as we are in practice, the grid will become extremely sparse very quickly as we increase the dimension of the feature space. As a result, this provides a very poor representation of the mapping from X to Y. In pattern recognition this is also called over-fitting, i.e., you have too few data points to determine too many parameters with the result that you have no generalization ability.

Copyright © by Yu-Chi Ho 8

X2

X1

Most pattern recognition problems have to struggle with this dilemma: with too few features, the learned model may be too rough to provide discriminative power; with too many features, the model will over-fit the training data very quickly and lead to poor prediction performance. Exercise: suppose you are given a set of M points (Xm, Ym), the task is to find a polynomial to fit the data and predict Y given a new X. Can you explain the above dilemma in term of the curve-fitting task?

V. Data preprocessing

In this lecture we emphasis feature extraction and selection heavily. The choice of data preprocessing can often have a significant impact on generalization performance. However, as you may have noticed, comparing to the number of papers talking about learning models or algorithms, there are only very few papers devoting to this topic. The reasons may be (i) features are very much problem-dependent. It is very difficulty to come up with an algorithm that automates the process and can be applied to a wide range of domains. (ii) Many theoreticians ignore this area because it does not seem to be a very promising area to generate beautiful theoretical results. However, for most real world problems this step can never be skipped.

A. Feature ExtractionThe most important objective of data preprocessing is to reduce the dimensionality of input vector. One way to do so is simply discarding a part of the original inputs. A more sophisticated way involves forming some linear or non-linear combination of the original inputs to create new features.

The later one is more “natural” if we think about how humans recognize patterns. For example, in our HR example, when a human looks at the image of a character, he never pays too much attention of each individual pixel. Instead, he notices the relative size or proportion of the character and the curvature of strokes quickly. This stage, namely, formulating features from the original input data, is often referred as feature extraction.

It is obvious that feature extraction can be highly problem-dependent. For different domains, the nature of features can be totally different. In many of the real world problems, this stage can only be partially automated, and the knowledge of human experts is still absolutely indispensable. For example, when computer scientists celebrated the wining of Deep Blue over Karsparov, nobody could deny the contributions of those chess masters who actually taught Deep Blue chess knowledge.

A combination of hard work and talent can achieve brilliant results. In artificial or computational intelligence area, until very recently, it is still human beings who contribute the talent part and computer is still a tool to do hard work.

Copyright © by Yu-Chi Ho 9

Another difficulty of the problem is that optimality can hardly be defined. Because we are talking about (liner and non-linear) mapping from raw data to feature space, even with finite number of original inputs, the number of possible features can still be infinite. Furthermore, how to evaluate the goodness of these features is rather difficult.

Principle component analysis for feature extraction (Optional)Let us restrict our attention to linear transformations and talk about the technique called principle component analysis (PCA). PCA is a kind of unsupervised method, in other words, no information of the target variable in the training set is used.

The goal is to map a set T of original M-dimensional vectors dt (t = 1, …, T) onto vectors xt in an N-dimensional feature space (X1, X2, …, XN), where M > N. Vector D can be represented as a linear combination of a set of M orthonormal vectors um

(5.1)where vectors um satisfy the orthonormality relation if i = j, 0 otherwise.

By multiplying ui to both side of (5.1) we have (m = 1, …, M)

(5.2)which is a simple rotation of coordinate system from D to X. Suppose only N of Xm’s are retained and other MN coefficients are replaced by some constant Bi, we have

(5.3)By doing so, the degrees of freedom are reduced from M (in D) to N (in X).

For vector dt, the error introduced by (5.3) is

We minimize the sum of the squares of the errors over the whole training set T, namely

(5.4)

By setting the derivative of ET with respect to Bi to 0, we get 2

(i = N+1, …, M)

and

2 The readers are referred to C. M. Bishop (1995), Neural Network for Pattern Recognition for a more detailed treatment

Copyright © by Yu-Chi Ho 10

(5.5)where

(5.6)

(5.7)

Now the remaining task is to minimize ET with respect to the choice of basis vectors ui’s. In Bishop (1995) Appendix E, it is shown that the orthonormal vectors turn out to be the eigenvectors of the covariance matrix T, and

(5.8)where i are the eigenvalues of T. Thus, to minimize ET we shall choose the MN smallest i’s. In another word, if we discard the eigenvectors corresponding to the MN smallest i’s, the total errors will be minimized.

To summarize, in PCA, we first compute the mean vector and covariance matrix T of the original data as in (5.6) and (5.7). Then, the eigenvectors and eigenvalues of T are calculated. The eigenvectors corresponding to the N largest eigenvalues are used as the base to project the original input vectors dt onto vectors xt in N dimensional space.

Question: explain the basic idea of PCA, in what sense is PCA an optimal or sub-optimal method?

Copyright © by Yu-Chi Ho 11

B. Feature SelectionThe goal of feature selection is to reduce the dimensionality of feature space by selecting a subset of the features and discard the reminders. In many real world domains, due to the large number of features, feature selection is absolutely necessary to avoid overtitting. In a given database, many features can be totally irrelevant or redundant in terms of predicting the target. Ideally, if we can identify these features and eliminate them from the training data, the learning performance should not be compromised.

Any feature selection algorithm must have two components: (i) a criterion that evaluates the goodness of a subset (ii) a search scheme for locating the good candidate subset. The most difficult part of the problem is due to the large number of candidate subsets. Exhaustive search is completely unfeasible.

Exercise: if the original data set has N0 features and we want choose a subset of at most N features, what is the total number of candidate subsets? For example, N0 = 400, N = 50.

Another difficulty is that evaluating the goodness of a subset can be difficult and expensive. Research work can be divided along two lines: filter and wrapper methods3. The wrapper method searches through the space of feature subsets using estimated accuracy of a certain induction algorithm as the measure of goodness for a subset. Thus, the feature selection step is “wrapped around” the learning algorithm so that the bias introduced by the search scheme and that by the learning algorithm strongly interact. In filter method, the criterion that judges whether a subset is good or not is somewhat “independent” of the induction learning procedure. In another word, the bias introduced by feature selection does not interact directly with the bias inherent in the learning model.

Gain some insight of each individual featureThe first step is to gain some insight of each individual feature. There are many criteria that can be used to judge the goodness of a feature; of course, different criterion introduces different bias.

1. For wrapper method, we first randomly sample a testing set from the training data. Then, we train the learning model with the remaining data considering only one feature as the input. After that, the goodness of that feature is estimated as the accuracy of the learned model on the testing set.

2. The correlation coefficient When we talk about one feature, the training data can be reduced to S{1} = {(x1, y1), (x2, y2), … (xm, ym), … (xM, yM)}. Given S{1}, the correlation coefficient between the feature X and target Y can be calculated as

(5.9)

3 See John, Kohavi and Pfleger (1994), Irrelevant features and the subset selection problem, Proceedings of Machine Learning-94

Copyright © by Yu-Chi Ho 12

where , ,

and ,

Correlation coefficient only works for numerical data.

3. The 2 test2 test is one of the most useful statistical tool to test statistical significance. It gives the probability that a correlation observed in an IJ contingency table is due to chance. For two variables A and B, each may take I and J different values respectively, the contingency table is an IJ matrix where each cell oij is the observed count of the event “A = i and B = j” in the database.

An IJ contingency table

We define , and . If we assume A and

B are independent, the expected count in cell ij is . The 2

statistics is defined as

If the null hypothesis is true, namely A and B are truly independent, this statistics is distributed approximately according to a chi-squared distribution with d = (I1)(J1) degrees of freedom. The mean of the distribution is d and variance is 2d.

The z-score of a statistics is how much it lies away from the mean in the unit of standard deviation of its distribution under null hypothesis. The z-score of 2

statistics with d degrees of freedom is

(5.10)Generally, a z-score less than 2 is not significant, while a z-score of more than 6 must be due to some deterministic relationship.

2 test only works for discrete features. For continuous variable, we may need to discretize it first.

Copyright © by Yu-Chi Ho 13

o11 … o1j …o21 … o2j …

… … … …oi1 … oij …

… … … …oI1 … oIj …

1 … j …o1J

o2J

…oiJ

…oIJ

J12

…i

…I

o1

o2

…oi

…oI

o1 … oj … oJ o

One important point worthy mentioning is that correlation does not necessarily imply causation. Without involving a systematic theory of causation, here are some basic guidelines4 (i) Causation implies correlation but correlation does not imply causation.(ii) Two variables may be correlated because they have a common cause.(iii) If variable A causes variable B, than changing A changes B in general.(iv) If A and B have a common cause, changing A does not necessarily changes B.

4. The information gain and gain ratio

In information theory, the entropy of a discrete distribution is defined as (bits)

where . Entropy measures the uncertainty of the distribution, i.e., on average, how many bits of information does one need to be sure about the true state. It is also referred as negative information. The information conveyed by each event depends on its probability pi and can be measured as bits.Let be the number of examples in set S that belong to class k, where k = 1, …K. is the total number of instances in the set. The event “randomly select one example from set S and it belongs to class k” has the probability . Thus, it conveys bits information. The expected information of set S can be expressed as

which is the entropy of set S if only target variable Y is considered.

If our training set S is partitioned into V subset S1, …, SV according to V different value of a feature X, the expected information given such partition would be

We therefore define(5.11)

which is also known as the mutual information between feature X and the target variable Y.

The gain criterion has a strong bias in favor of features with many different values. The bias can be rectified by a kind of normalization. We define

which represents the potential information generated by dividing S into V partitions, whereas gain(X) measures the information relevant to classification that arises from such partition. Hence

(5.12)4 Charles Elken (1999), Lecture Notes for CS281r, Harvard University.

Copyright © by Yu-Chi Ho 14

expresses the proportion of information generated by the partition that is helpful for classification.

(5.9), (5.10), (5.11) and (5.12) present several criteria to measure the “goodness” of one feature in terms of predicting the target variable. In practice, by ordering features according to any of these criteria, we can have some rough idea how good they are. These tests can be extended to the combination of a small number (2 or 3) of features.

Question: can we simply choose the top-N out of N0 features as our decision? Why? Searching strategyThe answer to the above question is “No,” since sometimes the combination of two features can be a strong predictor, but neither of them by itself is; the above strategy is likely to drop both of them. Sometimes, two features may be highly correlated or even identical, but the strategy could not detect such relationship at all.

Now the question becomes how to search for good candidate subset. As we have mentioned, because the search space is so large, exhaustive search is totally unfeasible and blind pick does not help too much.

Different researchers have proposed several greedy search schemes. One of them is forward selection, i.e., beginning with an empty set, features are added one by one according to metric of information gain until the either gain is not significant or the number of features reaches N. Another scheme is backward elimination. It begins with the full set and keeps on dropping features that seem to be not useful. Most of these greedy search methods consider one feature at a time, and do not have backtracking, namely, once a feature is selected or dropped, it is never considered again.

Question: in what sense is the greedy search sub-optimal?

Answer: consider we have two features X1 and X2, when they present together, they are good predictors, but either of them is not a good predictor by itself. The greedy search schemes tend to eliminate both of them.

Exercise: Can you propose a search scheme for the feature selection problem using the idea of Ordinal Optimization? (Just show some rough idea)

VI. Learning models

As we have mentioned in section 4, estimating the class density function or likelihood function is very hard because the dimensionality of X can be very high.

Different learning models deal with this problem differently. Some algorithms estimate this explicitly while some do it implicitly. Some algorithms assume likelihood function belongs to a parametric probability family and use

training data to fit the parameters and some use Monte-Carlo simulation or other non-parametric methods to estimate .

Copyright © by Yu-Chi Ho 15

6.1 Bayesian network, naïve Bayes classifier5

A Bayesian network is a graphical model for probability reasoning among a set of variables. Is has became a popular representation for encoding expert knowledge in expert systems. More recent advance in research work has been focus on learning Bayesian network from data and has shown that it is a remarkably efficient model for some data mining problems.

Bayesian network encodes the joint probability distribution for a large number of variables efficiently. To be more formal, a Bayesian network for a set of features

consists of (i) a network structure S that encodes the conditional independence assertion about the features in X and (ii) a set of local conditional probability associated with each feature.

The network structure S is a directed acyclic graph (DAG). Each node in S represents a feature Xn X. We use n to denote the parents of node Xn, in other words, n is a set of nodes and each of them has a direct arc pointed to node Xn. The lack of arc in a Bayesian network represents conditional independence.

The joint distribution for X is given by

(6.1)or

(6.2)

How to build a Bayesian networkTo illustrate the process of building a Bayesian network, let us consider our DMFR example. One possible choice of features for the problem could be Class (C, whether the person is a promising donor or not), Age (A), State (S, in which state the person is living), Donation (D, the amount of money donated last time), cluB (B, whether the person is a member of a certain club).

After choosing the features, we try to build an acyclic graph that encodes assertions of conditional independent. One approach for doing this is based on the chain rule of probability:

(6.3)

For every feature Xn, there will be some subset such that Xn

and are conditional independent given n, That is

5 The readers are referred to David Heckerman (1996), A Tutorial on Learning with Bayesian Networks for more detailed presentation. (available at ftp://ftp.research.microsoft.com/pub/dtg/david/tutorial.ps)

Copyright © by Yu-Chi Ho 16

(6.4)

Combine (6.3) and (6.4) we have (6.2). Consequently, to determine the structure of a Bayesian network we (i) order the features somewhat, (ii) determine the variable set that satisfies (6.4) for each features.

Question: what is the advantage we have taken to calculate the joint distribution p(x) by Equation (6.2)?

In our DMFR example, using the order (C, A, S, B, D), we have the following conditional independence, (not necessarily real)

Thus, we have the following network structure

A possible Bayesian network for DMFR example

The above approach has a serious drawback. If the order of the features is chosen carelessly, the resulting network structure may not reveal the conditional independence among features.

Another approach for constructing the Bayesian network does not require the ordering, but rely heavily on human knowledge to assert causal relationship among features. In particular, we simply draw arc from the cause variable to their immediate effects. Because causal relationships typically correspond to conditional dependence, in most cases, if we have good prior knowledge about the features, this will result in a good network structure.

The last step of constructing a Bayesian network involves assessing the local conditional probabilities , which we have shown in the above figures.

Theoretically given the structure of a Bayesian network S and the local conditional probabilities (n = 1, …, N), the joint probability distribution is completely specified and we can inference any interesting probability from it. For an example, in our example, we may be interested in , which is

Copyright © by Yu-Chi Ho 17

Class AgeState

Donation

cluB

p(C = y) = 0.05p(S = CAFL) = 0.2 p(A<30) = 0.25p(A=30~50) = 0.40

p(B = y| C = y, A<30) = 0.008p(B = y| C = n, A<30) = 0.001

p(B = y| C = y, A=30~50) = 0.015p(B = y| C = n, A=30~50) = 0.002

p(B = y| C = n, A>50) = 0.010 p(B = y| C = n, A>50) = 0.002p(D > 0 | c, s, b)

Although conditional independence simplifies the probability inference, exact inference in an arbitrary Bayesian network for discrete features is still NP-hard.

Another important issue is “can we automate the process of building a Bayesian from the training data?” To be more specific, can we learn local probabilities given the network structure S? Can we learn or improve structure S from data? Unfortunately, if we do not restrict the form of the network or make some assumptions to simplify the problems, they are all NP-hard problems. A lot of papers are devoted to algorithms that learn some special forms of Bayesian Network in polynomial time6. Due to the limited time, we can not afford to enumerate them here. However, the simplest form of Bayesian network, which is often referred as naïve Bayesian classifier, is of particular interest, and we shall talk about it now.

Naïve Bayesian classifiersNaïve Bayesian classifier assumes all the features are conditionally independent given the value of class variable, hence,

(6.5)where is the marginal density function for each individual feature given class variable.

Graphically, naïve Bayesian classifier can be represented as the following DAG:

Naïve Bayesian classifier

For a reasonable large training set S, we should have enough data to obtain a relatively accurate estimation of

(6.6)

6 The reference part of the tutorial by David Heckerman can be a good resource if the readers are interested in this topic.

Copyright © by Yu-Chi Ho 18

Y

X1 X2 X3 XN...

In real world domains, such independence assumption can be far from reality, which explains why the algorithm is called “naïve.” Nevertheless, this lead to considerable computational advantage, because (6.5) deals with N one-dimensional functions instead of one N-dimensional function. From (6.6), we can see that its computational and memory complexities are linear in N (number of features),

(largest number of possible values of features) and M (number of instances in the training set). In theory, no algorithm can be simpler than this.

When all the features are truly independent with each other, (6.5) gives unbiased estimation of pk(X) hence we make Bayesian decisions. However, if the independent assumption is violated, as in many domains, the estimation of class density function can be highly distorted. Due to this distortion, naïve Bayesian classifiers have been long considered as primitive. In spite of its perceived limitations, researchers have gradually noticed that naïve Bayesian classifiers perform unexpectedly well in many artificial and real world domains comparing to some more sophisticated algorithms such as decision trees and neural networks. Many experiments showed that even with clear dependence among features, naïve Bayesian classifier is still quite robust and competitive in terms of classification performance.

Question: can you give some intuition why naïve Bayesian classifier works so well even the conditional independent assumption is violated? Hint: how many parameters do we estimate when learning with naïve Bayesian classifier?

Besides the reason given as the anticipated answer to the above question, there is another important point: in naïve Bayesian classifiers, we make decision based on the estimation of the posterior probability :

(6.7)

We only care the order of for each k. According to the idea of ordinal optimization, order is much robust than value; therefore, even if our estimation is very rough, the order may still be kept which leads to optimal decision!

Lots of work has been done to improve the performance of naïve Bayesian classifier while at the same time maintaining its nice property. Some of these methods try to capture some dependence among features; some try to detect correlation among features and eliminate highly correlated features; others involve the technique called boosting. Should the readers be interested in this topic, they may find references from either the teaching fellow or by searching the literature.

6.2 Decision trees

Suppose in the HR example, after feature extraction, we have the following data set:

H/W Total Pixels Y Posi Left Edge Letter1.0 75 1 T ‘b’0.8 80 7 T ‘a’0.7 85 6 F ‘a’0.9 72 8 F ‘a’

Copyright © by Yu-Chi Ho 19

1.0 69 2 F ‘b’1.4 71 6 F ‘a’1.5 65 6 F ‘a’1.3 75 7 T ‘b’1.4 68 5 T ‘b’1.4 70 3 T ‘b’1.8 72 6 T ‘b’1.7 83 1 F ‘b’2.0 64 1 T ‘b’2.0 81 2 F ‘b’

We may come up with a decision tree like this:

A possible decision tree for our database The idea here is to partition the training into single-class subsets. Of course there are lots of ways to do so. However, the tree-building process is not intended to merely find such a partition, but also to build a tree that reveals the structure of the domain and have some predictive power. For this purpose, we need significant number of examples at each leaf or, in another word, there should be as few partitions for the training set as possible. Unfortunately, finding the smallest decision tree with a training set is NP-complete.

Most of the decision tree construction algorithms are non-backtracking, i.e., at each point, once a feature is selected to partition the current set, usually based on some greedy local measurement, the choice is cast in concrete and the consequences of other choices (at this point) are no longer considered again.

One of the famous decision tree algorithms is called C4.57 uses information gain ratio as criterion to decide which feature to be selected. The feature that maximizes the information gain ratio at each decision point is used to partition the training set. Other criteria may also be used to make decisions, which leads to different type of decision tree algorithms.

The decision tree algorithms have an intrinsic mechanism to selection features. However, if an instance has a large number of features, the learned decision tree tends to become very tall and the problem of over-fitting is likely to occur. Some algorithms have been proposed to prune the tree, or combine the branches so that the predictive performance can be improved. Usually, after the tree is pruned, there is no guarantee that the leaf node is still a single-class subset.

7 J. R. Quinlan (1993), C4.5: Programs for Machine Learning.

Copyright © by Yu-Chi Ho 20

H/W

1.0 1.1~1.5 > 1.5

‘b’ (4:0)Y Posi

3 > 3

‘b’ (2:0) ‘a’ (0:3)

Left Edge

F T

‘a’ (3:0) ‘b’ (0:2)

6.3 Neural network8

Numerous efforts have been made to develop “intelligent” programs based on Von Neumann’s computer architecture to solve pattern recognition, knowledge discovery problems like that in our HR and DMFR examples. Inspired by biological neural systems, artificial neural networks (ANN), which can be viewed as a parallel and distributed system with a large number of interconnected simple processing units, were proposed as a model to tackle such problems.

Motivations, human brainModern computers have outperformed humans in the domain of numerical computation and related symbolic manipulation. However, human can solve complex perceptual problems (e.g. recognizing handwriting) so fast that dwarfs the fastest supercomputer in the world. It is helpful to explore the biological neural system to see why such remarkable difference exists.

A neuron is a special biological cell with information processing ability. A neuron is composed of a cell body (soma) and two types of out-reaching branches: axon and dendrites. A neuron receives messages from other neuron through its dendrites, and transmits signals generated by its cell body through axon. A synapse is a place of contact between two neurons.

A biological neuron

Certain chemical, called neurotransmitter, diffuses across the synapse gap, and their effect is to either enhance or inhibit. The effectiveness of a synapse can be adjusted according to the signal passing it so that synapses can learn from the activity they participated. This may be responsible for human’s capability of memory.

The cerebral cortex in human brain contains about 1011 neurons, which is approximately the number of stars in the Milky May! Each neuron is connected to 103

~ 104 other neurons. As far as we know, there are only about 102 types of different neurons and they operate at much lower frequency than electronic circuits. However, complex perceptual decisions are made by a human very quickly, which implies that computation involved can not take too many serial stages, and there can not be direct transmission of bulk information. The key is parallel and distributed representation and processing.

8 A good tutorial for artificial neural network can be found in IEEE Computer Special Issue on Neural Computing, March 1996. A. K. Jain, et. al., Artificial Neural Networks: a Tutorial.

Copyright © by Yu-Chi Ho 21

Axon

SynapseDendrites

Cell body

We here reproduce the table in Jain, et. al (1996), which is a excellent comparison between modern digital computer and biological neural system.

Von Neumann computer Biological neural system

Processor

complex / high speed simple / low speedone or a few large number

Memory

separate from processor integrated into processorlocalized distributed

non-content addressable content addressable

Computing

centralized distributedsequential parallel

stored program self-learningReliability very vulnerable robust

Expertisenumerical and symbolic

manipulationperceptual problems

Operating environment

well-defined, well-constrained

poorly-defined, unconstrained

ANNs attempt to make use of some of the “organization” principles that are believed to be functioning in the biological neural systems to deal with perceptual problems. Many researchers have realized that hardware implementation might be the only way to take full advantage of such organization. However, to make a chip with so huge number of interconnections is still a challenge.

Basic model, perceptron and network architecturesIn 1943, McCulloch and Pitts proposed a binary threshold unit as a computational model for a neuron, which is known as perceptron.

A perceptron

Mathematically, a neuron computes a weighted sum of all the input signals. If it is greater than a certain threshold, the neuron outputs 1, otherwise 0,

(6.8)where f() is a unit step function and wn is the synapses weight associated with the nth

feature. Positive weights correspond to excitatory synapses, while negative weights correspond to inhibitory synapses.

The McCulloch and Pitts neuron use unit step threshold function as activation functions. This can be generalize to other types of functions such as piece-wise linear, sigmoid, or Gaussian as shown in figure below.

Copyright © by Yu-Chi Ho 22

w1X1

wNXN

SummingJunction

ThresholdDevice

YX 10

Piece-wise linear

Threshold Sigmoid Gaussian

Among all these functions, the sigmoid functions are the most widely used activation in ANNs. It is strictly increasing and much smoother than unit step function. The standard sigmoid function is the logistic function

(6.9)

An artificial neural network is an assembly of many interconnected neurons. To be more specific, it is a directed graph with each neuron as a node and their interconnections as the edges. According to their network architectures, ANNs may be classified into two major categories: (i) feedforward networks without any loop existing in the graph (ii) feedback (or recurrent) networks with loops (see the following figures).

Feedforward Feedback

Different connectivities lead to different behaviors, and require different mathematical tools to be employed. In our lecture, we do not have time to present different types of ANNs such as radial basis function nets, competitive networks, self-organize map, Hopfield networks, etc. Instead, we will mainly focus on learning single and multi-layer perceptron.

Learning rules, back-propagation algorithmA learning process, in the context of ANN can be viewed as updating network connectivities and synapse weights so that a network can perform some task reasonably well. The connectivities and weights may sometimes be given by some prior knowledge, but in most cases, they are learned from training data. As we have mentioned, learning may be either supervised or unsupervised and sometimes hybrid.

Any learning theory should address several fundamental issues: (i) learning capacity, i.e., what are learnable and what are not, (ii) sample complexity, i.e., how much training data are necessary to achieve valid generalization performance, (iii) time complexity.

For the first issue of learnability, one of the must significant results of ANNs is that 3-layer, or even 2-layer feedforward networks (the input layer is not counted) with a large enough number of nonlinear hidden units are capable of approximating any continuous functions with arbitrary accuracy. For the second issue, we really do not have much to say at this point. All we know is that the more nodes in the neural net,

Copyright © by Yu-Chi Ho 23

the more likely we are going to overfit the data. The third issue depends on the algorithms deployed to train the neural network and unfortunately many existing methods are very time-consuming.

The most important learning rule is error-correction rule. If the output of the neural network is while the desired label is Y, error-correction rule uses the error signal

to modify the weights such that this error reduces gradually. The well-known algorithm for learning single layer perceptron is based upon the error-correction rule. As we mentioned, after the summing junction, before applying the threshold function, we have

(6.10)and

(6.11)

Therefore the linear equation defines the decision boundary, which is a hyper-plane in the N-dimensional feature space.

Rosenblatt proved that, if the training data are drawn from two linearly separable classes, then the following learning procedure converges to zero error after finite number of iterations. This is known as perceptron convergence theorem.

Algorithm: learn a single perceptron

Exercise: can you show the above algorithm converges? (No strict proof needed, just give some rough intuition.)

For different activation functions, different learning algorithms must be formulates. However, a single layer perceptron can only learn linearly separable decision boundary, as long as a monotonous activation function is used.

Tons of papers talk about other learning rules such as Boltzmann machines, Hebbian rule, and Competitive learning which are designed for training specific network structures. Detailed treatment of these algorithms is beyond the scope of this lecture.

The most popular feedforward network is the layered perceptron network. In the rest of the lecture, we shall focus on “learning multilayer perceptron networks.” It has been proved that multilayer feedforward networks are able to learn arbitrarily

Copyright © by Yu-Chi Ho 24

1. Initial weights and threshold to small random numbers2. Evaluate the instance x = (x1, x2, …, xN)T

3. Updating weights according to

where i is the iteration index, and (0.0 < < 1.0) is the step size

complex decision boundaries. The invention of back-propagation algorithm for learning weights from the data has resulted in a lot of successful stories and made multilayer feedforward networks the most popular architecture.

A two-layer feedforward perceptron networkLet the training set be S = { (xm, ym) | m = 1, …, M }, the squared-error cost function, which is most commonly used in ANN literature, can be defined as

(6.12)The back-propagation algorithm is a gradient-descent method to minimize SE. The algorithm can be described as9 (Optional)

Algorithm: backwards propagation

Question: can you explain why multilayer perceptron networks can approximate arbitrary decision boundary?

Answer: As we have mentioned, single layer perceptron forms a hyper-plane in the feature space. The behaviors of a multilayer perceptron networks can be viewed as

9 The algorithm itself is not required for this course.

Copyright © by Yu-Chi Ho 25

Layer I Layer J Layer K

wij wjk

1. Initial weights to small random numbers2. Randomly choose an instance from the training set xt

3. Propagate the signal forward through the network4. Compute in the output layer

where is net input (weighted sum) to the ith neuron in lth layer,

and is its output. For the Lth layer, , which is the output of the network. is the activation function

5. Compute for the preceding layers by propagation errors

backwards

for l = (L1), …, 1

6. Update weights using

where (0.0 < < 1.0) is the step size7. Goto step 2 until some pre-defined criteria is reached

performing logic “AND” and “OR” operation on these hyper-planes and the capability of forming complex decision boundary is only limited by the number of hidden neurons. The figure on the next page explains this idea. It shows that how to combine two linear decision boundaries into a non-linear boundary.

Other issues such as how many layers are needed, how many neurons are to be used for each layer, how large the training set should be are still the “artistic part” of designing feedforward networks. Current theoretical results only give very loose guidelines.

Combine linear decision boundaries into a non-linear boundary

VII. Conclusions and Principles

We emphasis that the problem of pattern recognition or data mining is a multistage procedure. We hope to automate the process as much as possible, however, some fundamental difference between modern digital computer and biological neural system has limited the “intelligence” of the computer. Hence, human knowledge is still indispensable for many real world tasks. The key is how to balance the workload of human experts and the computers to make full use of their expertise. Artificial neural network may be a direction towards general-purposed intelligent machine. To take advantage of it, distributed parallel hardware must be employed.

To conclude the lecture, we would like to emphasis again some of the principles for pattern recognition tasks:

1. Early bird’s bonus: the earlier a problem can be detected and formulated, the better chance we may have to reach a good solution.

Copyright © by Yu-Chi Ho 26

x2

+

+

x1

x2

+

+

x1

x2

+

+

x1

2. No-free-lunch theorem: there is no universally applicable algorithm for all the problems. Do not be lured by fancy algorithms, instead, try to understand the problem better.

3. Occam’s Razor: among all the possible models that fit the training data, choose the simpler ones, those with less number of features.

Copyright © by Yu-Chi Ho 27