Transcript of Theodoridis.s.,.koutroumbas.k..pattern.recognition,.4ed,.ap,.2009
- 1. 00-FM-SA272 18/9/2008 page iv Academic Press is an imprint
of Elsevier 30 Corporate Drive,Suite 400,Burlington,MA 01803,USA
525 B Street,Suite 1900,San Diego,California 92101-4495,USA
84Theobalds Road,London WC1X 8RR,UK This book is printed on
acid-free paper. Copyright 2009,Elsevier Inc. All rights reserved.
No part of this publication may be reproduced or transmitted in any
form or by any means,electronic or mechanical,including
photocopy,recording,or any information storage and retrieval
system,without permission in writing from the publisher.
Permissions may be sought directly from Elseviers Science
&Technology Rights Department in Oxford,UK:phone: (+44) 1865
843830,fax: (+44) 1865 853333,E-mail: permissions@elsevier.com. You
may also complete your request on-line via the Elsevier homepage
(http://elsevier.com),by selectingSupport &
ContactthenCopyright and Permissionand thenObtaining Permissions.
Library of Congress Cataloging-in-Publication Data Application
submitted British Library Cataloguing-in-Publication Data A
catalogue record for this book is available from the British
Library. ISBN:978-1-59749-272-0 For information on all Academic
Press publications visit our Web site at www.books.elsevier.com
Printed in the United States of America 09 10 11 12 13 14 15 16 5 4
3 2 1
- 2. 02-Preface-SA272 17/9/2008 page xv Preface This book is the
outgrowth of our teaching advanced undergraduate and graduate
courses over the past 20 years. These courses have been taught to
different audiences, including students in electrical and
electronics engineering, computer engineering, computer science,
and informatics, as well as to an interdisciplinary audience of a
graduate course on automation. This experience led us to make the
book as self-contained as possible and to address students with
different back- grounds. As prerequisitive knowledge, the reader
requires only basic calculus, elementary linear algebra,and some
probability theory basics. A number of mathe- matical tools, such
as probability and statistics as well as constrained optimization,
needed by various chapters,are treated in fourAppendices. The book
is designed to serve as a text for advanced undergraduate and
graduate students,and it can be used for either a one- or a
two-semester course. Furthermore,it is intended to be used as a
self-study and reference book for research and for the practicing
scientist/engineer. This latter audience was also our second
incentive for writing this book,due to the involvement of our group
in a number of projects related to pattern recognition. SCOPE AND
APPROACH The goal of the book is to present in a unied way the most
widely used tech- niques and methodologies for pattern recognition
tasks. Pattern recognition is in the center of a number of
application areas, including image analysis, speech and audio
recognition, biometrics, bioinformatics, data mining, and
information retrieval. Despite their differences, these areas
share, to a large extent, a corpus of techniques that can be used
in extracting, from the available data, information related to data
categories,importanthiddenpatterns,and trends. The emphasis in this
book is on the most generic of the methods that are currently
available. Hav- ing acquired the basic knowledge and understanding,
the reader can subsequently move on to more specialized
application-dependent techniques, which have been developed and
reported in a vast number of research papers. Each chapter of the
book starts with the basics and moves, progressively, to more
advanced topicsand reviews up-to-date techniques. We have made an
effort to keep a balance between mathematical and descriptive
presentation. This is not always an easy task. However, we strongly
believe that in a topic such as pattern recognition,trying to
bypass mathematics deprives the reader of understanding the
essentials behind the methods and also the potential of developing
new techniques, which t the needs of the problem at hand that he or
she has to tackle. In pattern recognition, the nal adoption of an
appropriate technique and algorithm is very much a
problem-dependent task. Moreover, according to our experience,
teaching pattern recognition is also a good excuse for the students
to refresh and solidify xv
- 3. 02-Preface-SA272 17/9/2008 page xvi xvi Preface some of the
mathematical basics they have been taught in earlier years.
Repetitio est mater studiosum. NEW TO THIS EDITION The new features
of the fourth edition include the following. MATLAB codes and
computer experiments are given at the end of most chapters. More
examples and a number of new gures have been included to enhance
the readability and pedagogic aspects of the book. New sections on
some important topics of high current interest have been
added,including: Nonlinear dimensionality reduction Nonnegative
matrix factorization Relevance feedback Robust regression
Semi-supervised learning Spectral clustering Clustering combination
techniques Also, a number of sections have been rewritten in the
context of more recent applications in mind. SUPPLEMENTS TO THE
TEXT Demonstrations based on MATLAB are available for download from
the book Web site,www.elsevierdirect.com/9781597492720. Also
available are electronic gures from the text and (for instructors
only) a solutions manual for the end-of-chapter problems and
exercises. The interested reader can download detailed proofs,
which in the book necessarily, are sometimes, slightly condensed.
PowerPoint presentations are also available covering all chapters
of the book. Our intention is to update the site regularly with
more and/or improved versions of the MATLAB demonstrations.
Suggestions are always welcome. Also at this Web site a page will
be available for typos, which are unavoidable, despite frequent
careful reading. The authors would appreciate readers notifying
them about any typos found.
- 4. 02-Preface-SA272 17/9/2008 page xvii Preface xvii
ACKNOWLEDGMENTS This book would have not been written without the
constant support and help from a number of colleagues and students
throughout the years. We are espe- cially indebted to Kostas
Berberidis, Velissaris Gezerlis, Xaris Georgion, Kristina
Georgoulakis, Leyteris Kodis, Thanassis Liavas, Michalis
Mavroforakis, Aggelos Pikrakis,Thanassis Rontogiannis, Margaritis
Sdralis, Kostas Slavakis, and Theodoros Yiannakoponlos. The
constant support provided by Yannis Kopsinis and Kostas Thernelis
from the early stages up to the nal stage, with those long nights,
has been invaluable. The book improved a great deal after the
careful reading and the serious comments and suggestions of
Alexandros Blnn. Dionissis Cavouras, Vassilis Digalakis, Vassilis
Drakopoulos, Nikos Galatsanos, George Glentis, Spiros Hatzispyros,
Evagelos Karkaletsis, Elias Koutsoupias, Aristides Likas,
Gerassimos Mileounis, George Monstakides, George Paliouras, Stavros
Perantonis, Takis Stam- atoponlos,Nikos Vassilas,Manolis
Zervakis,and Vassilis Zissimopoulos. The book has greatly gained
and improved thanks to the comments of a number of people who
provided feedback on the revision plan and/or comments on revised
chapters: TulayAdali,University of Maryland;Mehniet Celenk,Ohio
University;Rama Chel- lappa, University of Maryland; Mark Clements,
Georgia Institute of Technology; Robert Duin,Delft University of
Technology;Miguel Figneroa,Villanueva University of Puerto Rico;
Dimitris Gunopoulos, University of Athens; Mathias Kolsch, Naval
Postgraduate School;Adam Krzyzak, Concordia University; Baoxiu
Li,Arizona State University; David Miller, Pennsylvania State
University; Bernhard Schlkopf, Max Planck Institute; Hari Sundaram,
Arizona State University; Harry Wechsler, George Mason
University;and Alexander Zien,Max Planck Institute. We are greatly
indebted to these colleagues for their time and their constructive
criticisms. Our collaboration and friendship with Nikos
Kalouptsidis have been a source of constant inspiration for all
these years. We are both deeply indebted to him. Last but not
least, K.Koutroumbas would like to thank Sophia, Dimitris- Marios,
and Valentini-Theodora for their tolerance and support and
S.Theodoridis would like to thank Despina, Eva, and Eleni, his
joyful and supportive harem.
- 5. 03-Ch01-SA272 17/9/2008 page 1 CHAPTER 1Introduction 1.1 IS
PATTERN RECOGNITION IMPORTANT? Pattern recognition is the scientic
discipline whose goal is the classication of objects into a number
of categories or classes. Depending on the application,these
objects can be images or signal waveforms or any type of
measurements that need to be classied. We will refer to these
objects using the generic term patterns. Pattern recognition has a
long history,but before the 1960s it was mostly the output of
theoretical research in the area of statistics. As with everything
else, the advent of computers increased the demand for practical
applications of pattern recogni- tion, which in turn set new
demands for further theoretical developments. As our society
evolves from the industrial to its postindustrial phase, automation
in indus- trial production and the need for information handling
and retrieval are becoming increasingly important. This trend has
pushed pattern recognition to the high edge of todays engineering
applications and research. Pattern recognition is an integral part
of most machine intelligence systems built for decision making.
Machine vision is an area in which pattern recognition is of
importance. A machine vision system captures images via a camera
and analyzes them to produce descriptions of what is imaged. A
typical application of a machine vision system is in the
manufacturing industry,either for automated visual inspection or
for automa- tion in the assembly line. For example, in inspection,
manufactured objects on a moving conveyor may pass the inspection
station, where the camera stands, and it has to be ascertained
whether there is a defect. Thus, images have to be analyzed
online,and a pattern recognition system has to classify the objects
into thedefect ornondefectclass. After that,an action has to be
taken,such as to reject the offend- ing parts. In an assembly line,
different objects must be located and recognized, that is, classied
in one of a number of classes known a priori. Examples are the
screwdriver class, the German key class, and so forth in a tools
manufacturing unit. Then a robot arm can move the objects in the
right place. Character (letter or number) recognition is another
important area of pattern recognition,with major implications in
automation and information handling. Opti- cal character
recognition (OCR) systems are already commercially available and
more or less familiar to all of us. An OCR system has
afront-enddevice consisting of a light source,a scan lens,a
document transport,and a detector. At the output of 1
- 6. 03-Ch01-SA272 17/9/2008 page 2 2 CHAPTER 1 Introduction the
light-sensitive detector,light-intensity variation is translated
intonumbersand an image array is formed. In the sequel,a series of
image processing techniques are applied leading to line and
character segmentation. The pattern recognition soft- ware then
takes over to recognize the charactersthat is,to classify each
character in the correctletter,number,punctuationclass. Storing the
recognized document has a twofold advantage over storing its
scanned image. First, further electronic processing, if needed, is
easy via a word processor, and second, it is much more efcient to
store ASCII characters than a document image. Besides the printed
character recognition systems, there is a great deal of interest
invested in systems that recognize handwriting. A typical
commercial application of such a system is in the machine reading
of bank checks. The machine must be able to recognize the amounts
in gures and digits and match them. Furthermore, it could check
whether the payee corresponds to the account to be credited. Even
if only half of the checks are manipulated correctly by such a
machine, much labor can be saved from a tedious job. Another
application is in automatic mail-sorting machines for postal code
identication in post ofces. Online handwriting recognition systems
are another area of great commercial interest. Such systems will
accompany pen computers, with which the entry of data will be done
not via the keyboard but by writing. This complies with todays
tendency to develop machines and computers with interfaces
acquiring human-like skills. Computer-aided diagnosis is another
important application of pattern recogni- tion,aiming at assisting
doctors in making diagnostic decisions. The nal diagnosis is,of
course,made by the doctor. Computer-assisted diagnosis has been
applied to and is of interest for a variety of medical data,such as
X-rays,computed tomographic images, ultrasound images,
electrocardiograms (ECGs), and electroencephalograms (EEGs). The
need for a computer-aided diagnosis stems from the fact that medi-
cal data are often not easily interpretable, and the interpretation
can depend very much on the skill of the doctor. Let us take for
example X-ray mammography for the detection of breast cancer.
Although mammography is currently the best method for detecting
breast cancer,10 to 30% of women who have the disease and undergo
mammography have negative mammograms. In approximately two thirds
of these cases with false results the radiologist failed to detect
the cancer, which was evident retrospectively. This may be due to
poor image quality, eye fatigue of the radiologist, or the subtle
nature of the ndings. The percentage of correct classications
improves at a second reading by another radiologist. Thus, one can
aim to develop a pattern recognition system in order to assist
radiologists with a second opinion. Increasing condence in the
diagnosis based on mammograms would,in turn,decrease the number of
patients with suspected breast cancer who have to undergo surgical
breast biopsy,with its associated complications. Speech recognition
is another area in which a great deal of research and devel- opment
effort has been invested. Speech is the most natural means by which
humans communicate and exchange information. Thus,the goal of
building intelli- gent machines that recognize spoken information
has been a long-standing one for scientists and engineers as well
as science ction writers. Potential applications of such machines
are numerous. They can be used,for example,to improve efciency
- 7. 03-Ch01-SA272 17/9/2008 page 3 1.1 Is Pattern Recognition
Important? 3 in a manufacturing environment, to control machines in
hazardous environments remotely,and to help handicapped people to
control machines by talking to them. A major effort, which has
already had considerable success, is to enter data into a computer
via a microphone. Software, built around a pattern (spoken sounds
in this case) recognition system, recognizes the spoken text and
translates it into ASCII characters,which are shown on the screen
and can be stored in the memory. Entering information bytalkingto a
computer is twice as fast as entry by a skilled typist.
Furthermore, this can enhance our ability to communicate with deaf
and dumb people. Data mining and knowledge discovery in databases
is another key application area of pattern recognition. Data mining
is of intense interest in a wide range of applications such as
medicine and biology, market and nancial analysis, business
management, science exploration, image and music retrieval. Its
popularity stems from the fact that in the age of information and
knowledge society there is an ever increasing demand for retrieving
information and turning it into knowledge. More- over,this
information exists in huge amounts of data in various forms
including,text, images, audio and video, stored in different places
distributed all over the world. The traditional way of searching
information in databases was the description-based model where
object retrieval was based on keyword description and subsequent
word matching. However, this type of searching presupposes that a
manual anno- tation of the stored information has previously been
performed by a human. This is a very time-consuming job and,
although feasible when the size of the stored information is
limited, it is not possible when the amount of the available
informa- tion becomes large. Moreover,the task of manual annotation
becomes problematic when the stored information is widely
distributed and shared by a heterogeneous mixtureof sites and
users. Content-based retrieval systems are becoming more and more
popular where information is sought based onsimilaritybetween an
object, which is presented into the system, and objects stored in
sites all over the world. In a content-based image retrieval CBIR
(system) an image is presented to an input device (e.g.,scanner).
The system returnssimilarimages based on a measuredsig- nature,
which can encode, for example, information related to color,
texture and shape. In a music content-based retrieval system, an
example (i.e., an extract from a music piece), is presented to a
microphone input device and the system returns similar music
pieces. In this case, similarity is based on certain
(automatically) measured cues that characterize a music piece, such
as the music meter, the music tempo,and the location of certain
repeated patterns. Mining for biomedical and DNA data analysis has
enjoyed an explosive growth since the mid-1990s. All DNA sequences
comprise four basic building elements; the nucleotides: adenine
(A), cytosine (C), guanine (G) and thymine (T). Like the letters in
our alphabets and the seven notes in music, these four nucleotides
are combined to form long sequences in a twisted ladder form. Genes
consist of,usually, hundreds of nucleotides arranged in a
particular order. Specic gene-sequence patterns are related to
particular diseases and play an important role in medicine. To this
end,pattern recognition is a key area that offers a wealth of
developed tools for similarity search and comparison between DNA
sequences. Such comparisons
- 8. 03-Ch01-SA272 17/9/2008 page 4 4 CHAPTER 1 Introduction
between healthy and diseased tissues are very important in medicine
to identify critical differences between these two classes. The
foregoing are only ve examples from a much larger number of
possible applications. Typically, we refer to ngerprint
identication, signature authentica- tion, text retrieval, and face
and gesture recognition. The last applications have recently
attracted much research interest and investment in an attempt to
facil- itate humanmachine interaction and further enhance the role
of computers in ofce automation,automatic personalization of
environments,and so forth. Just to provoke imagination, it is worth
pointing out that the MPEG-7 standard includes a provision for
content-based video information retrieval from digital libraries of
the type: search and nd all video scenes in a digital library
showing person X laughing. Of course, to achieve the nal goals in
all of these applications, pattern recognition is closely linked
with other scientic disciplines, such as linguistics, computer
graphics,machine vision,and database design. Having aroused the
readers curiosity about pattern recognition, we will next sketch
the basic philosophy and methodological directions in which the
various pattern recognition approaches have evolved and developed.
1.2 FEATURES, FEATURE VECTORS, AND CLASSIFIERS Let us rst simulate
a simplied case mimicking a medical image classication task. Figure
1.1 shows two images, each having a distinct region inside it. The
two regions are also themselves visually different. We could say
that the region of Figure 1.1a results from a benign lesion, class
A, and that of Figure 1.1b from a malignant one (cancer), class B.
We will further assume that these are not the only patterns
(images) that are available to us, but we have access to an image
database (a) (b) FIGURE 1.1 Examples of image regions corresponding
to (a) class A and (b) class B.
- 9. 03-Ch01-SA272 17/9/2008 page 5 1.2 Features, Feature
Vectors, and Classiers 5 FIGURE 1.2 Plot of the mean value versus
the standard deviation for a number of different images originating
from class A ( ) and class B (). In this case, a straight line
separates the two classes. with a number of patterns,some of which
are known to originate from class A and some from class B. The rst
step is to identify the measurable quantities that make these two
regions distinct from each other. Figure 1.2 shows a plot of the
mean value of the inten- sity in each region of interest versus the
corresponding standard deviation around this mean. Each point
corresponds to a different image from the available database. It
turns out that class A patterns tend to spread in a different area
from class B pat- terns. The straight line seems to be a good
candidate for separating the two classes. Let us now assume that we
are given a new image with a region in it and that we do not know
to which class it belongs. It is reasonable to say that we measure
the mean intensity and standard deviation in the region of interest
and we plot the cor- responding point. This is shown by the
asterisk () in Figure 1.2. Then it is sensible to assume that the
unknown pattern is more likely to belong to classA than class B.
The preceding articial classication task has outlined the rationale
behind a large class of pattern recognition problems. The
measurements used for the classi- cation,the mean value and the
standard deviation in this case,are known as features. In the more
general case l features xi, i 1, 2, . . . , l, are used, and they
form the feature vector x [x1, x2, . . . , xl]T where T denotes
transposition. Each of the feature vectors identies uniquely a
single pattern (object). Throughout this book features and feature
vectors will be treated as random variables and vectors,
respectively. This is natural, as the measurements resulting from
different patterns exhibit a random variation. This is due partly
to the measurement noise of the measuring devices and partly
to
- 10. 03-Ch01-SA272 17/9/2008 page 6 6 CHAPTER 1 Introduction the
distinct characteristics of each pattern. For example, in X-ray
imaging large variations are expected because of the differences in
physiology among individuals. This is the reason for the scattering
of the points in each class shown in Figure 1.1. The straight line
in Figure 1.2 is known as the decision line, and it constitutes the
classier whose role is to divide the feature space into regions
that correspond to either class A or class B. If a feature vector
x, corresponding to an unknown pattern, falls in the class A
region, it is classied as class A, otherwise as class B. This does
not necessarily mean that the decision is correct. If it is not
correct, a misclassication has occurred. In order to draw the
straight line in Figure 1.2 we exploited the fact that we knew the
labels (class A or B) for each point of the gure. The patterns
(feature vectors) whose true class is known and which are used for
the design of the classier are known as training patterns (training
feature vectors). Having outlined the denitions and the rationale,
let us point out the basic questions arising in a classication
task. How are the features generated? In the preceding example, we
used the mean and the standard deviation,because we knew how the
images had been generated. In practice,this is far from obvious. It
is problem dependent,and it concerns the feature generation stage
of the design of a classication system that performs a given
pattern recognition task. What is the best number l of features to
use? This is also a very important task and it concerns the feature
selection stage of the classication system. In practice,a larger
than necessary number of feature candidates is generated, and then
thebestof them is adopted. Having adopted the appropriate,for the
specic task,features,how does one design the classier? In the
preceding example the straight line was drawn empirically, just to
please the eye. In practice, this cannot be the case, and the line
should be drawn optimally, with respect to an optimality criterion.
Furthermore,problems for which a linear classier (straight line or
hyperplane in the l-dimensional space) can result in acceptable
performance are not the rule. In general, the surfaces dividing the
space in the various class regions are nonlinear. What type of
nonlinearity must one adopt, and what type of optimizing criterion
must be used in order to locate a surface in the right place in the
l-dimensional feature space? These questions concern the classier
design stage. Finally, once the classier has been designed, how can
one assess the perfor- mance of the designed classier? That is,what
is the classication error rate? This is the task of the system
evaluation stage. Figure 1.3 shows the various stages followed for
the design of a classication system. As is apparent from the
feedback arrows,these stages are not independent. On the
contrary,they are interrelated and,depending on the results,one may
go back
- 11. 03-Ch01-SA272 17/9/2008 page 7 1.3 Supervised,
Unsupervised, and Semi-Supervised Learning 7 feature selection
classifier design system evaluation feature generation sensor
patterns FIGURE 1.3 The basic stages involved in the design of a
classication system. to redesign earlier stages in order to improve
the overall performance. Furthermore, there are some methods that
combine stages,for example,the feature selection and the classier
design stage,in a common optimization task. Although the reader has
already been exposed to a number of basic problems at the heart of
the design of a classication system, there are still a few things
to be said. 1.3 SUPERVISED, UNSUPERVISED, AND SEMI-SUPERVISED
LEARNING In the example of Figure 1.1, we assumed that a set of
training data were available, and the classier was designed by
exploiting this a priori known information. This is known as
supervised pattern recognition or in the more general context of
machine learning as supervised learning. However,this is not always
the case,and there is another type of pattern recognition tasks for
which training data,of known class labels, are not available. In
this type of problem, we are given a set of feature vectors x and
the goal is to unravel the underlying similarities and cluster
(group) similar vectors together. This is known as unsupervised
pattern recognition or unsupervised learning or clustering. Such
tasks arise in many applications in social sciences and
engineering, such as remote sensing, image segmentation, and image
and speech coding. Let us pick two such problems. In multispectral
remote sensing, the electromagnetic energy emanating from the
earths surface is measured by sensitive scanners located aboard a
satellite, an aircraft,or a space station. This energy may be
reected solar energy (passive) or the reected part of the energy
transmitted from the vehicle (active) in order tointer- rogate the
earths surface. The scanners are sensitive to a number of
wavelength bands of the electromagnetic radiation. Different
properties of the earths surface contribute to the reection of the
energy in the different bands. For example,in the visibleinfrared
range properties such as the mineral and moisture contents of
soils, the sedimentation of water, and the moisture content of
vegetation are the main contributors to the reected energy. In
contrast,at the thermal end of the infrared, it is the thermal
capacity and thermal properties of the surface and near subsurface
that contribute to the reection. Thus, each band measures different
properties
- 12. 03-Ch01-SA272 17/9/2008 page 8 8 CHAPTER 1 Introduction of
the same patch of the earths surface. In this way, images of the
earths surface corresponding to the spatial distribution of the
reected energy in each band can be created. The task now is to
exploit this information in order to identify the various ground
cover types, that is, built-up land, agricultural land, forest, re
burn, water, and diseased crop. To this end, one feature vector x
for each cell from the sensedearths surface is formed. The elements
xi, i 1, 2, . . . , l,of the vector are the corresponding image
pixel intensities in the various spectral bands. In practice, the
number of spectral bands varies. A clustering algorithm can be
employed to reveal the groups in which feature vectors are
clustered in the l-dimensional feature space. Points that
correspond to the same ground cover type, such as water, are
expected to cluster together and form groups. Once this is done,the
analyst can identify the type of each cluster by associating a
sample of points in each group with available reference ground
data, that is,maps or visits. Figure 1.4 demonstrates the
procedure. Clustering is also widely used in the social sciences in
order to study and correlate survey and statistical data and draw
useful conclusions,which will then lead to the right actions. Let
us again resort to a simplied example and assume that we are
interested in studying whether there is any relation between a
countrys gross national product (GNP) and the level of peoples
illiteracy, on the one hand, and childrens mortality rate on the
other. In this case, each country is represented by a
three-dimensional feature vector whose coordinates are indices
measuring the quantities of interest. A clustering algorithm will
then reveal a rather compact cluster corresponding to countries
that exhibit low GNPs,high illiteracy levels,and high childrens
mortality expressed as a population percentage. forest soil water
forest vegetation water soil vegetation forest x1 x2 soil (a) (b)
FIGURE 1.4 (a) An illustration of various types of ground cover and
(b) clustering of the respective features for multispectral imaging
using two bands.
- 13. 03-Ch01-SA272 17/9/2008 page 9 1.5 MATLAB Programs 9 A
major issue in unsupervised pattern recognition is that of dening
the similarity between two feature vectors and choosing an
appropriate measure for it. Another issue of importance is choosing
an algorithmic scheme that will cluster (group) the vectors on the
basis of the adopted similarity measure. In gen- eral, different
algorithmic schemes may lead to different results, which the expert
has to interpret. Semi-supervised learning/pattern recognition for
designing a classication sys- tem shares the same goals as the
supervised case, however now, the designer has at his or her
disposal a set of patterns of unknown class origin, in addition to
the training patterns,whose true class is known. We usually refer
to the former ones as unlabeled and the latter as labeled data.
Semi-supervised pattern recognition can be of importance when the
system designer has access to a rather limited number of labeled
data. In such cases, recovering additional information from the
unla- beled samples, related to the general structure of the data
at hand, can be useful in improving the system design.
Semi-supervised learning nds its way also to cluster- ing tasks. In
this case,labeled data are used as constraints in the form of
must-links and cannot-links. In other words, the clustering task is
constrained to assign cer- tain points in the same cluster or to
exclude certain points of being assigned in the same cluster. From
this perspective, semi-supervised learning provides an a priori
knowledge that the clustering algorithm has to respect. 1.4 MATLAB
PROGRAMS At the end of most of the chapters there is a number of
MATLAB programs and computer experiments. The MATLAB codes provided
are not intended to form part of a software package, but they are
to serve a purely pedagogical goal. Most of these codes are given
to our students who are asked to play with and discover the
secretsassociated with the corresponding methods. This is also the
reason that for most of the cases the data used are simulated data
around the Gaussian distribution. They have been produced carefully
in order to guide the students in understanding the basic concepts.
This is also the reason that the provided codes correspond to those
of the techniques and algorithms that,to our opinion,comprise the
backbone of each chapter and the student has to understand in a rst
reading. Whenever the required MATLAB code was available (at the
time this book was prepared) in a MATLAB toolbox, we chose to use
the associated MATLAB function and explain how to use its
arguments. No doubt,each instructor has his or her own preferences,
experiences,and unique way of viewing teaching. The provided
routines are written in a way that can run on other data sets as
well. In a separate accompanying book we provide a more complete
list of MATLAB codes embedded in a user-friendly Graphical User
Interface (GUI) and also involving more realistic examples using
real images and audio signals.
- 14. 03-Ch01-SA272 17/9/2008 page 10 10 CHAPTER 1 Introduction
1.5 OUTLINE OF THE BOOK Chapters 210 deal with supervised pattern
recognition and Chapters 1116 deal with the unsupervised case.
Semi-supervised learning is introduced in Chapter 10. The goal of
each chapter is to start with the basics,denitions,and
approaches,and move progressively to more advanced issues and
recent techniques. To what extent the various topics covered in the
book will be presented in a rst course on pattern recognition
depends very much on the courses focus,on the studentsbackground,
and, of course, on the lecturer. In the following outline of the
chapters, we give our view and the topics that we cover in a rst
course on pattern recognition. No doubt,other views do exist and
may be better suited to different audiences. At the end of each
chapter,a number of problems and computer exercises are provided.
Chapter 2 is focused on Bayesian classication and techniques for
estimating unknown probability density functions. In a rst course
on pattern recognition,the sections related to Bayesian inference,
the maximum entropy, and the expectation maximization (EM)
algorithm are omitted. Special focus is put on the Bayesian clas-
sication,the minimum distance (Euclidean and Mahalanobis),the
nearest neighbor classiers,and the naive Bayes classier. Bayesian
networks are briey introduced. Chapter 3 deals with the design of
linear classiers. The sections dealing with the probability
estimation property of the mean square solution as well as the bias
vari- ance dilemma are only briey mentioned in our rst course. The
basic philosophy underlying the support vector machines can also be
explained, although a deeper treatment requires mathematical tools
(summarized inAppendix C) that most of the students are not
familiar with during a rst course class. On the contrary,emphasis
is put on the linear separability issue,the perceptron
algorithm,and the mean square and least squares solutions. After
all, these topics have a much broader horizon and applicability.
Support vector machines are briey introduced. The geometric
interpretation offers students a better understanding of the SVM
theory. Chapter 4 deals with the design of nonlinear classiers. The
section dealing with exact classication is bypassed in a rst
course. The proof of the backpropagation algorithm is usually very
boring for most of the students and we bypass its details. A
description of its rationale is given, and the students experiment
with it using MATLAB. The issues related to cost functions are
bypassed. Pruning is discussed with an emphasis on generalization
issues. Emphasis is also given to Covers theorem and radial basis
function (RBF) networks. The nonlinear support vector machines,
decision trees, and combining classiers are only briey touched via
a discussion on the basic philosophy behind their rationale.
Chapter 5 deals with the feature selection stage, and we have made
an effort to present most of the well-known techniques. In a rst
course we put emphasis on the t-test. This is because hypothesis
testing also has a broad horizon, and at the same time it is easy
for the students to apply it in computer exercises. Then, depending
on time constraints, divergence, Bhattacharrya distance, and
scattered matrices are presented and commented on,although their
more detailed treatment
- 15. 03-Ch01-SA272 17/9/2008 page 11 1.5 Outline of The Book 11
is for a more advanced course. Emphasis is given to Fishers linear
discriminant method ( LDA) for the two-class case. Chapter 6 deals
with the feature generation stage using transformations. The
KarhunenLove transform and the singular value decomposition are rst
intro- duced as dimensionality reduction techniques. Both methods
are briey covered in the second semester. In the sequel the
independent component analysis (ICA),non- negative matrix
factorization and nonlinear dimensionality reduction techniques are
presented. Then the discrete Fourier transform (DFT), discrete
cosine trans- form (DCT), discrete sine transform (DST), Hadamard,
and Haar transforms are dened. The rest of the chapter focuses on
the discrete time wavelet transform. The incentive is to give all
the necessary information so that a newcomer in the wavelet eld can
grasp the basics and be able to develop software, based on lter
banks, in order to generate features. All these techniques are
bypassed in a rst course. Chapter 7 deals with feature generation
focused on image and audio classica- tion. The sections concerning
local linear transforms,moments,parametric models, and fractals are
not covered in a rst course. Emphasis is placed on rst- and second-
order statistics features as well as the run-length method. The
chain code for shape description is also taught. Computer exercises
are then offered to generate these features and use them for
classication for some case studies. In a one-semester course there
is no time to cover more topics. Chapter 8 deals with template
matching. Dynamic programming (DP) and the Viterbi algorithm are
presented and then applied to speech recognition. In a two-semester
course, emphasis is given to the DP and the Viterbi algorithm. The
edit distance seems to be a good case for the students to grasp the
basics. Cor- relation matching is taught and the basic philosophy
behind deformable template matching can also be presented. Chapter
9 deals with context-dependent classication. Hidden Markov mod- els
are introduced and applied to communications and speech
recognition. This chapter is bypassed in a rst course. Chapter 10
deals with system evaluation and semi-supervised learning. The
various error rate estimation techniques are discussed, and a case
study with real data is treated. The leave-one-out method and the
resubstitution methods are emphasized in the second semester,and
students practice with computer exercises. Semi-supervised learning
is bypassed in a rst course. Chapter 11 deals with the basic
concepts of clustering. It focuses on denitions as well as on the
major stages involved in a clustering task. The various types of
data encountered in clustering applications are reviewed, and the
most commonly used proximity measures are provided. In a rst
course,only the most widely used proximity measures are covered
(e.g.,lp norms,inner product,Hamming distance). Chapter 12 deals
with sequential clustering algorithms. These include some of the
simplest clustering schemes, and they are well suited for a rst
course to introduce students to the basics of clustering and allow
them to experiment with
- 16. 03-Ch01-SA272 17/9/2008 page 12 12 CHAPTER 1 Introduction
the computer. The sections related to estimation of the number of
clusters and neural network implementations are bypassed. Chapter
13 deals with hierarchical clustering algorithms. In a rst course,
only the general agglomerative scheme is considered with an
emphasis on single link and complete link algorithms, based on
matrix theory. Agglomerative algorithms based on graph theory
concepts as well as the divisive schemes are bypassed. Chapter 14
deals with clustering algorithms based on cost function
optimization, using tools from differential calculus. Hard
clustering and fuzzy and possibilistic schemes are considered,based
on various types of cluster representatives,including point
representatives,hyperplane representatives,and shell-shaped
representatives. In a rst course, most of these algorithms are
bypassed, and emphasis is given to the isodata algorithm. Chapter
15 features a high degree of modularity. It deals with clustering
algo- rithms based on different ideas,which cannot be grouped under
a single philosophy. Spectral clustering, competitive learning,
branch and bound, simulated annealing, and genetic algorithms are
some of the schemes treated in this chapter. These are bypassed in
a rst course. Chapter 16 deals with the clustering validity stage
of a clustering procedure. It contains rather advanced concepts and
is omitted in a rst course. Emphasis is given to the denitions of
internal,external,and relative criteria and the random hypothe- ses
used in each case. Indices, adopted in the framework of external
and internal criteria,are presented,and examples are provided
showing the use of these indices. Syntactic pattern recognition
methods are not treated in this book. Syntactic pattern recognition
methods differ in philosophy from the methods discussed in this
book and,in general,are applicable to different types of problems.
In syntactic pattern recognition, the structure of the patterns is
of paramount importance, and pattern recognition is performed on
the basis of a set of pattern primitives, a set of rules in the
form of a grammar, and a recognizer called automaton. Thus, we were
faced with a dilemma: either to increase the size of the book
substantially, or to provide a short overview (which, however,
exists in a number of other books), or to omit it. The last option
seemed to be the most sensible choice.
- 17. 04-Ch02-SA272 18/9/2008 page 13 CHAPTER 2Classiers Based on
Bayes Decision Theory 2.1 INTRODUCTION This is the rst chapter, out
of three, dealing with the design of the classier in a pattern
recognition system. The approach to be followed builds upon
probabilistic arguments stemming from the statistical nature of the
generated features. As has already been pointed out in the
introductory chapter, this is due to the statistical variation of
the patterns as well as to the noise in the measuring sensors.
Adopting this reasoning as our kickoff point,we will design
classiers that classify an unknown pattern in the most probable of
the classes. Thus, our task now becomes that of dening whatmost
probablemeans. Given a classication task of M classes,1, 2, . . . ,
M ,and an unknown pattern, which is represented by a feature vector
x,we form the M conditional probabilities P(i|x), i 1, 2, . . . ,
M. Sometimes, these are also referred to as a posteriori
probabilities. In words,each of them represents the probability
that the unknown pattern belongs to the respective class i, given
that the corresponding feature vector takes the value x. Who could
then argue that these conditional probabilities are not sensible
choices to quantify the term most probable? Indeed,the classiers to
be considered in this chapter compute either the maximum of these M
values or, equivalently, the maximum of an appropriately dened
function of them. The unknown pattern is then assigned to the class
corresponding to this maximum. The rst task we are faced with is
the computation of the conditional proba- bilities. The Bayes rule
will once more prove its usefulness! A major effort in this chapter
will be devoted to techniques for estimating probability density
functions (pdf), based on the available experimental evidence, that
is, the feature vectors corresponding to the patterns of the
training set. 2.2 BAYES DECISION THEORY We will initially focus on
the two-class case. Let 1, 2 be the two classes in which our
patterns belong. In the sequel, we assume that the a priori
probabilities 13
- 18. 04-Ch02-SA272 18/9/2008 page 14 14 CHAPTER 2 Classiers
Based on Bayes Decision Theory P(1), P(2) are known. This is a very
reasonable assumption, because even if they are not known,they can
easily be estimated from the available training feature vectors.
Indeed, if N is the total number of available training patterns,
and N1, N2 of them belong to 1 and 2,respectively,then P(1) N1/N
and P(2) N2/N. The other statistical quantities assumed to be known
are the class-conditional probability density functions p(x|i), i
1, 2, describing the distribution of the feature vectors in each of
the classes. If these are not known, they can also be estimated
from the available training data,as we will discuss later on in
this chapter. The pdf p(x|i) is sometimes referred to as the
likelihood function of i with respect to x. Here we should stress
the fact that an implicit assumption has been made. That is, the
feature vectors can take any value in the l-dimensional feature
space. In the case that feature vectors can take only discrete
values,density functions p(x|i) become probabilities and will be
denoted by P(x|i). We now have all the ingredients to compute our
conditional probabilities, as stated in the introduction. To this
end, let us recall from our probability course basics the Bayes
rule (Appendix A) P(i|x) p(x|i)P(i) p(x) (2.1) where p(x) is the
pdf of x and for which we have (Appendix A) p(x) 2 i1 p(x|i)P(i)
(2.2) The Bayes classication rule can now be stated as If P(1|x)
P(2|x), x is classied to 1 If P(1|x) P(2|x), x is classied to 2
(2.3) The case of equality is detrimental and the pattern can be
assigned to either of the two classes. Using (2.1),the decision can
equivalently be based on the inequalities p(x|1)P(1) p(x|2)P(2)
(2.4) p(x) is not taken into account, because it is the same for
all classes and it does not affect the decision. Furthermore, if
the a priori probabilities are equal, that is, P(1) P(2) 1/2,Eq.
(2.4) becomes p(x|1) p(x|2) (2.5) Thus, the search for the maximum
now rests on the values of the conditional pdfs evaluated at x.
Figure 2.1 presents an example of two equiprobable classes and
shows the variations of p(x|i), i 1, 2, as functions of x for the
simple case of a single feature (l 1). The dotted line at x0 is a
threshold partitioning the feature space into two regions,R1 and
R2. According to the Bayes decision rule,for all values of x in R1
the classier decides 1 and for all values in R2 it decides 2.
However, it is obvious from the gure that decision errors are
unavoidable. Indeed, there is
- 19. 04-Ch02-SA272 18/9/2008 page 15 2.2 Bayes Decision Theory
15 R1 p(x|) p(x|1) p(x|2) x0 x R2 FIGURE 2.1 Example of the two
regions R1 and R2 formed by the Bayesian classier for the case of
two equiprobable classes. a nite probability for an x to lie in the
R2 region and at the same time to belong in class 1. Then our
decision is in error. The same is true for points originating from
class 2. It does not take much thought to see that the total
probability,Pe,of committing a decision error for the case of two
equiprobable classes,is given by Pe 1 2 x0 p(x|2) dx 1 2 x0 p(x|1)
dx (2.6) which is equal to the total shaded area under the curves
in Figure 2.1. We have now touched on a very important issue. Our
starting point to arrive at the Bayes classi- cation rule was
rather empirical, via our interpretation of the term most probable.
We will now see that this classication test, though simple in its
formulation, has a sounder mathematical interpretation. Minimizing
the Classication Error Probability We will show that the Bayesian
classier is optimal with respect to minimizing the classication
error probability. Indeed,the reader can easily verify,as an
exercise, that moving the threshold away from x0, in Figure 2.1,
always increases the corre- sponding shaded area under the curves.
Let us now proceed with a more formal proof. Proof. Let R1 be the
region of the feature space in which we decide in favor of 1 and R2
be the corresponding region for 2. Then an error is made if x R1,
although it belongs to 2 or if x R2,although it belongs to 1. That
is, Pe P(x R2, 1) P(x R1, 2) (2.7)
- 20. 04-Ch02-SA272 18/9/2008 page 16 16 CHAPTER 2 Classiers
Based on Bayes Decision Theory where P(, ) is the joint probability
of two events. Recalling, once more, our probability basics
(Appendix A),this becomes Pe P(x R2|1)P(1) P(x R1|2)P(2) P(1) R2
p(x|1) dx P(2) R1 p(x|2) dx (2.8) or using the Bayes rule Pe R2
P(1|x)p(x) dx R1 P(2|x)p(x) dx (2.9) It is now easy to see that the
error is minimized if the partitioning regions R1 and R2 of the
feature space are chosen so that R1: P(1|x) P(2|x) R2: P(2|x)
P(1|x) (2.10) Indeed,since the union of the regions R1, R2 covers
all the space,from the denition of a probability density function
we have that R1 P(1|x)p(x) dx R2 P(1|x)p(x) dx P(1) (2.11)
Combining Eqs. (2.9) and (2.11),we get Pe P(1) R1 (P(1|x) P(2|x))
p(x) dx (2.12) This suggests that the probability of error is
minimized if R1 is the region of space in which P(1|x) P(2|x).
Then,R2 becomes the region where the reverse is true. So far, we
have dealt with the simple case of two classes. Generalizations to
the multiclass case are straightforward. In a classication task
with M classes, 1, 2, . . . , M ,an unknown pattern,represented by
the feature vector x,is assigned to class i if P(i|x) P(j|x) j i
(2.13) It turns out that such a choice also minimizes the
classication error probability ( Problem 2.1). Minimizing the
Average Risk The classication error probability is not always the
best criterion to be adopted for minimization. This is because it
assigns the same importance to all errors. However, there are cases
in which some wrong decisions may have more serious implications
than others. For example, it is much more serious for a doctor to
make a wrong decision and a malignant tumor to be diagnosed as a
benign one, than the other way round. If a benign tumor is
diagnosed as a malignant one, the wrong decision will be cleared
out during subsequent clinical examinations. However, the
results
- 21. 04-Ch02-SA272 18/9/2008 page 17 2.2 Bayes Decision Theory
17 from the wrong decision concerning a malignant tumor may be
fatal. Thus,in such cases it is more appropriate to assign a
penalty term to weigh each error. For our example,let us denote by
1 the class of malignant tumors and as 2 the class of the benign
ones. Let, also, R1, R2 be the regions in the feature space where
we decide in favor of 1 and 2, respectively. The error probability
Pe is given by Eq. (2.8). Instead of selecting R1 and R2 so that Pe
is minimized,we will now try to minimize a modied version of
it,that is, r 12P(1) R2 p(x|1)dx 21P(2) R1 p(x|2)dx (2.14) where
each of the two terms that contributes to the overall error
probability is weighted according to its signicance. For our case,
the reasonable choice would be to have 12 21. Thus errors due to
the assignment of patterns originating from class 1 to class 2 will
have a larger effect on the cost function than the errors
associated with the second term in the summation. Let us now
consider an M-class problem and let Rj, j 1, 2, . . . , M,be the
regions of the feature space assigned to classes j,respectively.
Assume now that a feature vector x that belongs to class k lies in
Ri, i k. Then this vector is misclassied in i and an error is
committed. A penalty term ki,known as loss,is associated with this
wrong decision. The matrix L,which has at its (k, i) location the
corresponding penalty term,is known as the loss matrix.1 Observe
that in contrast to the philoso- phy behind Eq. (2.14),we have now
allowed weights across the diagonal of the loss matrix (kk), which
correspond to correct decisions. In practice, these are usually set
equal to zero,although we have considered them here for the sake of
generality. The risk or loss associated with k is dened as rk M i1
ki Ri p(x|k) dx (2.15) Observe that the integral is the overall
probability of a feature vector from class k being classied in i.
This probability is weighted by ki. Our goal now is to choose the
partitioning regions Rj so that the average risk r M k1 rkP(k) M
i1Ri M k1 ki p(x|k)P(k) dx (2.16) is minimized. This is achieved if
each of the integrals is minimized, which is equivalent to
selecting partitioning regions so that x Ri if li M k1 ki
p(x|k)P(k) lj M k1 kj p(x|k)P(k) j i (2.17) 1 The terminology comes
from the general decision theory.
- 22. 04-Ch02-SA272 18/9/2008 page 18 18 CHAPTER 2 Classiers
Based on Bayes Decision Theory It is obvious that if ki 1 ki, where
ki is Kroneckers delta (0 if k i and 1 if k i ),then minimizing the
average risk becomes equivalent to minimizing the classication
error probability. The two-class case. For this specic case we
obtain l1 11 p(x|1)P(1) 21 p(x|2)P(2) l2 12 p(x|1)P(1) 22
p(x|2)P(2) (2.18) We assign x to 1 if l1 l2,that is, (21
22)p(x|2)P(2) (12 11)p(x|1)P(1) (2.19) It is natural to assume that
ij ii (correct decisions are penalized much less than wrong ones).
Adopting this assumption, the decision rule (2.17) for the
two-class case now becomes x 1(2) if l12 p(x|1) p(x|2) () P(2) P(1)
21 22 12 11 (2.20) The ratio l12 is known as the likelihood ratio
and the preceding test as the likeli- hood ratio test. Let us now
investigate Eq. (2.20) a little further and consider the case of
Figure 2.1. Assume that the loss matrix is of the form L 0 12 21 0
If misclassication of patterns that come from 2 is considered to
have serious consequences, then we must choose 21 12. Thus,
patterns are assigned to class 2 if p(x|2) p(x|1) 12 21 where P(1)
P(2) 1/2 has been assumed. That is, p(x|1) is multiplied by a
factor less than 1 and the effect of this is to move the threshold
in Figure 2.1 to the left of x0. In other words, region R2 is
increased while R1 is decreased. The opposite would be true if 21
12. An alternative cost that sometimes is used for two class
problems is the Neyman- Pearson criterion. The error for one of the
classes is now constrained to be xed and equal to a chosen value (
Problem 2.6). Such a decision rule has been used, for example, in
radar detection problems. The task there is to detect a target in
the presence of noise. One type of error is the so-called false
alarmthat is, to mistake the noise for a signal (target) present.
Of course, the other type of error is to miss the signal and to
decide in favor of the noise (missed detection). In many cases the
error probability of false alarm is set equal to a predetermined
threshold.
- 23. 04-Ch02-SA272 18/9/2008 page 19 2.3 Discriminant Functions
and Decision Surfaces 19 Example 2.1 In a two-class problem with a
single feature x the pdfs are Gaussians with variance 2 1/2 for
both classes and mean values 0 and 1, respectively, that is, p(x|1)
1 exp(x2 ) p(x|2) 1 exp((x 1)2 ) If P(1) P(2) 1/2, compute the
threshold value x0 (a) for minimum error probability and (b) for
minimum risk if the loss matrix is L 0 0.5 1.0 0 Taking into
account the shape of the Gaussian function graph (Appendix A), the
threshold for the minimum probability case will be x0 : exp(x2 )
exp((x 1)2 ) Taking the logarithm of both sides, we end up with x0
1/2. In the minimum risk case we get x0 : exp(x2 ) 2 exp((x 1)2 )
or x0 (1ln 2)/2 1/2; that is, the threshold moves to the left of
1/2. If the two classes are not equiprobable, then it is easily
veried that if P(1) () P(2) the threshold moves to the right
(left). That is, we expand the region in which we decide in favor
of the most probable class, since it is better to make fewer errors
for the most probable class. 2.3 DISCRIMINANT FUNCTIONS AND
DECISION SURFACES It is by now clear that minimizing either the
risk or the error probability or the Neyman-Pearson criterion is
equivalent to partitioning the feature space into M regions, for a
task with M classes. If regions Ri, Rj happen to be contiguous,
then they are separated by a decision surface in the
multidimensional feature space. For the minimum error probability
case,this is described by the equation P(i|x) P(j|x) 0 (2.21) From
the one side of the surface this difference is positive, and from
the other it is negative. Sometimes, instead of working directly
with probabilities (or risk functions), it may be more convenient,
from a mathematical point of view, to work with an equivalent
function of them,for example,gi(x) f (P(i|x)),where f () is a
monotonically increasing function. gi(x) is known as a discriminant
function. The decision test (2.13) is now stated as classify x in i
if gi(x) gj(x) j i (2.22) The decision surfaces,separating
contiguous regions,are described by gij(x) gi(x) gj(x) 0, i, j 1,
2, . . . , M, i j (2.23)
- 24. 04-Ch02-SA272 18/9/2008 page 20 20 CHAPTER 2 Classiers
Based on Bayes Decision Theory So far,we have approached the
classication problem via Bayesian probabilistic argu- ments and the
goal was to minimize the classication error probability or the
risk. However,as we will soon see,not all problems are well suited
to such approaches. For example,in many cases the involved pdfs are
complicated and their estimation is not an easy task. In such
cases,it may be preferable to compute decision surfaces directly by
means of alternative costs,and this will be our focus in Chapters 3
and 4. Such approaches give rise to discriminant functions and
decision surfaces,which are entities with no (necessary) relation
to Bayesian classication, and they are, in general,suboptimal with
respect to Bayesian classiers. In the following we will focus on a
particular family of decision surfaces asso- ciated with the
Bayesian classication for the specic case of Gaussian density
functions. 2.4 BAYESIAN CLASSIFICATION FOR NORMAL DISTRIBUTIONS
2.4.1 The Gaussian Probability Density Function One of the most
commonly encountered probability density functions in practice is
the Gaussian or normal probability density function. The major
reasons for its popularity are its computational tractability and
the fact that it models adequately a large number of cases. One of
the most celebrated theorems in statistics is the central limit
theorem. The theorem states that if a random variable is the
outcome of a summation of a number of independent random
variables,its pdf approaches the Gaussian function as the number of
summands tends to innity (seeAppendixA). In practice,it is most
common to assume that the sum of random variables is distributed
according to a Gaussian pdf,for a sufciently large number of
summing terms. The one-dimensional or the univariate Gaussian, as
it is sometimes called, is dened by p(x) 1 2 exp (x )2 22 (2.24)
The parameters and 2 turn out to have a specic meaning. The mean
value of the random variable x is equal to ,that is, E[x] xp(x)dx
(2.25) where E[] denotes the mean (or expected) value of a random
variable. The parameter 2 is equal to the variance of x,that is, 2
E[(x )2 ] (x )2 p(x)dx (2.26)
- 25. 04-Ch02-SA272 18/9/2008 page 21 2.4 Bayesian Classication
for Normal Distributions 21 x x 1 100 (a) (b) 1 p(x) p(x) FIGURE
2.2 Graphs for the one-dimensional Gaussian pdf. (a) Mean value 0,
2 1, (b) 1 and 2 0.2. The larger the variance the broader the graph
is. The graphs are symmetric, and they are centered at the
respective mean value. Figure 2.2a shows the graph of the Gaussian
function for 0 and 2 1, and Figure 2.2b the case for 1 and 2 0.2.
The larger the variance the broader the graph,which is
symmetric,and it is always centered at (seeAppendixA,for some more
properties). The multivariate generalization of a Gaussian pdf in
the l-dimensional space is given by p(x) 1 (2)l/2||1/2 exp 1 2 (x
)T 1 (x ) (2.27) where E[x] is the mean value and is the l l
covariance matrix (Appendix A) dened as E[(x )(x )T ] (2.28) where
|| denotes the determinant of . It is readily seen that for l 1 the
multivariate Gaussian coincides with the univariate one. Sometimes,
the symbol N(, ) is used to denote a Gaussian pdf with mean value
and covariance . To get a better feeling on what the multivariate
Gaussian looks like, let us focus on some cases in the
two-dimensional space, where nature allows us the luxury of
visualization. For this case we have E x1 1 x2 2 x1 1, x2 2 (2.29)
2 1 12 12 2 2 (2.30) where E[xi] i, i 1, 2,and by denition 12 E[(x1
1)(x2 2)],which is known as the covariance between the random
variables x1 and x2 and it is a measure
- 26. 04-Ch02-SA272 18/9/2008 page 22 22 CHAPTER 2 Classiers
Based on Bayes Decision Theory of their mutual statistical
correlation. If the variables are statistically independent, their
covariance is zero (AppendixA). Obviously,the diagonal elements of
are the variances of the respective elements of the random vector.
Figures 2.32.6 show the graphs for four instances of a
two-dimensional Gaussian probability density function. Figure 2.3a
corresponds to a Gaussian with a diagonal covariance matrix 3 0 0 3
(a) (b) x2 x2 x1 x1 p(x) FIGURE 2.3 (a) The graph of a
two-dimensional Gaussian pdf and (b) the corresponding isovalue
curves for a diagonal with 2 1 2 2. The graph has a spherical
symmetry showing no preference in any direction. (a) (b) x2 x2 x1
x1 p(x) FIGURE 2.4 (a) The graph of a two-dimensional Gaussian pdf
and (b) the corresponding isovalue curves for a diagonal with 2 1 2
2. The graph is elongated along the x1 direction.
- 27. 04-Ch02-SA272 18/9/2008 page 23 2.4 Bayesian Classication
for Normal Distributions 23 p(x) (a) (b) x2 x2 x1 x1 FIGURE 2.5 (a)
The graph of a two-dimensional Gaussian pdf and (b) the
corresponding isovalue curves for a diagonal with 2 1 2 2. The
graph is elongated along the x2 direction. (a) (b) x2 p(x) x1 x1 x2
FIGURE 2.6 (a) The graph of a two-dimensional Gaussian pdf and (b)
the corresponding isovalue curves for a case of a nondiagonal .
Playing with the values of the elements of one can achieve
different shapes and orientations. that is, both features, x1, x2
have variance equal to 3 and their covariance is zero. The graph of
the Gaussian is symmetric. For this case the isovalue curves (i.e.,
curves of equal probability density values) are circles
(hyperspheres in the general l-dimensional space) and are shown in
Figure 2.3b. The case shown in Figure 2.4a corresponds to the
covariance matrix 2 1 0 0 2 2 with 2 1 15 2 2 3. The graph of the
Gaussian is now elongated along the x1-axis, which is the direction
of the larger variance. The isovalue curves, shown
- 28. 04-Ch02-SA272 18/9/2008 page 24 24 CHAPTER 2 Classiers
Based on Bayes Decision Theory in Figure 2.4b, are ellipses.
Figures 2.5a and 2.5b correspond to the case with 2 1 3 2 2 15.
Figures 2.6a and 2.6b correspond to the more general case where 2 1
12 12 2 2 and 2 1 15, 2 2 3, 12 6. Playing with 2 1, 2 2 and 12 one
can achieve different shapes and different orientations. The
isovalue curves are ellipses of different orientations and with
different ratios of major to minor axis lengths. Let us consider,
as an example, the case of a zero mean random vector with a
diagonal covariance matrix. To compute the isovalue curves is
equivalent to computing the curves of constant values for the
exponent, that is, xT 1 x [x1, x2] 1 2 1 0 0 1 2 2 x1 x2 C (2.31)
or x2 1 2 1 x2 2 2 2 C (2.32) for some constant C. This is the
equation of an ellipse whose axes are determined by the the
variances of the involved features. As we will soon see,the
principal axes of the ellipses are controlled by the
eigenvectors/eigenvalues of the covariance matrix. As we know from
linear algebra (and it is easily checked),the eigenvalues of a
diagonal matrix, which was the case for our example, are equal to
the respective elements across its diagonal. 2.4.2 The Bayesian
Classier for Normally Distributed Classes Our goal in this section
is to study the optimal Bayesian classier when the involved pdfs,
p(x|i), i 1, 2, . . . , M (likelihood functions of i with respect
to x), describing the data distribution in each one of the classes,
are multivariate normal distributions, that is, N(i, i), i 1, 2, .
. . , M. Because of the exponential form of the involved densities,
it is preferable to work with the following discriminant
functions,which involve the (monotonic) logarithmic function ln():
gi(x) ln( p(x|i)P(i)) ln p(x|i) ln P(i) (2.33) or gi(x) 1 2 (x i)T
1 i (x i) ln P(i) ci (2.34) where ci is a constant equal to (l/2)
ln 2 (1/2) ln|i|. Expanding,we obtain gi(x) 1 2 xT 1 i x 1 2 xT 1 i
i 1 2 T i 1 i i 1 2 T i 1 i x ln P(i) ci (2.35)
- 29. 04-Ch02-SA272 18/9/2008 page 25 2.4 Bayesian Classication
for Normal Distributions 25 In general, this is a nonlinear
quadratic form. Take, for example, the case of l 2 and assume that
i 2 i 0 0 2 i Then (2.35) becomes gi(x) 1 22 i x2 1 x2 2 1 2 i
(i1x1 i2x2) 1 22 i 2 i1 2 i2 ln P(i) ci (2.36) and obviously the
associated decision curves gi(x) gj(x) 0 are quadrics (i.e.,
ellipsoids, parabolas, hyperbolas, pairs of lines). That is, in
such cases, the Bayesian classier is a quadratic classier, in the
sense that the partition of the feature space is performed via
quadric decision surfaces. For l 2 the decision sur- faces are
hyperquadrics. Figure 2.7a shows the decision curve corresponding
to P(1) P(2), 1 [0, 0]T and 2 [4, 0]T . The covariance matrices for
the two classes are 1 0.3 0.0 0.0 0.35 , 2 1.2 0.0 0.0 1.85 For the
case of Figure 2.7b the classes are also equiprobable with 1 [0,
0]T , 2 [3.2, 0]T and covariance matrices 1 0.1 0.0 0.0 0.75 , 2
0.75 0.0 0.0 0.1 Figure 2.8 shows the two pdfs for the case of
Figure 2.7a. The red color is used for class 1 and indicates the
points where p(x|1) p(x|2). The gray color is similarly used for
class 2. It is readily observed that the decision curve is an
ellipse, as shown in Figure 2.7a. The setup corresponding to Figure
2.7b is shown in Figure 2.9. In this case,the decision curve is a
hyperbola. (a) (b) 1 2 1 2 1 x2 x2 x1 3 4 1 2 5 10 5 0 5 0 3 x1 3 2
1 0 FIGURE 2.7 Examples of quadric decision curves. Playing with
the covariance matrices of the Gaussian functions, different
decision curves result, that is, ellipsoids, parabolas, hyperbolas,
pairs of lines.
- 30. 04-Ch02-SA272 18/9/2008 page 26 26 CHAPTER 2 Classiers
Based on Bayes Decision Theory 0.25 0.2 0.15 0.1 0.05 0 8 6 4 2 0
864 20 22 22 24 24 26 2628 28 FIGURE 2.8 An example of the pdfs of
two equiprobable classes in the two-dimensional space. The feature
vectors in both classes are normally distributed with different
covariance matrices. In this case, the decision curve is an ellipse
and it is shown in Figure 2.7a. The coloring indicates the areas
where the value of the respective pdf is larger. Decision
Hyperplanes The only quadratic contribution in (2.35) comes from
the term xT 1 i x. If we now assume that the covariance matrix is
the same in all classes,that is,i ,the quadratic term will be the
same in all discriminant functions. Hence, it does not enter into
the comparisons for computing the maximum, and it cancels out in
the decision surface equations. The same is true for the constants
ci. Thus,they can be omitted and we may redene gi(x) as gi(x) wT i
x wi0 (2.37)
- 31. 04-Ch02-SA272 18/9/2008 page 27 2.4 Bayesian Classication
for Normal Distributions 27 FIGURE 2.9 An example of the pdfs of
two equiprobable classes in the two-dimensional space. The feature
vectors in both classes are normally distributed with different
covariance matrices. In this case, the decision curve is a
hyperbola and it is shown in Figure 2.7b. where wi 1 i (2.38) and
wi0 ln P(i) 1 2 T i 1 i (2.39) Hence gi(x) is a linear function of
x and the respective decision surfaces are hyperplanes. Let us
investigate this a bit more.
- 32. 04-Ch02-SA272 18/9/2008 page 28 28 CHAPTER 2 Classiers
Based on Bayes Decision Theory Diagonal covariance matrix with
equal elements:Assume that the individual features, constituting
the feature vector, are mutually uncorrelated and of the same
variance (E[(xi i)(xj j)] 2ij). Then, as discussed in Appendix A,
2I,where I is the l-dimensional identity matrix,and (2.37) becomes
gi(x) 1 2 T i x wi0 (2.40) Thus, the corresponding decision
hyperplanes can now be written as (verify it) gij(x) gi(x) gj(x) wT
(x x0) 0 (2.41) where w i j (2.42) and x0 1 2 (i j) 2 ln P(i) P(j)
i j i j 2 (2.43) where x x2 1 x2 2 x2 l denotes the Euclidean norm
of x. Thus, the decision surface is a hyperplane passing through
the point x0. Obviously, if P(i) P(j), then x0 1 2 (i j), and the
hyperplane passes through the average of i, j,that is,the middle
point of the segment joining the mean values. On the other hand, if
P(j) P(i) (P(i) P(j)) the hyperplane is located closer to i(j). In
other words, the area of the region where we decide in favor of the
more probable of the two classes is increased. The geometry is
illustrated in Figure 2.10 for the two-dimensional case and for two
cases, that is, P(j) P(i) (black line) and P(j) P(i) (red line). We
observe that for both cases the decision hyperplane (straight line)
is orthogonal to ij. Indeed,for any point x lying on the decision
hyperplane, the vector x x0 also lies on the hyperplane and gij(x)
0 wT (x x0) (i j)T (x x0) 0 That is,i j is orthogonal to the
decision hyperplane. Furthermore,if 2 is small with respect to i j
,the location of the hyperplane is rather insen- sitive to the
values of P(i), P(j). This is expected, because small variance
indicates that the random vectors are clustered within a small
radius around their mean values. Thus a small shift of the decision
hyperplane has a small effect on the result. Figure 2.11
illustrates this. For each class, the circles around the means
indicate regions where samples have a high probability, say
98%,
- 33. 04-Ch02-SA272 18/9/2008 page 29 2.4 Bayesian Classication
for Normal Distributions 29 x2 x1 x0 i j ij FIGURE 2.10 Decision
lines for normally distributed vectors with 2I. The black line
corresponds to the case of P(j) P(i) and it passes through the
middle point of the line segment joining the mean values of the two
classes. The red line corresponds to the case of P(j) P(i) and it
is closer to i, leaving more room to the more probable of the two
classes. If we had assumed P(j) P(i), the decision line would have
moved closer to j. (a) (b) x2 x2 x1 x1 i j i j FIGURE 2.11 Decision
line (a) for compact and (b) for noncompact classes. When classes
are compact around their mean values, the location of the
hyperplane is rather insensitive to the values of P(1) and P(2).
This is not the case for noncompact classes, where a small movement
of the hyperplane to the right or to the left may be more critical.
of being found. The case of Figure 2.11a corresponds to small
variance, and that of Figure 2.11b to large variance. No doubt the
location of the decision hyperplane in Figure 2.11b is much more
critical than that in Figure 2.11a.
- 34. 04-Ch02-SA272 18/9/2008 page 30 30 CHAPTER 2 Classiers
Based on Bayes Decision Theory Nondiagonal covariance matrix:
Following algebraic arguments similar to those used before,we end
up with hyperplanes described by gij(x) wT (x x0) 0 (2.44) where w
1 (i j) (2.45) and x0 1 2 (i j) ln P(i) P(j) i j i j 2 1 (2.46)
where x 1 (xT 1x)1/2 denotes the so-called 1 norm of x. The
comments made before for the case of the diagonal covariance matrix
are still valid,with one exception. The decision hyperplane is no
longer orthogonal to the vector i j but to its linear
transformation 1(i j). Figure 2.12 shows two Gaussian pdfs with
equal covariance matrices, describing the data distribution of two
equiprobable classes. In both classes, the data are dis- tributed
around their mean values in exactly the same way and the optimal
decision curve is a straight line. Minimum Distance Classiers We
will now view the task from a slightly different angle. Assuming
equiprobable classes with the same covariance matrix,gi(x) in
(2.34) is simplied to gi(x) 1 2 (x i)T 1 (x i) (2.47) where
constants have been neglected. 2I: In this case maximum gi(x)
implies minimum Euclidean distance: d x i (2.48) Thus, feature
vectors are assigned to classes according to their Euclidean
distance from the respective mean points. Can you verify that this
result ties in with the geometry of the hyperplanes discussed
before? Figure 2.13a shows curves of equal distance d c from the
mean points of each class. They are obviously circles of radius c
(hyperspheres in the general case). Nondiagonal : For this case
maximizing gi(x) is equivalent to minimizing the 1 norm,known as
the Mahalanobis distance: dm (x i)T 1 (x i) 1/2 (2.49) In this
case, the constant distance dm c curves are ellipses
(hyperellipses). Indeed, the covariance matrix is symmetric and, as
discussed in Appendix B, it can always be diagonalized by a unitary
transform T (2.50)
- 35. 04-Ch02-SA272 18/9/2008 page 31 2.4 Bayesian Classication
for Normal Distributions 31 0.014 0.012 0.008 0.006 0.004 0.002 0
40 40 30 30 20 20 10 10 10 1020 2030 30 40 40 0 0 0.01 FIGURE 2.12
An example of two Gaussian pdfs with the same covariance matrix in
the two-dimensional space. Each one of them is associated with one
of two equiprobable classes. In this case, the decision curve is a
straight line. where T 1 and is the diagonal matrix whose elements
are the eigen- values of . has as its columns the corresponding
(orthonormal) eigenvectors of [v1, v2, . . . , vl] (2.51) Combining
(2.49) and (2.50),we obtain (x i)T 1 T (x i) c2 (2.52) Dene x T x.
The coordinates of x are equal to vT k x, k 1, 2, . . . , l,that
is, the projections of x onto the eigenvectors. In other words,
they are the coordinates of x with respect to a new coordinate
system whose axes are determined by vk, k 1, 2, . . . , l. Equation
(2.52) can now be written as (x1 i1)2 1 (xl il)2 l c2 (2.53)
- 36. 04-Ch02-SA272 18/9/2008 page 32 32 CHAPTER 2 Classiers
Based on Bayes Decision Theory !W x2 x2 x1 x1(a) (b) 22 2cv2 !W22
1cv1 1 2 FIGURE 2.13 Curves of (a) equal Euclidean distance and (b)
equal Mahalanobis distance from the mean points of each class. In
the two-dimensional space, they are circles in the case of
Euclidean distance and ellipses in the case of Mahalanobis
distance. Observe that in the latter case the decision line is no
longer orthogonal to the line segment joining the mean values. It
turns according to the shape of the ellipses. This is the equation
of a hyperellipsoid in the new coordinate system. Figure 2.13b
shows the l 2 case. The center of mass of the ellipse is at i,and
the principal axes are aligned with the corresponding eigenvectors
and have lengths 2 kc, respectively. Thus, all points having the
same Mahalanobis distance from a specic point are located on an
ellipse. Example 2.2 In a two-class, two-dimensional classication
task, the feature vectors are generated by two normal distributions
sharing the same covariance matrix 1.1 0.3 0.3 1.9 and the mean
vectors are 1 [0, 0]T , 2 [3, 3]T , respectively. (a) Classify the
vector [1.0, 2.2]T according to the Bayesian classier. It sufces to
compute the Mahalanobis distance of [1.0, 2.2]T from the two mean
vectors. Thus, d2 m(1, x) (x 1)T 1 (x 1) [1.0, 2.2] 0.95 0.15 0.15
0.55 1.0 2.2 2.952 Similarly, d2 m(2, x) [2.0, 0.8] 0.95 0.15 0.15
0.55 2.0 0.8 3.672 (2.54) Thus, the vector is assigned to the class
with mean vector [0, 0]T . Notice that the given vector [1.0, 2.2]T
is closer to [3, 3]T with respect to the Euclidean distance.
- 37. 04-Ch02-SA272 18/9/2008 page 33 2.4 Bayesian Classication
for Normal Distributions 33 (b) Compute the principal axes of the
ellipse centered at [0, 0]T that corresponds to a constant
Mahalanobis distance dm 2.952 from the center. To this end, we rst
calculate the eigenvalues of . det 1.1 0.3 0.3 1.9 2 3 2 0 or 1 1
and 2 2. To compute the eigenvectors we substitute these values
into the equation ( I)v 0 and we obtain the unit norm eigenvectors
v1 3 10 1 10 , v2 1 10 3 10 It can easily be seen that they are
mutually orthogonal. The principal axes of the ellipse are parallel
to v1 and v2 and have lengths 3.436 and 4.859, respectively.
Remarks In practice, it is quite common to assume that the data in
each class are ade- quately described by a Gaussian distribution.
As a consequence,the associated Bayesian classier is either linear
or quadratic in nature, depending on the adopted assumptions
concerning the covariance matrices. That is, if they are all equal
or different. In statistics, this approach to the classication task
is known as linear discriminant analysis (LDA) or quadratic
discriminant analysis (QDA), respectively. Maximum likelihood is
usually the method mobilized for the estimation of the unknown
parameters that dene the mean values and the covariance matrices
(see Section 2.5 and Problem 2.19). A major problem associated with
LDA and even more with QDA is the large number of the unknown
parameters that have to be estimated in the case of
high-dimensional spaces. For example, there are l parameters in
each of the mean vectors and approximately l2/2 in each (symmetric)
covariance matrix. Besides the high demand for computational
resources,obtaining good estimates of a large number of parameters
dictates a large number of training points, N. This is a major
issue that also embraces the design of other types of classiers,
for most of the cases, and we will come to it in greater detail in
Chapter 5. In an effort to reduce the number of parameters to be
estimated, a number of approximate techniques have been suggested
over the years, including [Kimu 87, Hoff 96, Frie 89, Liu 04].
Linear discrimination will be approached from a different
perspective in Section 5.8. LDA and QDA exhibit good performance in
a large set of diverse applications and are considered to be among
the most popular classiers. No doubt, it is hard to accept that in
all these cases the Gaussian assumption provides a reasonable
modeling for the data statistics. The secret of the success
seems
- 38. 04-Ch02-SA272 18/9/2008 page 34 34 CHAPTER 2 Classiers
Based on Bayes Decision Theory to lie in the fact that linear or
quadratic decision surfaces offer a reasonably good partition of
the space,from the classication point of view. Moreover,as pointed
out in [Hast 01],the estimates associated with Gaussian models have
some good statistical properties (i.e., bias variance trade-off,
Section 3.5.3) compared to other techniques. 2.5 ESTIMATION OF
UNKNOWN PROBABILITY DENSITY FUNCTIONS So far,we have assumed that
the probability density functions are known. However, this is not
the most common case. In many problems,the underlying pdf has to be
estimated from the available data. There are various ways to
approach the problem. Sometimes we may know the type of the pdf
(e.g.,Gaussian,Rayleigh),but we do not know certain parameters,
such as the mean values or the variances. In contrast, in other
cases we may not have information about the type of the pdf but we
may know certain statistical parameters,such as the mean value and
the variance. Depending on the available information,different
approaches can be adopted. This will be our focus in the next
subsections. 2.5.1 Maximum Likelihood Parameter Estimation Let us
consider an M-class problem with feature vectors distributed
according to p(x|i), i 1, 2, . . . , M. We assume that these
likelihood functions are given in a parametric form and that the
corresponding parameters form the vectors i which are unknown. To
show the dependence on i we write p(x|i; i). Our goal is to
estimate the unknown parameters using a set of known feature
vectors in each class. If we further assume that data from one
class do not affect the parameter estimation of the others,we can
formulate the problem independent of classes and simplify our
notation. At the end,one has to solve one such problem for each
class independently. Let x1, x2, . . . , xN be random samples drawn
from pdf p(x; ). We form the joint pdf p(X; ), where X {x1, . . . ,
xN } is the set of the samples. Assuming statistical independence
between the different samples,we have p(X; ) p(x1, x2, . . . , xN ;
) N k1 p(xk; ) (2.55) This is a function of , and it is also known
as the likelihood function of with respect to X. The maximum
likelihood (ML) method estimates so that the likelihood function
takes its maximum value,that is, ML arg max N k1 p(xk; )
(2.56)
- 39. 04-Ch02-SA272 18/9/2008 page 35 2.5 Estimation of Unknown
Probability Density Functions 35 A necessary condition that ML must
satisfy in order to be a maximum is the gradient of the likelihood
function with respect to to be zero,that is N k1 p(xk; ) 0 (2.57)
Because of the monotonicity of the logarithmic function, we dene
the log- likelihood function as L() ln N k1 p(xk; ) (2.58) and
(2.57) is equivalent to L() N k1 ln p(xk; ) N k1 1 p(xk; ) p(xk; )
0 (2.59) Figure 2.14 illustrates the method for the single unknown
parameter case. The ML estimate corresponds to the peak of the
log-likelihood function. Maximum likelihood estimation has some
very desirable properties. If 0 is the true value of the unknown
parameter in p(x; ), it can be shown that under generally valid
conditions the following are true [Papo 91]. The ML estimate is
asymptotically unbiased,which by denition means that lim N E[ML] 0
(2.60) Alternatively,we say that the estimate converges in the mean
to the true value. The meaning of this is as follows. The estimate
ML is itself a random vector, because for different sample sets X
different estimates will result. An estimate is called unbiased if
its mean is the true value of the unknown parameter. In the ML case
this is true only asymptotically (N ). p(X;) ML FIGURE 2.14 The
maximum likelihood estimator ML corresponds to the peak of p(X;
).
- 40. 04-Ch02-SA272 18/9/2008 page 36 36 CHAPTER 2 Classiers
Based on Bayes Decision Theory The ML estimate is asymptotically
consistent,that is,it satises lim N prob{ ML 0 } 1 (2.61) where is
arbitrarily small. Alternatively,we say that the estimate converges
in probability. In other words,for large N it is highly probable
that the resulting estimate will be arbitrarily close to the true
value. A stronger condition for consistency is also true: lim N E[
ML 0 2 ] 0 (2.62) In such cases we say that the estimate converges
in the mean square. In words,for large N,the variance of the ML
estimates tends to zero. Consistency is very important for an
estimator,because it may be unbiased, but the resulting estimates
exhibit large variations around the mean. In such cases we have
little condence in the result obtained from a single set X. The ML
estimate is asymptotically efcient;that is,it achieves the
CramerRao lower bound (Appendix A). This is the lowest value of
variance, which any estimate can achieve. The pdf of the ML
estimate as N approaches the Gaussian distribution with mean 0
[Cram 46]. This property is an offspring of (a) the central limit
theorem (Appendix A) and (b) the fact that the ML estimate is
related to the sum of random variables,that is, ln(p(xk; ))/
(Problem 2.16). In summary, the ML estimator is unbiased, is
normally distributed, and has the minimum possible variance.
However, all these nice properties are valid only for large values
of N. Example 2.3 Assume that N data points, x1, x2, . . . , xN ,
have been generated by a one-dimensional Gaussian pdf of known
mean, , but of unknown variance. Derive the ML estimate of the
variance. The log-likelihood function for this case is given by L(2
) ln N k1 p(xk; 2 ) ln N k1 1 2 2 exp (xk )2 22 or L(2 ) N 2 ln(22
) 1 22 N k1 (xk )2 Taking the derivative of the above with respect
to 2 and equating to zero, we obtain N 22 1 24 N k1 (xk )2 0