Part IV Generative Learning algorithms · PDF fileCS229 Lecture notes Andrew Ng Part IV...

of 14 /14
CS229 Lecture notes Andrew Ng Part IV Generative Learning algorithms So far, we’ve mainly been talking about learning algorithms that model p(y|x; θ), the conditional distribution of y given x. For instance, logistic regression modeled p(y|x; θ) as h θ (x)= g(θ T x) where g is the sigmoid func- tion. In these notes, we’ll talk about a different type of learning algorithm. Consider a classification problem in which we want to learn to distinguish between elephants (y = 1) and dogs (y = 0), based on some features of an animal. Given a training set, an algorithm like logistic regression or the perceptron algorithm (basically) tries to find a straight line—that is, a decision boundary—that separates the elephants and dogs. Then, to classify a new animal as either an elephant or a dog, it checks on which side of the decision boundary it falls, and makes its prediction accordingly. Here’s a different approach. First, looking at elephants, we can build a model of what elephants look like. Then, looking at dogs, we can build a separate model of what dogs look like. Finally, to classify a new animal, we can match the new animal against the elephant model, and match it against the dog model, to see whether the new animal looks more like the elephants or more like the dogs we had seen in the training set. Algorithms that try to learn p(y|x) directly (such as logistic regression), or algorithms that try to learn mappings directly from the space of inputs X to the labels {0, 1}, (such as the perceptron algorithm) are called discrim- inative learning algorithms. Here, we’ll talk about algorithms that instead try to model p(x|y) (and p(y)). These algorithms are called generative learning algorithms. For instance, if y indicates whether a example is a dog (0) or an elephant (1), then p(x|y = 0) models the distribution of dogs’ features, and p(x|y = 1) models the distribution of elephants’ features. After modeling p(y) (called the class priors) and p(x|y), our algorithm 1

Embed Size (px)

Transcript of Part IV Generative Learning algorithms · PDF fileCS229 Lecture notes Andrew Ng Part IV...

  • CS229 Lecture notes

    Andrew Ng

    Part IV

    Generative Learning algorithmsSo far, weve mainly been talking about learning algorithms that modelp(y|x; ), the conditional distribution of y given x. For instance, logisticregression modeled p(y|x; ) as h(x) = g(

    T x) where g is the sigmoid func-tion. In these notes, well talk about a different type of learning algorithm.

    Consider a classification problem in which we want to learn to distinguishbetween elephants (y = 1) and dogs (y = 0), based on some features ofan animal. Given a training set, an algorithm like logistic regression orthe perceptron algorithm (basically) tries to find a straight linethat is, adecision boundarythat separates the elephants and dogs. Then, to classifya new animal as either an elephant or a dog, it checks on which side of thedecision boundary it falls, and makes its prediction accordingly.

    Heres a different approach. First, looking at elephants, we can build amodel of what elephants look like. Then, looking at dogs, we can build aseparate model of what dogs look like. Finally, to classify a new animal, wecan match the new animal against the elephant model, and match it againstthe dog model, to see whether the new animal looks more like the elephantsor more like the dogs we had seen in the training set.

    Algorithms that try to learn p(y|x) directly (such as logistic regression),or algorithms that try to learn mappings directly from the space of inputs Xto the labels {0, 1}, (such as the perceptron algorithm) are called discrim-inative learning algorithms. Here, well talk about algorithms that insteadtry to model p(x|y) (and p(y)). These algorithms are called generativelearning algorithms. For instance, if y indicates whether a example is a dog(0) or an elephant (1), then p(x|y = 0) models the distribution of dogsfeatures, and p(x|y = 1) models the distribution of elephants features.

    After modeling p(y) (called the class priors) and p(x|y), our algorithm

    1

  • 2

    can then use Bayes rule to derive the posterior distribution on y given x:

    p(y|x) =p(x|y)p(y)

    p(x).

    Here, the denominator is given by p(x) = p(x|y = 1)p(y = 1) + p(x|y =0)p(y = 0) (you should be able to verify that this is true from the standardproperties of probabilities), and thus can also be expressed in terms of thequantities p(x|y) and p(y) that weve learned. Actually, if were calculatingp(y|x) in order to make a prediction, then we dont actually need to calculatethe denominator, since

    arg maxy

    p(y|x) = arg maxy

    p(x|y)p(y)

    p(x)

    = arg maxy

    p(x|y)p(y).

    1 Gaussian discriminant analysis

    The first generative learning algorithm that well look at is Gaussian discrim-inant analysis (GDA). In this model, well assume that p(x|y) is distributedaccording to a multivariate normal distribution. Lets talk briefly about theproperties of multivariate normal distributions before moving on to the GDAmodel itself.

    1.1 The multivariate normal distribution

    The multivariate normal distribution in n-dimensions, also called the multi-variate Gaussian distribution, is parameterized by a mean vector Rn

    and a covariance matrix Rnn, where 0 is symmetric and positivesemi-definite. Also written N (, ), its density is given by:

    p(x; , ) =1

    (2)n/2||1/2exp

    (

    1

    2(x )T 1(x )

    )

    .

    In the equation above, || denotes the determinant of the matrix .For a random variable X distributed N (, ), the mean is (unsurpris-

    ingly,) given by :

    E[X] =

    x

    x p(x; , )dx =

    The covariance of a vector-valued random variable Z is defined as Cov(Z) =E[(Z E[Z])(Z E[Z])T ]. This generalizes the notion of the variance of a

  • 3

    real-valued random variable. The covariance can also be defined as Cov(Z) =E[ZZT ] (E[Z])(E[Z])T . (You should be able to prove to yourself that thesetwo definitions are equivalent.) If X N (, ), then

    Cov(X) = .

    Herere some examples of what the density of a Gaussian distributionlook like:

    32

    10

    12

    3

    32

    10

    12

    3

    0.05

    0.1

    0.15

    0.2

    0.25

    32

    10

    12

    3

    32

    10

    12

    3

    0.05

    0.1

    0.15

    0.2

    0.25

    32

    10

    12

    3

    32

    10

    12

    3

    0.05

    0.1

    0.15

    0.2

    0.25

    The left-most figure shows a Gaussian with mean zero (that is, the 2x1zero-vector) and covariance matrix = I (the 2x2 identity matrix). A Gaus-sian with zero mean and identity covariance is also called the standard nor-mal distribution. The middle figure shows the density of a Gaussian withzero mean and = 0.6I; and in the rightmost figure shows one with , = 2I.We see that as becomes larger, the Gaussian becomes more spread-out,and as it becomes smaller, the distribution becomes more compressed.

    Lets look at some more examples.

    32

    10

    12

    3

    3

    2

    1

    0

    1

    2

    3

    0.05

    0.1

    0.15

    0.2

    0.25

    32

    10

    12

    3

    3

    2

    1

    0

    1

    2

    3

    0.05

    0.1

    0.15

    0.2

    0.25

    32

    10

    12

    3

    3

    2

    1

    0

    1

    2

    3

    0.05

    0.1

    0.15

    0.2

    0.25

    The figures above show Gaussians with mean 0, and with covariancematrices respectively

    =

    [

    1 00 1

    ]

    ; =

    [

    1 0.50.5 1

    ]

    ; . =

    [

    1 0.80.8 1

    ]

    .

    The leftmost figure shows the familiar standard normal distribution, and wesee that as we increase the off-diagonal entry in , the density becomes morecompressed towards the 45 line (given by x1 = x2). We can see this moreclearly when we look at the contours of the same three densities:

  • 4

    3 2 1 0 1 2 3

    3

    2

    1

    0

    1

    2

    3

    3 2 1 0 1 2 3

    3

    2

    1

    0

    1

    2

    3

    3 2 1 0 1 2 3

    3

    2

    1

    0

    1

    2

    3

    Heres one last set of examples generated by varying :

    3 2 1 0 1 2 3

    3

    2

    1

    0

    1

    2

    3

    3 2 1 0 1 2 3

    3

    2

    1

    0

    1

    2

    3

    3 2 1 0 1 2 3

    3

    2

    1

    0

    1

    2

    3

    The plots above used, respectively,

    =

    [

    1 -0.5-0.5 1

    ]

    ; =

    [

    1 -0.8-0.8 1

    ]

    ; . =

    [

    3 0.80.8 1

    ]

    .

    From the leftmost and middle figures, we see that by decreasing the diagonalelements of the covariance matrix, the density now becomes compressedagain, but in the opposite direction. Lastly, as we vary the parameters, moregenerally the contours will form ellipses (the rightmost figure showing anexample).

    As our last set of examples, fixing = I, by varying , we can also movethe mean of the density around.

    32

    10

    12

    3

    32

    10

    12

    3

    0.05

    0.1

    0.15

    0.2

    0.25

    32

    10

    12

    3

    32

    10

    12

    3

    0.05

    0.1

    0.15

    0.2

    0.25

    32

    10

    12

    3

    32

    10

    12

    3

    0.05

    0.1

    0.15

    0.2

    0.25

    The figures above were generated using = I, and respectively

    =

    [

    10

    ]

    ; =

    [

    -0.50

    ]

    ; =

    [

    -1-1.5

    ]

    .

  • 5

    1.2 The Gaussian Discriminant Analysis model

    When we have a classification problem in which the input features x arecontinuous-valued random variables, we can then use the Gaussian Discrim-inant Analysis (GDA) model, which models p(x|y) using a multivariate nor-mal distribution. The model is:

    y Bernoulli()

    x|y = 0 N (0, )

    x|y = 1 N (1, )

    Writing out the distributions, this is:

    p(y) = y(1 )1y

    p(x|y = 0) =1

    (2)n/2||1/2exp

    (

    1

    2(x 0)

    T 1(x 0)

    )

    p(x|y = 1) =1

    (2)n/2||1/2exp

    (

    1

    2(x 1)

    T 1(x 1)

    )

    Here, the parameters of our model are , , 0 and 1. (Note that whiletherere two different mean vectors 0 and 1, this model is usually appliedusing only one covariance matrix .) The log-likelihood of the data is givenby

    `(, 0, 1, ) = logm

    i=1

    p(x(i), y(i); , 0, 1, )

    = logm

    i=1

    p(x(i)|y(i); 0, 1, )p(y(i); ).

  • 6

    By maximizing ` with respect to the parameters, we find the maximum like-lihood estimate of the parameters (see problem set 1) to be:

    =1

    m

    m

    i=1

    1{y(i) = 1}

    0 =

    mi=1 1{y

    (i) = 0}x(i)m

    i=1 1{y(i) = 0}

    1 =

    mi=1 1{y

    (i) = 1}x(i)m

    i=1 1{y(i) = 1}

    =1

    m

    m

    i=1

    (x(i) y(i))(x(i) y(i))

    T .

    Pictorially, what the algorithm is doing can be seen in as follows:

    2 1 0 1 2 3 4 5 6 77

    6

    5

    4

    3

    2

    1

    0

    1

    Shown in the figure are the training set, as well as the contours of thetwo Gaussian distributions that have been fit to the data in each of thetwo classes. Note that the two Gaussians have contours that are the sameshape and orientation, since they share a covariance matrix , but they havedifferent means 0 and 1. Also shown in the figure is the straight linegiving the decision boundary at which p(y = 1|x) = 0.5. On one side ofthe boundary, well predict y = 1 to be the most likely outcome, and on theother side, well predict y = 0.

    1.3 Discussion: GDA and logistic regression

    The GDA model has an interesting relationship to logistic regression. If weview the quantity p(y = 1|x; , 0, 1, ) a