• date post

17-Apr-2018
• Category

## Documents

• view

215

2

Embed Size (px)

### Transcript of EXAM IN STATISTICAL MACHINE LEARNING · PDF fileEXAM IN STATISTICAL MACHINE LEARNING...

• EXAM IN STATISTICAL MACHINE LEARNINGSTATISTISK MASKININLRNING

DATE AND TIME: March 10, 2017, 8.0013.00

RESPONSIBLE TEACHER: Fredrik Lindsten

NUMBER OF PROBLEMS: 5

AIDING MATERIAL: Calculator, mathematical handbooks

Some general instructions and information: Your solutions can be given in Swedish or in English.

Only write on one page of the paper.

Write your exam code and a page number on all pages.

Do not use a red pen.

Use separate sheets of paper for the different problems (i.e. the num-bered problems, 15).

When asked to pair e.g. plots with corresponding formulas, the orderof the plots/formulas is always randomly generated using the functionsample(n) in R, where n is the number of options. Thus, it is notpossible to infer the correct answer from the way in which the problemis presented.

a proper motivation will score zero points!

Good luck!

• Some useful formulas

Pages 13 contain some expressions that may or may not be useful for solvingthe exam problems. This is not a complete list of formulas used in thecourse! Consequently, some of the problems may require knowledge aboutcertain expressions not listed here. Furthermore, the formulas listed beloware not all self-explanatory, meaning that you need to be familiar with theexpressions to be able to interpret them. Thus, the list should be viewed as asupport for solving the problems, rather than as a comprehensive collectionof formulas.

Marginalization and conditioning of probability densities: For a par-titioned random vector Z =

(ZT1 Z

T2

)Twith joint probability density func-

tion p(z) = p(z1, z2), the marginal probability density function of Z1 is

p(z1) =

Z2p(z1, z2)dz2

and the conditional probability density function for Z1 given Z2 = z2 is

p(z1 | z2) =p(z1, z2)p(z2)

= p(z2 | z1)p(z1)p(z2)

.

The Gaussian distribution: The probability density function of the p-dimensional Gaussian distribution is

N (x |, ) = 1(2)p/2

det

exp(12(x )

T1(x )), x Rp.

For a Gaussian random vector X N (,) partitioned according to,

X =(XaXb

), =

(ab

), =

(a abTab b

)

it holds that the marginal probability density of Xa is p(xa) = N (xa |a, a)and the conditional density ofXa givenXb = xb is p(xa |xb) = N

(xaa|b, a|b) ,

where

a|b = a + ab1b (xb b),a|b = a ab1b Tab.

1

• If we have that Xa, as well as Xb conditioned on Xa = xa, are Gaussiandistributed according to

p(xa) = N (xa |a, a) ,p(xb |xa) = N

(xbMxa + b, b|a) ,

where M is a matrix (of appropriate dimension) and b is a constant vector,then the joint distribution of Xa and Xb is given by

p(xa, xb) = N((

xaxb

) (

aMa + b

),

(a aMTMa b|a +MaMT

)).

Sum of identically distributed variables: For identically distributed ran-dom variables {Zi}ni=1 with mean , variance 2 and average correlation be-tween distinct variables , it holds that E

[1n

ni=1 Zi

]= and Var

(1n

ni=1 Zi

)=

1n2 + 2.

Linear regression and regularization:

The least-squares estimate of in the linear regression model Y =TX + is given by LS = (XTX)1XTy, where

X =

xT1...xTN

, and y =y1...yN

.

Ridge regression uses the regularization term 22 = pj=0

2j . The

ridge regression estimate is RR = (XTX + I)1XTy.

LASSO uses the regularization term 1 = pj=0 |j|. (The LASSO

estimate does not admit a simple closed form expression.)

For a probabilistic linear regression model with N (0, 2) and priordistribution p() = N ( |0,0) the posterior distribution is p( |y) =N ( |N ,N) with

N = N(10 0 + 2XTy

), N =

(10 + 2XTX

)1.

Maximum likelihood: The maximum likelihood estimate is given by ML =arg max `() where `() = log p(y | ) =

Ni=1 log p(yi | ) is the log-likelihood

2

• function (the last equality holds when the N training data points are inde-pendent).

Logistic regression: The logistic regression model uses a linear regressionfor the the log-odds. In the binary classification context we thus have

log(

Pr(Y = 1 |X)Pr(Y = 0 |X)

)= TX.

Discriminant Analysis: The linear discriminant analysis (LDA) classifierassigns a test input X = x to class k for which,

k(x) = xT1k 12

Tk 1k + log k

is largest, where k = Nk/N and k = 1Nki:yi=k xi for k = 1, . . . , K, and

= 1NK

Kk=1

i:yi=k(xi k)(xi k)T.

k(x) = 12 log |k|

12(x k)

T1k (x k) + log k,

where k = 1Nk1i:yi=k(xi k)(xi k)T for k = 1, . . . , K.

Loss functions for classification: Misclassification loss: I(y 6= G(x)). Exponential loss: exp(yC(x)) where G(x) = sign(C(x)).

Gaussian processes: If the function f is distributed according to a Gaus-sian process, f GP(m, k), it holds that for any arbitrary selection of inputs{x(1), x(2), . . . , x(n)} the output values f(x(1)), f(x(2)), . . . , f(x(n))} are jointlyGaussian,

f(x(1))...

f(x(n))

Nm(x(1))

...m(x(n))

,k(x(1), x(1)) k(x(1), x(n))

... . . . ...k(x(n), x(1)) k(x(n), x(n))

,

For a Gaussian process regression model Y = f(X) + with f GP(m, k)and N (0, 2), the prediction model is given by

f(x?) |y N(m(x?) + sT(ym(X)), k(x?, x?) sTk(X, x?)

),

where sT = k(x?,X)(k(X,X) + 2IN)1.

3

• 1. This problem is composed of 10 true-or-false statements. You only haveto classify these as either true or false. For this problem (only!) nomotivation is required. Each correct answer scores 1 point and eachincorrect answer scores -1 point (capped at 0 for the whole problem).

i. Regression problems have only quantitative inputs.ii. The following (so called probit) classifier is linear:

G(x) = I((Tx) > 0.2)

where (x) = xN (z | 0, 1)dz is the cumulative distribution

function of the standard Gaussian distribution, I() is the indi-cator function, and the class labels are 0 and 1.

iii. Ensemble methods can be used to reduce both bias and variance(compared to the base model used).

iv. The input partitioning shown in Figure 1 could not have beengenerated by recursive binary splitting.

X1

X2

Figure 1: Input partitioning for Problem 1.iv.

v. The model Y = 0 + 1X1 + 01X2 + (where 0 and 1 aremodel parameters) is a linear regression model.

vi. Over-fitting for a k-NN classifier occurs when there is too muchdata, so that the k-neighborhoods become to localized in the inputspace.

4

• vii. Maximum likelihood estimation is another word for least-squaresestimation.

viii. The mean-squared-error (in f(X; T ) w.r.t. f(X)) can be decom-posed into the sum of the squared model bias and the model vari-ance, i.e.

E[(f(X; T ) f(X))2] =(E[f(X; T ) f(X)

])2+ E

[(f(X; T ) E[f(X; T )]

)2].

ix. AdaBoost can use logistic regression as a base classifier.x. The k-NN classifier is sensitive to the scaling of the input variables.

(10p)

5

• 2. A large Swedish retailer of alcoholic beverages has asked you to build amodel for predicting the price of a wine based on various types of infor-mation, such as the wines chemical composition, region, etc. They havecollected a database with their data, containing the following columns:

id a unique identification number for each row of the database,specified as an integer value grape one out of 6 different grapes (e.g. Syrah, Zinfandel,

Merlot, etc.) alcohol percentage of alcohol in the wine, specified as a real

number between 0 and 1 (e.g. 0.135 for 13.5%) year production year of the wine, specified as an integer in the

range 20102016 region one out of 78 different wine regions (e.g. Bordeaux,

Burgundy, Rhne, etc.) proline concentration of the amino acid proline in mg/l speci-

fied as an integer (typically in the range 2002000) timestamp a time stamp specifying the time when the row was

entered into the database, on the format YYYY-MM-DD HH:MM:SS price the current price of the wine in SEK, specified as an

integer (typically in the range 80400)

(a) The customer wants to try a simple model first, like a linearregression or a logistic regression. Which one of these two methodsdo you suggest? (1p)

(b) For each column of the customers database as listed above, specifywhether you would consider that variable as an input of the model,an output of the model, or neither. (4p)

(c) For each of the inputs and outputs of your model (from the pre-vious question), specify whether that variable is best viewed asquantitative or qualitative. (3p)

(d) The customer has previously tried to use a CART1 model for thisproblem, but ran into problems when trying to learn this model.In fact, they were not able to train the CART model at all! The

1CART=Classification And Regression Trees

6

• issue was traced to the variable region. Considering specificallythis variable, the splitting criteria they considered was to compute

I = arg minIR

i:xiI

L(yi, c1) +i:xi /I

L(yi, c2)

where L is a given loss function and c1 and c2 are constant pre-dictions for the two regions obtained in the split. Furthermore, Ris the set of all possible values for region, i.e.

R = {Bordeaux, Burgundy, Rhne, . . . }

and the op