Linear Regression - University of California, Irvine

86
Linear Regression PROF XIAOHUI XIE SPRING 2019 CS 273P Machine Learning and Data Mining Slides courtesy of Alex Ihler

Transcript of Linear Regression - University of California, Irvine

Page 1: Linear Regression - University of California, Irvine

Linear Regression

PROF XIAOHUI XIESPRING 2019

CS 273P Machine Learning and Data Mining

Slides courtesy of Alex Ihler

Page 2: Linear Regression - University of California, Irvine

Machine Learning

Linear Regression via Least Squares

Gradient Descent Algorithms

Direct Minimization of Squared Error

Regression with Non-linear Features

Bias, Variance, & Validation

Regularized Linear Regression

Page 3: Linear Regression - University of California, Irvine

Supervised learning• Notation

– Features x– Targets y– Predictions ŷ = f(x ; θ)– Parameters θ

Program (“Learner”)

Characterized by some “parameters” θ

Procedure (using θ) that outputs a prediction

Training data (examples)

Features

Learning algorithm

Change θImprove performance

Feedback / Target values Score performance

(“cost function”)

“predict”

“train”

Page 4: Linear Regression - University of California, Irvine

Linear regression

• Define form of function f(x) explicitly• Find a good f(x) within that family

0 10 200

20

40

Tar

get

y

Feature x

“Predictor”:Evaluate line:

return r

Page 5: Linear Regression - University of California, Irvine

Notation

Define “feature” x0 = 1 (constant)Then

Page 6: Linear Regression - University of California, Irvine

Supervised learning• Notation

– Features x– Targets y– Predictions ŷ = f(x ; θ)– Parameters θ

Program (“Learner”)

Characterized by some “parameters” θ

Procedure (using θ) that outputs a prediction

Training data (examples)

Features

Learning algorithm

Change θImprove performance

Feedback / Target values Score performance

(“cost function”)

“predict”

“train”

Page 7: Linear Regression - University of California, Irvine

Measuring error

0 200

Error or “residual”

Prediction

Observation

Page 8: Linear Regression - University of California, Irvine

Mean squared error• How can we quantify the error?

• Could choose something else, of course…– Computationally convenient (more later)– Measures the variance of the residuals– Corresponds to likelihood under Gaussian model of “noise”

Page 9: Linear Regression - University of California, Irvine

MSE cost function• Rewrite using matrix form

# Python / NumPy:e = Y – X.dot( theta.T );J = e.T.dot( e ) / m # = np.mean( e ** 2 )

Page 10: Linear Regression - University of California, Irvine

Supervised learning• Notation

– Features x– Targets y– Predictions ŷ = f(x ; θ)– Parameters θ

Program (“Learner”)

Characterized by some “parameters” θ

Procedure (using θ) that outputs a prediction

Training data (examples)

Features

Learning algorithm

Change θImprove performance

Feedback / Target values Score performance

(“cost function”)

“predict”

“train”

Page 11: Linear Regression - University of California, Irvine

Visualizing the cost function

θ1θ 0

J(µ)

!

θ1

θ 0

θ1

θ0

θ1

θ0

Page 12: Linear Regression - University of California, Irvine

Finding good parameters• Want to find parameters which minimize our error…

• Think of a cost “surface”: error residual for that θ …

Page 13: Linear Regression - University of California, Irvine

Machine Learning

Gradient Descent Algorithms

Linear Regression via Least Squares

Direct Minimization of Squared Error

Regression with Non-linear Features

Bias, Variance, & Validation

Regularized Linear Regression

Page 14: Linear Regression - University of California, Irvine

Gradient descent

?

• How to change θ to improve J(θ)?

• Choose a direction in which J(θ) is decreasing

Page 15: Linear Regression - University of California, Irvine

• How to change θ to improve J(θ)?

• Choose a direction in which J(θ) is decreasing

• Derivative

• Positive => increasing• Negative => decreasing

Gradient descent

Page 16: Linear Regression - University of California, Irvine

Gradient descent in more dimensions

• Gradient vector

Indicates direction of steepest ascent(negative = steepest descent)

Page 17: Linear Regression - University of California, Irvine

Gradient descent• Initialization

• Step size α– Can change over iterations

• Gradient direction

• Stopping condition

Initialize θDo{

θ ← θ - α∇θJ(θ)} while (α||∇θJ|| > ε)

Page 18: Linear Regression - University of California, Irvine

Gradient for the MSE• MSE

• ∇ J = ?

0 0

Page 19: Linear Regression - University of California, Irvine

Gradient for the MSE• MSE

• ∇ J = ?

Page 20: Linear Regression - University of California, Irvine

Gradient descent• Initialization

• Step size α– Can change over iterations

• Gradient direction

• Stopping condition

Initialize θDo{

θ ← θ - α∇θJ(θ)} while (α||∇θJ|| > ε)

{

Error magnitude &direction for datum j

{Sensitivity toeach param

Page 21: Linear Regression - University of California, Irvine

Derivative of MSE• Rewrite using matrix form

{

Error magnitude &direction for datum j

{

Sensitivity toeach θ

i

e = Y – X.dot( theta.T ) # error residualDJ = - e.dot(X) * 2.0/m # compute the gradient theta -= alpha * DJ # take a step

Page 22: Linear Regression - University of California, Irvine

-1 -0.5 0 0.5 1 1.5 2 2.5 3-40

-30

-20

-10

0

10

20

30

40

Gradient descent on cost function

0 2 4 6 8 10

12

14

16

18

20

-20

-15

-10

-5

0

5

10

15

20

Page 23: Linear Regression - University of California, Irvine

Comments on gradient descent• Very general algorithm

– We’ll see it many times• Local minima

– Sensitive to starting point

Page 24: Linear Regression - University of California, Irvine

Comments on gradient descent• Very general algorithm

– We’ll see it many times• Local minima

– Sensitive to starting point• Step size

– Too large? Too small? Automatic ways to choose?– May want step size to decrease with iteration– Common choices:

• Fixed• Linear: C/(iteration)• Line search / backoff (Armijo, etc.)• Newton’s method

Page 25: Linear Regression - University of California, Irvine

Newton’s method• Want to find the roots of f(x)

– “Root”: value of x for which f(x)=0

• Initialize to some point x

• Compute the tangent at x & compute where it crosses x-axis

• Optimization: find roots of rJ(µ)

– Does not always converge; sometimes unstable

– If converges, usually very fast

– Works well for smooth, non-pathological functions, locally quadratic

– For n large, may be computationally hard: O(n2) storage, O(n3) time

(Multivariate: r J(µ) = gradient vector r2 J(µ) = matrix of 2nd derivatives a/b = a b-1, matrix inverse)

(“Step size” ¸ = 1/rrJ ; inverse curvature)

Page 26: Linear Regression - University of California, Irvine

• MSE

• Gradient

• Stochastic (or “online”) gradient descent:– Use updates based on individual datum j, chosen at random– At optima,

(average over the data)

Stochastic / Online gradient descent

Page 27: Linear Regression - University of California, Irvine

-1 -0.5 0 0.5 1 1.5 2 2.5 3-40

-30

-20

-10

0

10

20

30

40

Online gradient descent• Update based on one datum, and

its residual, at a time

0 2 4 6 8 10

12

14

16

18

20

-20

-15

-10

-5

0

5

10

15

20

Initialize θDo {

for j=1:m

θ ← θ - α∇θJj(θ)

} while (not done)

Page 28: Linear Regression - University of California, Irvine

-1 -0.5 0 0.5 1 1.5 2 2.5 3-40

-30

-20

-10

0

10

20

30

40

Online gradient descent

0 2 4 6 8 10

12

14

16

18

20

-20

-15

-10

-5

0

5

10

15

20

Initialize θDo {

for j=1:m

θ ← θ - α∇θJj(θ)

} while (not done)

Page 29: Linear Regression - University of California, Irvine

-1 -0.5 0 0.5 1 1.5 2 2.5 3-40

-30

-20

-10

0

10

20

30

40

Online gradient descent

0 2 4 6 8 10

12

14

16

18

20

-20

-15

-10

-5

0

5

10

15

20

Initialize θDo {

for j=1:m

θ ← θ - α∇θJj(θ)

} while (not done)

Page 30: Linear Regression - University of California, Irvine

-1 -0.5 0 0.5 1 1.5 2 2.5 3-40

-30

-20

-10

0

10

20

30

40

Online gradient descent

0 2 4 6 8 10

12

14

16

18

20

-20

-15

-10

-5

0

5

10

15

20

Initialize θDo {

for j=1:m

θ ← θ - α∇θJj(θ)

} while (not done)

Page 31: Linear Regression - University of California, Irvine

-1 -0.5 0 0.5 1 1.5 2 2.5 3-40

-30

-20

-10

0

10

20

30

40

Online gradient descent

0 2 4 6 8 10

12

14

16

18

20

-20

-15

-10

-5

0

5

10

15

20

Initialize θDo {

for j=1:m

θ ← θ - α∇θJj(θ)

} while (not done)

Page 32: Linear Regression - University of California, Irvine

-1 -0.5 0 0.5 1 1.5 2 2.5 3-40

-30

-20

-10

0

10

20

30

40

Online gradient descent

0 2 4 6 8 10

12

14

16

18

20

-20

-15

-10

-5

0

5

10

15

20

Initialize θDo {

for j=1:m

θ ← θ - α∇θJj(θ)

} while (not done)

Page 33: Linear Regression - University of California, Irvine

Online gradient descent• Benefits

– Lots of data = many more updates per pass– Computationally faster

• Disadvantages– No longer strictly “descent”– Stopping conditions may be harder to evaluate(Can use “running estimates” of J(.), etc. )

Initialize θDo {

for j=1:m

θ ← θ - α∇θJj(θ)

} while (not done)

Page 34: Linear Regression - University of California, Irvine

Machine Learning

Direct Minimization of Squared Error

Linear Regression via Least Squares

Gradient Descent Algorithms

Regression with Non-linear Features

Bias, Variance, & Validation

Regularized Linear Regression

Page 35: Linear Regression - University of California, Irvine

MSE Minimum• Consider a simple problem

– One feature, two data points

– Two unknowns: θ0, θ

1

– Two equations:

• Can solve this system directly:

• However, most of the time, m > n– There may be no linear function that hits all the data exactly– Instead, solve directly for minimum of MSE function

Page 36: Linear Regression - University of California, Irvine

MSE Minimum• Simplify with some algebra:

• X (XT X)-1 is called the “pseudo-inverse”

• If XT is square and full rank, this is the inverse• If m > n: overdetermined; gives minimum MSE fit

Page 37: Linear Regression - University of California, Irvine

Matlab MSE• This is easy to solve in Matlab…

% y = [y1 ; … ; ym]% X = [x1_0 … x1_m ; x2_0 … x2_m ; …]

% Solution 1: “manual” th = y’ * X * inv(X’ * X);

% Solution 2: “mrdivide” th = y’ / X’; % th*X’ = y => th = y/X’

Page 38: Linear Regression - University of California, Irvine

Python MSE• This is easy to solve in Python / NumPy…

# y = np.matrix( [[y1], … , [ym]] )# X = np.matrix( [[x1_0 … x1_n], [x2_0 … x2_n], …] )

# Solution 1: “manual” th = y.T * X * np.linalg.inv(X.T * X)

# Solution 2: “least squares solve” th = np.linalg.lstsq(X, Y)

Page 39: Linear Regression - University of California, Irvine

Normal equations

• Interpretation:– (y - θ X) = (y - yhat) is the vector of errors in each example– X are the features we have to work with for each example– Dot product = 0: orthogonal

Page 40: Linear Regression - University of California, Irvine

Normal equations

• Interpretation:– (y - θ X) = (y - yhat) is the vector of errors in each example– X are the features we have to work with for each example– Dot product = 0: orthogonal

• Example:

Page 41: Linear Regression - University of California, Irvine

0 2 4 6 8 10 12 14 16 18 200

2

4

6

8

10

12

14

16

18

Effects of MSE choice• Sensitivity to outliers

16 2 cost for this one datum

Heavy penalty for large errors

-20

-15

-10

-5

0 50

1

2

3

4

5

Page 42: Linear Regression - University of California, Irvine

L1 error: Mean Absolute Error

0 2 4 6 8 10 12 14 16 18 200

2

4

6

8

10

12

14

16

18 L2, original data

L1, original data

L1, outlier data

Page 43: Linear Regression - University of California, Irvine

Cost functions for regression

Arbitrary functions cannot be solved in closed form- use gradient descent

(MSE)

(MAE)

Something else entirely…

(???)

Page 44: Linear Regression - University of California, Irvine

Machine Learning

Regression with Non-linear Features

Linear Regression via Least Squares

Gradient Descent Algorithms

Direct Minimization of Squared Error

Bias, Variance, & Validation

Regularized Linear Regression

Page 45: Linear Regression - University of California, Irvine

More dimensions?

010

2030

40

0

10

20

30

20

22

24

26

010

2030

40

0

10

20

30

20

22

24

26

x1 x2

y

x1 x2

y

Page 46: Linear Regression - University of California, Irvine

Nonlinear functions• What if our hypotheses are not lines?

– Ex: higher-order polynomials

Page 47: Linear Regression - University of California, Irvine

Nonlinear functions• Single feature x, predict target y:

• Sometimes useful to think of “feature transform”

Add features:

Linear regression in new features

Page 48: Linear Regression - University of California, Irvine

Higher-order polynomials• Fit in the same way• More “features”

Page 49: Linear Regression - University of California, Irvine

Features• In general, can use any features we think are useful

• Other information about the problem– Anything you can encode as fixed-length vectors of numbers

• Polynomial functions– Features [1, x, x2, x3, …]

• Other functions– 1/x, sqrt(x), x1 * x2, …

• “Linear regression” = linear in the parameters– Features we can make as complex as we want!

Page 50: Linear Regression - University of California, Irvine

Higher-order polynomials• Are more features better?

• “Nested” hypotheses– 2nd order more general than 1st,– 3rd order more general than 2nd, …

• Fits the observed data better

Page 51: Linear Regression - University of California, Irvine

Overfitting and complexity• More complex models will always fit the training data better• But they may “overfit” the training data, learning complex

relationships that are not really present

X

Y

Complex model

X

Y

Simple model

Page 52: Linear Regression - University of California, Irvine

Test data• After training the model• Go out and get more data from the world

– New observations (x,y)• How well does our model perform?

Page 53: Linear Regression - University of California, Irvine

00.511.522.533.544.55051015202530

Training data

Training versus test error• Plot MSE as a function of

model complexity– Polynomial order

• Decreases– More complex function fits

training data better

• What about new data?M

ean

sq

uar

ed e

rro

r

Polynomial order

New, “test” data

• 0th to 1st order– Error decreases– Underfitting

• Higher order– Error increases– Overfitting

Page 54: Linear Regression - University of California, Irvine

Machine Learning

Bias, Variance, & Validation

Linear Regression via Least Squares

Gradient Descent Algorithms

Direct Minimization of Squared Error

Regression with Non-linear Features

Regularized Linear Regression

Page 55: Linear Regression - University of California, Irvine

Inductive bias• The assumptions needed to predict examples we haven’t seen• Makes us “prefer” one model over another• Polynomial functions; smooth functions; etc

• Some bias is necessary for learning!

X

Y

Complex model

X

Y

Simple model

Page 56: Linear Regression - University of California, Irvine

Bias & variance

Data we observe“The world” Three different possible data sets:

Page 57: Linear Regression - University of California, Irvine

Bias & variance

Data we observe“The world” Three different possible data sets:

Each would give different predictors for any polynomial degree:

Page 58: Linear Regression - University of California, Irvine

Detecting overfitting• Overfitting effect

– Do better on training data than on future data– Need to choose the “right” complexity

• One solution: “Hold-out” data• Separate our data into two sets

– Training– Test

• Learn only on training data• Use test data to estimate generalization quality

– Model selection

• All good competitions use this formulation– Often multiple splits: one by judges, then another by you

Page 59: Linear Regression - University of California, Irvine

Model selection• Which of these models fits the data best?

– p=0 (constant); p=1 (linear); p=3 (cubic); …• Or, should we use KNN? Other methods?

• Model selection problem– Can’t use training data to decide (esp. if models are nested!)

• Want to estimate

(c) Alexander Ihler

p=0

p=1

p=3

J = loss function (MSE)D = training data set

Page 60: Linear Regression - University of California, Irvine

Hold-out method• Validation data

– “Hold out” some data for evaluation (e.g., 70/30 split)

– Train only on the remainder

• Some problems, if we have few data:– Few data in hold-out: noisy estimate of the error

– More hold-out data leaves less for training!

x(i) y(i)

88 79

32 -2

27 30

68 73

7 -16

20 43

53 77

17 16

87 94

x(i) y(i)

88 79

32 -2

27 30

68 73

7 -16

20 43

53 77

17 16

87 94

Training data

Validation data

MSE = 331.8

Page 61: Linear Regression - University of California, Irvine

Cross-validation method• K-fold cross-validation

– Divide data into K disjoint sets

– Hold out one set (= M / K data) for evaluation

– Train on the others (= M*(K-1) / K data)

x(i) y(i)

88 79

32 -2

27 30

68 73

7 -16

20 43

53 77

17 16

87 94

Training data

Validation data

Split 1:MSE = 331.8

Split 2:MSE = 361.2

Split 3:MSE = 669.8

x(i) y(i)

88 79

32 -2

27 30

68 73

7 -16

20 43

53 77

17 16

87 94

x(i) y(i)

88 79

32 -2

27 30

68 73

7 -16

20 43

53 77

17 16

87 94

3-Fold X-Val MSE = 464.1

Page 62: Linear Regression - University of California, Irvine

Cross-validation method• K-fold cross-validation

– Divide data into K disjoint sets

– Hold out one set (= M / K data) for evaluation

– Train on the others (= M*(K-1) / K data)

x(i) y(i)

88 79

32 -2

27 30

68 73

7 -16

20 43

53 77

17 16

87 94

Training data

Validation data

Split 1:MSE = 280.5

Split 2:MSE = 3081.3

Split 3:MSE = 1640.1

x(i) y(i)

88 79

32 -2

27 30

68 73

7 -16

20 43

53 77

17 16

87 94

x(i) y(i)

88 79

32 -2

27 30

68 73

7 -16

20 43

53 77

17 16

87 94

3-Fold X-Val MSE = 1667.3

Page 63: Linear Regression - University of California, Irvine

Cross-validation• Advantages:

– Lets us use more (M) validation data

(= less noisy estimate of test performance)

• Disadvantages:– More work

• Trains K models instead of just one

– Doesn’t evaluate any particular predictor• Evaluates K different models & averages

• Scores hyperparameters / procedure, not an actual, specific predictor!

• Also: still estimating error for M’ < M data…

Page 64: Linear Regression - University of California, Irvine

Learning curves• Plot performance as a function of training size

– Assess impact of fewer data on performance

Ex: MSE0 - MSE (regression)

or 1-Err (classification)

• Few data– More data significantly

improve performance

• “Enough” data– Performance saturates

• If slope is high, decreasing m (for validation / cross-validation) might have a big impact…

1 /

MSE

Training data size (m)

Page 65: Linear Regression - University of California, Irvine

Leave-one-out cross-validation• When K=M (# of data), we get

– Train on all data except one

– Evaluate on the left-out data

– Repeat M times (each data point held out once) and average

Training data

Validation data

MSE = …

MSE = …

LOO X-Val MSE = …

x(i) y(i)

88 79

32 -2

27 30

68 73

7 -16

20 43

53 77

17 16

87 94

x(i) y(i)

88 79

32 -2

27 30

68 73

7 -16

20 43

53 77

17 16

87 94

Page 66: Linear Regression - University of California, Irvine

Cross-validation Issues• Need to balance:

– Computational burden (multiple trainings)

– Accuracy of estimated performance / error

• Single hold-out set:– Estimates performance with M’ < M data (important? learning curve?)

– Need enough data to trust performance estimate

– Estimates performance of a particular, trained learner

• K-fold cross-validation– K times as much work, computationally

– Better estimates, still of performance with M’ < M data

• Leave-one-out cross-validation– M times as much work, computationally

– M’ = M-1, but overall error estimate may have high variance

Page 67: Linear Regression - University of California, Irvine

Machine Learning

Regularized Linear Regression

Linear Regression via Least Squares

Gradient Descent Algorithms

Direct Minimization of Squared Error

Regression with Non-linear Features

Bias, Variance, & Validation

Page 68: Linear Regression - University of California, Irvine

What to do about under/overfitting?• Ways to increase complexity?

– Add features, parameters– We’ll see more…

• Ways to decrease complexity?– Remove features (“feature selection”)– “Fail to fully memorize data”

• Partial training• Regularization

PredictiveError

Model Complexity

Error on Training Data

Error on Test Data

Ideal Rangefor Model Complexity

OverfittingUnderfitting

Page 69: Linear Regression - University of California, Irvine

Linear regression

• Linear model, two data

• Quadratic model, two data?– Infinitely many settings with zero error– How to choose among them?

• Higher order coefficients = 0?– Uses knowledge of where features came from…

• Could choose e.g. minimum magnitude:

• A type of bias: tells us which models to prefer

Page 70: Linear Regression - University of California, Irvine

0th Order Polynomial

N=0

Page 71: Linear Regression - University of California, Irvine

1st Order Polynomial

N=1

Page 72: Linear Regression - University of California, Irvine

3rd Order Polynomial

N=3

Page 73: Linear Regression - University of California, Irvine

9th Order Polynomial

N=9

Page 74: Linear Regression - University of California, Irvine

Estimated Polynomial Coefficients

N=0 N=1 N=3 N=9

N

Esti

mat

ed R

egre

ssio

n C

oeff

ient

s θ

Page 75: Linear Regression - University of California, Irvine

Regularization• Can modify our cost function J to add “preference” for

certain parameter values

• New solution (derive the same way)

– Problem is now well-posed for any degree

• Notes:– “Shrinks” the parameters toward zero– Alpha large: we prefer small theta to small MSE– Regularization term is independent of the data: paying more

attention reduces our model variance

L2 penalty: “Ridge regression”

Page 76: Linear Regression - University of California, Irvine

Regression: Zero Regularization

Page 77: Linear Regression - University of California, Irvine

Regression: Moderate Regularization

Page 78: Linear Regression - University of California, Irvine

Regression: Big Regularization

Page 79: Linear Regression - University of California, Irvine

Impact of Regularization Parameter

ln(α)

Page 80: Linear Regression - University of California, Irvine

Estimated Polynomial Coefficients

Esti

mat

ed R

egre

ssio

n C

oeff

ient

s θ

α zero α medium α big

Page 81: Linear Regression - University of California, Irvine

Regularization• Compare between unreg. & reg. results

(c) Alexander Ihler

Alpha =0(Unregularized)

Alpha =1

Page 82: Linear Regression - University of California, Irvine

Different regularization functions• More generally, for the Lp regularizer:

Quadratic

L0 = limit as p goes to 0 : “number of nonzero weights”, a natural notion of complexityL1 = limit as p ! 1 : “maximum parameter value”

Lassop=0.5 p=1 p=2 p=4

Isosurfaces: ||𝜃||p = constant

Page 83: Linear Regression - University of California, Irvine

Different regularization functions• More generally, for the Lp regularizer:

p=2 p=1 p=0.3

Page 84: Linear Regression - University of California, Irvine

Regularization: L2 vs L

1• Estimate balances data term & regularization term

Minimizes data term

Minimizes regularization

Minimizes combination

Page 85: Linear Regression - University of California, Irvine

Regularization: L2 vs L

1• Estimate balances data term & regularization term • Lasso tends to generate sparser solutions than a quadratic regularizer.

Data term only: all 𝜃i non-zero

Regularized estimate: some 𝜃i may be zero

Page 86: Linear Regression - University of California, Irvine

Gradient-Based Optimization

Laplacian priorL

1 regularization

Lasso regression

Gaussian priorL

2 regularization

Ridge regression

(Informal intuition: Gradient of L1 objective not defined at zero)

Objective Function:

Negative Gradient:

• L2 makes (all) coefficients smaller• L1 makes (some) coefficients exactly zero: feature selection