Download - CPSC 340: Data Mining Machine Learningschmidtm/Courses/340-F15/L14.pdf · CPSC 340: Machine Learning and Data Mining Regularization Fall 2015

Transcript
Page 1: CPSC 340: Data Mining Machine Learningschmidtm/Courses/340-F15/L14.pdf · CPSC 340: Machine Learning and Data Mining Regularization Fall 2015

CPSC 340:Machine Learning and Data Mining

Regularization

Fall 2015

Page 2: CPSC 340: Data Mining Machine Learningschmidtm/Courses/340-F15/L14.pdf · CPSC 340: Machine Learning and Data Mining Regularization Fall 2015

Admin

• No tutorials/class Monday (holiday).

Page 3: CPSC 340: Data Mining Machine Learningschmidtm/Courses/340-F15/L14.pdf · CPSC 340: Machine Learning and Data Mining Regularization Fall 2015

Radial Basis Functions

• Alternative to polynomial bases are radial basis functions (RBFs):– Basis functions that depend on distances to training points.– A non-parametric basis.

• Most common example is Gaussian RBF:

• Variance σ2 controls have much nearby vs. far points contribute.– Affects fundamental trade-off.

• There are universal consistency results with these functions:– In terms of bias-variance, achieves irreducible error as ‘n’ goes to infinity.

Page 4: CPSC 340: Data Mining Machine Learningschmidtm/Courses/340-F15/L14.pdf · CPSC 340: Machine Learning and Data Mining Regularization Fall 2015

Predicting the Future

• In principle, we can use any features xi that we think are relevant.

• This makes it tempting to use time as a feature, and predict future.

https://gravityandlevity.wordpress.com/2009/04/22/the-fastest-possible-mile/

Page 5: CPSC 340: Data Mining Machine Learningschmidtm/Courses/340-F15/L14.pdf · CPSC 340: Machine Learning and Data Mining Regularization Fall 2015

Predicting the Future

• In principle, we can use any features xi that we think are relevant.

• This makes it tempting to use time as a feature, and predict future.

https://gravityandlevity.wordpress.com/2009/04/22/the-fastest-possible-mihttps://overthehillsports.wordpress.com/tag/hicham-el-guerrouj/le/

Page 6: CPSC 340: Data Mining Machine Learningschmidtm/Courses/340-F15/L14.pdf · CPSC 340: Machine Learning and Data Mining Regularization Fall 2015

Predicting 100m times 500 years in the future?

https://plus.maths.org/content/sites/plus.maths.org/files/articles/2011/usain/graph2.gif

Page 7: CPSC 340: Data Mining Machine Learningschmidtm/Courses/340-F15/L14.pdf · CPSC 340: Machine Learning and Data Mining Regularization Fall 2015

Predicting 100m times 400 years in the future?

https://plus.maths.org/content/sites/plus.maths.org/files/articles/2011/usain/graph2.gifhttp://www.washingtonpost.com/blogs/london-2012-olympics/wp/2012/08/08/report-usain-bolt-invited-to-tryout-for-manchester-united/

Page 8: CPSC 340: Data Mining Machine Learningschmidtm/Courses/340-F15/L14.pdf · CPSC 340: Machine Learning and Data Mining Regularization Fall 2015

No Free Lunch, Consistency, and the Future

Page 9: CPSC 340: Data Mining Machine Learningschmidtm/Courses/340-F15/L14.pdf · CPSC 340: Machine Learning and Data Mining Regularization Fall 2015

No Free Lunch, Consistency, and the Future

Page 10: CPSC 340: Data Mining Machine Learningschmidtm/Courses/340-F15/L14.pdf · CPSC 340: Machine Learning and Data Mining Regularization Fall 2015

No Free Lunch, Consistency, and the Future

Page 11: CPSC 340: Data Mining Machine Learningschmidtm/Courses/340-F15/L14.pdf · CPSC 340: Machine Learning and Data Mining Regularization Fall 2015

Ockham’s Razor vs. No Free Lunch

• Ockham’s razor is a problem-solving principle:

– “Among competing hypotheses, the one with the fewest assumptions should be selected.”

– Suggests we should select linear model.

• Fundamental theorem of ML:

– If training same error, pick model less likely to overfit.

– Formal version of Occam’s problem-solving principle.

– Also suggests we should select linear model.

• No free lunch theorem:

– There exists possible datasets where you should select the green model.

Page 12: CPSC 340: Data Mining Machine Learningschmidtm/Courses/340-F15/L14.pdf · CPSC 340: Machine Learning and Data Mining Regularization Fall 2015

No Free Lunch, Consistency, and the Future

Page 13: CPSC 340: Data Mining Machine Learningschmidtm/Courses/340-F15/L14.pdf · CPSC 340: Machine Learning and Data Mining Regularization Fall 2015

No Free Lunch, Consistency, and the Future

Page 14: CPSC 340: Data Mining Machine Learningschmidtm/Courses/340-F15/L14.pdf · CPSC 340: Machine Learning and Data Mining Regularization Fall 2015

No Free Lunch, Consistency, and the Future

Page 15: CPSC 340: Data Mining Machine Learningschmidtm/Courses/340-F15/L14.pdf · CPSC 340: Machine Learning and Data Mining Regularization Fall 2015

No Free Lunch, Consistency, and the Future

Page 16: CPSC 340: Data Mining Machine Learningschmidtm/Courses/340-F15/L14.pdf · CPSC 340: Machine Learning and Data Mining Regularization Fall 2015

No Free Lunch, Consistency, and the Future

Page 17: CPSC 340: Data Mining Machine Learningschmidtm/Courses/340-F15/L14.pdf · CPSC 340: Machine Learning and Data Mining Regularization Fall 2015

No Free Lunch, Consistency, and the Future

Page 18: CPSC 340: Data Mining Machine Learningschmidtm/Courses/340-F15/L14.pdf · CPSC 340: Machine Learning and Data Mining Regularization Fall 2015

No Free Lunch, Consistency, and the Future

Page 19: CPSC 340: Data Mining Machine Learningschmidtm/Courses/340-F15/L14.pdf · CPSC 340: Machine Learning and Data Mining Regularization Fall 2015

Application: Climate Models

• Has Earth warmed up over last 100 years? (Consistency zone)– Data clearly says ‘yes’.

• Will Earth continue to warm over next 100 years? (Really NFL zone)– We should be more skeptical about models that predict future events.

https://en.wikipedia.org/wiki/Global_warming

Page 20: CPSC 340: Data Mining Machine Learningschmidtm/Courses/340-F15/L14.pdf · CPSC 340: Machine Learning and Data Mining Regularization Fall 2015

Application: Climate Models

• So should we all become global warming skeptics?

• If we average over models that overfit in *different* ways, we expect the test error to be lower, so this gives more confidence:

– We should be skeptical of individual models, but agreeing predictions made by models with different data/assumptions are more likely be true.

• If all near-future predictions agree, they are likely to be accurate.

• As we go further in the future, variance of average will be higher.https://en.wikipedia.org/wiki/Global_warming

Page 21: CPSC 340: Data Mining Machine Learningschmidtm/Courses/340-F15/L14.pdf · CPSC 340: Machine Learning and Data Mining Regularization Fall 2015

Regularization

• Ridge regression is a very common variation on least squares:

• The extra term is called L2-regularization:

– Objective balances getting low error vs. having small slope ‘w’.

– E.g., you allow a small increase in error if it makes slope ‘w’ much smaller.

• Regularization parameter λ > 0 controls level of regularization:

– High λ makes L2-norm more important compared to than data.

– Theory says choices should be in the range O(1) to O(n-1/2).

– In practice, set by validation set or cross-validation.

Page 22: CPSC 340: Data Mining Machine Learningschmidtm/Courses/340-F15/L14.pdf · CPSC 340: Machine Learning and Data Mining Regularization Fall 2015

Why use L2-Regularization?

• It’s a weird thing to do, but L2-regularization is magic.

• 6 reasons to use L2-regularization:

1. Does not require X’X to be invertible.

2. Solution ‘w’ is unique.

3. Solution ‘w’ is less sensitive to changes in X or y (like ensemble methods)

4. Makes iterative (large-scale) methods for computing ‘w’ converge faster.

5. Significant decrease in variance, and often only small increase in bias.

• This means you typically have lower test error.

6. Stein’s paradox: if d ≥ 3, ‘shrinking’ estimate moves us closer to ‘true’ w.

Page 23: CPSC 340: Data Mining Machine Learningschmidtm/Courses/340-F15/L14.pdf · CPSC 340: Machine Learning and Data Mining Regularization Fall 2015

Shrinking is Weird and Magical

• We throw darts at a target:

– Assume we don’t always hit the exact center.

– Assume the darts follow a symmetric pattern around center.

Page 24: CPSC 340: Data Mining Machine Learningschmidtm/Courses/340-F15/L14.pdf · CPSC 340: Machine Learning and Data Mining Regularization Fall 2015

Shrinking is Weird and Magical

• We throw darts at a target:

– Assume we don’t always hit the exact center.

– Assume the darts follow a symmetric pattern around center.

• Shrinkage of the darts :

1. Choose some arbitrary location ‘0’.

2. Measure distances from darts to ‘0’.

Page 25: CPSC 340: Data Mining Machine Learningschmidtm/Courses/340-F15/L14.pdf · CPSC 340: Machine Learning and Data Mining Regularization Fall 2015

Shrinking is Weird and Magical

• We throw darts at a target:

– Assume we don’t always hit the exact center.

– Assume the darts follow a symmetric pattern around center.

• Shrinkage of the darts :

1. Choose some arbitrary location ‘0’.

2. Measure distances from darts to ‘0’.

3. Move misses towards ‘0’, by smallamount proportional to distances.

• On average, darts will be closer to center.

Page 26: CPSC 340: Data Mining Machine Learningschmidtm/Courses/340-F15/L14.pdf · CPSC 340: Machine Learning and Data Mining Regularization Fall 2015

Ridge Regression Calculation

Page 27: CPSC 340: Data Mining Machine Learningschmidtm/Courses/340-F15/L14.pdf · CPSC 340: Machine Learning and Data Mining Regularization Fall 2015

Least Squares with Outliers

• Consider least squares problem with outliers:

Page 28: CPSC 340: Data Mining Machine Learningschmidtm/Courses/340-F15/L14.pdf · CPSC 340: Machine Learning and Data Mining Regularization Fall 2015

Least Squares with Outliers

• Consider least squares problem with outliers:

• Least squares is very sensitive to outliers.

Page 29: CPSC 340: Data Mining Machine Learningschmidtm/Courses/340-F15/L14.pdf · CPSC 340: Machine Learning and Data Mining Regularization Fall 2015

Least Squares with Outliers

• Squaring error shrinks small errors, and magnifies large errors:

• Outliers (large error) influence ‘w’ much more than other points.

Page 30: CPSC 340: Data Mining Machine Learningschmidtm/Courses/340-F15/L14.pdf · CPSC 340: Machine Learning and Data Mining Regularization Fall 2015

Least Squares with Outliers

• Squaring error shrinks small errors, and magnifies large errors:

• Outliers (large error) influence ‘w’ much more than other points.

– Good if outlier means ‘plane crashes’, bad if it means ‘data entry error’.

Page 31: CPSC 340: Data Mining Machine Learningschmidtm/Courses/340-F15/L14.pdf · CPSC 340: Machine Learning and Data Mining Regularization Fall 2015

Robust Regression

• Robust regression objectives put less focus on far-away points.

• For example, use absolute error:

• Now decreasing ‘small’ and ‘large’ errors is equally important.

• Norms are a nice way to write least squares vs. least absolute error:

Page 32: CPSC 340: Data Mining Machine Learningschmidtm/Courses/340-F15/L14.pdf · CPSC 340: Machine Learning and Data Mining Regularization Fall 2015

Summary

• Predicting future is hard, ensemble predictions are more reliable.

• Regularization improves test error because it is magic.

• Outliers can cause least squares to perform poorly.

• Robust regression using L1-norm is less sensitive.

• Next time:

– How to fine the L1-norm solution, and what if features are irrelevant?