07 - Bootstrap and Splines â€؛ SYS6018 â€؛ lectures â€؛ 07- آ  07 - Bootstrap and...

download 07 - Bootstrap and Splines â€؛ SYS6018 â€؛ lectures â€؛ 07- آ  07 - Bootstrap and Splines

of 17

  • date post

    29-Jun-2020
  • Category

    Documents

  • view

    1
  • download

    0

Embed Size (px)

Transcript of 07 - Bootstrap and Splines â€؛ SYS6018 â€؛ lectures â€؛ 07- آ  07 - Bootstrap and...

  • 07 - Bootstrap and Splines Data Mining

    SYS 6018 | Fall 2019

    07-bootstrap.pdf

    Contents

    1 Introduction to the Bootstrap 2 1.1 Required R Packages . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 1.2 Uncertainty in a test statistic . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2

    2 Bootstrapping Regression Parameters 5 2.1 Bootstrap the β’s . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

    3 Basis Function Modeling 7 3.1 Piecewise Polynomials . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 3.2 B-Splines . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 3.3 Bootstrap Confidence Interval for f(x) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

    4 More Bagging 12 4.1 Out-of-Bag Samples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 4.2 Number of Bootstrap Simulations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

    5 More Resources 16 5.1 Variations of the Bootstrap . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

  • 07 - Bootstrap and Splines SYS 6018 | Fall 2019 2/17

    1 Introduction to the Bootstrap

    1.1 Required R Packages

    We will be using the R packages of:

    • boot for help with bootstrapping • broom for tidy extraction of model components • splines for working with B-splines • tidyverse for data manipulation and visualization

    library(boot) library(broom) library(splines) library(tidyverse)

    1.2 Uncertainty in a test statistic

    There is often interest to understand the uncertainty in the estimated value of a test statistic.

    • For example, let p be the actual/true proportion of customers who will use your company’s coupon.

    • To estimate p, you decide to take a sample of n = 200 customers and find that x = 10 or p̂ = 10/200 = 0.05 = 5% redeemed the coupon.

    1.2.1 Confidence Interval

    • It is common to calculate the 95% confidence interval (CI)

    CI(p) = p̂± 2 · SE(p̂)

    = p̂± 2

    √ p̂(1− p̂)

    n

    = 0.05± 0.03

    • This calculation is based on the assumption that p̂ is approximately normally distributed with the mean

    equal to the unknown true p, i.e., p̂ ∼ N(p, √

    p(1−p) n ).

    1.2.2 Bayesian Posterior Distribution

    In the Bayesian world, you’d probably specify a Beta prior for p, i.e., p ∼ Beta(a, b) and calculate the posterior distribution p | x = 10 ∼ Beta(a+ x, b+ n− x) which would fully characterize the uncertainty.

  • 07 - Bootstrap and Splines SYS 6018 | Fall 2019 3/17

    0

    10

    20

    0.00 0.02 0.04 0.06 0.08 0.10 0.12 0.14 0.16 0.18 0.20 0.22 0.24 0.26 p

    de ns

    ity

    model

    CI

    posterior

    prior

    Prior ~ Beta(1,1) [Uniform]

    1.2.3 The Bootstrap

    • The Boostrap is a way to assess the uncertainty in a test statistic using resampling.

    • The idea is to simulate the data from the empirical distribution, which puts a point mass of 1/n at each observed data point (i.e., sample the original data with replacement).

    – It is important to simulate n observations (same size as original data) because the uncertainty in the test statistic is a function of n

    or ig

    in al

    sa m

    pl e

    1 sa

    m pl

    e 2

    sa m

    pl e

    3

    0 1 2 3 4 5 6 7 8 9 10 value

  • 07 - Bootstrap and Splines SYS 6018 | Fall 2019 4/17

    • Then, calculate the test statistic for each bootstrap sample. The variability in the collection of bootstrap test statistics should be similar to the variability in the test statistic.

    Algorithm: Nonparametric/Empirical Bootstrap

    Observe data D = [X1, X2, . . . , Xn] (n observations). Calculate a test statistic θ̂ = θ̂(D), which is a function of D. Repeat steps 1 and 2 M times:

    1. Simulate D∗, a new data set of n observations by sampling from D with replacement. 2. Calculate the bootstrap test statistic θ̂∗ = θ̂(D∗)

    The bootstrapped samples θ̂∗1, θ̂ ∗ 2, . . . , θ̂

    ∗ M can be used to estimate the distribution of θ̂.

    • Or properties of the distribution, like standard deviation (standard error), percentiles, etc.

    1 #-- Original Data 2 x = c(rep(1, 10), rep(0, 190)) # 10 successes, 190 failures 3 n = length(x) # length of observed data 4

    5 #-- Bootstrap Distribution 6 M = 2000 # number of bootstrap samples 7 p = numeric(M) # initialize vector for test statistic 8 set.seed(201910) # set random seed 9 for(m in 1:M){

    10 #- sample from empirical distribution 11 ind = sample(n, replace=TRUE) # sample indices with replacement 12 xboot = x[ind] # bootstrap sample 13 #- calculate proportion of successes 14 p[m] = mean(xboot) # calculate test statistic 15 } 16

    17 #-- Bootstrap Percentile based confidence Intervals 18 quantile(p, probs=c(.025, .975)) # 95% bootstrap interval 19 #> 2.5% 97.5% 20 #> 0.02 0.08

    0

    10

    20

    0.00 0.02 0.04 0.06 0.08 0.10 0.12 0.14 0.16 0.18 0.20 0.22 0.24 0.26 p

    de ns

    ity

    model

    bootstrap

    CI

    posterior

    prior

    Prior ~ Beta(1,1) [Uniform]

  • 07 - Bootstrap and Splines SYS 6018 | Fall 2019 5/17

    • Notice that in the above example the bootstrap distribution is close to the Bayesian posterior distribution (using the uninformed Uniform prior).

    • This is no accident, it turns out there is a close correspondence between the bootstrap derived distribution and the Bayesian posterior distribution under uninformative priors

    – See ESL 8.4 for more details

    2 Bootstrapping Regression Parameters

    The bootstrap is not limited to univariate test statistics. It can be used on multivariate test statistics.

    Consider the uncertainty in estimates of the parameters (i.e., β coefficients) of a regression model.

    −5

    0

    5

    10

    0.0 0.2 0.4 0.6 0.8 1.0 x

    y

    1 data_train = tibble(x,y) # create a data frame/tibble 2 m1 = lm(y~x, data=data_train) # fit simple OLS 3 broom::tidy(m1, conf.int=TRUE)# OLS estimated coefficients 4 #> # A tibble: 2 x 7 5 #> term estimate std.error statistic p.value conf.low conf.high 6 #> 7 #> 1 (Intercept) 6.48 0.584 11.1 5.39e-19 5.32 7.64 8 #> 2 x -7.37 1.06 -6.97 3.69e-10 -9.47 -5.27

    1 vcov(m1) # variance matrix 2 #> (Intercept) x 3 #> (Intercept) 0.3414 -0.5359 4 #> x -0.5359 1.1183

    2.1 Bootstrap the β’s

  • 07 - Bootstrap and Splines SYS 6018 | Fall 2019 6/17

    1 #-- Bootstrap Distribution 2 M = 2000 # number of bootstrap samples 3 beta = matrix(NA, M, 2) # initialize vector for test statistics 4 set.seed(201910) # set random seed 5 for(m in 1:M){ 6 #- sample from empirical distribution 7 ind = sample(n, replace=TRUE) # sample indices with replacement 8 data.boot = data_train[ind,] # bootstrap sample 9 #- fit regression model

    10 m.boot = lm(y~x, data=data.boot) # fit simple OLS 11 #- save test statistics 12 beta[m, ] = coef(m.boot) 13 } 14 #- convert to tibble (and add column names) 15 beta = as_tibble(beta, .name_repair = "unique") %>% 16 setNames(c('intercept', 'slope')) 17

    18 #- Plot 19 ggplot(beta, aes(intercept, slope)) + geom_point() + 20 geom_point(data=tibble(intercept=coef(m1)[1], 21 slope = coef(m1)[2]), color="red", size=4) 22

    23 #- bootstrap estimate 24 var(beta) # varaince matrix 25 #> intercept slope 26 #> intercept 0.5821 -0.8747 27 #> slope -0.8747 1.4918

    1 apply(beta, 2, sd) # standard errors (sqrt of diagonal) 2 #> intercept slope 3 #> 0.7629 1.2214

    −10.0

    −7.5

    −5.0

    4 5 6 7 8 9 intercept

    sl op

    e

  • 07 - Bootstrap and Splines SYS 6018 | Fall 2019 7/17

    3 Basis Function Modeling

    For a univariate x, a linear basis expansion is

    f̂(x) = ∑

    j

    θ̂jbj(x)

    where bj(x) is the value of the jth basis function at x and θj is the coefficient to be estimated.

    • The bj(x) are usually specified in advanced (i.e., not estimated)

    Examples:

    • Linear Regression

    f̂(x) = β̂0 + β̂1x

    b1(x) = 1 b2(x) = x

    0.0

    0.5

    1.0

    1.5

    2.0

    0.0 0.5 1.0 1.5 2.0 x

    basis

    b1

    b2

    • Polynomial Regression

    f̂(x) = d∑

    j=1 β̂jx

    j

    bj(x) = xj

    0

    2

    4

    6

    8

    0.0 0.5 1.0 1.5 2.0 x

    basis

    b1

    b2

    b3

    b4

  • 07 - Bootstrap and Splines SYS 6018 | Fall 2019 8/17

    3.1 Piecewise Polynomials

  • 07 - Bootstrap and Splines SYS 6018 | Fall 2019 9/17

    3.2 B-Splines

    B−splines of degree 3

    B−splines of degree 2

    B−splines of degree 1

    B−splines of degree 0

    0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0

    0.00 0.25 0.50 0.75 1.00

    0.00 0.25 0.50 0.75 1.00

    0.00 0.25 0.50 0.75 1.00

    0.00 0.25 0.50 0.75 1.00

    x

    Like ESL Fig 5.20, B-splines (knots shown by vertical dashed lines)

    • A degree = 0 B-spline is a regressogram basis. Will lead to a piecewise constant fit.

    • A degree = 3 B-spline (called cubic splines) is similar in shape to a Gaussian pdf. But the B-spline has finite support and facilitates quick computation (due to th