class: title-slide, top, center # Data 311: Machine Learning ##Lecture 13: Jackknife and Bootstrap Dr. Irene Vrbik .left[.footnote[Updated: 2021-11-01] ] --- # Introduction - Last class we discuss some cross-validation resampling techniques, namely, LOOCV and `\(k\)`-fold CV. - Today we discuss other popular resampling method called the .display[bootstrap] and .display[jackknife] - We will primarily do this in the context of linear regression ??? might seem silly since we have the CLT, but they'll be easy to understand (harder example in assignment 3) - we'll see another example using bootstrap in lecture 16 --- # Recap - Last lecture we saw how we could you cross validation to: - assess model performance (eg. estimate test MSE), - perform model selection (eg. choose k for knn) - Consider that following CV, we have `\(k\)`/ `\(n\)` different fits of the model, which model should we report? **Answer**: _none of them._ - While CV is used to help _select_ a model, the acutal reported model would be the one run with all the data. ??? cv used strictly as a way of predicting the true MSE of a model. --- class: slide-yellow, middle, center # Jackknife --- # Jackknife - The .display[jackknife] approach can actually use additional information gained from those CV fits to provide insights into the parameter estimates. - Essentially the Jackknife is LOOCV applied to estimating the bias and variance of parameter estimates. - We will see that Bootstrap will be used over Jackknife (and that Jackknife is somewhat obsolete `\(^{[1]}\)` at this point) however it serves as a nice segue between CV and Bootstrap. .footnote[[1] in fact the topic of jackknife is not even covered in the ISLR textbook] --- name: jackknife # Jackknife 1. Let `\(i=1\)` be the index of the first observation 1. Remove observation `\(i\)` from the data 1. Estimate your model, report any parameter estimates, say `\(\hat \alpha_i\)` 1. Set `\(i=i+1\)`, return to 2 if `\(i \leq n\)`. 1. Now we have `\(n\)` estimates of parameter `\(\alpha\)`. We can estimate the standard error and bias of an estimator `\(\hat \alpha\)` by .pull-left[ .medium[ `\begin{equation} \hat{\text{SE}}(\hat \alpha) = \sqrt{\frac{n-1}{n} \sum_{i=1}^{n} (\hat \alpha_i - \overline{\hat \alpha})^2} \end{equation}` ] ] .pull.right[ `\begin{equation} \hat{\text{Bias}}(\hat \alpha) = (n-1)(\overline {\hat \alpha} - \hat \alpha) \end{equation}` .right[ where `\(\overline {\hat \alpha} = \sum_{i=1}^n {\hat \alpha_i}/{n}\)` and `\(\hat \alpha\)` is the estimate obtained on the full data. ] ] ??? Here it is algorithmically: (we will not prove it) --- # Jackknife Simulation - Before applying the jackknife in a modelling scenario, let's look at it from a simple, intro-stats-type example - Estimating `\(\mu\)` from a normal distribution! - We generate a sample of size 25 from the normal distribution with the following (known) parameters `\(\mu=0\)` and `\(\sigma = 10\)` - `\(\overline X\)` is a traditional estimator for `\(\mu\)` (among others estimators) --- # Jackknife Simulation - From intro stats we know that `\(\overline X\)` is an unbiased estimate of `\(\mu\)` with `\(SE(\bar X) = \dfrac{\sigma}{\sqrt{n}} = \frac{10}{\sqrt{25}} = 2\)` - We know this because `\(X\)` and `\(\bar X\)` are especially neatly intertwined when `\(X\)` is normal. - But suppose we lived in a world void of nice statistical theory didn't provide us this knowledge (and even in the real world we won't the SE for several other estimators!) - The jackknife provides us a way of estimating `\(SE(\bar X)\)` --- # Jackknife Simulation .medium[ Using the formula for .display[SE] from [this](#eq:jackknife) slide and recalling that the true theoretical `\(SE(\overline X) = 2\)` ... ] ```r set.seed(5141) x <- rnorm(25, 0, 10) xbarfull <- mean(x) # alpha hat xbarjack <- NA # alphai hat for i = 1,..n for(i in 1:25) xbarjack[i] <- mean(x[-i]) xbarhat = mean(xbarjack) # (n-1)/n sum alphai hat - the average alpha hat sqrt((25-1)/(25)*sum((xbarjack-xbarhat)^2)) ``` ``` ## [1] 1.873911 ``` ??? - so if i didn't have this theoretical result i could use this to get a reasonable estiamte of it --- # Jackknife Simulation Using a different seed... ```r set.seed(511) x <- rnorm(25, 0, 10) xbarfull <- mean(x) xbarjack <- NA for(i in 1:25){ xbarjack[i] <- mean(x[-i]) } xbarhat = mean(xbarjack) sqrt((25-1)/(25)* sum((xbarjack-xbarhat)^2)) ``` ``` ## [1] 2.314983 ``` ??? - proof of concept - sometimes our estimate is below sometimes its above --- # Jackknife Simulation .medium[ Using the formula for .display[Bias] from [this](#eq:jackknife) slide and recalling that this is an unbiased estimator (bias is 0) ... ] ```r set.seed(5141) x <- rnorm(25, 0, 10) xbarfull <- mean(x) # alpha hat xbarjack <- NA # alphai hat for i = 1,..n for(i in 1:25) xbarjack[i] <- mean(x[-i]) xbarhat = mean(xbarjack) # (n-1)*(average alphai hat - alpha hat) (25-1)*(xbarhat - xbarfull) ``` ``` ## [1] 0 ``` --- # Jackknife Simulation Using a different seed ```r set.seed(511) x <- rnorm(25, 0, 10) xbarfull <- mean(x) # alpha hat xbarjack <- NA # alphai hat for i = 1,..n for(i in 1:25) xbarjack[i] <- mean(x[-i]) xbarhat = mean(xbarjack) # (n-1)*(average alphai hat - alpha hat) (25-1)*(xbarhat - xbarfull) ``` ``` ## [1] -0.000000000000001332268 ``` ??? mathematically this should be equal to 0 so this really small value is just a rounding error --- # Jackknife Simulation -- Faster .pull-left[ Alternatively we could just the `jackknife` function available in the **bootstrap** package. Usage: ```r jackknife(x, theta, ...) ``` where `x` is a vector of data and `theta` is the function to be jackknifed. ] .pull-right[ Redoing our example: ```r library(bootstrap) set.seed(5141) x <- rnorm(25, 0, 10) jfit <- jackknife(x, mean) jfit$jack.bias ``` ``` ## [1] 0 ``` ```r jfit$jack.se ``` ``` ## [1] 1.873911 ``` ] ??? - same as before - messier when we get more complicated (eg. `\(\beta_1\)` for lm) --- # Jackknife Modelling Example ## SLR Example - Bringing this back around to machine learning, we can use the jackknife to investigate the bias and variance of our estimators within any particular model - For example, `\(\hat \beta_0\)` and `\(\hat \beta_1\)` from SLR. We simulate 30 observations from `$$Y = 2X + \epsilon$$` where `\(\epsilon \sim N(0, 0.25^2)\)` --- name:sampling-dist # SLR Example ## Sampling distributions of coefficients It can be shown that `\begin{equation} \hat\beta_0 \sim N\left(\beta_0, \sigma^2\left[\dfrac{1}{n} + \dfrac{\bar x^2}{S_{xx}}\right]\right) \quad \hat\beta_1 \sim N\left(\beta_1, \dfrac{\sigma^2}{S_{xx}}\right) \end{equation}` where `\(S_{xx} = \sum_{i=1}^n (x_i - \overline x)^2\)` = `\(\sum_{i=1}^n x_i^2 - \dfrac{(\sum_{i=1}^n x_i)^2}{n}\)` Notice that these are both unbiased with known standard errors. --- # SLR Example ```r set.seed(123978) x <- runif(30, 0, 1); y <- 2*x + rnorm(30, sd=0.25) fullfit <- lm(y~x) jlist <- matrix(NA, nrow=30, ncol=2) for(i in 1:length(x)) jlist[i,] <- lm(y[-i]~x[-i])$coef ``` .pull-left[ ```r # Bias for intercept (30-1)*(mean(jlist[,1]) -fullfit$coefficients[1]) ``` ``` ## (Intercept) ## -0.001610663 ``` ] .pull-right[ ```r # Bias for slope (30-1)*(mean(jlist[,2]) -fullfit$coefficients[2]) ``` ``` ## x ## 0.0002091236 ``` ] ??? near 0 as these are unbiased --- # SLR Example Intercept `\(\beta_0\)` standard errors: ```r # Jackknife SE for intercept: sqrt(((30-1)/30)*sum((jlist[,1]-mean(jlist[,1]))^2)) ``` ``` ## [1] 0.07949095 ``` ```r # True SE for intercept xbar <- mean(x) sxx <- sum((x-xbar)^2) var.b0 = 0.25^2*(1/30 + xbar^2/sxx) sqrt(var.b0) ``` ``` ## [1] 0.08779294 ``` --- # SLR Example Slope `\(\beta_1\)` standard errors: ```r # Jackknife SE for slope: (jk.se.b1 = sqrt(((30-1)/30)* sum((jlist[,2]-mean(jlist[,2]))^2)) ) ``` ``` ## [1] 0.1603949 ``` ```r # True SE for slope sqrt(0.25^2/sxx) ``` ``` ## [1] 0.162064 ``` <!-- .25/ sqrt( 30 / 12 ) --> --- # Modelling Example - Again, traditional estimators `\(\hat \beta_0\)` and `\(\hat \beta_1\)` are unbiased...so it's good to see the jackknife bias estimates near 0. - We also see that the jackknife estimates for the SEs of both estimators appear reasonably close to the known SEs - With some additional work, we could use these SEs to provide jackknife versions of confidence intervals...but we'll leave that aside, as we're about to learn a "simpler" option. --- # Comments about Jackknife - We considered LOO jackknife; however, it is generalizable to "leave `\(k\)` out", just as CV was. - The jackknife is a nonparametric method but that does not mean that it is void of assumptions, or rigorous statistical theory. The biggest assumption is that observations are .blue[iid] `\(^{1}\)`. - It has been shown _consistent_ `\(^{2}\)` (asymptotically "good") for many common estimators, but not _all_ estimators (eg. the median). .footnote[1 independent and identically distributed <br> 2 the procedure with unlimited data should identify the underlying truth] ??? The jackknife is a nonparametric method for estimating things like bias and variance of an estimator --- class: slide-yellow, middle, center # Bootstrap --- # Bootstrap - The (nonparametric) .display[bootstrap] is a tremendously versatile method for estimating the standard errors and bias of parameter estimates and has largely replaced the jackknife. - Instead of simulating new data from a known model (which will be impossible in real scenarios), we can randomly sample .alert[with replacement] from our observed sampled data! - Oddly enough, with a bit of work, this provides good estimates of the bias and standard error of our estimators. --- # Bootstrap 1. Set `\(j=1\)` as index for bootstrap sample number 1. Take the `\(j^{\text{th}}\)` random sample of size `\(n\)` from your observations `\({x_1, x_2, \ldots, x_n}\)` .alert[with replacement] 1. Fit model, estimate parameter `\(\alpha\)` with `\(\hat \alpha_j\)` (for example) 1. Set `\(j=j+1\)`, and if `\(j\leq B\)` return to step 2 1. Estimate the standard error and/or bias of the estimator .pull-left[ .small[ `\begin{equation} \hat{\text{SE}}(\hat \alpha) = \sqrt{\sum_{i=1}^{B} (\hat \alpha_i - \overline{\hat \alpha})^2} \end{equation}` ] ] .pull.right[ `\begin{align} \hat{\text{Bias}}(\hat \alpha) = \overline {\hat \alpha} - \hat \alpha \end{align}` .right[ .small[ where `\(\overline {\hat \alpha} = \sum_{i=1}^B {\hat \alpha_i}/{B}\)` and `\(\hat \alpha\)` is the estimate obtained on the full data. ] ] ] .footnote[ N.B. `\(B=1000,5000\)` are standard amounts ] --- name: islr-bootstrap class: hide-logo <div class="figure" style="text-align: center"> <img src="img/ISLR2/Chapter5/5_11.png" alt="Visualization of Bootstrap (Figure 5.11 from ISLR)" width="60%" /> <p class="caption">Visualization of Bootstrap (Figure 5.11 from ISLR)</p> </div> --- # Bootstrap Simulation - We simulate 30 observations from `$$Y = 2X + \epsilon$$` where `\(\epsilon \sim N(0, 0.25^2)\)` - In a simulation, we can simply continue generating data to investigate things like the distribution of `\(\hat \beta_1\)` - So we do it... --- name: bootstrap:sim1 # Bootstrap Simulation ```r set.seed(2) newxy <- list() modnew <- list() coefs <- NA for(i in 1:1000){ newx <- runif(30, 0, 1) newy <- 2*newx + rnorm(30, sd=0.25) newxy[[i]] <- cbind(newx, newy) modnew[[i]] <- lm(newxy[[i]][,2]~newxy[[i]][,1]) coefs[i] <- modnew[[i]]$coefficients[2] } ``` --- # Bootstrap Simulation - Now we have fit 1000 linear models to 1000 new simulations (of size 30) from the exame same generative scheme - We stored all `\(\hat \beta_1\)` in `coefs` ```r sd(coefs) ``` ``` ## [1] 0.1604527 ``` The above should give us a pretty good estimates of the SE of `\(\hat\beta_1\)` --- # Bootstrap Simulation - If we could do this in real life this would be our gold standard. - BUT with real data, you will never have the option to just simulate/collect a bunch more of it. - Enter the bootstrap... --- # Bootstrap Simulation ```r set.seed(311) newboots <- list() bootsmod <- list() bootcoef <- NA xy <- cbind(x,y) for(i in 1:1000){ newboots[[i]] <- xy[sample(1:30, 30, replace=TRUE),] bootsmod[[i]] <- lm(newboots[[i]][,2]~newboots[[i]][,1]) bootcoef[i] <- bootsmod[[i]]$coefficient[2] } ``` Comparing with our [previous code](#bootstrap:sim1) .... --- # Bootstrap Simulation ... notice that we are not regenerating new x and y values: ```r set.seed(2) newxy <- list() modnew <- list() coefs <- NA for(i in 1:1000){ * newx <- runif(30, 0, 1) * newy <- 2*newx + rnorm(30, sd=0.25) * newxy[[i]] <- cbind(newx, newy) modnew[[i]] <- lm(newxy[[i]][,2]~newxy[[i]][,1]) coefs[i] <- modnew[[i]]$coefficients[2] } ``` --- # Bootstrap Simulation ```r for(i in 1:1000){ * newboots[[i]] <- xy[sample(1:30, 30, replace=TRUE),] bootsmod[[i]] <- lm(newboots[[i]][,2]~newboots[[i]][,1]) bootcoef[i] <- bootsmod[[i]]$coefficient[2]} ``` .pull-left[ ```r head(newboots[[i]]) ``` ``` ## x y ## [1,] 0.106 0.314 ## [2,] 0.487 1.195 ## [3,] 0.761 1.181 ## [4,] 0.761 1.181 ## [5,] 0.551 1.064 ## [6,] 0.304 0.695 ``` ] .pull-right[ - `newboots[[i]]` stores `\(n\)` observations sampled with replacement from the original data set - it's akin to `\(Z^{*i}\)` in the visualization on [this](#islr-bootstrap) slide ] --- # Bootstrap Simulation ```r for(i in 1:1000){ * newboots[[i]] <- xy[sample(1:30, 30, replace=TRUE),] bootsmod[[i]] <- lm(newboots[[i]][,2]~newboots[[i]][,1]) bootcoef[i] <- bootsmod[[i]]$coefficient[2]} ``` .pull-left-narrow[ .small[ ```r bootcoef[[i]] ``` ``` ## [1] 1.758587 ``` ] ] .pull-right-wide[ - `bootcoef[[i]]` stores the estimate of slope based on the `\(i\)`th bootstrap sample `\(Z^{*i}\)` - `bootcoef[[i]]` is akin to `\(\hat\alpha^{*i}\)` in the visualization on [this](#islr-bootstrap) slide ] --- # Bootstrap Simulation - Now we have fit 1000 linear models to 1000 _bootstrapped_ samples (of size 30). Comparing with the jackknife ... .pull-left[ **Bootstrap estimate**: ```r sd(bootcoef) ``` ``` ## [1] 0.1543347 ``` ] .pull-left[ **Jackknife estimate:** ```r jk.se.b1 ``` ``` ## [1] 0.1603949 ``` ] --- # Bootstrap Simulation - Now we have fit 1000 linear models to 1000 _bootstrapped_ samples (of size 30). Comparing with the gold standard ... .pull-left[ **Bootstrap estimate**: ```r sd(bootcoef) ``` ``` ## [1] 0.1543347 ``` ] .pull-left[ **Gold standard estimate:** ```r sd(coefs) ``` ``` ## [1] 0.1604527 ``` ] ??? - close to the gold standard value which allows us to keep generating new data - impressive no? --- # Bootstrap Simulation - Now we have fit 1000 linear models to 1000 _bootstrapped_ samples (of size 30). Comparing with the true value .pull-left[ **Bootstrap estimate**: ```r sd(bootcoef) ``` ``` ## [1] 0.1543347 ``` <!-- 0.25/sqrt(30/12) --> ] .pull-left[ **True value:** ```r xbar <- mean(x) sxx <- sum((x-xbar)^2) var.b1 = 0.25^2/sxx (true.se.b1 = sqrt(var.b1)) ``` ``` ## [1] 0.162064 ``` ] --- # Bootstrap Simulation - Now we have fit 1000 linear models to 1000 _bootstrapped_ samples (of size 30). Comparing with estimated SE from one fit .pull-left[ **Bootstrap estimate**: ```r sd(bootcoef) ``` ``` ## [1] 0.1543347 ``` ] .pull-left[ **From one fit:** ```r mod1 <- lm(y~x) coeftab <- summary(mod1)$coefficients (mod1.se.b1 <- coeftab[2,2]) ``` ``` ## [1] 0.1434842 ``` ] --- # Bootstrap Simulation The value from the previous page was pulled from the summary table of the regression model fitted to the complete data ... ```r mod1 <- lm(y~x) summary(mod1) ``` <table class="table" style="margin-left: auto; margin-right: auto;"> <thead> <tr> <th style="text-align:left;"> </th> <th style="text-align:right;"> Estimate </th> <th style="text-align:right;"> Std. Error </th> <th style="text-align:right;"> t value </th> <th style="text-align:right;"> Pr(>|t|) </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;font-weight: bold;color: black !important;background-color: white !important;"> (Intercept) </td> <td style="text-align:right;color: black !important;background-color: white !important;"> 0.0579822 </td> <td style="text-align:right;color: purple !important;background-color: #EAEEFF !important;color: black !important;background-color: white !important;"> 0.0777279 </td> <td style="text-align:right;color: black !important;background-color: white !important;"> 0.7459639 </td> <td style="text-align:right;color: black !important;background-color: white !important;"> 0.4619065 </td> </tr> <tr> <td style="text-align:left;font-weight: bold;"> x </td> <td style="text-align:right;"> 1.8664054 </td> <td style="text-align:right;color: purple !important;background-color: #EAEEFF !important;"> 0.1434842 </td> <td style="text-align:right;"> 13.0077406 </td> <td style="text-align:right;"> 0.0000000 </td> </tr> </tbody> </table> --- # Summary Below are the estimates for the SE( `\(\hat \beta_1\)`): <table class="table" style="margin-left: auto; margin-right: auto;"> <thead><tr> <th style="border-bottom:hidden;padding-bottom:0; padding-left:3px;padding-right:3px;text-align: center; " colspan="1"><div style="border-bottom: 1px solid #ddd; padding-bottom: 5px; ">Method</div></th> <th style="border-bottom:hidden;padding-bottom:0; padding-left:3px;padding-right:3px;text-align: center; " colspan="1"><div style="border-bottom: 1px solid #ddd; padding-bottom: 5px; ">SE value</div></th> </tr></thead> <tbody> <tr> <td style="text-align:left;"> True value </td> <td style="text-align:left;"> 0.1621 </td> </tr> <tr> <td style="text-align:left;"> Multiple Simulations </td> <td style="text-align:left;"> 0.1605 </td> </tr> <tr> <td style="text-align:left;"> Bootstrap </td> <td style="text-align:left;"> 0.1543 </td> </tr> <tr> <td style="text-align:left;"> Jackknife </td> <td style="text-align:left;"> 0.1604 </td> </tr> <tr> <td style="text-align:left;"> One fit </td> <td style="text-align:left;"> 0.1435 </td> </tr> </tbody> </table> ??? multiple sims = gold standard --- class:center, hide-logo, inverse So while it may not have been obvious off the bat that this method would work, hopefully you are convinced by this simple example. <img src="https://media.giphy.com/media/D8vbqAZMb8WTm/giphy.gif" width="70%" /> --- # Comments - This should give you some confidence in the power of bootstrapping! - Bootstrap and jackknife are used for the same general purposes, but jackknife is older and was primarily used at a time when computing resources were limited. - As such, bootstrapping is the predominant nonparametric method used for bias and standard error calculations. --- # Taking it Further - We alluded to the potential of computing confidence intervals with the jackknife, but with the bootstrap one option for doing so is quite straightforward. - The `\(B\)` bootstrap estimates of parameter `\(\alpha\)` provide an empirical estimate of the distribution of estimator `\(\hat \alpha\)`. - So, for example, we can look at a histogram of bootstrapped `\(\hat \beta_1\)`, `hist(bootcoef)`, from our previous example... --- # Empirical Distribution of `\(\hat \beta_1\)` <img src="r-figures/lec13/unnamed-chunk-34-1.png" width="85%" style="display: block; margin: auto;" /> ??? contains 1000 bootstrap `\(\hat \beta_1\)` estimates... --- # Bootstrap CI Suppose I want a 95% bootstrap CI for `\(\beta_1\)`... <img src="r-figures/lec13/unnamed-chunk-35-1.png" width="85%" style="display: block; margin: auto;" /> ??? remember truth is at 2 --- # Bootstrap CI The simplest method for providing a CI is to take percentiles. <img src="r-figures/lec13/unnamed-chunk-36-1.png" width="85%" style="display: block; margin: auto;" /> --- # Bootstrap CI The simplest method for providing a CI is to take percentiles. <img src="r-figures/lec13/unnamed-chunk-37-1.png" width="85%" style="display: block; margin: auto;" /> --- # Bootstrap CI .pull-left[ To find the lower bound: ```r quantile(bootcoef, .025) ``` ``` ## 2.5% ## 1.567437 ``` ] .pull-right[ To find the lower bound: ```r quantile(bootcoef, .975) ``` ``` ## 97.5% ## 2.182977 ``` ] So we have a 95% CI for true `\(\beta_1\)` lying within the interval: `$$(1.5674, 2.183)$$` --- # Bootstrap CI <img src="r-figures/lec13/unnamed-chunk-40-1.png" width="85%" style="display: block; margin: auto;" /> --- class:inverse # Bootstrap Acclaim [Statistical ‘rock star’ wins coveted international prizeStatistical ‘rock star’ wins coveted international prize ](https://www.nature.com/articles/d41586-018-07395-w) > _The bootstrap is rarely the star of statistics, but it is the best supporting actor_ .right[ \- Dr. Bradley Efron] You can hear Dr. Brad Efron speak about the bootstrap [here](https://www.youtube.com/watch?v=H2tOhMaXWvI&ab_channel=BBVAFoundation)