Data 311: Machine Learning

class: title-slide, top, center

# Data 311: Machine Learning
##Lecture 13: Jackknife and Bootstrap
Dr. Irene Vrbik

.left[.footnote[Updated: 2021-11-01] ]

---

# Introduction

- Last class we discuss some cross-validation resampling techniques, namely, LOOCV and `$k$`-fold CV.

- Today we discuss other popular resampling method called the .display[bootstrap] and .display[jackknife]

- We will primarily do this in the context of linear regression

???
might seem silly since we have the CLT, but they'll be easy to understand (harder example in assignment 3)
- we'll see another example using bootstrap in lecture 16

---
# Recap

- Last lecture we saw how we could you cross validation to:
  - assess model performance (eg. estimate test MSE),
  - perform model selection (eg. choose k for knn)

- Consider that following CV, we have `$k$`/ `$n$` different fits of the model, which model should we report?

**Answer**: _none of them._  
 
- While CV is used to help _select_ a model, the acutal reported model would be the one run with all the data.

???
cv used strictly  as a way of predicting the true MSE of a model.

---

class: slide-yellow, middle, center

#  Jackknife

---

# Jackknife

- The .display[jackknife] approach can actually use additional information gained from those CV fits to provide insights into the parameter estimates.

- Essentially the Jackknife is LOOCV applied to estimating the bias and variance of parameter estimates.

- We will see that Bootstrap will be used over Jackknife (and that Jackknife is somewhat obsolete `$^{[1]}$` at this point) however it serves as a nice segue between CV and Bootstrap.

.footnote[[1] in fact the topic of jackknife is not even covered in the ISLR textbook]

---
name: jackknife
# Jackknife

1. Let `$i=1$` be the index of the first observation
1. Remove observation `$i$` from the data
1. Estimate your model, report any parameter estimates, say `$\hat \alpha_i$`
1. Set `$i=i+1$`, return to 2 if `$i \leq n$`.
1. Now we have `$n$` estimates of parameter `$\alpha$`. We can estimate the standard error and bias of an estimator `$\hat \alpha$` by

.pull-left[
.medium[
`\begin{equation}
\hat{\text{SE}}(\hat \alpha) = \sqrt{\frac{n-1}{n} \sum_{i=1}^{n} (\hat \alpha_i - \overline{\hat \alpha})^2}
\end{equation}`
]
]
.pull.right[
`\begin{equation}
\hat{\text{Bias}}(\hat \alpha) = (n-1)(\overline {\hat \alpha} - \hat \alpha)
\end{equation}`

.right[
where  `$\overline {\hat \alpha} = \sum_{i=1}^n {\hat \alpha_i}/{n}$` and `$\hat \alpha$` is the estimate obtained on the full data.
]

]

???
Here it is algorithmically: (we will not prove it)

---
# Jackknife Simulation

- Before applying the jackknife in a modelling scenario, let's look at it from a simple, intro-stats-type example

- Estimating `$\mu$` from a normal distribution!

- We generate a sample of size 25 from the normal distribution with the following (known) parameters `$\mu=0$` and `$\sigma = 10$`

- `$\overline X$` is a traditional estimator for `$\mu$` (among others estimators)

---
# Jackknife Simulation

- From intro stats we know that  `$\overline X$` is an unbiased estimate of `$\mu$` with  `$SE(\bar X) = \dfrac{\sigma}{\sqrt{n}} = \frac{10}{\sqrt{25}} = 2$`

- We know this because `$X$` and `$\bar X$` are especially neatly intertwined when `$X$` is normal.

- But suppose we lived in a world void of nice statistical theory didn't provide us this knowledge (and even in the real world we won't the SE for several other estimators!)

- The jackknife provides us a way of estimating `$SE(\bar X)$`

---

# Jackknife Simulation

.medium[
Using the formula for .display[SE] from [this](#eq:jackknife) slide and recalling that the true theoretical `$SE(\overline X) = 2$` ...
]

```r
set.seed(5141)
x <- rnorm(25, 0, 10)
xbarfull <- mean(x) # alpha hat
xbarjack <- NA      # alphai hat for i = 1,..n
for(i in 1:25)   xbarjack[i] <- mean(x[-i]) 
xbarhat = mean(xbarjack)
# (n-1)/n sum alphai hat - the average alpha hat
sqrt((25-1)/(25)*sum((xbarjack-xbarhat)^2))
```

```
## [1] 1.873911
```

???
-  so if i didn't have this theoretical result i could use this to get a reasonable estiamte of it

---

# Jackknife Simulation

Using a different seed...

```r
set.seed(511)
x <- rnorm(25, 0, 10)
xbarfull <- mean(x) 
xbarjack <- NA    
for(i in 1:25){
  xbarjack[i] <- mean(x[-i]) 
}   
xbarhat = mean(xbarjack)
sqrt((25-1)/(25)*
       sum((xbarjack-xbarhat)^2))
```

```
## [1] 2.314983
```

???
- proof of concept
- sometimes our estimate is below sometimes its above

---

# Jackknife Simulation

.medium[
Using the formula for .display[Bias] from [this](#eq:jackknife) slide and recalling that this is an unbiased estimator  (bias is 0) ...
]

```r
set.seed(5141)
x <- rnorm(25, 0, 10)
xbarfull <- mean(x) # alpha hat
xbarjack <- NA      # alphai hat for i = 1,..n
for(i in 1:25)   xbarjack[i] <- mean(x[-i]) 
xbarhat = mean(xbarjack)
# (n-1)*(average alphai hat - alpha hat)
(25-1)*(xbarhat - xbarfull)
```

```
## [1] 0
```

---

# Jackknife Simulation

Using a different seed

```r
set.seed(511)
x <- rnorm(25, 0, 10)
xbarfull <- mean(x) # alpha hat
xbarjack <- NA      # alphai hat for i = 1,..n
for(i in 1:25)   xbarjack[i] <- mean(x[-i]) 
xbarhat = mean(xbarjack)
# (n-1)*(average alphai hat - alpha hat)
(25-1)*(xbarhat - xbarfull)
```

```
## [1] -0.000000000000001332268
```

???
mathematically this should be equal to 0 so this really small value is just a rounding error

---
# Jackknife Simulation -- Faster

.pull-left[
Alternatively we could just the `jackknife` function available in the **bootstrap** package. Usage:

```r
jackknife(x, theta, ...)
```
where `x` is a vector of data and `theta` is the function to be jackknifed.
]

.pull-right[
Redoing our example:

```r
library(bootstrap)
set.seed(5141)
x <- rnorm(25, 0, 10)
jfit <- jackknife(x, mean)
jfit$jack.bias
```

```
## [1] 0
```

```r
jfit$jack.se
```

```
## [1] 1.873911
```

]

???
- same as before
- messier when we get more complicated (eg. `$\beta_1$` for lm)

---

# Jackknife Modelling Example

## SLR Example

- Bringing this back around to machine learning, we can use the jackknife to investigate the bias and variance of our estimators within any particular model

- For example, `$\hat \beta_0$` and `$\hat \beta_1$` from SLR.  We simulate 30 observations from `$$Y = 2X + \epsilon$$` where `$\epsilon \sim N(0, 0.25^2)$`

---
name:sampling-dist
# SLR Example

## Sampling distributions of coefficients

It can be shown that

`\begin{equation}
\hat\beta_0 \sim N\left(\beta_0, \sigma^2\left[\dfrac{1}{n} + \dfrac{\bar x^2}{S_{xx}}\right]\right) \quad
\hat\beta_1 \sim N\left(\beta_1, \dfrac{\sigma^2}{S_{xx}}\right)
\end{equation}`

where `$S_{xx} = \sum_{i=1}^n (x_i - \overline x)^2$` = 
`$\sum_{i=1}^n x_i^2 - \dfrac{(\sum_{i=1}^n x_i)^2}{n}$`

Notice that these are both unbiased with known standard errors.

---

# SLR Example

```r
set.seed(123978)
x <- runif(30, 0, 1); y <- 2*x + rnorm(30, sd=0.25)
fullfit <- lm(y~x)
jlist <- matrix(NA, nrow=30, ncol=2)
for(i in 1:length(x))  jlist[i,] <- lm(y[-i]~x[-i])$coef
```

.pull-left[

```r
# Bias for intercept
(30-1)*(mean(jlist[,1])
 -fullfit$coefficients[1]) 
```

```
##  (Intercept) 
## -0.001610663
```
]

.pull-right[

```r
# Bias for slope
(30-1)*(mean(jlist[,2])
 -fullfit$coefficients[2]) 
```

```
##            x 
## 0.0002091236
```

]

???
near 0 as these are unbiased

---

# SLR Example

Intercept `$\beta_0$` standard errors:

```r
# Jackknife SE for intercept:
sqrt(((30-1)/30)*sum((jlist[,1]-mean(jlist[,1]))^2)) 
```

```
## [1] 0.07949095
```

```r
# True SE for intercept
xbar <- mean(x)
sxx <- sum((x-xbar)^2)
var.b0 = 0.25^2*(1/30 + xbar^2/sxx)
sqrt(var.b0)
```

```
## [1] 0.08779294
```

---

# SLR Example

Slope `$\beta_1$` standard errors:

```r
# Jackknife SE for slope:
(jk.se.b1 = sqrt(((30-1)/30)*
                   sum((jlist[,2]-mean(jlist[,2]))^2)) )
```

```
## [1] 0.1603949
```

```r
# True SE for slope
sqrt(0.25^2/sxx)
```

```
## [1] 0.162064
```

---

# Modelling Example

- Again, traditional estimators `$\hat \beta_0$` and `$\hat \beta_1$` are unbiased...so it's good to see the jackknife bias estimates near 0.

- We also see that the jackknife estimates for the SEs of both estimators appear reasonably close to the known SEs

- With some additional work, we could use these SEs to provide jackknife versions of confidence intervals...but we'll leave that aside, as we're about to learn a "simpler" option.

---

# Comments about Jackknife

- We considered LOO jackknife; however, it is generalizable to "leave `$k$` out", just as CV was.

- The jackknife is a nonparametric method but that does not mean that it is void of assumptions, or rigorous statistical theory.  The biggest assumption is that observations are .blue[iid] `$^{1}$`.

- It has been shown _consistent_ `$^{2}$` (asymptotically "good") for many common estimators, but not _all_ estimators (eg. the median).

.footnote[1 independent and identically distributed <br>
2  the procedure with unlimited data should identify the underlying truth]

???
The jackknife is a nonparametric method for estimating things like bias and variance of an estimator

---
class: slide-yellow, middle, center

# Bootstrap

---

# Bootstrap

- The (nonparametric) .display[bootstrap] is a tremendously versatile method for estimating the standard errors and bias of parameter estimates and has largely replaced the jackknife.

- Instead of simulating new data from a known model (which will be impossible in real scenarios), we can randomly sample .alert[with replacement] from our observed sampled  data!

- Oddly enough, with a bit of work, this provides good estimates of the bias and standard error of our estimators.

---

# Bootstrap

1. Set `$j=1$` as index for bootstrap sample number
1.  Take the `$j^{\text{th}}$` random sample of size `$n$` from your observations `${x_1, x_2, \ldots, x_n}$` .alert[with replacement]
1.  Fit model, estimate parameter `$\alpha$` with `$\hat \alpha_j$` (for example)
1.  Set `$j=j+1$`, and if `$j\leq B$` return to step 2
1. Estimate the standard error and/or bias of the estimator

.pull-left[
.small[
`\begin{equation}
\hat{\text{SE}}(\hat \alpha) = \sqrt{\sum_{i=1}^{B} (\hat \alpha_i - \overline{\hat \alpha})^2}
\end{equation}`
]
]
.pull.right[
`\begin{align}
\hat{\text{Bias}}(\hat \alpha) = \overline {\hat \alpha} - \hat \alpha
\end{align}`

.right[
.small[
where  `$\overline {\hat \alpha} = \sum_{i=1}^B {\hat \alpha_i}/{B}$` and `$\hat \alpha$` is the estimate obtained on the full data.
]
]

]

.footnote[
N.B. `$B=1000,5000$` are standard amounts
]

---
name: islr-bootstrap
class: hide-logo

<div class="figure" style="text-align: center">
<img src="img/ISLR2/Chapter5/5_11.png" alt="Visualization of Bootstrap (Figure 5.11 from ISLR)" width="60%" />
<p class="caption">Visualization of Bootstrap (Figure 5.11 from ISLR)</p>
</div>

---

# Bootstrap Simulation

- We simulate 30 observations from `$$Y = 2X + \epsilon$$` where `$\epsilon \sim N(0, 0.25^2)$`

- In a simulation, we can simply continue generating data to investigate things like the distribution of `$\hat \beta_1$`

- So we do it...

---
name: bootstrap:sim1
# Bootstrap Simulation

```r
set.seed(2)
newxy <- list()
modnew <- list()
coefs <- NA
for(i in 1:1000){
  newx <- runif(30, 0, 1)
  newy <- 2*newx + rnorm(30, sd=0.25)
  newxy[[i]] <- cbind(newx, newy)
  modnew[[i]] <- lm(newxy[[i]][,2]~newxy[[i]][,1])
  coefs[i] <- modnew[[i]]$coefficients[2]
}
```

---

# Bootstrap Simulation

-  Now we have fit 1000 linear models to 1000 new simulations (of size 30) from the exame same generative scheme

- We stored all `$\hat \beta_1$` in `coefs`

```r
sd(coefs)
```

```
## [1] 0.1604527
```

The above should give us a pretty good estimates of the SE of `$\hat\beta_1$`

---

# Bootstrap Simulation

- If we could do this in real life this would be our gold standard.

- BUT with real data, you will never have the option to just simulate/collect a bunch more of it.

- Enter the bootstrap...

---

# Bootstrap Simulation

```r
set.seed(311)
newboots <- list()
bootsmod <- list()
bootcoef <- NA
xy <- cbind(x,y)
for(i in 1:1000){
  newboots[[i]] <- xy[sample(1:30, 30, replace=TRUE),]
  bootsmod[[i]] <- lm(newboots[[i]][,2]~newboots[[i]][,1])
  bootcoef[i] <- bootsmod[[i]]$coefficient[2]
}
```

Comparing with our [previous code](#bootstrap:sim1) ....

---

# Bootstrap Simulation

...  notice that we are not regenerating new x and y values:

```r
set.seed(2)
newxy <- list()
modnew <- list()
coefs <- NA
for(i in 1:1000){
* newx <- runif(30, 0, 1)
* newy <- 2*newx + rnorm(30, sd=0.25)
* newxy[[i]] <- cbind(newx, newy)
  modnew[[i]] <- lm(newxy[[i]][,2]~newxy[[i]][,1])
  coefs[i] <- modnew[[i]]$coefficients[2]
}
```

---

# Bootstrap Simulation

```r
for(i in 1:1000){
* newboots[[i]] <- xy[sample(1:30, 30, replace=TRUE),]
  bootsmod[[i]] <- lm(newboots[[i]][,2]~newboots[[i]][,1])
  bootcoef[i] <- bootsmod[[i]]$coefficient[2]}
```

.pull-left[

```r
head(newboots[[i]])
```

```
##          x     y
## [1,] 0.106 0.314
## [2,] 0.487 1.195
## [3,] 0.761 1.181
## [4,] 0.761 1.181
## [5,] 0.551 1.064
## [6,] 0.304 0.695
```

]

.pull-right[
- `newboots[[i]]` stores `$n$` observations sampled with replacement from the original data set

- it's akin to `$Z^{*i}$` in the visualization on [this](#islr-bootstrap) slide

]

---

# Bootstrap Simulation

```r
for(i in 1:1000){
* newboots[[i]] <- xy[sample(1:30, 30, replace=TRUE),]
  bootsmod[[i]] <- lm(newboots[[i]][,2]~newboots[[i]][,1])
  bootcoef[i] <- bootsmod[[i]]$coefficient[2]}
```

.pull-left-narrow[
.small[

```r
bootcoef[[i]]
```

```
## [1] 1.758587
```
]

]

.pull-right-wide[
- `bootcoef[[i]]` stores the estimate of slope based on the `$i$`th bootstrap sample `$Z^{*i}$`

- `bootcoef[[i]]` is akin to `$\hat\alpha^{*i}$` in the visualization on [this](#islr-bootstrap) slide

]

---

# Bootstrap Simulation

- Now we have fit 1000 linear models to 1000 _bootstrapped_ samples (of size 30). Comparing with the jackknife ...

.pull-left[
**Bootstrap estimate**:

```r
sd(bootcoef)
```

```
## [1] 0.1543347
```

]

.pull-left[
**Jackknife estimate:**

```r
jk.se.b1
```

```
## [1] 0.1603949
```

]

---

# Bootstrap Simulation

- Now we have fit 1000 linear models to 1000 _bootstrapped_ samples (of size 30). Comparing with the gold standard ...

.pull-left[
**Bootstrap estimate**:

```r
sd(bootcoef)
```

```
## [1] 0.1543347
```

]

.pull-left[
**Gold standard estimate:**

```r
sd(coefs)
```

```
## [1] 0.1604527
```

]

???
- close to the gold standard value which allows us to keep generating new data
- impressive no?

---

# Bootstrap Simulation

- Now we have fit 1000 linear models to 1000 _bootstrapped_ samples (of size 30). Comparing with the true value

.pull-left[
**Bootstrap estimate**:

```r
sd(bootcoef)
```

```
## [1] 0.1543347
```

]

.pull-left[
**True value:**

```r
xbar <- mean(x)
sxx <- sum((x-xbar)^2)
var.b1 = 0.25^2/sxx
(true.se.b1 = sqrt(var.b1))
```

```
## [1] 0.162064
```

]

---

# Bootstrap Simulation

- Now we have fit 1000 linear models to 1000 _bootstrapped_ samples (of size 30). Comparing with estimated SE from one fit

.pull-left[
**Bootstrap estimate**:

```r
sd(bootcoef)
```

```
## [1] 0.1543347
```

]

.pull-left[
**From one fit:**

```r
mod1 <- lm(y~x)
coeftab <- summary(mod1)$coefficients
(mod1.se.b1 <- coeftab[2,2])
```

```
## [1] 0.1434842
```

]

---

# Bootstrap Simulation

The  value from the previous page was pulled from the summary table of the regression model fitted to the complete data ...

```r
mod1 <- lm(y~x)
summary(mod1)
```

<table class="table" style="margin-left: auto; margin-right: auto;">
 <thead>
  <tr>
   <th style="text-align:left;">   </th>
   <th style="text-align:right;"> Estimate </th>
   <th style="text-align:right;"> Std. Error </th>
   <th style="text-align:right;"> t value </th>
   <th style="text-align:right;"> Pr(&gt;|t|) </th>
  </tr>
 </thead>
<tbody>
  <tr>
   <td style="text-align:left;font-weight: bold;color: black !important;background-color: white !important;"> (Intercept) </td>
   <td style="text-align:right;color: black !important;background-color: white !important;"> 0.0579822 </td>
   <td style="text-align:right;color: purple !important;background-color: #EAEEFF !important;color: black !important;background-color: white !important;"> 0.0777279 </td>
   <td style="text-align:right;color: black !important;background-color: white !important;"> 0.7459639 </td>
   <td style="text-align:right;color: black !important;background-color: white !important;"> 0.4619065 </td>
  </tr>
  <tr>
   <td style="text-align:left;font-weight: bold;"> x </td>
   <td style="text-align:right;"> 1.8664054 </td>
   <td style="text-align:right;color: purple !important;background-color: #EAEEFF !important;"> 0.1434842 </td>
   <td style="text-align:right;"> 13.0077406 </td>
   <td style="text-align:right;"> 0.0000000 </td>
  </tr>
</tbody>
</table>

---

# Summary

Below are the estimates for the SE( `$\hat \beta_1$`):

<table class="table" style="margin-left: auto; margin-right: auto;">
<thead><tr>
<th style="border-bottom:hidden;padding-bottom:0; padding-left:3px;padding-right:3px;text-align: center; " colspan="1"><div style="border-bottom: 1px solid #ddd; padding-bottom: 5px; ">Method</div></th>
<th style="border-bottom:hidden;padding-bottom:0; padding-left:3px;padding-right:3px;text-align: center; " colspan="1"><div style="border-bottom: 1px solid #ddd; padding-bottom: 5px; ">SE value</div></th>
</tr></thead>
<tbody>
  <tr>
   <td style="text-align:left;"> True value </td>
   <td style="text-align:left;"> 0.1621 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> Multiple Simulations </td>
   <td style="text-align:left;"> 0.1605 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> Bootstrap </td>
   <td style="text-align:left;"> 0.1543 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> Jackknife </td>
   <td style="text-align:left;"> 0.1604 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> One fit </td>
   <td style="text-align:left;"> 0.1435 </td>
  </tr>
</tbody>
</table>

???
multiple sims = gold standard

---
class:center, hide-logo, inverse

So while it may not have been obvious off the bat that this method would work, hopefully you are convinced by this simple example.

---

# Comments

- This should give you some confidence in the power of bootstrapping!

- Bootstrap and jackknife are used for the same general purposes, but jackknife is older and was primarily used at a time when computing resources were limited.

- As such, bootstrapping is the predominant nonparametric method used for bias and standard error calculations.

---

# Taking it Further

- We alluded to the potential of computing confidence intervals with the jackknife, but with the bootstrap one option for doing so is quite straightforward.

- The `$B$` bootstrap estimates of parameter `$\alpha$` provide an empirical estimate of the distribution of estimator `$\hat \alpha$`.

- So, for example, we can look at a histogram of bootstrapped `$\hat \beta_1$`, `hist(bootcoef)`, from our previous example...

---
# Empirical Distribution of `$\hat \beta_1$`

???
contains 1000 bootstrap `$\hat \beta_1$` estimates...

---

# Bootstrap CI

Suppose I want a 95% bootstrap CI for `$\beta_1$`...

???
remember truth is at 2

---

# Bootstrap CI

The simplest method for providing a CI is to take percentiles.

---

# Bootstrap CI

The simplest method for providing a CI is to take percentiles.

---

# Bootstrap CI

.pull-left[
To find the lower bound:

```r
quantile(bootcoef, .025)
```

```
##     2.5% 
## 1.567437
```
]
.pull-right[
To find the lower bound:

```r
quantile(bootcoef, .975)
```

```
##    97.5% 
## 2.182977
```
]

So we have a 95% CI for true `$\beta_1$` lying within the interval: 
`$$(1.5674, 2.183)$$`
---

# Bootstrap CI

---
class:inverse

# Bootstrap Acclaim

[Statistical ‘rock star’ wins coveted international prizeStatistical ‘rock star’ wins coveted international prize ](https://www.nature.com/articles/d41586-018-07395-w)

> _The bootstrap is rarely the star of statistics, but it is the best supporting actor_
.right[ \- Dr. Bradley Efron]

You can hear Dr. Brad Efron speak about the bootstrap [here](https://www.youtube.com/watch?v=H2tOhMaXWvI&ab_channel=BBVAFoundation)