class: title-slide, top, center # Data 311: Machine Learning ##Lecture 11: Clustering Dr. Irene Vrbik .left[.footnote[Updated: 2021-11-01] ] --- # Motivation Measurements on time to eruption, and length of eruption time of the Old Faithful geyser in Yellowstone National Park. <img src="r-figures/lec11/unnamed-chunk-2-1.png" width="54%" style="display: block; margin: auto;" /> ??? - longer build-up of pressure --> longer eruption time - No "true groups" - Natural clustering --- # Motivation Simulated data <img src="r-figures/lec11/unnamed-chunk-3-1.png" width="50%" style="display: block; margin: auto;" /> --- # Wine - We will also look at the `wine` data from the **gclus** package - This data comprise 178 Italian wines from three different cultivars that correspond to the wine varietals: `Barolo`,`Grignolino`,`Barbera` - Recorded are 13 measurements (eg. alcohol content, malic acid, ash, \dots ) - This dataset is often used to test and compare the performance of various classification algorithms. --- ```r library(gclus) data(wine) head(wine) ``` ``` ## Class Alcohol Malic Ash Alcalinity Magnesium Phenols Flavanoids Nonflavanoid ## 1 1 14.23 1.71 2.43 15.6 127 2.80 3.06 0.28 ## 2 1 13.20 1.78 2.14 11.2 100 2.65 2.76 0.26 ## 3 1 13.16 2.36 2.67 18.6 101 2.80 3.24 0.30 ## 4 1 14.37 1.95 2.50 16.8 113 3.85 3.49 0.24 ## 5 1 13.24 2.59 2.87 21.0 118 2.80 2.69 0.39 ## 6 1 14.20 1.76 2.45 15.2 112 3.27 3.39 0.34 ## Proanthocyanins Intensity Hue OD280 Proline ## 1 2.29 5.64 1.04 3.92 1065 ## 2 1.28 4.38 1.05 3.40 1050 ## 3 2.81 5.68 1.03 3.17 1185 ## 4 2.18 7.80 0.86 3.45 1480 ## 5 1.82 4.32 1.04 2.93 735 ## 6 1.97 6.75 1.05 2.85 1450 ``` --- class: full-slide-fig, hide-logo <img src="r-figures/lec11/unnamed-chunk-5-1.png" width="95%" style="display: block; margin: auto;" /> ??? - multi-class harder to see --- # Motivation The goal of .display[clustering] (a form of unsupervised learning) is to find groups such that all observations are - more _similar_ to observations _inside their group_ and - more _dissimilar_ to observations in _other groups_. Types of clustering algorithms: - Hierarchical - Partitioning - Mixture models ??? - within group similarity is maximized - mixture models (last lecture?) --- class: slide-yellow, center, middle # Hierarchical Clustering --- layout:true # Hierarchical Clustering --- - Once we have information on the distances between all observations in our data set, we're ready to come up with ways to group those observations. <!-- - I genuinely believe if I threw all of you padded cells with the task of coming up with a method, this would be the one you would come up with! On the face of it, at least, it's pretty straightforward. --> - Hierarchical clustering (HC), or hierarchical cluster analysis, is in many ways the most straightforward method for finding groups. - As it's name suggests, this method builds a _hierarchy_ of clusters. ??? - **Agglomerative** "bottom-up" approach: each observation starts in its own cluster, and pairs of clusters are merged as one moves up the hierarchy. - Not **Divisive** "top-down" approach: all observations start in one cluster, and splits are performed recursively as one moves down the hierarchy. --- .display[Hierachical clustering] can be boiled down to the following steps: 1. Start with all observations in their own groups ( `\(n\)` unique groups) 2. Join the two closest observations (now there are `\(n-1\)` groups) 3. Recalculate distances (more on this soon) 4. Repeat 2) and 3) until you are left with only 1 group. --- ## Simple Example <img src="r-figures/lec11/unnamed-chunk-6-1.png" width="54%" style="display: block; margin: auto;" /> --- ## Simple Example Example distance matrix: ```r dist(x) ``` ``` ## a b c d ## b 1.1180340 ## c 0.5000000 0.7071068 ## d 4.1231056 3.0413813 3.6400549 ## e 4.0311289 2.9154759 3.6055513 1.1180340 ``` --- <table class="table" style="margin-left: auto; margin-right: auto;"> <thead> <tr> <th style="text-align:left;"> </th> <th style="text-align:left;"> a </th> <th style="text-align:left;"> b </th> <th style="text-align:left;"> c </th> <th style="text-align:left;"> d </th> <th style="text-align:left;"> e </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;font-weight: bold;color: black !important;background-color: white !important;"> a </td> <td style="text-align:left;color: purple !important;background-color: #EAEEFF !important;color: black !important;background-color: white !important;"> 0 </td> <td style="text-align:left;color: black !important;background-color: white !important;"> </td> <td style="text-align:left;color: black !important;background-color: white !important;"> </td> <td style="text-align:left;color: black !important;background-color: white !important;"> </td> <td style="text-align:left;color: black !important;background-color: white !important;"> </td> </tr> <tr> <td style="text-align:left;font-weight: bold;"> b </td> <td style="text-align:left;color: purple !important;background-color: #EAEEFF !important;"> 1.118 </td> <td style="text-align:left;"> 0 </td> <td style="text-align:left;"> </td> <td style="text-align:left;"> </td> <td style="text-align:left;"> </td> </tr> <tr> <td style="text-align:left;font-weight: bold;color: black !important;background-color: white !important;"> c </td> <td style="text-align:left;color: purple !important;background-color: #EAEEFF !important;color: black !important;background-color: white !important;"> 0.5 </td> <td style="text-align:left;color: black !important;background-color: white !important;"> 0.707 </td> <td style="text-align:left;color: black !important;background-color: white !important;"> 0 </td> <td style="text-align:left;color: black !important;background-color: white !important;"> </td> <td style="text-align:left;color: black !important;background-color: white !important;"> </td> </tr> <tr> <td style="text-align:left;font-weight: bold;color: black !important;background-color: white !important;"> d </td> <td style="text-align:left;color: purple !important;background-color: #EAEEFF !important;color: black !important;background-color: white !important;"> 4.123 </td> <td style="text-align:left;color: black !important;background-color: white !important;"> 3.041 </td> <td style="text-align:left;color: black !important;background-color: white !important;"> 3.64 </td> <td style="text-align:left;color: black !important;background-color: white !important;"> 0 </td> <td style="text-align:left;color: black !important;background-color: white !important;"> </td> </tr> <tr> <td style="text-align:left;font-weight: bold;color: black !important;background-color: white !important;"> e </td> <td style="text-align:left;color: purple !important;background-color: #EAEEFF !important;color: black !important;background-color: white !important;"> 4.031 </td> <td style="text-align:left;color: black !important;background-color: white !important;"> 2.915 </td> <td style="text-align:left;color: black !important;background-color: white !important;"> 3.606 </td> <td style="text-align:left;color: black !important;background-color: white !important;"> 1.118 </td> <td style="text-align:left;color: black !important;background-color: white !important;"> 0 </td> </tr> </tbody> </table> So we join observation (a) with observation (c) at the second step of our process. --- layout: false # Step 1 ## 5 groups <img src="r-figures/lec11/unnamed-chunk-9-1.png" width="50%" style="display: block; margin: auto;" /> --- # Step 2 ## 4 groups <img src="r-figures/lec11/unnamed-chunk-10-1.png" width="50%" style="display: block; margin: auto;" /> --- # Recalculating Distances Now that we've joined the two closest observations ... how do we determine the distance between that group and the others? <img src="r-figures/lec11/unnamed-chunk-11-1.png" width="50%" style="display: block; margin: auto;" /> --- # Recalculating Distances For example, whats the distance from the purple group (with observations a, b) to the blue group (with solo observation b) <img src="r-figures/lec11/unnamed-chunk-12-1.png" width="50%" style="display: block; margin: auto;" /> --- # Recalculating Distances There are 3 common ways of recalculating distances between groups (also called linkages): 1. .display[Single linkage:] define distance as between _closest_ observations. For this example, `\(d_{\{ac\}\{b\}}=d_{cb}\)` 1. .display[Complete linkage:] define distance as between _furthest_ observations. For this example, `\(d_{\{ac\}\{b\}}=d_{ab}\)`. 1. .display[Average linkage:] define distance as the average distance between the observations inside group with those outside. For this example, `\(d_{\{ac\}\{b\}}=\frac{d_{ab}+d_{cb}}{2}\)`. ??? Complete linkage might seem odd but it often works well --- class: slide-footnote # Single Linkage .pull-left[ <img src="r-figures/lec11/unnamed-chunk-13-1.png" width="95%" style="display: block; margin: auto;" /> `\(d_{\{ac\}\{b\}}=d_{cb}\)` ] .pull-right[ <table class="table" style="margin-left: auto; margin-right: auto;"> <thead> <tr> <th style="text-align:left;"> </th> <th style="text-align:left;"> a </th> <th style="text-align:left;"> b </th> <th style="text-align:left;"> c </th> <th style="text-align:left;"> d </th> <th style="text-align:left;"> e </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;font-weight: bold;color: black !important;background-color: white !important;"> a </td> <td style="text-align:left;color: black !important;background-color: white !important;"> 0 </td> <td style="text-align:left;color: purple !important;background-color: #EAEEFF !important;color: black !important;background-color: white !important;"> </td> <td style="text-align:left;color: black !important;background-color: white !important;"> </td> <td style="text-align:left;color: black !important;background-color: white !important;"> </td> <td style="text-align:left;color: black !important;background-color: white !important;"> </td> </tr> <tr> <td style="text-align:left;font-weight: bold;color: black !important;background-color: white !important;"> b </td> <td style="text-align:left;color: black !important;background-color: white !important;"> 1.118 </td> <td style="text-align:left;color: purple !important;background-color: #EAEEFF !important;color: black !important;background-color: white !important;"> 0 </td> <td style="text-align:left;color: black !important;background-color: white !important;"> </td> <td style="text-align:left;color: black !important;background-color: white !important;"> </td> <td style="text-align:left;color: black !important;background-color: white !important;"> </td> </tr> <tr> <td style="text-align:left;font-weight: bold;"> c </td> <td style="text-align:left;"> 0.5 </td> <td style="text-align:left;color: purple !important;background-color: #EAEEFF !important;"> 0.707 </td> <td style="text-align:left;"> 0 </td> <td style="text-align:left;"> </td> <td style="text-align:left;"> </td> </tr> <tr> <td style="text-align:left;font-weight: bold;color: black !important;background-color: white !important;"> d </td> <td style="text-align:left;color: black !important;background-color: white !important;"> 4.123 </td> <td style="text-align:left;color: purple !important;background-color: #EAEEFF !important;color: black !important;background-color: white !important;"> 3.041 </td> <td style="text-align:left;color: black !important;background-color: white !important;"> 3.64 </td> <td style="text-align:left;color: black !important;background-color: white !important;"> 0 </td> <td style="text-align:left;color: black !important;background-color: white !important;"> </td> </tr> <tr> <td style="text-align:left;font-weight: bold;color: black !important;background-color: white !important;"> e </td> <td style="text-align:left;color: black !important;background-color: white !important;"> 4.031 </td> <td style="text-align:left;color: purple !important;background-color: #EAEEFF !important;color: black !important;background-color: white !important;"> 2.915 </td> <td style="text-align:left;color: black !important;background-color: white !important;"> 3.606 </td> <td style="text-align:left;color: black !important;background-color: white !important;"> 1.118 </td> <td style="text-align:left;color: black !important;background-color: white !important;"> 0 </td> </tr> </tbody> </table> <br> <table class="table" style="margin-left: auto; margin-right: auto;"> <thead> <tr> <th style="text-align:left;"> </th> <th style="text-align:left;"> ac </th> <th style="text-align:left;"> b </th> <th style="text-align:left;"> d </th> <th style="text-align:left;"> e </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;font-weight: bold;"> ac </td> <td style="text-align:left;"> 0 </td> <td style="text-align:left;"> </td> <td style="text-align:left;"> </td> <td style="text-align:left;"> </td> </tr> <tr> <td style="text-align:left;font-weight: bold;"> b </td> <td style="text-align:left;"> 0.707 </td> <td style="text-align:left;"> 0 </td> <td style="text-align:left;"> </td> <td style="text-align:left;"> </td> </tr> <tr> <td style="text-align:left;font-weight: bold;"> d </td> <td style="text-align:left;"> </td> <td style="text-align:left;"> </td> <td style="text-align:left;"> 0 </td> <td style="text-align:left;"> </td> </tr> <tr> <td style="text-align:left;font-weight: bold;"> e </td> <td style="text-align:left;"> </td> <td style="text-align:left;"> </td> <td style="text-align:left;"> </td> <td style="text-align:left;"> 0 </td> </tr> </tbody> </table> ] --- class: slide-footnote # Single Linkage .pull-left[ <img src="r-figures/lec11/unnamed-chunk-16-1.png" width="95%" style="display: block; margin: auto;" /> `\(d_{\{ac\}\{e\}}=d_{ce}\)` ] .pull-right[ <table class="table" style="margin-left: auto; margin-right: auto;"> <thead> <tr> <th style="text-align:left;"> </th> <th style="text-align:left;"> a </th> <th style="text-align:left;"> b </th> <th style="text-align:left;"> c </th> <th style="text-align:left;"> d </th> <th style="text-align:left;"> e </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;font-weight: bold;color: black !important;background-color: white !important;"> a </td> <td style="text-align:left;color: black !important;background-color: white !important;"> 0 </td> <td style="text-align:left;color: black !important;background-color: white !important;"> </td> <td style="text-align:left;color: purple !important;background-color: #EAEEFF !important;color: black !important;background-color: white !important;"> </td> <td style="text-align:left;color: black !important;background-color: white !important;"> </td> <td style="text-align:left;color: black !important;background-color: white !important;"> </td> </tr> <tr> <td style="text-align:left;font-weight: bold;color: black !important;background-color: white !important;"> b </td> <td style="text-align:left;color: black !important;background-color: white !important;"> 1.118 </td> <td style="text-align:left;color: black !important;background-color: white !important;"> 0 </td> <td style="text-align:left;color: purple !important;background-color: #EAEEFF !important;color: black !important;background-color: white !important;"> </td> <td style="text-align:left;color: black !important;background-color: white !important;"> </td> <td style="text-align:left;color: black !important;background-color: white !important;"> </td> </tr> <tr> <td style="text-align:left;font-weight: bold;color: black !important;background-color: white !important;"> c </td> <td style="text-align:left;color: black !important;background-color: white !important;"> 0.5 </td> <td style="text-align:left;color: black !important;background-color: white !important;"> 0.707 </td> <td style="text-align:left;color: purple !important;background-color: #EAEEFF !important;color: black !important;background-color: white !important;"> 0 </td> <td style="text-align:left;color: black !important;background-color: white !important;"> </td> <td style="text-align:left;color: black !important;background-color: white !important;"> </td> </tr> <tr> <td style="text-align:left;font-weight: bold;color: black !important;background-color: white !important;"> d </td> <td style="text-align:left;color: black !important;background-color: white !important;"> 4.123 </td> <td style="text-align:left;color: black !important;background-color: white !important;"> 3.041 </td> <td style="text-align:left;color: purple !important;background-color: #EAEEFF !important;color: black !important;background-color: white !important;"> 3.64 </td> <td style="text-align:left;color: black !important;background-color: white !important;"> 0 </td> <td style="text-align:left;color: black !important;background-color: white !important;"> </td> </tr> <tr> <td style="text-align:left;font-weight: bold;"> e </td> <td style="text-align:left;"> 4.031 </td> <td style="text-align:left;"> 2.915 </td> <td style="text-align:left;color: purple !important;background-color: #EAEEFF !important;"> 3.606 </td> <td style="text-align:left;"> 1.118 </td> <td style="text-align:left;"> 0 </td> </tr> </tbody> </table> <br> <table class="table" style="margin-left: auto; margin-right: auto;"> <thead> <tr> <th style="text-align:left;"> </th> <th style="text-align:left;"> ac </th> <th style="text-align:left;"> b </th> <th style="text-align:left;"> d </th> <th style="text-align:left;"> e </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;font-weight: bold;"> ac </td> <td style="text-align:left;"> 0 </td> <td style="text-align:left;"> </td> <td style="text-align:left;"> </td> <td style="text-align:left;"> </td> </tr> <tr> <td style="text-align:left;font-weight: bold;"> b </td> <td style="text-align:left;"> 0.707 </td> <td style="text-align:left;"> 0 </td> <td style="text-align:left;"> </td> <td style="text-align:left;"> </td> </tr> <tr> <td style="text-align:left;font-weight: bold;"> d </td> <td style="text-align:left;"> </td> <td style="text-align:left;"> </td> <td style="text-align:left;"> 0 </td> <td style="text-align:left;"> </td> </tr> <tr> <td style="text-align:left;font-weight: bold;"> e </td> <td style="text-align:left;"> 3.606 </td> <td style="text-align:left;"> </td> <td style="text-align:left;"> </td> <td style="text-align:left;"> 0 </td> </tr> </tbody> </table> ] --- class: slide-footnote # Single Linkage .pull-left[ <img src="r-figures/lec11/unnamed-chunk-19-1.png" width="95%" style="display: block; margin: auto;" /> `\(d_{\{ac\}\{d\}}=d_{cd}\)` ] .pull-right[ <table class="table" style="margin-left: auto; margin-right: auto;"> <thead> <tr> <th style="text-align:left;"> </th> <th style="text-align:left;"> a </th> <th style="text-align:left;"> b </th> <th style="text-align:left;"> c </th> <th style="text-align:left;"> d </th> <th style="text-align:left;"> e </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;font-weight: bold;color: black !important;background-color: white !important;"> a </td> <td style="text-align:left;color: black !important;background-color: white !important;"> 0 </td> <td style="text-align:left;color: black !important;background-color: white !important;"> </td> <td style="text-align:left;color: purple !important;background-color: #EAEEFF !important;color: black !important;background-color: white !important;"> </td> <td style="text-align:left;color: black !important;background-color: white !important;"> </td> <td style="text-align:left;color: black !important;background-color: white !important;"> </td> </tr> <tr> <td style="text-align:left;font-weight: bold;color: black !important;background-color: white !important;"> b </td> <td style="text-align:left;color: black !important;background-color: white !important;"> 1.118 </td> <td style="text-align:left;color: black !important;background-color: white !important;"> 0 </td> <td style="text-align:left;color: purple !important;background-color: #EAEEFF !important;color: black !important;background-color: white !important;"> </td> <td style="text-align:left;color: black !important;background-color: white !important;"> </td> <td style="text-align:left;color: black !important;background-color: white !important;"> </td> </tr> <tr> <td style="text-align:left;font-weight: bold;color: black !important;background-color: white !important;"> c </td> <td style="text-align:left;color: black !important;background-color: white !important;"> 0.5 </td> <td style="text-align:left;color: black !important;background-color: white !important;"> 0.707 </td> <td style="text-align:left;color: purple !important;background-color: #EAEEFF !important;color: black !important;background-color: white !important;"> 0 </td> <td style="text-align:left;color: black !important;background-color: white !important;"> </td> <td style="text-align:left;color: black !important;background-color: white !important;"> </td> </tr> <tr> <td style="text-align:left;font-weight: bold;"> d </td> <td style="text-align:left;"> 4.123 </td> <td style="text-align:left;"> 3.041 </td> <td style="text-align:left;color: purple !important;background-color: #EAEEFF !important;"> 3.64 </td> <td style="text-align:left;"> 0 </td> <td style="text-align:left;"> </td> </tr> <tr> <td style="text-align:left;font-weight: bold;color: black !important;background-color: white !important;"> e </td> <td style="text-align:left;color: black !important;background-color: white !important;"> 4.031 </td> <td style="text-align:left;color: black !important;background-color: white !important;"> 2.915 </td> <td style="text-align:left;color: purple !important;background-color: #EAEEFF !important;color: black !important;background-color: white !important;"> 3.606 </td> <td style="text-align:left;color: black !important;background-color: white !important;"> 1.118 </td> <td style="text-align:left;color: black !important;background-color: white !important;"> 0 </td> </tr> </tbody> </table> <br> <table class="table" style="margin-left: auto; margin-right: auto;"> <thead> <tr> <th style="text-align:left;"> </th> <th style="text-align:left;"> ac </th> <th style="text-align:left;"> b </th> <th style="text-align:left;"> d </th> <th style="text-align:left;"> e </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;font-weight: bold;color: black !important;background-color: white !important;"> ac </td> <td style="text-align:left;color: purple !important;background-color: #EAEEFF !important;color: black !important;background-color: white !important;"> 0 </td> <td style="text-align:left;color: black !important;background-color: white !important;"> </td> <td style="text-align:left;color: black !important;background-color: white !important;"> </td> <td style="text-align:left;color: black !important;background-color: white !important;"> </td> </tr> <tr> <td style="text-align:left;font-weight: bold;color: black !important;background-color: white !important;"> b </td> <td style="text-align:left;color: purple !important;background-color: #EAEEFF !important;color: black !important;background-color: white !important;"> 0.707 </td> <td style="text-align:left;color: black !important;background-color: white !important;"> 0 </td> <td style="text-align:left;color: black !important;background-color: white !important;"> </td> <td style="text-align:left;color: black !important;background-color: white !important;"> </td> </tr> <tr> <td style="text-align:left;font-weight: bold;"> d </td> <td style="text-align:left;color: purple !important;background-color: #EAEEFF !important;"> 3.64 </td> <td style="text-align:left;"> 3.041 </td> <td style="text-align:left;"> 0 </td> <td style="text-align:left;"> </td> </tr> <tr> <td style="text-align:left;font-weight: bold;color: black !important;background-color: white !important;"> e </td> <td style="text-align:left;color: purple !important;background-color: #EAEEFF !important;color: black !important;background-color: white !important;"> 3.606 </td> <td style="text-align:left;color: black !important;background-color: white !important;"> 2.915 </td> <td style="text-align:left;color: black !important;background-color: white !important;"> 1.118 </td> <td style="text-align:left;color: black !important;background-color: white !important;"> 0 </td> </tr> </tbody> </table> ] --- # Step 3 ## 3 groups .pull-left[ <img src="r-figures/lec11/unnamed-chunk-22-1.png" width="65%" style="display: block; margin: auto;" /> ] .pull-right[ <table class="table" style="margin-left: auto; margin-right: auto;"> <thead> <tr> <th style="text-align:left;"> </th> <th style="text-align:left;"> ac </th> <th style="text-align:left;"> b </th> <th style="text-align:left;"> d </th> <th style="text-align:left;"> e </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;font-weight: bold;color: black !important;background-color: white !important;"> ac </td> <td style="text-align:left;color: purple !important;background-color: #EAEEFF !important;color: black !important;background-color: white !important;"> 0 </td> <td style="text-align:left;color: black !important;background-color: white !important;"> </td> <td style="text-align:left;color: black !important;background-color: white !important;"> </td> <td style="text-align:left;color: black !important;background-color: white !important;"> </td> </tr> <tr> <td style="text-align:left;font-weight: bold;"> b </td> <td style="text-align:left;color: purple !important;background-color: #EAEEFF !important;"> 0.707 </td> <td style="text-align:left;"> 0 </td> <td style="text-align:left;"> </td> <td style="text-align:left;"> </td> </tr> <tr> <td style="text-align:left;font-weight: bold;color: black !important;background-color: white !important;"> d </td> <td style="text-align:left;color: purple !important;background-color: #EAEEFF !important;color: black !important;background-color: white !important;"> 3.64 </td> <td style="text-align:left;color: black !important;background-color: white !important;"> 3.041 </td> <td style="text-align:left;color: black !important;background-color: white !important;"> 0 </td> <td style="text-align:left;color: black !important;background-color: white !important;"> </td> </tr> <tr> <td style="text-align:left;font-weight: bold;color: black !important;background-color: white !important;"> e </td> <td style="text-align:left;color: purple !important;background-color: #EAEEFF !important;color: black !important;background-color: white !important;"> 3.606 </td> <td style="text-align:left;color: black !important;background-color: white !important;"> 2.915 </td> <td style="text-align:left;color: black !important;background-color: white !important;"> 1.118 </td> <td style="text-align:left;color: black !important;background-color: white !important;"> 0 </td> </tr> </tbody> </table> ] Closest groups are the purple group \{ac\} and blue group \{b\}. --- # Step 3 ## 3 groups .pull-left[ <img src="r-figures/lec11/unnamed-chunk-24-1.png" width="65%" style="display: block; margin: auto;" /> ] .pull-right[ <table class="table" style="margin-left: auto; margin-right: auto;"> <thead> <tr> <th style="text-align:left;"> </th> <th style="text-align:left;"> abc </th> <th style="text-align:left;"> d </th> <th style="text-align:left;"> e </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;font-weight: bold;"> abc </td> <td style="text-align:left;"> 0 </td> <td style="text-align:left;"> </td> <td style="text-align:left;"> </td> </tr> <tr> <td style="text-align:left;font-weight: bold;"> d </td> <td style="text-align:left;"> 3.041 </td> <td style="text-align:left;"> 0 </td> <td style="text-align:left;"> </td> </tr> <tr> <td style="text-align:left;font-weight: bold;"> e </td> <td style="text-align:left;"> 2.915 </td> <td style="text-align:left;"> 1.118 </td> <td style="text-align:left;"> 0 </td> </tr> </tbody> </table> ] We join those now and recalculate the distance matrix. --- # Step 4 ## 2 groups .pull-left[ <img src="r-figures/lec11/unnamed-chunk-26-1.png" width="65%" style="display: block; margin: auto;" /> ] .pull-right[ <table class="table" style="margin-left: auto; margin-right: auto;"> <thead> <tr> <th style="text-align:left;"> </th> <th style="text-align:left;"> abc </th> <th style="text-align:left;"> d </th> <th style="text-align:left;"> e </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;font-weight: bold;color: black !important;background-color: white !important;"> abc </td> <td style="text-align:left;color: black !important;background-color: white !important;"> 0 </td> <td style="text-align:left;color: purple !important;background-color: #EAEEFF !important;color: black !important;background-color: white !important;"> </td> <td style="text-align:left;color: black !important;background-color: white !important;"> </td> </tr> <tr> <td style="text-align:left;font-weight: bold;color: black !important;background-color: white !important;"> d </td> <td style="text-align:left;color: black !important;background-color: white !important;"> 3.041 </td> <td style="text-align:left;color: purple !important;background-color: #EAEEFF !important;color: black !important;background-color: white !important;"> 0 </td> <td style="text-align:left;color: black !important;background-color: white !important;"> </td> </tr> <tr> <td style="text-align:left;font-weight: bold;"> e </td> <td style="text-align:left;"> 2.915 </td> <td style="text-align:left;color: purple !important;background-color: #EAEEFF !important;"> 1.118 </td> <td style="text-align:left;"> 0 </td> </tr> </tbody> </table> ] Closest groups are the yellow group \{e\} and green group \{d\}. --- # Step 4 ## 2 groups .pull-left[ <img src="r-figures/lec11/unnamed-chunk-28-1.png" width="65%" style="display: block; margin: auto;" /> ] .pull-right[ <table class="table" style="margin-left: auto; margin-right: auto;"> <thead> <tr> <th style="text-align:left;"> </th> <th style="text-align:left;"> abc </th> <th style="text-align:left;"> de </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;font-weight: bold;"> abc </td> <td style="text-align:left;"> 0 </td> <td style="text-align:left;"> </td> </tr> <tr> <td style="text-align:left;font-weight: bold;"> de </td> <td style="text-align:left;"> 2.915 </td> <td style="text-align:left;"> 0 </td> </tr> </tbody> </table> ] We join those now and recalculate the distance matrix. --- # Step 5 ## 1 group .pull-left[ <img src="r-figures/lec11/unnamed-chunk-30-1.png" width="65%" style="display: block; margin: auto;" /> ] .pull-right[ <table class="table" style="margin-left: auto; margin-right: auto;"> <thead> <tr> <th style="text-align:left;"> </th> <th style="text-align:left;"> abc </th> <th style="text-align:left;"> de </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;font-weight: bold;color: black !important;background-color: white !important;"> abc </td> <td style="text-align:left;color: purple !important;background-color: #EAEEFF !important;color: black !important;background-color: white !important;"> 0 </td> <td style="text-align:left;color: black !important;background-color: white !important;"> </td> </tr> <tr> <td style="text-align:left;font-weight: bold;"> de </td> <td style="text-align:left;color: purple !important;background-color: #EAEEFF !important;"> 2.915 </td> <td style="text-align:left;"> 0 </td> </tr> </tbody> </table> ] Closest groups are \{abc\} and \{de\} (only groups left to join). --- # Step 5 ## 1 groups .pull-left[ - We join those now - Notice how the final step has a single group with all observation belonging to that single group ] .pull-right[ <img src="r-figures/lec11/unnamed-chunk-32-1.png" width="85%" style="display: block; margin: auto;" /> ] --- # Hierarchical Clustering - What we've done is provided a _hierarchy_ of potential solutions to the following non-trivial questions: - What would 2 groups look like in this data? - What would 3 groups look like in this data? - What would 4 groups look like in this data? - This doesn't answer the more important question: _how many groups are in this data and what do they look like?_ - Also, do we really need to go through the algorithm step-by-step to see the process? --- # Hierarchical Clustering - The solution to both issues is one in the same. - We can summarize the algorithm graphically, using what's called a .display[dendrogram] - On the X-axis, we just have our observations - On the Y-axis, we have the distance where the groups are combined --- # Dendrogram for our Simple Example <img src="r-figures/lec11/unnamed-chunk-33-1.png" width="85%" style="display: block; margin: auto;" /> --- # Hierarchical Clustering - Dendrograms show what observations were joined at what distance. - They can also be used to determine how many groups are present in the data. - Large "jumps" between combining groups signify large distances between groups. - So generally, we can "cut" at the largest jump in distance, and that will determine the number of groups, as well as each observation's group membership. --- # Dendrogram for our Simple Example <img src="r-figures/lec11/unnamed-chunk-34-1.png" width="85%" style="display: block; margin: auto;" /> --- # Dendrogram for our Simple Example Which matches our intuition ... .pull-left[ <img src="r-figures/lec11/unnamed-chunk-35-1.png" width="95%" style="display: block; margin: auto;" /> ] .pull-right[ <img src="r-figures/lec11/unnamed-chunk-36-1.png" width="95%" style="display: block; margin: auto;" /> ] --- # Old Faithful - Let's apply HC to some more interesting data sets... <img src="r-figures/lec11/unnamed-chunk-37-1.png" width="54%" style="display: block; margin: auto;" /> --- class: full-fig-slide Here is the dendogram on the .alert[raw] distance matrix (calculated using single linkage) <img src="r-figures/lec11/unnamed-chunk-38-1.png" width="99%" style="display: block; margin: auto;" /> ??? - eruption wait: 1.6 5.1 min (var 1.30) - waiting time: 43-96 min (var 184.82) - essentially clusterig on waiting time --- class: full-fig-slide Here is the two group solution from that model: <img src="r-figures/lec11/unnamed-chunk-39-1.png" width="99%" style="display: block; margin: auto;" /> --- class: full-fig-slide In case you couldn't find it, here's the solo observation in group 2 <img src="r-figures/lec11/unnamed-chunk-40-1.png" width="99%" style="display: block; margin: auto;" /> --- class: full-fig-slide Here is the three group solution from that model: <img src="r-figures/lec11/unnamed-chunk-41-1.png" width="99%" style="display: block; margin: auto;" /> --- class: full-fig-slide Here is the four group solution from that model: <img src="r-figures/lec11/unnamed-chunk-42-1.png" width="99%" style="display: block; margin: auto;" /> --- class: full-fig-slide Here is the dendogram on the _standardized_ euclidean distance matrix (calculated using single linkage and the `scale` function) <img src="r-figures/lec11/unnamed-chunk-43-1.png" width="99%" style="display: block; margin: auto;" /> --- class: full-fig-slide # Old Faithful Once we perform HC on the scaled data, we get a two-group solution that much more matches our intuition ... <img src="r-figures/lec11/unnamed-chunk-44-1.png" width="54%" style="display: block; margin: auto;" /> --- # Tori .pull-left-narrow[ For "easy" data sets, the linkage method may not affect the end result... ] .pull-right-wide[ <img src="r-figures/lec11/unnamed-chunk-45-1.png" width="80%" style="display: block; margin: auto;" /> ] ??? - same scale for x and y don't need to standardize but you can if you want. --- # Tori <img src="r-figures/lec11/unnamed-chunk-46-1.png" width="99%" style="display: block; margin: auto;" /> Both produce the identical four-group solution for example --- # Tori .pull-left-narrow[ Which, again, gives the solution we expect... ] .pull-right-wide[ <img src="r-figures/lec11/unnamed-chunk-47-1.png" width="80%" style="display: block; margin: auto;" /> ] --- class: full-fig-slide, hide-logo <img src="r-figures/lec11/unnamed-chunk-48-1.png" width="95%" style="display: block; margin: auto;" /> --- class: full-fig-slide, hide-logo # Wine <img src="r-figures/lec11/unnamed-chunk-49-1.png" width="100%" style="display: block; margin: auto;" /> Linkage method .alert[will] affect more difficult clustering problems... ??? - Single-linkage clustering is susceptible to an effect known as "chaining" whereby poorly separated, but distinct clusters are merged at an early stage (eg in lab) - average will lead to a bunch of solitary obs groups (generally bad sign) - Complete linkage gives the only clear-ish clustering result, suggesting probably 2 or 3 groups. We'll come back to this later. - Dendogram well behaved for complete - way more, but these are popular --- # Hierarchical Clustering: Problems ## Some cons - Distance matrices (of size `\(n \times n\)`, symmetric) must be calculated. - For very large samples, this can be time consuming (even for computers). - Results are often sensitive to what distance type and what linkage method are used. .blue[**Pro**] - easy to understand and implement ??? --- class: slide-yellow, center, middle # `\(k\)`-means clustering --- # `\(k\)`-means Clustering - Moving on from hierarchical methods, we'll now consider a method based on partitioning data. - `\(k\)`-means clustering is a popular method that requires the user to provide the number of groups they are looking for --- though we will discuss one way of seeking evidence for the number of groups. - In this algorithm, each observation will belong to the cluster whose mean is closest. ??? - most popular - over-used and abused - will will talk about how to choose `\(k\)` later --- # `\(k\)`-means Algorithm ## Steps: 1. Randomly select `\(k\)` (the number of groups) points in your data. These will serve as the first .display[centroids]. 2. Assign all observations to their closest centroid. You now have `\(k\)` groups. 3. Calculate the means of each group. These are your new centroids. 4. Repeat 2) and 3) until nothing changes anymore. ??? - common style of clustering algorithm - 2) assumes group labels are missing centroids are known - 3) assumes centroids are missing labels are known - each loop is called an iteration - minimizing within group sum of squares --- ## Step 1 Randomly select 2 centroid ( `\(k\)` = 2) <img src="r-figures/lec11/unnamed-chunk-50-1.png" width="60%" style="display: block; margin: auto;" /> --- ## Step 2 Assign obs. to their closest centroid <img src="r-figures/lec11/unnamed-chunk-51-1.png" width="60%" style="display: block; margin: auto;" /> --- ## Step 3 Recalculate the means of each group. These are your new centroids. <img src="r-figures/lec11/unnamed-chunk-52-1.png" width="60%" style="display: block; margin: auto;" /> --- ## Step 4 Repeat 2) and 3) until nothing changes anymore. <img src="r-figures/lec11/unnamed-chunk-53-.gif" width="60%" style="display: block; margin: auto;" /> --- ## Step 1 Randomly select 3 observations ( `\(k\)` =3). These will be your centroids for the 3 groups <img src="r-figures/lec11/unnamed-chunk-54-1.png" width="60%" style="display: block; margin: auto;" /> ??? - wrong number of groups --- ## Step 2 Assign obs. to their closest centroid <img src="r-figures/lec11/unnamed-chunk-55-1.png" width="60%" style="display: block; margin: auto;" /> --- ## Step 3 Recalculate the means of each group. These are your new centroids. <img src="r-figures/lec11/unnamed-chunk-56-1.png" width="60%" style="display: block; margin: auto;" /> --- ## Step 4 Repeat 2) and 3) until nothing changes anymore. <img src="r-figures/lec11/unnamed-chunk-57-.gif" width="60%" style="display: block; margin: auto;" /> --- Let's do the 3-group `\(k\)`-means again: <img src="r-figures/lec11/unnamed-chunk-58-.gif" width="60%" style="display: block; margin: auto;" /> --- ## 3 group comparison This should make the difference in results more clear... .pull-left[ <img src="r-figures/lec11/unnamed-chunk-59-1.png" width="98%" style="display: block; margin: auto;" /> ] .pull-right[ <img src="r-figures/lec11/unnamed-chunk-60-1.png" width="98%" style="display: block; margin: auto;" /> ] --- # Tori data ## Step 1 <img src="r-figures/lec11/unnamed-chunk-61-1.png" width="54%" style="display: block; margin: auto;" /> --- ## Step 2 Assign obs. to their closest centroid <img src="r-figures/lec11/unnamed-chunk-62-1.png" width="60%" style="display: block; margin: auto;" /> --- ## Step 3 Recalculate the means of each group. These are your new centroids. <img src="r-figures/lec11/unnamed-chunk-63-1.png" width="60%" style="display: block; margin: auto;" /> --- ## Step 4 Repeat 2) and 3) until nothing changes anymore. <img src="r-figures/lec11/unnamed-chunk-64-.gif" width="60%" style="display: block; margin: auto;" /> --- # Why does the `\(k\)`-means algorithm work? 1. Randomly select `\(k\)` (the number of groups) points in your data. These will serve as the first centroids. 2. Assign all observations to their closest centroid. You now have `\(k\)` groups. 3. Calculate the means of each group. These are your new centroids. 4. Repeat 2) and 3) until nothing changes anymore. Both 2. and 3. are recursively finding the minimum within-group sum of squared distances between observations and their centroids. --- class: slide-small .pull-left[ # Pros - Computationally efficient (fast to run), even for large data sets. - Only `\(n \times k\)` distance matrices needed (distances between `\(n\)` observations and `\(k\)` centroids). - Relatively straightforward concept. - Often provides clearer groups than HC. ] .pull-right[ # Cons - As we saw, where the algorithm (randomly) starts can affect the results (local optimization rather than global) - Groups will be found no matter what --- even if there are no groups present in the data. ] ??? - global will find the best solution no matter what but they are very expensive - we'll do a bunch of random starts and compare --- # Wine - Let's return to the `wine` dataset - Let's apply HC and `\(k\)`-means and then investigate the results... - Recall complete linkage suggested probably 2 or 3 groups. --- <img src="r-figures/lec11/unnamed-chunk-65-1.png" width="99%" style="display: block; margin: auto;" /> --- class: full-slide-fig, hide-logo <img src="r-figures/lec11/unnamed-chunk-66-1.png" width="95%" style="display: block; margin: auto;" /> --- class: full-slide-fig, hide-logo <img src="r-figures/lec11/unnamed-chunk-67-1.png" width="95%" style="display: block; margin: auto;" /> --- class: full-slide-fig, hide-logo <img src="r-figures/lec11/unnamed-chunk-68-1.png" width="95%" style="display: block; margin: auto;" /> --- class: full-slide-fig, hide-logo <img src="r-figures/lec11/unnamed-chunk-69-1.png" width="95%" style="display: block; margin: auto;" /> --- # Comparing Results - By looking at the plots, it seems as though the results are pretty similar between `\(k\)`-means and HC. - The resolution makes it tough to tell, but the solutions are not identical. - Both methods are finding most of the group structure, but... - `\(k\)`-means only misclassifies 6 of the 178 wines - HC misclassifies 29 of them ??? - we actually have know classes (but hid them in training) - benchmark data set to see how we did - barolo is more expensive (food authenticity problem) --- # Comparing Results .pull-left[ ## HC <table class="table" style="margin-left: auto; margin-right: auto;"> <thead> <tr> <th style="text-align:left;"> </th> <th style="text-align:right;"> 1 </th> <th style="text-align:right;"> 2 </th> <th style="text-align:right;"> 3 </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;"> Barbera </td> <td style="text-align:right;"> 0 </td> <td style="text-align:right;"> 0 </td> <td style="text-align:right;"> 48 </td> </tr> <tr> <td style="text-align:left;"> Barolo </td> <td style="text-align:right;"> 51 </td> <td style="text-align:right;"> 8 </td> <td style="text-align:right;"> 0 </td> </tr> <tr> <td style="text-align:left;"> Grignolino </td> <td style="text-align:right;"> 18 </td> <td style="text-align:right;"> 50 </td> <td style="text-align:right;"> 3 </td> </tr> </tbody> </table> ] .pull-right[ ## `\(k\)`-means <table class="table" style="margin-left: auto; margin-right: auto;"> <thead> <tr> <th style="text-align:left;"> </th> <th style="text-align:right;"> 1 </th> <th style="text-align:right;"> 2 </th> <th style="text-align:right;"> 3 </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;"> Barbera </td> <td style="text-align:right;"> 0 </td> <td style="text-align:right;"> 48 </td> <td style="text-align:right;"> 0 </td> </tr> <tr> <td style="text-align:left;"> Barolo </td> <td style="text-align:right;"> 59 </td> <td style="text-align:right;"> 0 </td> <td style="text-align:right;"> 0 </td> </tr> <tr> <td style="text-align:left;"> Grignolino </td> <td style="text-align:right;"> 3 </td> <td style="text-align:right;"> 3 </td> <td style="text-align:right;"> 65 </td> </tr> </tbody> </table> <br> ] .alert[Notice] how the correctly labeled observations do not necessarily fall on the main diagonal. --- # Partitioning and Hierarchical - Neither HC or `\(k\)`-means can really be considered "statistical" in the sense of: - What model are we assuming? - How can we test whether the model is accurate? - They are basically _ad hoc_ methods that mostly follow from intuition and tend to perform alright. - Eventually, we'll see clustering via mixture models; the unsupervised equivalent of discriminant analysis. ??? goodness of fit test (even if we don't have the truth) --- # Dealing with Random Starts - As mentioned, `\(k\)`-means minimizes the within group sum of squares (WSS); however, it does so locally (not globally) <!-- so as a local minimizer, not global. --> - Furthermore, since it's initialized randomly, results change from run to run. - Importantly: it's fast! So we can run it many times, and choose the solution with the smallest within-group sum of squares for those runs. - In fact, there's a built in argument in R... --- # Iris - This famous (Fisher's or Anderson's) iris data set gives the measurements in centimeters of the following four variables: - sepal length and width - petal length and width, - Data set contains the measurements on 50 flowers from each of 3 species of iris. - The species are Iris `setosa`, `versicolor`, and `virginica`. --- <img src="r-figures/lec11/unnamed-chunk-73-1.png" width="75%" style="display: block; margin: auto;" /> --- # Iris Example .pull-left[ ```r set.seed(1) kiris <- kmeans(iris[,-5], 3) table(iris[,5], kiris$cluster) ``` ``` ## ## 1 2 3 ## setosa 0 0 50 ## versicolor 48 2 0 ## virginica 14 36 0 ``` ] .pull-right[ ```r set.seed(2) kiris <- kmeans(iris[,-5], 3) table(iris[,5], kiris$cluster) ``` ``` ## ## 1 2 3 ## setosa 0 50 0 ## versicolor 48 0 2 ## virginica 14 0 36 ``` ] ??? These are the same solution --- # Iris Example .pull-left[ ```r set.seed(3) kiris <- kmeans(iris[,-5], 3) table(iris[,5], kiris$cluster) ``` ``` ## ## 1 2 3 ## setosa 33 0 17 ## versicolor 0 46 4 ## virginica 0 50 0 ``` ] .pull-right[ ```r set.seed(4) kiris <- kmeans(iris[,-5], 3) table(iris[,5], kiris$cluster) ``` ``` ## ## 1 2 3 ## setosa 0 0 50 ## versicolor 48 2 0 ## virginica 14 36 0 ``` ] ??? These are not the same solution --- # Iris Example ```r kiris <- kmeans(iris[,-5], 3, nstart=25) table(iris[,5], kiris$cluster) ``` ``` ## ## 1 2 3 ## setosa 50 0 0 ## versicolor 0 48 2 ## virginica 0 14 36 ``` This will automatically report the solution which had the smallest within group sum of squares --- # Determining Number of Groups - For heirarchical, we have an argument for the number of groups present in the data (largest jump in dendrogram). - For `\(k\)`-means, we need to specify `\(k\)` but can still provide some guidance - We can plot the WSS (within sum of squares) for differing numbers of groups --- # WSS .pull-left[ .medium[ ```r clustore <- matrix(0, nrow=150, ncol=10) wsstore <- NULL x <- scale(iris[,-5]) for(i in 1:10){ dum <- kmeans(x, i, nstart=25) clustore[, i] <- dum$cluster wsstore[i] <- dum$tot.withinss } ``` ```r plot(wsstore) ``` ] ] .pull-right[ <img src="r-figures/lec11/unnamed-chunk-81-1.png" width="99%" style="display: block; margin: auto;" /> ] We look for the "elbow" in the so-called "scree" plot. ??? fairly subjective --- # No groups? Also as previously noted, `\(k\)`-means can seem like it finds groups, even when they may not exist. <img src="r-figures/lec11/unnamed-chunk-82-1.png" width="54%" style="display: block; margin: auto;" /> --- # No groups? <img src="r-figures/lec11/unnamed-chunk-83-1.png" width="54%" style="display: block; margin: auto;" /> --- # WSS .pull-left[ .medium[ ```r clustore <- matrix(0, nrow=nrow(nogroups), ncol=10) wsstore <- NULL x <- scale(nogroups) for(i in 1:10){ dum <- kmeans(x, i, nstart=25) clustore[, i] <- dum$cluster wsstore[i] <- dum$tot.withinss } ``` ] ] .pull-right[ <img src="r-figures/lec11/unnamed-chunk-85-1.png" width="99%" style="display: block; margin: auto;" /> ]