Taking too long? Close loading screen.

SOA ASA Exam: Predictive Analysis (PA) – 6. Principal Components and Cluster Analyses

Principal Components and Cluster Analyses

LEARNING OBJECTIVES

The candidate will be able to apply cluster and principal components analysis to enhance supervised learning.

The Candidate will be able to:

  • Understand and apply K-means clustering.
  • Understand and apply hierarchical clustering.
  • Understand and apply principal component analysis.

 

Chapter Overview

As you can tell from its name, Exam PA is mostly concerned with developing models to “predict” a target variable of interest. In the final chapter of Part II of this study manual, we switch our attention from supervised learning methods to unsupervised learning methods, which ignore the target variable (if present) and look solely at the predictor variables in the dataset to extract their structural relationships. We will learn two unsupervised learning techniques, principal components analysis (PCA) and cluster analysis, which are advanced data exploration tools with the following merits:

  • Exploratory data analysis: These tools lend themselves to high-dimensional datasets, which are characterized by an overwhelming number of variables compared to observations. To make sense of these datasets, it is necessary to explore and visualize the relationships not only between pairs of variables, but also among a large group of variables on a holistic basis. For this purpose, traditional bivariate data exploration tools like scatterplots and correlation matrices are ineffective. Fortunately, PCA and duster analysis will come to our rescue.
  • Unsupervised learning may help supervised learning: PCA and cluster analysis are quite often a means to an end. Besides summarizing large, complex datasets, they also have the potential of generating useful features which are by-products of our data exploration process. These features may be fed into a predictive model in an attempt to better predict a target variable. 

 

Principal Components Analysis

Theoretical Foundations

Motivation

In Section 2.2 and some previous case studies, we performed exploratory data analysis with the aid of summary statistics and graphical displays. At that time, the dataset involved only a few variables, so bivariate data exploration worked well for helping us understand the relationships between different pairs of variables. In practice, however, our dataset can involve hundreds and thousands of variables, many of which are correlated in subtle ways.

For such a high-dimensional dataset, unraveling the relationships among the large number of variables by bivariate data exploration is not only a daunting task – if there are p features, then there are “p choose 2” pairs to examine – but also unlikely to show us the “big picture” that lies behind the data. More advanced data exploration tools are warranted.

Principal components analysis (PCA) is an advanced data analytic technique that transforms a high-dimensional dataset into a smaller, much more manageable set of representative (hence the qualifier “principal”) variables that capture most of the information in the original dataset. Known as principal components (PCs), these composite variables are linear combinations of the existing variables generated such that they are mutually uncorrelated and collectively simplify the dataset, reducing its dimension and making it more amenable to data exploration and visualization. This technique especially lends itself to highly correlated data, for which a few PCs (as few as two or three) are enough to represent most of the information in the full dataset. In Exam PA, PCA is a useful data pre-processing step that mainly serves two purposes: Exploratory data analysis (which includes data exploration and visualization) and feature generation.

It is almost impossible to learn PCA without formulas, so in what follows we will describe the mechanics of PCA with the aid of a small amount of mathematical notation in keeping with ISLR. In Exam PA, you will mostly be tested on conceptual issues rather than technical details, but the mathematics we present will help you understand the output of a PCA more fully and make sense of the conceptual issues.

 

What exactly are the PCs?

Throughout this chapter, suppose for concreteness that we are given n observations, each containing the measurements of p features (in this chapter, we prefer the term “features” rather than “predictors” because there is nothing to predict in an unsupervised setting.), X1, X2, …, Xp (the target variable, if present, plays no role here, so it is ignored). In many practical situations, p can be very large (e.g., 100 or more). The data matrix is the n x p matrix given by:

\(\boldsymbol{X}=\left( \begin{matrix}
{{x}_{11}} & {{x}_{12}} & \cdots  & {{x}_{1p}}  \\
{{x}_{21}} & {{x}_{22}} & \cdots  & {{x}_{2p}}  \\
\vdots  & \vdots  & \ddots  & \vdots   \\
{{x}_{n1}} & {{x}_{n2}} & \cdots  & {{x}_{np}}  \\
\end{matrix} \right)\)

Here for each Xij,

  • i is the index for the observations, and
  • j is the index for the variables or features.

Typically, the observations of the features have been centered to have a zero mean: \(\bar{x}_j=\dfrac{\sum\nolimits_{i=1}^{n}{x_{ij}}}{n}\), for j = 1, 2, … ,p.

If not, we can use the mean-centered values: \(x’_{ij}=x_{ij}-\bar{x}_j\), which have a zero mean, and perform PCA on these adjusted feature values.

Mathematically, the PCs are composite variables which are normalized linear combinations of the original set of features. For m = 1, 2, … , M, where M (≤ p) is the number of PCs to use, the mth PC is given by the linear combination:

\(Z_m=\phi_{1m}X_1+\phi_{2m}X_2+\cdots+\phi_{pm}X_p=\sum\limits_{j=1}^{p}{\phi_{jm}X_j}\)

where the coefficients \(\phi_{1m}\), \(\phi_{2m}\), …, \(\phi_{pm}\) are the loadings (a.k.a. weights) of the mth PC corresponding to X1, … Xp.

You can see that we are combining the p features into a single metric, which has the advantage of quickly summarizing the data so long as the loadings are well-chosen, and we will see very shortly how the loadings are determined for each PC. For now, keep in mind that:

  • The PCs are constructed from the features, so the sum in the linear combination is taken over j, the index for features, but not i.
  • The generic symbol \(\phi_{jm}\) for PC loadings is indexed as follows: 
    • j index for the variables (features)
    • m index for PC

Given the feature values, we can compute the realizations of Zm by:

\(Z_{im}=\phi_{1m}X_{i1}+\phi_{2m}X_{i2}+\cdots+\phi_{pm}X_{ip}=\sum\limits_{j=1}^{p}{\phi_{jm}X_{ij}}\), i = 1, …, n

We refer to Z1m, Z2m, .. . , Znm as the scores of the mth PC.

If you are comfortable with matrix multiplications, you may find it convenient to collect the p loadings of a PC using the PC loading vector given by:

\(\phi_m=\left( \begin{matrix}
\phi_{1m}  \\
\phi_{2m}  \\
\vdots   \\
\phi_{pm}  \\
\end{matrix} \right)\), one loading for each feature

and the PC scores:

\(Z_m=\left( \begin{matrix}
Z_{1m}  \\
Z_{2m}  \\
\vdots   \\
Z_{nm}  \\
\end{matrix} \right)\), one score for each observation

In general, the mean score of a PC must be zero, i.e. \(\bar{z}_m=0\) for all m, as long as the feature values are mean-centered. : \(\sum\limits_{i=1}^{n}{Z_{im}}=0\)

Translating the realizations of Zm into matrix form, we can relate the scores and loadings of the mth PC and the feature values via:

\(\boldsymbol{Z}_m = \boldsymbol{X\phi}_m\)

=> Scores = Loadings × X

This matrix formula, though not given in ISLR or the PA e-learning modules, is easy to remember and dispenses with the need to deal with which subscript (i, j, or m) is being summed over. It is also appealing from a linear algebra perspective as it stresses the important fact that the PC scores result from rotating the original feature values (hosted by the data matrix X) in a suitable direction determined by the loading vector.

 

PC Loadings

How do we go about choosing the PC loadings? For simplicity, let’s start with the first PC and determine its loadings \(\phi_{11}\), \(\phi_{21}\), …, \(\phi_{p1}\). To ensure that the first PC captures as much information in the original dataset as possible, the PC loadings are such that they maximize the sample variance of Z1 (remember, the larger the spread, the more information we have), i.e., they solve:

\(max(\dfrac{1}{n}\sum\limits_{i=1}^{n}{z^2_{i1}})=\dfrac{1}{n}\sum\limits_{i=1}^{n}{(\sum\limits_{j=1}^{p}{\phi_{j1}x_{ij}})^2}\)

subject to the normalization constraint:

\(\sum\limits_{j=1}^{p}{\phi^2_{j1}}=1\)

This constraint makes the variance maximization meaningful; without it, we can scale up the \(\phi_{j1}\)’s to inflate the variance arbitrarily. There is no need to worry about how this constrained maximization problem is solved in R, but it is important to keep in mind that the PC loadings are defined to maximize variance (information).

Geometrically, the p loadings \(\phi_{11}\), \(\phi_{21}\), …, \(\phi_{p1}\), represent a line in the p-dimensional feature space along which the data vary the most in the sense that the projected points of then observations on this line, which are the PC scores, are as spread out as possible among all possible lines. The left panel of Figure 6.1.1 illustrates this with a simplistic dataset with n = 5 and p = 2.

Given the first PC, subsequent PCs are defined the same way as the maximal variance linear combination, with the additional constraint that they must be orthogonal to (or uncorrelated with) the previous PCs. To put this in mathematical language, the loadings of the mth PC with m ≥ 2 maximize:

\(max(\dfrac{1}{n}\sum\limits_{i=1}^{n}{z^2_{im}})=\dfrac{1}{n}\sum\limits_{i=1}^{n}{(\sum\limits_{j=1}^{p}{\phi_{jm}x_{ij}})^2}\)

subject to the constrains:

  • normalization constraint: \(\sum\limits_{j=1}^{p}{\phi^2_{jm}}=1\)
  • orthogonality constraints: \(\phi^{T}_r\phi_m=\sum\limits_{j=1}^{p}{\phi_{jr}\phi_{jm}}=0\), for r = 1, …, m-1

The orthogonality constraints are imposed so that the subsequent PCs are uncorrelated with the previous PCs and serve to measure different aspects of the variables in the dataset. We will see this in action when we get to the case study in Subsection 6.1.3.

In the right panel of Figure 6.1.1, we have included both the first and second PCs to the same toy dataset. We can see that the two PC directions are mutually perpendicular, with the angle between the two lines being 90°.

 

PCA Application

Data visualization

Via PCA, we successfully reduce the dimension of the dataset from p variables to a much smaller set of variables (more precisely, the M PCs) that together retain most of the information measured by variance. With its dimension reduced, the dataset becomes much easier to explore and visualize. Using M = 2 PCs, for instance, we can plot the scores of the first PC against the scores of the second PC to gain a low-dimensional view of the data in an attempt to uncover some latent patterns.

 

Feature Generation

In the context of Exam PA, the most important application of PCA is to create useful features for predicting a given target variable. Once we have settled on the number of PCs to use (typically, we will use just the first few) , the original variables X1, …, Xp are replaced by the PCs Z1, …, ZM, which capture most of the information in the dataset and serve as predictors for the target variable. These predictors are, by construction, mutually uncorrelated, so the effect of one variable being confounded by another variable due to strong correlations is no longer an issue. By reducing the dimension of the data and the complexity of the model, we hope to optimize the bias-variance trade-off and improve the prediction accuracy of the model. We will see how to create new features based on PCA in the case study in Subsection 6.1.3.

When it comes to fitting a predictive model using the PCs as predictors, it is important to delete the original variables X1, …, Xp to avoid any duplication of information. If the original variables were retained, this would run counter to the original aim of using PCA. More seriously, the PCs are linear combinations of X1, …, Xp, so the co-existence of the two groups of variables will result in perfect collinearity and a rank-deficient model if you are fitting a GLM.

 

Additional PCA Issues

Now that we have a basic understanding of how PCA works, this subsection will delve into some more subtle PCA issues which are common items in Exam PA.

 

Proportion of Variance Explained

Arguably the biggest issue surrounding the use of PCA is how to choose M, the number of PCs to use. To address this important issue, it is useful to quantify how much information is captured by the PCs (equivalently, how much information is lost if we use the PCs in place of the original features). This can be done by assessing the proportion of variance explained by each of the M PCs in comparison to the total variance.

Total Variance

The total variance present in the data is the sum of the sample variances of the p variables:

\(\underbrace{\sum\limits_{j=1}^{p}}_{\displaylines{\text{sum over} \\ \text{all features }}}(\underbrace{\dfrac{1}{n}\sum\limits_{i=1}^{n}{x^2_{ij}}}_{\displaylines{\text{sample variance } \\ \text{of feature j}}})\)

which, if the data values have been scaled to have unit standard deviation equals \(\sum\limits_{j=1}^{p}{1}=p\)

 

Variance Explained by the mth PC

As with the variables, the scores of each PC also have zero mean, so the variance explained by the mth PC equals the (sample) second moment:

\(\dfrac{1}{n}\sum\limits_{i=1}^{n}{z^2_{im}}=\dfrac{1}{n}\sum\limits_{i=1}^{n}{\sum\limits_{j=1}^{p}{\phi_{jm}x_{ij}}}\)

 

As a result, the proportion of variance explained (PVE) by the mth PC is given by:

\(PVE_m=\dfrac{\text{variance explained by mth PC}}{\text{total variance}}=\dfrac{\sum\limits_{i=1}^{n}{z^2_{im}}}{\sum\limits_{i=1}^{n}{\sum\limits_{j=1}^{p}{x^2_{ij}}}}\)

For standardized data, \(\dfrac{\sum\limits_{i=1}^{n}{z^2_{im}}}{np}\)

These proportions for m = 1, …, M satisfy several properties:

  • It can be shown that each PVEm is between 0 and 1, i.e., it is indeed a proportion, and that they sum to 1.
  • By the definition of PCs, the PVEm‘s are monotonically decreasing in m, i.e., PVE1 ≥ PVE2 ≥ … ≥ PVEM, because subsequent PCs have more and more orthogonality constraints to comply with and therefore less flexibility with the choice of the PC loadings. As a result, the first PC can explain the greatest amount of variance and is the most informative, followed by the second PC, third PC, and so forth. 

 

Choosing the number of PCs by a scree plot

Given the PVEs, a scree plot provides a simple visual inspection method for selecting the number of PCs to use. The horizontal axis of such a plot indexes the M PCs ‘whereas the vertical axis shows the PVEs in order. As discussed above, the PVE will decrease as we move along the horizontal axis of a scree plot and we will choose the number of PCs required to explain a sizable proportion of variance. This is typically done by eyeballing the scree plot and looking for the point at which the PVE of a subsequent PC has dropped off to a sufficiently low level. That point is known as an elbow in the scree plot. The PCs beyond the elbow have a very small PVE and can be safely dropped without losing much information.

There is often some degree of subjectivity when it comes to locating the elbow of a scree plot, so the model solution of a typical PA exam will likely have a range of acceptable values for the number of PCs to use. As long as you justify your choice with the reasoning above, you will be fine.

 

Importance of centering and scaling

Another issue when one applies PCA is whether to mean-center and / or scale the variables to have unit variance in the dataset.

Centering

In theory, whether the variables are subtracted by their sample mean to yield mean zero should have no effect on the PC loadings, which are defined to maximize the (sample) variance of the PC scores, and the variance remains unchanged even when the values of a variable are added or subtracted by the same constant. As a result, mean-centering is often not of great concern when we perform PCA.

(Note: In practice, prcomp(), the R function we will use to perform PCA, appears to maximize not the variance of the PC scores, but their second moment, which varies when the variables are mean-centered. This may be why prcomp() centers the variables to have mean zero by default.)

 

Scaling

The results of a PCA depend very much on the scale of the variables. Technically speaking:

    • If we conduct a PCA using the variables on their original scale, the PC loadings are determined based on the (sample) covariance matrix of the variables.
    • If we conduct a PCA for the standardized variables (i.e., they are divided by their standard deviation to have unit standard deviation), the PC loadings are determined based on the (sample) correlation matrix of the variables.

If no scaling is done and the variables are of vastly different orders of magnitude (e.g., one variable is measured in km and one is in cm), then those variables with an unusually large variance on their scale will receive a large PC loading and dominate the corresponding PC – remember that the loadings are defined to maximize the sample variance, so it makes sense to attach a large weight to a high-variance variable. However, there is no guarantee that this variable explains much of the underlying pattern in the data. Because it is not desirable for the results of a PCA to depend on an arbitrary choice of scaling, scaling the variables to produce unit standard deviation prior to performing a PCA is often recommended.

 

EXAM NOTE

Task 5 of the June 17-18, 2020 PA exam has a small item asking you to:

explain why scale will affect the PCA results.

 

PCA with categorical features

PCA, in its current form, can only be applied to numeric variables. To perform PCA on categorical variables, they have to be explicitly binarized (with the use of the dummyVars() function in the caret package we used in Chapters 3 and 4) in advance to create the dummy variables, which are numeric, taking values of either 0 or 1. Then all of the discussions above apply equally well. This was done in Task 3 of the June 2019 PA exam.

 

Drawbacks of PCA

Let’s end this subsection with some limitations of PCA, which you may be asked to acknowledge on the exam.

Interpretability

The greatest disadvantage of PCA in Exam PA is that it may not lead to interpretable results. This is because it is not always an easy task to make sense of the PCs, which can be complicated linear combinations of the original features (they capture the effects of what on the target variable?) . The loss of interpretability is a particular concern in PA exam projects where your client is interested in factors affecting the target variable or actionable insights from your work. Not being able to interpret your newly generated features will be a huge downside.

 

Not good for non-linear relationships

By construction, PCA involves using linear transformations to summarize and visualize high-dimensional datasets where the variables are highly linearly correlated. If the variables are related in a non-linear fashion (e.g., polynomial and interaction relationships), then PCA may not be a good tool to use.

 

Target variable is ignored (Less important)

When we use PCA in a supervised learning setting, the PC loadings and scores are generated completely independently of the target variable. We are implicitly assuming that the directions in which the features exhibit the most variation (given by the loading vectors) are also the directions most associated with the target variable. Often there is no guarantee that this is true.

(This drawback is partially remedied by partial least squares, which is a supervised alternative to PCA you may have learned in Exam SRM. The PA e-learning modules, however, make no mention of this technique.)

 

Not feature selection (Less important)

Although PCA reduces the dimensionality of the model, it is not exactly a feature selection method. This is because each PC is a linear combination of all of the original p features, none of which is removed (except in the unlikely case when some of the PC loadings are exactly zero). From a practical point of view, all of the original variables must still be collected for model construction, so no operational efficiency is achieved.

 

Simple Case Study

In this subsection, we illustrate the concepts presented in Subsections 6.1.1 and 6.1.2 by means of a simple case study. After completing this case study, you should be able to:

  • Perform a PCA using the prcomp() function in R.
  • Extract useful information from a prcomp object and interpret the output of a PCA.
  • Generate and interpret a biplot.
  • Realize the effect of scaling on the results of a PCA.
  • Use the output of a PCA to create new features for prediction.

 

Data Description

This case study centers on the well-known USArrests dataset that comes with base R (if you are interested, run ?USArrests to learn more about the dataset). For each of the n = 50 US states in 1973, this dataset hosts the information about p = 4 numeric variables: (Yes, this dataset is not high-dimensional by any standards, but it is a very useful tool for illustrating PCA!)

  • Murder: The number of arrests for murder per 100,000 residents
  • Assault: The number of arrests for assault per 100,000 residents
  • Rape: The number of arrests for rape per 100,000 residents
  • UrbanPop: The percent (from Oto 100) of the population residing in urban areas

One would expect that the three crime-related variables are correlated at least to some extent and it may make sense to use PCA to combine them into a single variable measuring overall crime levels and reduce the dimension of the data from 4 to 2. We can then understand and visualize the data much more easily.

 

TASK 1: Explore the data

Investigate the distribution of the four variables using graphical displays and summary statistics. Perform the following:

  • Comment on their range of values and the relative size of their standard deviations.
  • Determine if any of these variables should be transformed and, if so, what transformation should be made.
  • Create a correlation matrix for the four variables, comment on the correlation structure of the four variables, and explain why a PCA may be useful for summarizing the four variables.

 

Univariate Data Exploration

Before performing a PCA, let’s first run CHUNK 1 to load the dataset and print a summary for each of the four variables.

# CHUNK 1 
data(USArrests) 
summary(USArrests)
    Murder          Assault         UrbanPop          Rape      
Min.   : 0.800   Min.   : 45.0   Min.   :32.00   Min.   : 7.30  
1st Qu.: 4.075   1st Qu.:109.0   1st Qu.:54.50   1st Qu.:15.07  
Median : 7.250   Median :159.0   Median :66.00   Median :20.10  
Mean   : 7.788   Mean   :170.8   Mean   :65.54   Mean   :21.23  
3rd Qu.:11.250   3rd Qu.:249.0   3rd Qu.:77.75   3rd Qu.:26.18  
Max.   :17.400   Max.   :337.0   Max.   :91.00   Max.   :46.00

We can see that the four variables are on drastically different scales, even among the three crime-related variables. For example, Murder ranges from 0.800 to 17.400, but Assault is from 45.0 to 337.0.

In CHUNK 2, we use the apply() function to “apply” the sd() function to return the standard deviation of each variable (which is not available from summary above).

# CHUNK 2 
apply(USArrests, 2, sd)
  Murder   Assault  UrbanPop      Rape 
4.355510 83.337661 14.474763  9.366385

That the four standard deviations differ substantially makes it imperative to scale the four variables to have unit standard deviation prior to performing a PCA. This can be easily achieved by setting an option in an R function, as we will see soon.

# CHUNK 3 
library(ggplot2)

# names(USArrests) extracts the column names of the USArrests data 
for (i in names(USArrests)) { 
    plot <- ggplot(USArrests, aes(x = USArrests[, i])) + 
        geom_histogram() + 
        xlab(i)
    print(plot)
} 

Figure 6.1.3: Histograms for the four variables in the USArrests dataset

We used a for loop to produce a histogram for each (see Figure 6.1.3). None of the four graphs indicates a particularly problematic distribution, so no variable transformations seem warranted. (There is a small amount of right skewness in the distribution of Rape. For simplicity, we leave Rape as is. If you are interested, you may apply a log transformation and see how it affects the PCA we are going to perform.)

 

Bivariate Data Exploration

Finally, in CHUNK 4 we create a correlation matrix for the four variables in an attempt to understand their correlation structure.

# CHUNK 4 
cor(USArrests)
##              Murder   Assault   UrbanPop      Rape
## Murder   1.00000000 0.8018733 0.06957262 0.5635788
## Assault  0.80187331 1.0000000 0.25887170 0.6652412
## UrbanPop 0.06957262 0.2588717 1.00000000 0.4113412
## Rape     0.56357883 0.6652412 0.41134124 1.0000000

Examining the pairwise correlations, we can see that:

  • Murder, Assault, and Rape are strongly positively correlated with one another, with all pairwise correlations greater than 0.5. This is perhaps not surprising because the three variables all measure similar quantities – crime rates. The strong correlations suggest that PCA may be an effective technique to “compress” the three crime-related variables into a single measure of overall crime levels while retaining most of the information. If this can be done, then the three dimensions represented by the three variables can be effectively summarized in one dimension and the data, whose dimension will be reduced from 4 to 2, can be visualized much more easily.
  • UrbanPop does not seem to have a strong linear relationship with the three crime-related variables; it has a negligible correlation with Murder and a moderately weak correlation with Assault and Rape. In other words, urbanization levels and crime levels do not correlate closely.

 

TASK 2: Conduct a principal components analysis

Task 1 suggests that PCA may be a good way to summarize the four variables in the USArrests data.

  • Perform a PCA on the four variables. You should decide whether to scale the variables in view of your observations in Task 1.
  • Verify that the first PC score of Alabama is calculated correctly.
  • Interpret the loadings of the first two principal components.

 

Implementing a PCA

There are several functions in R for doing PCA. The one we will use in Exam PA is the prcomp() function, which is part of base R (though it is not the most versatile one). This function takes a numeric matrix or a data frame carrying the variables to which PCA is to be applied (if you want to perform PCA on only certain variables in the data frame, subset it using the square bracket [, ] notation.

To incorporate mean-centering and scaling, we set center = TRUE and scale. = TRUE – By default, center = TRUE and scale. = FALSE, so the latter option is necessary to override the no-scaling default.

In CHUNK 5, we run a PCA on the USArrests data and print out the PC loading vectors and scores. As we saw in CHUNK 2, the four variables have vastly different variances, so we will scale them to have unit standard deviation.

# CHUNK 5 
PCA <- prcomp(USArrests, center = TRUE, scale. = TRUE)

# PC loadings 
PCA$rotation
##                 PC1        PC2        PC3         PC4
## Murder   -0.5358995  0.4181809 -0.3412327  0.64922780
## Assault  -0.5831836  0.1879856 -0.2681484 -0.74340748
## UrbanPop -0.2781909 -0.8728062 -0.3780158  0.13387773
## Rape     -0.5434321 -0.1673186  0.8177779  0.08902432
# PC scores 
PCA$x
##                        PC1         PC2         PC3          PC4
## Alabama        -0.97566045  1.12200121 -0.43980366  0.154696581
## Alaska         -1.93053788  1.06242692  2.01950027 -0.434175454
## Arizona        -1.74544285 -0.73845954  0.05423025 -0.826264240
## Arkansas        0.13999894  1.10854226  0.11342217 -0.180973554
## California     -2.49861285 -1.52742672  0.59254100 -0.338559240
## Colorado       -1.49934074 -0.97762966  1.08400162  0.001450164
## Connecticut     1.34499236 -1.07798362 -0.63679250 -0.117278736
## Delaware       -0.04722981 -0.32208890 -0.71141032 -0.873113315
## Florida        -2.98275967  0.03883425 -0.57103206 -0.095317042
## Georgia        -1.62280742  1.26608838 -0.33901818  1.065974459
## Hawaii          0.90348448 -1.55467609  0.05027151  0.893733198
## Idaho           1.62331903  0.20885253  0.25719021 -0.494087852
## Illinois       -1.36505197 -0.67498834 -0.67068647 -0.120794916
## Indiana         0.50038122 -0.15003926  0.22576277  0.420397595
## Iowa            2.23099579 -0.10300828  0.16291036  0.017379470
## Kansas          0.78887206 -0.26744941  0.02529648  0.204421034
## Kentucky        0.74331256  0.94880748 -0.02808429  0.663817237
## Louisiana      -1.54909076  0.86230011 -0.77560598  0.450157791
## Maine           2.37274014  0.37260865 -0.06502225 -0.327138529
## Maryland       -1.74564663  0.42335704 -0.15566968 -0.553450589
## Massachusetts   0.48128007 -1.45967706 -0.60337172 -0.177793902
## Michigan       -2.08725025 -0.15383500  0.38100046  0.101343128
## Minnesota       1.67566951 -0.62590670  0.15153200  0.066640316
## Mississippi    -0.98647919  2.36973712 -0.73336290  0.213342049
## Missouri       -0.68978426 -0.26070794  0.37365033  0.223554811
## Montana         1.17353751  0.53147851  0.24440796  0.122498555
## Nebraska        1.25291625 -0.19200440  0.17380930  0.015733156
## Nevada         -2.84550542 -0.76780502  1.15168793  0.311354436
## New Hampshire   2.35995585 -0.01790055  0.03648498 -0.032804291
## New Jersey     -0.17974128 -1.43493745 -0.75677041  0.240936580
## New Mexico     -1.96012351  0.14141308  0.18184598 -0.336121113
## New York       -1.66566662 -0.81491072 -0.63661186 -0.013348844
## North Carolina -1.11208808  2.20561081 -0.85489245 -0.944789648
## North Dakota    2.96215223  0.59309738  0.29824930 -0.251434626
## Ohio            0.22369436 -0.73477837 -0.03082616  0.469152817
## Oklahoma        0.30864928 -0.28496113 -0.01515592  0.010228476
## Oregon         -0.05852787 -0.53596999  0.93038718 -0.235390872
## Pennsylvania    0.87948680 -0.56536050 -0.39660218  0.355452378
## Rhode Island    0.85509072 -1.47698328 -1.35617705 -0.607402746
## South Carolina -1.30744986  1.91397297 -0.29751723 -0.130145378
## South Dakota    1.96779669  0.81506822  0.38538073 -0.108470512
## Tennessee      -0.98969377  0.85160534  0.18619262  0.646302674
## Texas          -1.34151838 -0.40833518 -0.48712332  0.636731051
## Utah            0.54503180 -1.45671524  0.29077592 -0.081486749
## Vermont         2.77325613  1.38819435  0.83280797 -0.143433697
## Virginia        0.09536670  0.19772785  0.01159482  0.209246429
## Washington      0.21472339 -0.96037394  0.61859067 -0.218628161
## West Virginia   2.08739306  1.41052627  0.10372163  0.130583080
## Wisconsin       2.05881199 -0.60512507 -0.13746933  0.182253407
## Wyoming         0.62310061  0.31778662 -0.23824049 -0.164976866

Interpretation of PCA Output

Note that a prcomp object like PCA is a list hosting a lot of useful information that we can access using the dollar sign $ notation. The two most important components are:

  • rotation: This is the p x M numeric matrix of PC loadings indexed by variables (row) and PCs (column). Each column of the matrix provides the corresponding PC loading vector \(\phi_m\). The matrix is called rotation because the PC loadings define the “rotated” coordinate system.
  • x: This is the n x M numeric matrix of PC scores indexed by observations (row) and PCs (column). Despite its nomenclature, it is not the original data matrix X, but the new data matrix whose columns are the PC scores computed in accordance.

 

In CHUNK 5, we can see, for example, that the loadings of the first PC are \(\phi_{11}\) = -0.5358995, \(\phi_{21}\) = -0.5831836, \(\phi_{31}\) = -0.2781909, \(\phi_{41}\) = -0.5434321, which means that the first PC is defined by:

Z1 = -0.5358995(Murder) – 0.5831836(Assault) – 0.2781909(UrbanPop) – 0.5434321(Rape )

where the four variables have been centered and standardized. Of course, we have \(\phi^2_{11}+\phi^2_{21}+\phi^2_{31}+\phi^2_{41}=1\).

To check that the PCA output makes sense, let’s verify by hand the first PC score for Alabama, the very first observation in the USArrests data. The variable values of Alabama are:

USArrests[1, ]
        Murder Assault UrbanPop Rape
Alabama   13.2     236       58 21.2

Upon centering and standardization (the means and standard deviations are taken from CHUNKs 1 and 2), Alabama has:

  • Murder1 = \(\dfrac{x_1-\mu_{11}}{\sigma_{11}}=\dfrac{13.2-7.788}{4.35551}=1.242564\)
  • Assault1 = \(\dfrac{x_1-\mu_{12}}{\sigma_{12}}=\dfrac{236-170.76}{83.337661}=0.782839\)
  • UrbanPop1 = \(\dfrac{x_1-\mu_{13}}{\sigma_{13}}=\dfrac{58-65.54}{14.474763}=-0.520907\)
  • Rape1 = \(\dfrac{x_1-\mu_{14}}{\sigma_{14}}=\dfrac{21.2-21.232}{9.366385}=-0.003416\)

So PC1 = z11 = -0.5358995(1.242564) – 0.5831836(0.782839) – 0.2781909(-0.520907) – 0.5434321(0.003416) = -0.9757

which is the (1, 1)-entry of the x matrix (subject to some rounding error). Other PC scores in the x matrix are computed in a similar fashion.

 

Interpretation of the PCs

The results of a PCA become particularly meaningful if the PCs can be interpreted in a way that makes sense in the context of the given business problem. In other words, the PCs have special meaning and capture specific aspects of the data. This allows us to understand how the original variables contribute to the construction of different PCs and uncover potentially interesting relationships among the variables that would otherwise have gone unnoticed.

Examining the relative signs and magnitudes of the PC loadings in CHUNK 5, we can see that:

  • PC1:
    • Relative Signs: The first PC attaches roughly the same (negative) weight to each of Assault, Murder, and Rape, which are strongly positively correlated with one another (recall the correlation matrix in CHUNK 4), and a much smaller weight to UrbanPop, so it can be interpreted as a measure of overall crime rates.
    • Magnitudes: The smaller (more negative) the PC score, the higher the crime rates. In Task 4 below, we will use this insight to combine these three crime-related variables as a single variable without losing much information.
  • PC2:
    • Relative Signs: In contrast, the second PC puts a very heavy (negative) weight on UrbanPop and a rather small weight on the three crime-related variables, so it is mainly a measure of urbanization level.
    • Magnitudes: The smaller (more negative) the PC score, the more urbanized the state.

 

Exercise (Further practice with interpreting PCs)

Consider the PC loadings in CHUNK 5 again.

    1. As far as possible, provide an interpretation of the third principal component.
    2. Among the 50 states, Alaska has the largest score for the third principal component. Without going back to the USArrests data (don’t cheat!!), describe the characteristics of Alaska.

 

Solution

    1. The third PC has a large positive loading on Rape and negative loadings of a similar size (but non-negligible) on Murder, Assault, and UrbanPop. As a result, we may interpret the third PC roughly as a contrast between Rape (large positive loading) and the sum of Murder, Assault, and UrbanPop (smaller negative loadings).
    2. Based on the interpretation in part i), Alaska is expected to have a large value of Rape in comparison to the sum of Murder, Assault, and UrbanPop.

Remark

    • You will see in CHUNK 8 below that the third PC accounts for only 8.914% of the variance of the data, so it is not a very useful PC. However, you can check that Alaska really has a relatively high value of Rape (but its values of Murder and Assault are not low).
    • To make your life easy, a typical PA exam project will likely involve a dataset for which the PCs can be interpreted in a rather straightforward manner, e.g.:
      • An overall or weighted average of a group of variables, e.g., PC1 in the USArrests data and PC1 in the June 2020 PA exams.
      • An index for a certain variable, e.g., PC2 in the USArrests data.
      • A contrast between a certain group of variables, e.g., PC3 in the USArrests data, PC2 in the June 17-18, 2020 PA exam, and PC1 in the June 2019 PA exam.

 

TASK 3: Generate and interpret a biplot

Create a biplot of the first two principal components. Then do the following:

  • Use the biplot to describe the characteristics of the following states:
    • Florida
    • California
    • Virginia
    • Iowa
  • Use the bi plot to describe the dependence structure of the four variables.

 

What is a biplot?

As soon as the PCs have been computed, we can plot their scores against one another to produce a low-dimensional view of the data. In particular, plotting the scores of the first two PCs against each other allows us to visualize the data in a two-dimensional scatterplot and inspect which observations and variables are similar to each other. An example of such a plot is a biplot, which overlays the scores and loading vectors of the first two PCs in a single graph.

 

EXAM NOTE

Although the PA modules don’t really explain what a biplot is, the PA exam expects you to know how to interpret a biplot. Task 5 of the June 17-18, 2020 PA exam, for example, is about interpreting a biplot.

 

In CHUNK 6, we use the aptly named biplot() function to create a biplot (see Figure 6.1.4) for the USArrests data.

# CHUNK 6 
# cex argument indicates the amount by which piotting symbois should be scal.ed 
# cex = 0.6 means 40% smaiier 
# scale = 0 ensures that the arrows are scaled to represent the PC loadings 
biplot(PCA, scale = 0, cex = 0.6)

Figure 6.1.4: The biplot for the USArrests dataset with scaling

The scores of the first two PCs are plotted against each other for the 50 states (identified by their names) along with the directions represented by the first two PC loading vectors. The four axes of the biplot correspond to:

Axis Quantity
Bottom PC1 score
Left PC2 score
Top loadings on PC1 (tallies in red)
Right loadings on PC2 (tallies in red)

For instance, Murder is represented by the arrow heading to the north west and passing through (\(\phi_{11}\), \(\phi_{12}\)) = (-0.5358995, 0.4181809), where the two loadings are from CHUNK 5.

 

Interpreting a Biplot

A biplot is a very helpful graphical tool as it reveals a lot of useful information about the variables as well as observations. Let’s use Figure 6.1.4 to illustrate:

How to use the biplot to interpret the PCs?

We saw in CHUNK 5 that the first PC attaches roughly the same weight to each of Assault, Murder, and Rape, so it can be interpreted as a measure of overall crime rates. This is also reflected in the biplot, where the coordinates of Murder, Assault, and Rape on the top axis (which is for the first PC loadings) are highly similar, whereas the coordinate of UrbanPop is much closer to zero. In contrast, the coordinate of UrbanPop on the right axis (which is for the second PC loadings) is very negative and overshadows the loadings for the crime-related variables, so the second PC is mainly a measure of urbanization level.

 

What can we say about some of the observations?

A biplot also tells us something about the characteristics of the observations based on the relative size of their PC scores. Looking at Figure 6.1.4, we can see which states have a relatively high crime rate and high level of urbanization. Florida, for example, is sitting on the far left end on the bottom axis, i.e., its first PC score is very negative, and its second PC score is virtually zero. As a result, it is a high-crime area (remember, the loadings of the first PC are all negative, so the higher the value of Murder, Assault, and Rape for a state, the more negative its first PC score) and has an average level of urbanization. We can check that this is indeed the case if we go back to the original values of Florida:

USArrests ["Florida", ]
        Murder Assault UrbanPop Rape
Florida   15.4     335       80 31.9
# For comparison
apply(USArrests, 2, mean)
Murder  Assault UrbanPop     Rape 
 7.788  170.760   65.540   21.232

As expected, the three crime-related variables are all a few standard deviations higher than their means, suggesting serious crime levels (but its UrbanPop is also quite large!).

Similarly, we can describe California, Virginia, and Iowa as follows:

State First PC Score Second PC Score Characteristics
California Very Negative   Very Negative  

 High crime rate

 High urbanization level

Virginia Almost Zero   Almost Zero  

 Average crime rate

 Average urbanization level

Iowa Very Positive   Almost Zero  

 Low crime rate

 Average urbanization level

Note that the PC scores are centered to have mean zero, so a PC1 score close to zero indicates an average crime level, not a low crime level.

 

Which variables are more correlated with each other?

We can also use the location of the variables in a biplot to gain a visual understanding of their correlation structure. That the three crime-related variables in Figure 6.1.4 are quite close to one another indicates that these three variables are rather positively correlated. Moreover, UrbanPop is quite far away from and so is less correlated with the three crime-related variables. These findings are consistent with the correlation matrix we saw in CHUNK 4.

 

One interesting observation about Figure 6.1.4 you can make is that the scores of the first PC range from -3 to 3 (on the bottom axis) whereas the scores of the second PC range only from -1.5 to 2.5 (on the left axis). The larger spread of the scores of the first PC is expected if you recall that the first PC is, by construction, always more informative and has a higher variance than the second PC.

 

Biplot for unscaled variables

To appreciate the impact of scaling versus no scaling, let’s look at the biplot should we perform the PCA without scaling (not recommended!) and see how things dramatically change. This is done in CHUNK 7 (notice the option scale. = FALSE in the prcomp() function) with the biplot shown in Figure 6.1.5.

# CHUNK 7 
PCA.unscaled <- prcomp(USArrests, scale. = FALSE) 
PCA.unscaled$rotation
                PC1         PC2         PC3         PC4
Murder   0.04170432 -0.04482166  0.07989066 -0.99492173
Assault  0.99522128 -0.05876003 -0.06756974  0.03893830
UrbanPop 0.04633575  0.97685748 -0.20054629 -0.05816914
Rape     0.07515550  0.20071807  0.97408059  0.07232502
biplot(PCA.unscaled, scale = 0, cex = 0.6)

Unlike Figure 6.1.4, the first and second PC loading vectors now place almost all of their weights respectively on Assault and UrbanPop, which is not surprising given that Assault has a disproportionately large standard deviation, followed by UrbanPop, as we saw in CHUNK 2. In this case, the first PC is no longer an overall measure of crime levels. This shows that in the absence of scaling, the results of a PCA may be heavily distorted and may fail to capture the patterns underlying the data. Standardization of variables is therefore typically performed to prevent this from happening.

 

TASK 4: Use observations from the PCA to generate a new feature

  • Recommend the number of principal components to use. Justify your choice.
  • Regardless of your recommendation, generate one new feature based on the first principal component.

 

The USArrests data has four variables, so running a PCA naturally leads to four PCs. How many of them should we actually use to summarize the data well? We may get some idea by looking at the PVEs. In CHUNK 8, we apply the summary() function to PCA, a prcomp object, and output the PVE and the cumulative PVE for each of the four PCs.

# CHUNK 8 
summary(PCA)
Importance of components:
                          PC1    PC2     PC3     PC4
Standard deviation     1.5749 0.9949 0.59713 0.41645
Proportion of Variance 0.6201 0.2474 0.08914 0.04336
Cumulative Proportion  0.6201 0.8675 0.95664 1.00000

For instance, the first PC explains 62.01 % of the variability and the second PC explains 24. 74 %, for a cumulative PVE of 86. 75%. The fact that the PVE of the first PC is so high can be attributed to the large positive correlations among the three crime variables (and there are only four variables in the USArrests data!), as we saw in CHUNK 4. As the first two PCs together are able to explain most of the variability in the data (in other words, the biplot in Figure 6.1.4 is a pretty accurate summary of the USArrests data with only two dimensions) and they are easy to interpret (from Task 2), we would recommend using the first two PCs.

Although not really necessary, we can visualize the variances explained by the PCs using a scree plot, produced by the self-explanatory screeplot() function. When applied to a prcomp object like PCA, this function plots the variance explained by the PCs, the number of which can be controlled by the npcs argument, and you can choose to have a line plot (with the extra option type = "lines") or a bar plot (type = "barplot"). Figure 6.1.6 shows the scree plot for the USArrests data.

Figure 6.1.6: A scree plot for the USArrests data

# CHUNK 8 (Cont.)
screeplot(PCA, npcs = 4, type = "lines")

# CHUNK 8 (Cont.)
screeplot(PCA, npcs = 4, type = "barplot")

 

Feature Generation Based on PCA

Now that we have decided to use two PCs, how can we use them to create new useful features? We have seen in CHUNK 5 and Figure 6.1.4 that the three crime-related variables are the dominating variables in the first PC, all sharing approximately the same weight. The first PC alone can also explain a large proportion of the variability of the data, so combining the three crime-related variables as a single feature measuring overall crime rates is a judicious move.

There are three different ways, suggested by different sources, to combine the crime-related variables.

Method 1 (Adopted by the June 2020 PA exam)

The most straightforward and common method is to simply take the scores of the first PC as a new feature. This is easily achieved in CHUNK 10, where we extract the first column of the x matrix, insert it into the USArrests data as a new feature called PC1, and print out the first six rows of the new dataset.

# CHUNK 10 
USArrests.1 <- USArrests # make a new copy of USArrests 
USArrests.1$PC1 <- PCA$x[ , 1] 
head(USArrests.1)
           Murder Assault UrbanPop Rape        PC1
Alabama      13.2     236       58 21.2 -0.9756604
Alaska       10.0     263       48 44.5 -1.9305379
Arizona       8.1     294       80 31.0 -1.7454429
Arkansas      8.8     190       50 19.5  0.1399989
California    9.0     276       91 40.6 -2.4986128
Colorado      7.9     204       78 38.7 -1.4993407

 

Method 2 (Suggested by the June 2019 PA exam)

When you apply PCA to a lot of variables (far more than 4), using the first PC entirely may result in a new feature that is not easy to interpret. In the case of the USArrests data, the first PC has Murder, Assault, and Rape as the dominating variables and is mainly a measure of overall crime rates, so we may simply drop UrbanPop and use the loadings of the three crime-related variables to define a new feature.

In CHUNK 11, we manually create the new feature called crime using the PC loadings of Murder, Assault, and Rape.

EXAM NOTE

The June 2019 PA exam model solution says that

[u]sing every number in the first principal component is not as beneficial as it will be difficult to explain to the client what it represents. By using only a few of the [variables] it is clearer what the new variable is measuring.

# CHUNK 11 
USArrests.2 <- USArrests

# the scale() function will convert the USArrests data to a numeric matrix
# so we use the as.data.frame() function to change it back to a data frame
USArrests.scaled <- as.data.frame(scale(USArrests.2)) 

USArrests.2$crime <- PCA$rotation[1, 1] * USArrests.scaled$Murder + 
  PCA$rotation[2, 1] * USArrests.scaled$Assault + 
  PCA$rotation[4, 1] * USArrests.scaled$Rape 

head(USArrests.2)
           Murder Assault UrbanPop Rape      crime
Alabama      13.2     236       58 21.2 -1.1205719
Alaska       10.0     263       48 44.5 -2.2676396
Arizona       8.1     294       80 31.0 -1.4675357
Arkansas      8.8     190       50 19.5 -0.1586647
California    9.0     276       91 40.6 -2.0092964
Colorado      7.9     204       78 38.7 -1.2598717

Notice that it is necessary to scale the three crime-related variables before setting up the new feature. The scaling can be done by scaling the entire USArrests dataset (with the new data frame called USArrests.scaled) or scaling each of the three crime-related variables separately using the scale() function.

 

Method 3 (Suggested by Module 8 of PA e-learning modules; not recommended in general)

The third method, which is illustrated in the sample project section of PA e-learning Module 8, is to rerun a PCA only on the three crime-related variables and use the first resulting PC as the new feature. By design, the first PC found this way will only involve the three crime-related variables, which are our interest, but not UrbanPop. This is done in CHUNK 12.

# CHUNK 12 
USArrests.3 <- USArrests 

# Run a PCA on onLy the three crime- reLated variables 
PCA.3 <- prcomp(USArrests.3[, c(1, 2, 4)], center = TRUE, scale. = TRUE) 
PCA.3$rotation
               PC1        PC2        PC3
Murder  -0.5826006  0.5339532 -0.6127565
Assault -0.6079818  0.2140236  0.7645600
Rape    -0.5393836 -0.8179779 -0.1999436
USArrests.3$PC1 <- PCA.3$x[, 1] 
head(USArrests.3) 
           Murder Assault UrbanPop Rape        PC1
Alabama      13.2     236       58 21.2 -1.1980278
Alaska       10.0     263       48 44.5 -2.3087473
Arizona       8.1     294       80 31.0 -1.5033307
Arkansas      8.8     190       50 19.5 -0.1759894
California    9.0     276       91 40.6 -2.0452358
Colorado      7.9     204       78 38.7 -1.2634133

Comparing the results in CHUNKs 10, 11, and 12, we can see that the three new features possess approximately the same values. Which method should we use on the exam? The June 2020 exams explicitly ask that you “create a new feature based on the first principal component” (June 16 and 19 exams) or “Run the code [given in the Rmd file] to attach the principal components you will use to the data frame ,” (June 17-18 exams) so it is a no-brainer to use Method 1. If you are given rather vague guidance like “generate a new feature based on your PCA output” and there are a large number of variables, my personal recommendation is to follow the June 2019 PA exam model solution and use the second method, although it is far from wrong to take the first PC. The third method is rather ad hoc and generally not recommended.

 

Deleting the Original Variables

Irrespective of which feature generation method you adopt, it is of utmost importance to drop the variables that contribute to the new feature. Failure to do so will result in a duplication of information in the data and raise the issue of multicollinearity when a GLM is fitted. Using the second feature generation method as an example, we can delete the three crime-related variables in CHUNK 13

# CHUNK 13 
USArrests.2$Murder <- NULL 
USArrests.2$Assault <- NULL 
USArrests.2$Rape <- NULL
# OR
# USArrests.2[, c(1, 2, 4)) <- NULL
head(USArrests.2)
           UrbanPop      crime
Alabama          58 -1.1205719
Alaska           48 -2.2676396
Arizona          80 -1.4675357
Arkansas         50 -0.1586647
California       91 -2.0092964
Colorado         78 -1.2598717

After the deletion, the USArrests.2 dataset only has two features, UrbanPop and PC1, as desired.

If you prefer to the keep the original variables (perhaps they have other use in the rest of the exam project), then you have to take extra care to kick them out in the formula when fitting predictive models. It is too easy to make silly but serious mistakes on the exam!

 

EXAM NOTE

The June 2019 PA exam model solution says that:

[m]any candidates did not drop the contributing variables to the new feature, leading to a rank-deficient model in [a later task].

 

Cluster Analysis

Cluster analysis takes a different approach to data exploration. It works by partitioning the observations in a dataset into a set of distinct homogeneous subgroups, better known as clusters, with the goal of revealing hidden patterns in the data. Observations within each cluster share similar characteristics (feature values) whereas observations in different clusters are rather different from one another. Of course, we will have to quantify similarity and we will do so very shortly. Here are two daily life applications of cluster analysis (rephrased from ISLR) with a positive business impact:

  • Biomedical studies: Suppose that we have a set of biomedical data where the observations correspond to tissue samples for patients with breast cancer and the features are different clinical and gene measurements collected for each tissue sample. Via cluster analysis, we may be able to identify unknown subtypes of breast cancer on the basis of the measurements taken on the tissue samples. This is an unsupervised learning problem because we are attempting to discover structure in the data, not to predict a (categorical) target variable, and we do not know how many subtypes of breast cancer there are beforehand. (For a categorical target variable, we know how many levels it has.)
  • Market segmentation: Now suppose that you operate a company and have a dataset of customers and their information (e.g., purchase history, geographical location, age, gender). Cluster analysis may allow us to partition the data into a few subgroups of customers with similar shopping preferences. We can then better target our sales efforts and have advertisements tailor-made for these customers to improve sales.

In Exam PA, we will learn two commonly used clustering methods, k-means clustering and hierarchical clustering, and look at their underlying rationale and practical implementations.

Note that although clustering itself is an unsupervised learning technique, the group assignments created as a result of clustering is a factor variable which may serve as a useful feature for supervised learning. This new feature can be used in place of the original variables for constructing a predictive model. In the data exploration part of some past PA exam projects, candidates were asked to perform clustering and use the output to create a new feature, e.g., Task 5 of the December 7, 2020 exam and Task 3 of the Hospital Readmissions sample project.

 

k-Means Clustering

The aim of k-means clustering is to assign each observation in a dataset into one and only one of k clusters. The number of clusters, k, is specified upfront, after which the algorithm automatically searches for the best (in some sense) configuration of the k clusters.

 

k-Means Clustering Algorithm

How do we define a “good” grouping? Intuitively, the k clusters are chosen such that the variation of the observations inside each cluster is relatively small while the variation between clusters should be large. The smaller the within-cluster variation, the better the discrimination works. The k-means clustering algorithm formalizes this idea in search of the best similarity grouping. Without inundating you with mathematical formulas that play a minimal role in Exam PA, the k-means clustering algorithm can be described as follows.

  • Initialization: Randomly select k points in the feature space. These k points will serve as the initial cluster centers.
  • Iteration:
    • Step 1. Assign each observation to the cluster with the closest center in terms of Euclidean distance.
    • Step 2. Recalculate the center of each of the k clusters.
    • Step 3. Repeat Steps 1 and 2 until the cluster assignments no longer change.

Because the algorithm involves iteratively calculating the k means (more precisely, the k centers) of the clusters, such a clustering method is aptly called “k-means” (instead of “k-cluster”) clustering.

 

Random Initial Assignments

In each iteration of the k-means clustering algorithm, the within-cluster variation of the data is automatically reduced. At the completion of the algorithm, we are guaranteed to arrive at a local (but not necessarily global) optimum. The algorithm, however, relies on the initial cluster assignments, which are made randomly. A different set of initial cluster centers may give rise to a different final set of clusters and a different local optimum. For some datasets, the presence of outliers may also distort the cluster arrangements. If one of the initial centers is too close to the outlier, then only that outlier will get assigned to that center and form its own cluster well separated from the rest of the data.

To increase the chance of identifying a global optimum and getting a representative cluster grouping, it is always advisable to run the k-means algorithm many times (e.g., 20 to 50) with different initial cluster assignments. Then we select the best solution, i.e., the cluster assignments that result in the lowest within-cluster variation. The R function we will use to run a k-means cluster analysis has a built-in argument to allow for this, as we will see in Subsection 6.2.3.

 

Choosing the Number of Clusters: The Elbow Method

Thus far we have performed k-means clustering with a pre-specified value of k. In practice, how should we go about selecting the value of k? The elbow method is a possible solution based on the fact that a good grouping is characterized by a small within-cluster variation, but a large between-cluster variation. Following this idea, we can make a plot of the ratio of the between-cluster variation to the total variation in the data (details of the calculation not covered in the PA modules) against the value of k. When the proportion of variance explained has plateaued out (i.e. , the increase that comes with an additional cluster starts to drop off) , we are said to have reached an “elbow” and the corresponding value of k provides an appropriate number of clusters to segment the data.

 

Hierarchical Clustering

The k-means clustering algorithm we used in Subsection 6.2.1 requires that the number of clusters k be specified at the start. This can be a problem because in many situations, it is not a priori clear how many clusters are needed. Hierarchical clustering, the second clustering method we will study, operates differently. Not only does it not require the choice of k in advance, the cluster groupings it produces can also be displayed using a dendrogram, a tree-based visualization of the “hierarchy” of clusters produced and the resulting cluster compositions.

 

Inter-Cluster Dissimilarity

Hierarchical clustering consists of a series of fusions (or mergers) of observations in the data. This bottom-up clustering method starts with the individual observations, each treated as a separate cluster, and successively fuses the closest pair of clusters, one pair at a time. The process goes on iteratively until all clusters are eventually fused into a single cluster containing all of the observations.

How do we measure the distance or dissimilarity between two clusters? The distance between two observations is straightforward: One often uses the Euclidean distance as in k-means clustering. The notion of the distance between two clusters, one of which has more than one observation, is somewhat more ambiguous and relies on the linkage used, which defines the dissimilarity between two groups of observations. Four commonly used types of linkage are:

Linkage The Inter-cluster Dissimilarity Is …
Complete The maximal pairwise distance between observations in one cluster and observations in the other cluster, i.e., look at the furthest pair of observations
Single The minimal pairwise distance between observations in one cluster and observations in the other cluster, i.e., look at the closest pair of observations
Average The average of all pairwise distances between observations in one cluster and observations in the other cluster
Centroid The distance between the two cluster centroids (or centers)

Note that average linkage and centroid linkage, while similar on the surface, are different in terms of the order in which distance calculations and averaging are performed.

  • For average linkage, we first compute all pairwise distances, then take an average.
  • For centroid linkage, we first take an average of the feature values to get the two centroids, then compute the distance between the two centroids.

The choice of the linkage can have a dramatic impact on the way the observations are clustered.

Using complete (resp. single) linkage can result in a large (resp. small) distance between two intuitively close (resp. distant) clusters just because there is a distant (resp. close) pair of observations from the two clusters.

The average linkage balances these two extreme linkage functions. In practice, complete and average linkage are commonly used as they tend to result in more balanced (i.e., the clusters formed have a similar number of observations) and visually appealing clusters. Single linkage, on the other hand, may result in extended, trailing clusters with single observations fused one-at-a-time.

 

Hierarchical Clustering Algorithm

Now that we know how to measure inter-cluster dissimilarities using a given linkage, we are ready to implement the hierarchical clustering algorithm and fuse the clusters iteratively.

 

Dendrogram

A very attractive aspect of hierarchical clustering is that we can represent the evolution of the clusters over the course of the algorithm vividly with the aid of a dendrogram, which is an upside-down tree showing the sequence of fusions and the inter-cluster dissimilarity when each fusion takes place on the vertical axis. From this visual representation, we can easily tell which clusters are in close proximity to one another (in terms of feature values) . Clusters which are joined towards the bottom of the dendrogram are rather similar to one another while those which fuse towards the top are rather far apart.

A useful by-product of a dendrogram is that we can view the cluster compositions at once for every desired number of clusters, from 1 to n. In short, one picture tells it all! Once a dendrogram is constructed, we can determine the clusters by making a horizontal cut across the dendrogram. The resulting clusters formed are the distinct branches immediately below the cut. The lower the cut, the more clusters we create. We can get any desired number of clusters from 1 (when there is no cut) ton (when the height of the cut is 0) by varying the height of the cut.

  • Nested clusters: The clusters formed by cutting the dendrogram at one height must be nested within those formed by cutting the dendrogram at a greater height.
  • One dendrogram suffices: The height of the cut plays the same role in determining the number of clusters as k in k-means clustering with one important difference: The height need not be specified in advance. Only after the dendrogram is constructed does the number of clusters have to be determined to find the composition of different clusters. If you want to change the number of clusters in the future, all you need to do is vary the height of the horizontal cut across the dendrogram, and no further reconstruction is needed. Compare this with k-means clustering, where changing the value of k requires rerunning the whole algorithm.

 

k-Means vs. Hierarchical Clustering

Table 6.1: Key differences between k-means clustering and hierarchical clustering

Item k-Means Clustering Hierarchical Clustering
Is randomization needed? Yes, needed for determining initial cluster centers No
Is the number of clusters pre-specified? Yes, k needs to be specified No
Are the clusters nested? No Yes, a hierarchy of clusters

 

Scaling of Variables Matters

For both k-means clustering and hierarchical clustering, the Euclidean distance calculations depend very much on the scale on which the feature values are measured.

  • Without scaling: If the variables are not in the same unit and one variable is of a much larger order of magnitude, then that variable will dominate the distance calculations and exert a disproportionate impact on the cluster arrangements. As an example, take p = 2 and think of X1 ranging narrowly from 0 to 1 and X2 ranging widely from 0 to 1000. In this case, much of the Euclidean distance between any pair of observations is attributed to their X2 values; their X1 values do not really matter.
  • With scaling: In contrast, when the features are standardized, we essentially attach equal importance to each feature when performing distance calculations, which is more desirable for most applications.

Hence, just like PCA, variables are typically scaled to make their standard deviation one prior to performing clustering analysis, unless there is a clear reason for not doing so.