Taking too long? Close loading screen.

SOA ASA Exam: Predictive Analysis (PA) – 3.2 Linear Models Case Study 2: Feature Selection and Regularization

[mathjax]

Learning Objectives

After completing this case study, you should be able to:

  • Fit a multiple linear regression model with both numeric and categorical (factor) predictors.
  • Detect and accommodate interactions between predictors which can be quantitative or qualitative.
  • Perform explicit binarization of categorical predictors using the dummyVars() function from the caret package and understand why doing so may be beneficial.
  • Perform stepwise selection using the stepAIC() function from the MASS package and be familiar with the different options allowed by this function.
  • Generate and interpret diagnostic plots for a linear model.
  • Implement regularized regression using the glmnet() and cv.glmnet() functions from the glmnet package.

 

Stage 1: Define the Business Problem

Objectives

Our goal here is to identify and interpret key factors that relate to a higher or lower Balance with the aid of appropriate linear models.

 

Stage 2: Data Collection

Data Design

Relevance

Read in data and remove irrelevant variables.

# CHUNK 1
library(ISLR)
data(Credit)
Credit$ID <- NULL

 

Data Description

The Credit dataset contains n = 400 observations and 11 variables. Numeric predictors are listed first, followed by categorical ones. The target variable is the last variable in the dataset, Balance, an integer-valued variable that ranges from 0 to 1,999 and represents the credit card balance (in$) for an individual. The other variables describe the financial and demographic factors of an individual and can be grouped as follows:

  • The first six variables, Income, Limit, Rating, Cards, Age, and Education, are numeric predictors for Balance.
  • The other four variables, Gender, Student, Married, and Ethnicity, are categorical predictors. The first three variables are binary and Ethnicity has three levels.

TASK 1: Consider a data issue

  • Identify a variable that may have potential ethical concerns.
  • Discuss the benefits and concerns related to using this variable in a model for this business problem. Regardless of any concerns, continue to use this variable in subsequent analyses.

 

Data Quality Issues

Ethnicity

Before doing any predictive analysis, one of the variables that we may have to pay attention to is Ethnicity, which can be a controversial variable to use.

On the positive side, this variable may provide potentially useful information for understanding or predicting credit card balance. Depending on the ultimate objective of the study, the ability to make more accurate predictions for credit card balance may have huge societal benefits (e.g., issuing credit cards to appropriate prospective cardholders).

On the negative side, incorporating ethnicity into the prediction procedure may be a sensitive move as making ethnicity-based predictions may be criticized on grounds of discrimination and raise legal concerns.

For this case study, we will take Ethnicity as an ordinary predictor and see if it may prove useful for predicting Balance.

 

Stage 3: Exploratory Data Analysis

TASK 2: Explore the numeric variables

  • Investigate the distribution of the target variable.
  • Create a correlation matrix for all the numeric variables (including the target variable) in the dataset.
    • Examine the pairwise correlations, both between the predictors and in relation to the target variable, and state your observations.
    • Explain whether it is reasonable to delete the Limit variable. Regardless of your response, delete this variable from further consideration.
  • Create visual representations between the numeric variables and the target variable to identify variables most likely to predict the target.

 

Descriptive Statistics: Numeric Predictors

Data Exploration: Distribution of the Target Variable

Even if it is not directly requested in the task statement, it is a good idea to explore the distribution of the target variable for virtually all data exploration tasks. The nature of the target variable may suggest useful transformations and determine which predictive analytic techniques should be applied.

 

Common Mistakes

  • Failed to consider the skewed distribution of the target variable before launching into the analysis.

 

To begin with, let’s explore the distributional characteristics of the target variable Balance.

summary(Credit)

     Income           Limit           Rating          Cards            Age          Education        Gender    Student   Married              Ethnicity      Balance       
 Min.   : 10.35   Min.   :  855   Min.   : 93.0   Min.   :1.000   Min.   :23.00   Min.   : 5.00    Male :193   No :360   No :155   African American: 99   Min.   :   0.00  
 1st Qu.: 21.01   1st Qu.: 3088   1st Qu.:247.2   1st Qu.:2.000   1st Qu.:41.75   1st Qu.:11.00   Female:207   Yes: 40   Yes:245   Asian           :102   1st Qu.:  68.75  
 Median : 33.12   Median : 4622   Median :344.0   Median :3.000   Median :56.00   Median :14.00                                    Caucasian       :199   Median : 459.50  
 Mean   : 45.22   Mean   : 4736   Mean   :354.9   Mean   :2.958   Mean   :55.67   Mean   :13.45                                                           Mean   : 520.01  
 3rd Qu.: 57.47   3rd Qu.: 5873   3rd Qu.:437.2   3rd Qu.:4.000   3rd Qu.:70.00   3rd Qu.:16.00                                                           3rd Qu.: 863.00  
 Max.   :186.63   Max.   :13913   Max.   :982.0   Max.   :9.000   Max.   :98.00   Max.   :20.00                                                           Max.   :1999.00

Findings

From the summary in CHUNK 1, the median, mean, and maximum of Balance are 459.50, 520.01, and 1,999, respectively. The fact that the mean exceeds the median suggests that the distribution of Balance is skewed to the right (although the skewness does not seem severe).

 

Graphical Display: Numeric Predictors

Histogram

# CHUNK 2
library(ggplot2)
ggplot(Credit, aes(x = Balance)) +
    geom_histogram()

nrow(Credit[Credit$Balance <= 0, ])
[1] 90

Findings

Confirmed by the histogram of Balance constructed in CHUNK 2, the distribution of Balance is skewed to the right. However, there are a large number of observations (90 to be exact) for which Balance = 0. This makes the log transformation problematic and so we will leave Balance as is.

 

Exercise: Skewness-Correction Transformation

Suggest a possible transformation to reduce the right skewness of Balance. Do not apply this transformation.

Solution

While the log transformation is inapplicable to Balance due to the presence of zero values, the square root transformation still works well because Balance is non-negative.

Remark

Try to change Balance in CHUNK 2 to sqrt(Balance) and you will find that the distribution of the square root of Balance becomes more symmetric, although the mass at zero remains prominent.

Correlation Matrix

Now let’s turn to the relationship between each numeric predictor and the target variable, which involves bivariate data exploration, with the aid of summary statistics and graphical displays. To have a quick glance at which of the six numeric variables are likely to have significant predictive power, in CHUNK 3 we generate the correlation matrix for these variables and the target variable. To make the matrix easier to inspect, all pairwise correlations are rounded to 4 decimal places.

# CHUNK 3 
# Calculate the correlation matrix for numeric variables 
# The numeric predictors are in the first 6 columns 
cor.matrix <- cor(Credit[, c(1:6, 11)])
print("Correlation Matrix")
round(cor.matrix, digits = 4)

[1] "Correlation Matrix"
           Income   Limit  Rating   Cards    Age Education Balance
Income     1.0000  0.7921  0.7914 -0.0183 0.1753   -0.0277  0.4637
Limit      0.7921  1.0000  0.9969  0.0102 0.1009   -0.0235  0.8617
Rating     0.7914  0.9969  1.0000  0.0532 0.1032   -0.0301  0.8636
Cards     -0.0183  0.0102  0.0532  1.0000 0.0429   -0.0511  0.0865
Age        0.1753  0.1009  0.1032  0.0429 1.0000    0.0036  0.0018
Education -0.0277 -0.0235 -0.0301 -0.0511 0.0036    1.0000 -0.0081
Balance    0.4637  0.8617  0.8636  0.0865 0.0018   -0.0081  1.0000

 

Findings

    • Which variables are predictive of Balance:
      One obvious observation is related to the last row or column of the correlation matrix, which contains the correlation between Balance and each of the six numeric predictors.
      Looking at the sign and size of these six correlations, we can say that Limit (0.8617) and Rating (0.8636) have very strongly positive linear relationships with Balance, and Income (0.4637) exerts a moderately positive linear effect on Balance.
      The linear relationships between Balance and the other four numeric variables are weak at best given the negligible correlations.
      At this preliminary stage, Limit, Rating, and Income appear to be important predictors affecting Balance favorably, which aligns with our intuition.
    • Correlations between variables:
      A less obvious observation has to do with the correlations within the six numeric predictors, which are contained in the first six rows and columns of the correlation matrix. There are some notable correlations concerning three variables:

      • Between Income and Limit (0.7921)
      • Between Income and Rating (0. 7914)
      • Between Limit and Rating (0.9969)

These large correlations can be of concern as they suggest duplication of information contained in the three variables and may complicate the interpretation of the regression coefficients. The last correlation is most alarming, indicating that Limit and Rating are almost positively collinear. This is further confirmed by plotting the two variables in comparison with each other, as we do in CHUNK 4. The scatterplot reveals that the two variables are virtually perfectly related in a linear fashion and do not contribute much information when the other variable is present.
Therefore, we will keep one and only one of the two predictors, say Rating, and delete Limit (this can be justified by the slightly higher correlation between Rating and Balance), as we do in the last line of CHUNK 4.

Scatterplots:geom_point()

# CHUNK 4
ggplot(Credit, aes(x = Limit, y = Rating)) +
    geom_point()

Findings:

    • See above.

 

# Delete Limit 
Credit$Limit <- NULL

 

Scatterplots between numeric variables and target variable.

The pairwise correlations in CHUNK 3 are only a measure of the strength of the linear relationships between the numeric variables and Balance. We now use visual displays to gain a more holistic view on their relationships and further confirm the predictive power of Income and Rating (and Limit, if it still remains).

The best way to visualize the relationship between two numeric variables is a scatterplot. In CHUNK 5, we use a for loop to make a scatterplot, with a smoothed line superimposed, for Balance against each of the five remaining numeric predictors.

# CHUNK 5 
# first save the names of the five numeric predictors as a vector 
vars.numeric <- colnames(Credit[ , 1:5]) 

for (i in vars.numeric) { 
    plot <- ggplot(Credit, aes(x = Credit[, i] , y = Balance)) + 
        geom_point() + 
        geom_smooth(method = "lm" , se = FALSE) + labs(x = i) 
    print(plot) 
}
 

Findings

We can see that Balance has an unquestionably strong positive linear relationship with Rating. There is also a positive linear relationship between Balance and Income, but its strength is not as strong and there are hardly any systematic patterns between Balance and each of Cards, Age, and Educationthe fitted regression lines are almost flat, meaning that Balance is not sensitive to changes in these three predictors. All of these findings are consistent with our earlier deductions based on the correlation matrix in CHUNK 3. (By the way, you can see a series of vertical strips in the scatterplots for Cards and Education because these two variables are integer-valued with few distinct values.)

 

Descriptive Statistics: Categorical Predictors

TASK 3: Explore the factor variables

Investigate the distribution of the factor variables. Identify a factor variable that is likely to predict the target. For this factor variable (there is no need to discuss the other variables in your response), perform the following:

  • Show key descriptive statistics.
  • Create visual representations (e.g., graphs, charts).
  • Explain why you selected this variable and how the variable relates to the target variable.

In this task, we will investigate the four categorical predictors, Gender, Student, Married, and Ethnicity, and identify one of them which is likely to be predictive. Some preliminary descriptive statistics of these four variables can be found in the summary in CHUNK 1. For instance, 360 (90%) of the observations are non-students while 40 (10%) of them are students.

In CHUNK 6, we produce the mean (and median) of Balance split by different levels of each of the four factor variables.

# CHUNK 6
library(tidyverse)

# Save the names of the categoricai predictors as a vector
vars.categorical <- c("Gender", "Student", "Married", "Ethnicity")

for (i in vars.categorical) {
    x <- Credit %>%
        group_by_(i) %>%
        summarise(
            mean = mean(Balance),
            median = median(Balance),
            n = n()
        )
    print(x) 
}
# A tibble: 2 × 4
  Gender    mean median     n
  <fct>    <dbl>  <int> <int>
1 " Male"   510.    463   193
2 "Female"  530.    456   207
# A tibble: 2 × 4
  Student  mean median     n
  <fct>   <dbl>  <dbl> <int>
1 No       480.    424   360
2 Yes      877.    953    40
# A tibble: 2 × 4
  Married  mean median     n
  <fct>   <dbl>  <int> <int>
1 No       523.    467   155
2 Yes      518.    454   245
# A tibble: 3 × 4
  Ethnicity         mean median     n
  <fct>            <dbl>  <dbl> <int>
1 African American  531     480    99
2 Asian             512.    414   102
3 Caucasian         518.    465   199

Findings

Of the four categorical predictors, only Student stands out as making a significant difference to the mean of Balance over the two levels, “Yes” and “No” , with students (876.8250) having a much higher average Balance than for non-students (480.3694). The mean of Balance is more or less the same across different levels of Gender, Married, and Ethnicity. These three variables appear to be ineffective in differentiating the observations of Balance.

 

Graphical Display: Categorical Predictors

Split Boxplots

Let’s supplement our conclusions above with some graphical displays. Since Balance is a numeric variable but we are now dealing with categorical predictors, we will make use of split boxplots, as we do in CHUNK 7.

# CHUNK 7
for (i in vars.categorical) {
    plot <- ggplot(Credit, aes(x = Credit[, i], y = Balance)) +
        geom_boxplot() +
    labs(x = i)
    print(plot)
}

Findings

As we expect, the box for students is noticeably higher than that for non-students, with a similar spread, and the boxes are very similar across different factor levels for each of Gender, Married, and Ethnicity. These observations support our earlier findings that Student seems to be the only factor variable with significant predictive power.

Overall, the most promising predictors for Balance are Income (numeric), Rating (numeric), and Student (factor).

 

TASK 4: Consider two graphs

Your assistant has built the following two plots and asked for your comments:

  1. A boxplot of Age by Student.
  2. A scatterplot of Balance against Income colored by Student.

Run the code to make them.

  • Include them in your response.
  • State your observations and discuss the impact, if any, of each plot on your later modeling.

Occasionally on the exam, you are asked to focus on specific graphs that highlight subtle but important aspects of the
data that may go unnoticed in a standard data exploration exercise. The two graphs constructed in CHUNK 8 are good examples.

Age against Student

# CHUNK 8
ggplot(Credit, aes(x = Student, y = Age)) +
    geom_boxplot()
Findings

The boxplot of Age by Student shows that those who are students tend to have a smaller age (a lower median). This makes intuitive sense as students are more likely to be younger, although the median age of students, which is close to 50, seems unreasonably high (well, this dataset is simulated!). However, this also means that there is some co-dependence (but not interaction!) between Age and Student, and the effects of these two variables on Balance may be distorted when they serve as predictors at the same time. Fortunately, this concern may not be relevant after all, as we have seen in Task 2 that Age is unlikely to be an important predictor of Balance.

 

Balance against Income by Student

# CHUNK 8
ggplot(Credit, aes (x = Income, y = Balance, color = Student)) +
    geom_point() +
    geom_smooth(method = "lm", se = FALSE)

Findings

The scatterplot of Balance against Income with the observations colored by Student and with a smoothed line superimposed for each level of Student. This plot highlights the possible interaction between Income and Student in affecting Balance.

Recall that an interaction means that the expected effect of one predictor on the target variable depends on the value or level of another predictor. This is reflected in the scatterplot, where the slopes of the two smoothed lines differ quite remarkably, with the non-students being more sensitive to Income than students perhaps due to their financial status. To accommodate the interaction between Income and Student, we will include the interaction variable Income:Student when constructing linear models in later tasks.

Note

It is a good idea to briefly describe what an interaction is whenever this concept arises before explaining why there is an interaction. This not only helps structure your response, but also provides a good opportunity for you to demonstrate to the SOA your knowledge of predictive analytics!

 

Stage 4: Model Construction and Evaluation

Model Evaluation: Training/Test Set Split

TASK 5: Explore the effect of releveling factor variables

  • Split the data into training and test sets.
  • Fit a multiple linear regression model for Balance on all variables (except Limit) on the training set. Be sure to take note of the implications in Task 4. Interpret the coefficient estimates for Married and for the Asian level of Ethnicity.
  • Relevel the factor variables so that the most frequent level becomes the baseline level and refit the multiple linear regression model. Interpret the coefficient estimates for Married and for the Asian level of Ethnicity again.

 

Creation of training and test sets: createDataPartition()

To pave the way for constructing and evaluating the predictive performance of different linear models that will come later, we start this subsection by setting up the training and test sets, again by the createDataPartition() function . This is done in CHUNK 9.

createDataPartition(<target_variable>, p = <proportion of training data>, list = <if vector>)

# CHUNK 9
library(caret)
set.seed(8964)
partition <- createDataPartition(Credit$Balance, p = 0.75, list = FALSE)
data.train <- Credit[partition, ]
data.test<- Credit[-partition, ] 
print("TRAIN")
mean(data.train$Balance)
print("TEST")
mean(data.test$Balance)
[1] "TRAIN"
> mean(data.train$Balance)
[1] 520.4352

[1] "TEST"
> mean(data.test$Balance)
[1] 518.7374

Again, the mean of Balance in the training set is pretty close to that in the test set due to the built-in stratification.

 

Model Construction

Step 1: Fitting the First Model

In CHUNK 10, we fit the linear model for Balance regressed on all variables together with the interaction between Income and Student identified in Task 4 on the training set, save the fitted model as model.full (meaning the “full model”), and output a model summary.

Note that when specifying the interaction term in the formula, it is enough to use the colon operator instead of the asterisk because the placeholder “.” already includes the main effects terms. (In fact, if you change Income:Student to Income*Student, the model summary will remain the same.)

# CHUNK 10 
model.full <- lm(Balance ~ . + Income:Student, data = data.train)
summary(model.full)
Call:
lm(formula = Balance ~ . + Income:Student, data = data.train)

Residuals:
    Min      1Q  Median      3Q     Max 
-189.17  -76.46  -15.65   66.69  296.17 

Coefficients:
                     Estimate Std. Error t value Pr(>|t|)    
(Intercept)        -552.34353   40.34543 -13.690   <2e-16 ***
Income               -7.93864    0.29465 -26.943   <2e-16 ***
Rating                4.01246    0.06416  62.540   <2e-16 ***
Cards                 2.89772    4.69168   0.618   0.5373    
Age                  -0.78778    0.35994  -2.189   0.0294 *  
Education             0.88922    1.92035   0.463   0.6437    
GenderFemale        -18.37030   12.00867  -1.530   0.1272    
StudentYes          395.67744   30.72302  12.879   <2e-16 ***
MarriedYes          -20.46469   12.39735  -1.651   0.0999 .  
EthnicityAsian        6.53269   17.13794   0.381   0.7033    
EthnicityCaucasian   11.95386   14.82444   0.806   0.4207    
Income:StudentYes     0.50102    0.50028   1.001   0.3174    
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 103.1 on 289 degrees of freedom
Multiple R-squared:  0.952,	Adjusted R-squared:  0.9502 
F-statistic: 521.4 on 11 and 289 DF,  p-value: < 2.2e-16

 

# of Levels of the Categorical Predictors

Note that the lm() function automatically turns the four categorical predictors into dummy variables, which then enter the equation of the multiple linear regression model; we need not create the dummy variables beforehand. These dummy variables are named in the form of

<categorical_predictor><level>

For example, StudentYes is a dummy variable that equals 1 if the observation in question is a student and 0 otherwise. A categorical predictor with r levels is represented by r-1 dummy variables, each of which is an indicator of one and only one of the r-1 levels of the predictor, so:

    • Gender, Student, and Married are each represented by 1 dummy variable, and
    • Ethnicity by 2 dummy variables.

The level that is left out is the baseline level, which, by default, is the alpha-numerically first level.

For instance, the baseline level of Student is “No“, which comes before “Yes” in alphabetical order; this explains why instead of StudentNo, the StudentYes dummy variable appears in the model output.

Similarly, the baseline levels of Married and Ethnicity are “No“, and “African American“, respectively. (If you go back to the frequency tables of the four categorical predictors in CHUNK 1, you can see that the baseline levels are listed first).

 

Coefficient Estimates

The coefficient estimates for the categorical predictors have to be interpreted differently from those for numeric predictors. Taking the estimates for MarriedYes and EthnicityAsian for illustration, we can say:

    • The credit card balance of married individuals is lower than that of unmarried (baseline) by 20.46469 on average, holding all other predictors fixed.
    • The credit card balance of Asians is higher than that of African Americans (baseline) by 6.53269 on average, holding all other predictors fixed.

The model summary also agrees with the findings in the data exploration tasks we completed (Tasks 2 and 3) and shows that:

Income, Rating, Student (or equivalently, StudentYes) are highly significant.

Perhaps to your astonishment, the interaction term between Income and Student is not quite significant. This does not necessarily contradict what we saw in Task 4, where we examined the relationships among Balance, Income, and Student, and ignored other variables.

In contrast, the linear model we have just fitted takes into account all of the variables in the Credit dataset. The model says that the interaction becomes insignificant when the effects of other variables have been accounted for.

By the way, you can see that:

    • The standard error of the coefficient estimate of the “Asian” level (17.13794) of the Ethnicity variable is larger than that of the “Caucasian” level (14.82444). This should come as no surprise to us, considering the larger number of observations in the latter level (see CHUNK 1 or CHUNK 6).

Recall that sparse (resp. populous) levels tend to have larger (resp. smaller) standard errors because the coefficient estimates for those levels are based on less (resp. more) data.

 

Releveling

In most of the PA sample and exam projects, a code chunk is supplied for you to relevel factor variables so that their baseline level is not the alpha-numerically first level, but the level with the most observations. While releveling has no effect on the predicted values, changing the baseline levels will:

  • affect the coefficient estimates of the dummy variables (which are now dummy variables with respect to another baseline level) and
  • possibly the significance statistics, such as t-statistics for the dummy variables and the corresponding p-values.

Remember, the p-values measure how different the non-baseline levels are from the baseline level, and changing the baseline level certainly changes the measurement of such differences.

In general, for linear models and GLMs the usual practice is to set the baseline level to the most populous level to improve the accuracy of our measures of significance. In CHUNK 11, we relevel the four categorical predictors using a for loop and the relevel() function.

# CHUNK 11
for (i in vars.categorical){
  # Use the table() function to calculate the frequencies for each factor
  table <- as.data.frame(table(Credit[, i]))
  # Determine the level with the highest frequency
  max <- which.max(table[, 2])
  # Save the name of the level with the highest frequency
  level.name <- as.character(table[max, 1])
  # Set the baseline level to the most populous level 
  Credit[, i] <- relevel(Credit[, i], ref = level.name) 
}

summary(Credit[, vars.categorical])
   Gender    Student   Married              Ethnicity  
Female:207   No :360   Yes:245   Caucasian       :199  
 Male :193   Yes: 40   No :155   African American: 99  
                                 Asian           :102

As you can see, the baseline levels of Gender, Married, and Ethnicity have been changed respectively from “Male“, “No“, and “African American” to “Female“, “Yes“, and “Caucasian“, which now come first in the frequency tables for Married and Ethnicity.

 

What happens to the model output after doing the releveling? Run CHUNK 12 to refit the full model.

# CHUNK 12
# To make sure factors in the training set are reLeveLed
data.train <- Credit[partition, ]
model.full <- lm(Balance ~ . + Income:Student, data = data.train)
summary(model.full)
Call:
lm(formula = Balance ~ . + Income:Student, data = data.train)

Residuals:
    Min      1Q  Median      3Q     Max 
-189.17  -76.46  -15.65   66.69  296.17 

Coefficients:
                            Estimate Std. Error t value Pr(>|t|)    
(Intercept)               -579.22466   40.08600 -14.450   <2e-16 ***
Income                      -7.93864    0.29465 -26.943   <2e-16 ***
Rating                       4.01246    0.06416  62.540   <2e-16 ***
Cards                        2.89772    4.69168   0.618   0.5373    
Age                         -0.78778    0.35994  -2.189   0.0294 *  
Education                    0.88922    1.92035   0.463   0.6437    
Gender Male                 18.37030   12.00867   1.530   0.1272    
StudentYes                 395.67744   30.72302  12.879   <2e-16 ***
MarriedNo                   20.46469   12.39735   1.651   0.0999 .  
EthnicityAfrican American  -11.95386   14.82444  -0.806   0.4207    
EthnicityAsian              -5.42117   14.68523  -0.369   0.7123    
Income:StudentYes            0.50102    0.50028   1.001   0.3174    
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 103.1 on 289 degrees of freedom
Multiple R-squared:  0.952,	Adjusted R-squared:  0.9502 
F-statistic: 521.4 on 11 and 289 DF,  p-value: < 2.2e-1

Comparing the output in CHUNK 12 with that in CHUNK 10, we can see that the coefficient estimates of the dummy variables for Married and Ethnicity have changed, so have the p-values for Ethnicity Asian. We can say that:

  • The credit card balance of unmarried individuals is higher than that of married (baseline) by 20.46469 on average, holding all other predictors fixed. The size of the coefficient estimate remains unchanged, but it is now positive (-20.46469 -> +20.46469).
  • The credit card balance of Asians is lower than that of Caucasian (baseline) by 5.42117 on average, holding all other predictors fixed. If you are interested, this coefficient estimate can be computed from the two coefficient estimates in CHUNK 10: 6.53269 – 11.95386 = -5.42117.

While there is no change in the p-value for Student, the p-value for the Asian level of Ethnicity is different from that in CHUNK 10. In general, the p-values for factor variables with three or more levels may depend on the choice of the baseline level. If we perform stepwise (backward) selection with binarization and try to drop the most statistically insignificant feature, that is, the one with the highest p-value, then the choice of the baseline levels of the factor variables may affect which feature is to be dropped first. This explains the importance of releveling.

 

TASK 6: Binarize Factor Variables Manually

  • Describe the advantages and disadvantages of binarizing the factor variables before performing feature selection.
  • Regardless, binarize the factor variables and refit the multiple linear regression model in Task 5.

 

One more preparatory step: Binarization of factor variables

Before implementing feature selection, sometimes it is desirable to explicitly binarize categorical predictors with three or more levels. Although the lm() function is capable of automatically converting categorical predictors into dummy variables behind the scenes, as we saw in CHUNKs 10 and 12, and so whether or not we binarize the categorical predictors in advance has no impact on a fitted linear model, explicitly doing binarization can help make the feature selection process we are going to perform more meaningful.

  • Advantages: Many feature selection functions in R treat factor variables as a single feature and either retain the variable with all of its levels or remove the variable completely. This “all or nothing” feature selection approach ignores the possibility that individual levels of a factor with three or more levels may be:
    • Insignificant with regard to the baseline level (and hence could be combined with it to form a bigger baseline level)
    • Insignificantly different from other levels (in which case they could be combined into a new, bigger level).

With binarization, we are able to add or drop individual factor levels when doing feature selection.

(Note: This advantage is not very relevant to this case study because only the Ethnicity variable has more than two levels and both of its dummy variables are not statistically significant. We do binarization here mainly as illustration.)

  • Disadvantages: Binarization of factor variables is not without disadvantages.
    • As each factor level is considered a separate feature to add or drop, the stepwise selection procedure may take considerably longer to complete. The increase in computational burden is non-trivial when there are a fairly large number of factor levels.
    • We may see some non-intuitive or nonsensical results when only a handful of the levels of a categorical predictor are selected.

Compared to generating new features such as \(x^2\), \(x^3\), \(log(x)\), \(exp(x)\), \(sin(x)\) for a numeric variable x, binarization of categorical variables is conceptually more subtle and programmatically more difficult to implement in R.

The PA modules suggest using the dummyVars() function in the caret package and we will follow suit. Among the four categorical variables Gender, Student, Married, and Ethnicity, only Ethnicity has more than two levels and has to be binarized by two dummy variables. The binarization of Ethnicity can be done by running CHUNK 13.

# CHUNK 13
library(caret)
binarizer <- dummyVars(paste("~", paste(vars.categorical, collapse = "+")),
                       data = Credit, fullRank = TRUE)
# OR type out the categorical predictors one by one 
#binarizer <- dummyVars(~ Gender + Student + Married + Ethnicity, 
#                       data = Credit, fullRank = TRUE)
binarized_vars <- data.frame(predict(binarizer, newdata = Credit))
head(binarized_vars)
  Gender..Male Student.Yes Married.No Ethnicity.African.American Ethnicity.Asian
1            1           0          0                          0               0
2            0           1          0                          0               1
3            1           0          1                          0               1
4            0           0          1                          0               1
5            1           0          0                          0               0
6            1           0          1                          0               0

Explanation of Coding

The dummyVars() function takes a formula (there is no target variable) as the first argument and a data frame carrying the categorical variable(s) you want to binarize as the second argument.

In vars.categorical, collapse = "+", we use the paste() function to “paste” the names of the four categorical predictors Gender, Student, Married, and Ethnicity (recall that vars.categorical is defined in CHUNK 6) and separate them by the + sign.

Then another application of the paste() function inserts the tilde sign before the “sum” of the names of the categorical predictors and forms the overall formula.

(Note: The use of the paste() function is suggested in the PA e-learning modules and projects. If you find the paste() function too difficult to work with, you may simply type out the formula ~ Gender + Student + Married+ Ethnicity).

The fullRank option can be set to TRUE or FALSE depending on your purpose:

    • Setting it to TRUE is appropriate for regression and allows you to have a full rank parameterization of the model matrix, with the baseline levels of the categorical predictors left out.
    • When the option is set to FALSE (the default value) , the baseline levels would be retained and a rank-deficient model would result. This is more appropriate for principal components and cluster analyses, which will be covered in Chapter 6.

Below is the outcome of setting fullRank = FALSE

# CHUNK 13
library(caret)
binarizer <- dummyVars(paste("~", paste(vars.categorical, collapse = "+")),
                       data = Credit, fullRank = FALSE)
# OR type out the categorical predictors one by one 
#binarizer <- dummyVars(~ Gender + Student + Married + Ethnicity, 
#                       data= Credit, fullRank = TRUE) 
binarized_vars <- data.frame(predict(binarizer, newdata = Credit))
head(binarized_vars)
  Gender.Female Gender..Male Student.No Student.Yes Married.Yes Married.No Ethnicity.Caucasian Ethnicity.African.American Ethnicity.Asian
1             0            1          1           0           1          0                   1                          0               0
2             1            0          0           1           1          0                   0                          0               1
3             0            1          1           0           0          1                   0                          0               1
4             1            0          1           0           0          1                   0                          0               1
5             0            1          1           0           1          0                   1                          0               0
6             0            1          1           0           0          1                   1                          0               0

 

As we said above, it is enough to binarize only Ethnicity as it is the only categorical predictor with three or more levels. For illustration purposes, we have binarized all of the four categorical predictors to show how multiple factor variables can be dealt with at the same time; this is likely the case you will encounter in the exam. As we expect, the binarization procedure creates two dummy variables for Ethnicity, called Ethnicity.African.American and Ethnicity.Asian (recall that the baseline level after doing the releveling is Caucasian), indicating whether Ethnicity equals “African American” or “Asian“, respectively.

The binarized_vars data frame contains all of the dummy variables. In CHUNK 14, we combine it with the original Credit dataset via the cbind() function and rename it with a .bin suffix to distinguish it from the original non-binarized dataset.

# CHUNK 14 
Credit.bin <- cbind(Credit, binarized_vars) 
head(Credit.bin) 
   Income Rating Cards Age Education Gender Student Married Ethnicity Balance Gender.Female Gender..Male Student.No Student.Yes Married.Yes Married.No Ethnicity.Caucasian Ethnicity.African.American Ethnicity.Asian
1  14.891    283     2  34        11   Male      No     Yes Caucasian     333             0            1          1           0           1          0                   1                          0               0
2 106.025    483     3  82        15 Female     Yes     Yes     Asian     903             1            0          0           1           1          0                   0                          0               1
3 104.593    514     4  71        11   Male      No      No     Asian     580             0            1          1           0           0          1                   0                          0               1
4 148.924    681     3  36        11 Female      No      No     Asian     964             1            0          1           0           0          1                   0                          0               1
5  55.882    357     2  68        16   Male      No     Yes Caucasian     331             0            1          1           0           1          0                   1                          0               0
6  80.180    569     4  77        10   Male      No      No Caucasian    1151             0            1          1           0           0          1                   1                          0               0

Given the binarized variables, the original categorical predictors are no longer needed as they duplicate the information contained in the dummy variables. They can be dropped by assigning them to the NULL symbol.

# CHUNK 14 (Cont.) 
Credit.bin$Gender <- NULL 
Credit.bin$Student <- NULL 
Credit.bin$Married <- NULL 
Credit.bin$Ethnicity <- NULL 
head(Credit.bin) 
   Income Rating Cards Age Education Balance Gender.Female Gender..Male Student.No Student.Yes Married.Yes Married.No Ethnicity.Caucasian Ethnicity.African.American Ethnicity.Asian
1  14.891    283     2  34        11     333             0            1          1           0           1          0                   1                          0               0
2 106.025    483     3  82        15     903             1            0          0           1           1          0                   0                          0               1
3 104.593    514     4  71        11     580             0            1          1           0           0          1                   0                          0               1
4 148.924    681     3  36        11     964             1            0          1           0           0          1                   0                          0               1
5  55.882    357     2  68        16     331             0            1          1           0           1          0                   1                          0               0
6  80.180    569     4  77        10    1151             0            1          1           0           0          1                   1                          0               0

Now that the expanded dataset with binarized variables is available, we again split it into the training and test sets, using the same partition vector in CHUNK 9. There is no need to call the createDataPartition() function again as partition already contains the set of observations (rows) that should be sent to the training and test sets, and the same training/test set split should be used to evaluate all of our candidate models.

# CHUNK 14- (Cont.) 
data.train.bin <- Credit.bin[partition, ] 
data.test.bin <- Credit.bin[-partition, ] 

 

A code chunk for performing binarization of factor variables will be provided to help you. You merely have to change the placeholders to names of the variables in the dataset you are working on. Here is how it looks.

library (caret) 

# insert the names of the variables to be binarized 
factor_names <- c("<var1>", "<var2>", ... ) 
factor_vars <- <data>[, factor_names] 

# dummyVars is not compatible with factors 
for (var in factor_names) { 
    factor_vars[, var] <- as.character(factor_vars[, var]) 
} 

binarizer <- caret::dummyVars(paste("~", paste(factor_names, collapse = "+" )), 
                              data = factor_vars, fullRank = TRUE) 
binarized_vars <- data.frame(predict(binarizer, newdata = factor_vars)) 
head(binarized_vars) 

Notice two features of the SOA’s code not present in our illustrative code given in CHUNK 13.

  • The first part of the SOA’s code converts the factor variables to character variables. The PA modules and exams say that this conversion is done because “dummyVars is not compatible with factors,” which does not seem true. Without this conversion, the binarization continues to work well, as we saw in CHUNK 13 above, and the linear models can still be fit properly, as we will see soon.
  • The command caret::dummyVars explicitly specifies that the dummyVars() function from the caret package is used.

It is very likely that future exams that require the use of binarization will also provide all of the necessary code for you. (If not, type ?dummyVars to seek help. Remember to load the caret package first).

 

Step 1 Revisited: Refitting the Linear Model with Binarized Factors

Now that the training and test sets have been binarized, let’s end this task by refitting the linear model on the binarized training set, where all variables are numeric, and print a summary of the model to make sure everything looks right.

# CHUNK 15 
# The interaction term is now Income:Student.Yes, not Income:Student 
model.full.bin <- lm(Balance ~ . + Income:Student.Yes, data= data.train.bin) 
summary(model.full.bin)
Call:
lm(formula = Balance ~ . + Income:Student.Yes, data = data.train.bin)

Residuals:
    Min      1Q  Median      3Q     Max 
-189.17  -76.46  -15.65   66.69  296.17 

Coefficients: (4 not defined because of singularities)
                             Estimate Std. Error t value Pr(>|t|)    
(Intercept)                -150.13340   48.40440  -3.102  0.00211 ** 
Income                       -7.93864    0.29465 -26.943  < 2e-16 ***
Rating                        4.01246    0.06416  62.540  < 2e-16 ***
Cards                         2.89772    4.69168   0.618  0.53731    
Age                          -0.78778    0.35994  -2.189  0.02942 *  
Education                     0.88922    1.92035   0.463  0.64368    
Gender.Female               -18.37030   12.00867  -1.530  0.12717    
Gender..Male                       NA         NA      NA       NA    
Student.No                 -395.67744   30.72302 -12.879  < 2e-16 ***
Student.Yes                        NA         NA      NA       NA    
Married.Yes                 -20.46469   12.39735  -1.651  0.09988 .  
Married.No                         NA         NA      NA       NA    
Ethnicity.Caucasian           5.42117   14.68523   0.369  0.71228    
Ethnicity.African.American   -6.53269   17.13794  -0.381  0.70335    
Ethnicity.Asian                    NA         NA      NA       NA    
Income:Student.Yes            0.50102    0.50028   1.001  0.31743    
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 103.1 on 289 degrees of freedom
Multiple R-squared:  0.952,	Adjusted R-squared:  0.9502 
F-statistic: 521.4 on 11 and 289 DF,  p-value: < 2.2e-16

The summary output is identical to that in CHUNK 12, showing that binarization has no effect on the quality of the fitted model. However, the two dummy variables for Ethnicity will be considered separately when we do stepwise selection in the next task.

 

TASK 7: Select features using stepwise selection

Some of the features in the full model may lack predictive power and result in overfitting.

  • Perform stepwise selection using the stepAIC() function to determine which features should be retained.
    • When using this function, there are two decisions to make:
      • Forward (direction = “forward”) vs. backward (direction = “backward”)
      • AIC(k = 2) vs. BIC (k = log(nrow(data.train))).

Describe each decision and the implications in choosing one option versus the other.

    • Employ forward selection using BIC.
  • Run the summary function on the linear model selected by this process. Provide the summary output and list the variables selected.

 

Step 2: Removing Non-Predictive Features

As CHUNK 15 shows, only a few features are statistically significant. While including all of the features will improve its fit to the training data, this may make the model too complex in size and result in overfitting (i.e., poor fit to unseen test data). A sensible way to improve its predictive power (and interpretability) is to drop statistically insignificant features and retain genuinely predictive ones. This is what this task is about. In R, dropping insignificant features sequentially can be achieved by several functions, the most important of which in Exam PA is the stepAIC() function.

 

Feature Selection by the stepAIC() Function.

Available from the MASS package, the stepAIC() function automates stepwise model selection, in each step adding (forward selection) or dropping the feature (backward selection) to produce the greatest improvement in the model according to a certain criterion (the AIC by default). The process is terminated until no more variables can be added or dropped to improve the model.

Although not much about the stepAIC() function is presented in the PA e-learning modules, it is used in many of the PA exam projects (almost all PA projects involve feature selection in some way!). The December 2020, June 2020, and June 2019 PA exams even test some of the options of the function that can be changed to customize the feature selection process, so it pays to look at the two important options below that allow us to control the feature selection process and selection criterion:

  • direction = "backward" (default) or direction = "forward":
    With the stepAIC() function, we can either implement:

    • Backward selection, where we start with the full model (with interaction and polynomial terms inserted if desired) and in each step drop the feature that leads to the greatest decrease in the model AIC ( or BIC).
    • Forward selection, where we start with a primitive model (usually the model with only the intercept) and add the feature, one at a time, that leads to the greatest decrease in the model AIC (or BIC).
  • k = 2 (default) or k = log(nrow(data.train)):
    Despite the nomenclature of the stepAIC() function, it can perform stepwise selection using not only the AIC, but also the BIC as the selection criterion.
    To change the selection criterion from the AIC to the BIC, adjust the k argument, which is the multiple of the number of degrees of freedom that serves as the penalty term. Since:

    • \(\text{AIC}=-2l+\boxed{2}p\) and
    • \(\text{BIC}=-2l+\boxed{\ln(n_{tr})}p\),

The AIC and BIC are captured by setting k = 2 (default) and k = log(nrow(data.train)), respectively.

 

AIC vs. BIC, backward vs. forward: Which ones to use?

When performing stepwise feature selection, which selection criterion and selection process to use is an important modeling decision. Naturally, none of the selection criteria and selection processes are universally superior and an argument in favor of a particular criterion or process can be put forward based on the context of the given business problem.

AIC vs. BIC

Recall that the AIC and BIC differ only in terms of the size of the penalty that applies to the number of parameters.

      • For the AIC, the penalty per parameter is 2;
      • For the BIC, the penalty is the natural logarithm of the size of the training set, or \(\ln(n_{tr})=ln301 = 5.7071\), which is greater.

For a more complex model to beat a less complex model in terms of the BIC, the goodness of fit of the former has to far surpass that of the latter to make up for the increase in the penalty term. Thus, the BIC tends to favor a model with fewer features and represents a more conservative approach to feature selection.

Since our goal in this case study is to identify key factors that relate to the target variable Balance, it makes sense to take a conservative approach and work with as few features as necessary in the final model using the BIC as the selection criterion.

You can also make an argument (though not as convincing) in support of the AIC, saying that due to its propensity for retaining more features than other selection criteria, the client you work with will be given more factors to consider and the risk of missing important features will be smaller.

Backward vs. Forward Selections

The decision of using forward or backward selections can be justified with similar reasons. With the selection criterion held fixed, forward selection is more likely to produce a final model with fewer features (because the model you start with is the null model, which has no features) and better aligns with our goal of identifying key factors affecting Balance.

Based on the above considerations, it makes sense that we employ forward selection with the BIC in this case study and this is what we are asked to perform according to the task statement. For comparison, we will also conduct backward selection with the AIC. If this was a real exam, you would only need to do forward selection with the BIC.

 

Backward selection with the AIC

To begin with, let’s run CHUNK 16 to perform stepwise backward selection using the AIC, which is easier coding-wise. As backward selection and AIC are the defaults of the stepAIC() function, it is enough to apply the function directly to the binarized full model constructed in CHUNK 15 without specifying additional options. We then save the final model as an lm object named model.backward.AIC for future manipulation, and print a summary of the final model.

# CHUNK 16 
#instaLL.packages("MASSn) # uncomment this Line the first time you use MASS 
library(MASS) 
model.backward.AIC <- stepAIC(model.full.bin)
Start:  AIC=2802.25
Balance ~ Income + Rating + Cards + Age + Education + Gender.Female + 
    Gender..Male + Student.No + Student.Yes + Married.Yes + Married.No + 
    Ethnicity.Caucasian + Ethnicity.African.American + Ethnicity.Asian + 
    Income:Student.Yes


Step:  AIC=2802.25
Balance ~ Income + Rating + Cards + Age + Education + Gender.Female + 
    Gender..Male + Student.No + Student.Yes + Married.Yes + Married.No + 
    Ethnicity.Caucasian + Ethnicity.African.American + Income:Student.Yes


Step:  AIC=2802.25
Balance ~ Income + Rating + Cards + Age + Education + Gender.Female + 
    Gender..Male + Student.No + Student.Yes + Married.Yes + Ethnicity.Caucasian + 
    Ethnicity.African.American + Income:Student.Yes


Step:  AIC=2802.25
Balance ~ Income + Rating + Cards + Age + Education + Gender.Female + 
    Gender..Male + Student.Yes + Married.Yes + Ethnicity.Caucasian + 
    Ethnicity.African.American + Income:Student.Yes


Step:  AIC=2802.25
Balance ~ Income + Rating + Cards + Age + Education + Gender.Female + 
    Student.Yes + Married.Yes + Ethnicity.Caucasian + Ethnicity.African.American + 
    Income:Student.Yes

                             Df Sum of Sq      RSS    AIC
- Ethnicity.Caucasian         1      1448  3071419 2800.4
- Ethnicity.African.American  1      1543  3071515 2800.4
- Education                   1      2278  3072249 2800.5
- Cards                       1      4052  3074024 2800.6
- Income:Student.Yes          1     10654  3080626 2801.3
<none>                                     3069972 2802.3
- Gender.Female               1     24859  3094830 2802.7
- Married.Yes                 1     28946  3098918 2803.1
- Age                         1     50885  3120857 2805.2
- Rating                      1  41548705 44618677 3605.9

Step:  AIC=2800.39
Balance ~ Income + Rating + Cards + Age + Education + Gender.Female + 
    Student.Yes + Married.Yes + Ethnicity.African.American + 
    Income:Student.Yes

                             Df Sum of Sq      RSS    AIC
- Education                   1      2255  3073675 2798.6
- Cards                       1      3911  3075331 2798.8
- Ethnicity.African.American  1      5595  3077014 2798.9
- Income:Student.Yes          1     10176  3081595 2799.4
<none>                                     3071419 2800.4
- Gender.Female               1     25165  3096584 2800.8
- Married.Yes                 1     29351  3100771 2801.3
- Age                         1     51462  3122881 2803.4
- Rating                      1  41550276 44621695 3603.9

Step:  AIC=2798.61
Balance ~ Income + Rating + Cards + Age + Gender.Female + Student.Yes + 
    Married.Yes + Ethnicity.African.American + Income:Student.Yes

                             Df Sum of Sq      RSS    AIC
- Cards                       1      3865  3077540 2797.0
- Ethnicity.African.American  1      5586  3079261 2797.2
- Income:Student.Yes          1     10214  3083888 2797.6
<none>                                     3073675 2798.6
- Gender.Female               1     25155  3098830 2799.1
- Married.Yes                 1     28499  3102174 2799.4
- Age                         1     51652  3125327 2801.6
- Rating                      1  41550265 44623939 3601.9

Step:  AIC=2796.99
Balance ~ Income + Rating + Age + Gender.Female + Student.Yes + 
    Married.Yes + Ethnicity.African.American + Income:Student.Yes

                             Df Sum of Sq      RSS    AIC
- Ethnicity.African.American  1      5579  3083119 2795.5
- Income:Student.Yes          1     10767  3088307 2796.0
<none>                                     3077540 2797.0
- Gender.Female               1     23659  3101198 2797.3
- Married.Yes                 1     27917  3105457 2797.7
- Age                         1     49660  3127200 2799.8
- Rating                      1  42161655 45239195 3604.0

Step:  AIC=2795.54
Balance ~ Income + Rating + Age + Gender.Female + Student.Yes + 
    Married.Yes + Income:Student.Yes

                     Df Sum of Sq      RSS    AIC
- Income:Student.Yes  1      9873  3092992 2794.5
<none>                             3083119 2795.5
- Gender.Female       1     23435  3106554 2795.8
- Married.Yes         1     24716  3107835 2795.9
- Age                 1     50672  3133791 2798.4
- Rating              1  42165674 45248793 3602.1

Step:  AIC=2794.5
Balance ~ Income + Rating + Age + Gender.Female + Student.Yes + 
    Married.Yes

                Df Sum of Sq      RSS    AIC
<none>                        3092992 2794.5
- Gender.Female  1     22314  3115306 2794.7
- Married.Yes    1     25027  3118019 2794.9
- Age            1     49133  3142125 2797.2
- Student.Yes    1   4931600  8024592 3079.5
- Income         1   8357794 11450786 3186.5
- Rating         1  42247226 45340218 3600.7
summary(model.backward.AIC)
Call:
lm(formula = Balance ~ Income + Rating + Age + Gender.Female + 
    Student.Yes + Married.Yes, data = data.train.bin)

Residuals:
    Min      1Q  Median      3Q     Max 
-195.16  -76.45  -14.03   65.33  285.30 

Coefficients:
                Estimate Std. Error t value Pr(>|t|)    
(Intercept)   -529.16392   26.85954 -19.701   <2e-16 ***
Income          -7.87666    0.27946 -28.186   <2e-16 ***
Rating           4.01180    0.06331  63.370   <2e-16 ***
Age             -0.77016    0.35638  -2.161   0.0315 *  
Gender.Female  -17.32103   11.89326  -1.456   0.1464    
Student.Yes    418.87031   19.34645  21.651   <2e-16 ***
Married.Yes    -18.73773   12.14860  -1.542   0.1241    
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 102.6 on 294 degrees of freedom
Multiple R-squared:  0.9517,	Adjusted R-squared:  0.9507 
F-statistic: 964.8 on 6 and 294 DF,  p-value: < 2.2e-16

When executed, stepAIC() prints out the results of all iterations, showing how the model AIC (displayed in the AIC column in ascending order) changes when the most insignificant feature is dropped in each step.

For instance, in the first iteration dropping Ethnicity.Asian (“-Ethnicity.Asian” in the first row of the table means deleting Ethnicity.Asian) leads to the lowest model AIC (= 2800.4), down from the current AIC of 2802.25 (the line corresponding to is the current model, where “none” of the variables are dropped). In the second iteration, Education is dropped next to result in the lowest AIC.

Overall, the features are dropped in the following order: Ethnicity.Asian, Education, Cards, Ethnicity. African.American, Income:Student.Yes, then the process terminates.

The following six features remain in the final model: Income, Rating, Age, Gender.Male, Student.YesMarried.No

All of the coefficient estimates in the final model, except for the estimates of Gender.Male and Married.No, are statistically significant and the retention of Income, Rating, and Student.Yes is in agreement with what we got in Tasks 2 and 3.

Note

    • Candidates should run the final model” after performing stepwise model selection and “check that everything looks right.” You can make simple comments like how many or which features are in the final model, whether their retention matches what you observed in the data exploration part (not always the case), and whether the coefficient estimates are statistically significant (not always the case, especially with AIC).

 

Exercise: Stepwise selection for non-binarized data

In CHUNK 16 above, we performed backward stepwise selection for the linear model fitted to the binarized training set.

Adapt the code to perform backward stepwise selection for the linear model fitted to the non-binarized training set.

Pay attention to the Ethnicity variable.

Solution: Stepwise selection for non-binarized data

The linear model fitted to the non-binarized training set is constructed in CHUNK 10 and named model. full. We now apply the stepAIC() function to this lm object and see what happens in the first iteration.

# CHUNK 17 
model.no.binarize <- stepAIC(model.full, steps = 1)
Start:  AIC=2802.25
Balance ~ Income + Rating + Cards + Age + Education + Gender + 
    Student + Married + Ethnicity + Income:Student

                 Df Sum of Sq      RSS    AIC
- Ethnicity       2      7042  3077014 2798.9
- Education       1      2278  3072249 2800.5
- Cards           1      4052  3074024 2800.6
- Income:Student  1     10654  3080626 2801.3
<none>                         3069972 2802.3
- Gender          1     24859  3094830 2802.7
- Married         1     28946  3098918 2803.1
- Age             1     50885  3120857 2805.2
- Rating          1  41548705 44618677 3605.9

Step:  AIC=2798.94
Balance ~ Income + Rating + Cards + Age + Education + Gender + 
    Student + Married + Income:Student

Without binarization, the entire Ethnicity variable, which is associated with two degrees of freedom (look at the Df column), one for each of the two non-baseline levels, is treated as a single predictor and considered being dropped altogether.

Although this does not make a material difference in this case study, in other cases this may not be desirable as we are ignoring the possibility that only one of the two non-baseline levels of Ethnicity is insignificant while the other one is significant (with respect to the baseline).

 

Forward Selection with the BIC

To perform forward selection requires more coding effort. Besides the additional option direction = "forward" to mean that we are doing forward selection, we have to specify, in the scope argument of the stepAIC() function, a two-component list to define the range of models to be examined in the stepwise selection process.

The most sophisticated model, often the full model with all of the features, is given in the upper component while the most primitive model, usually the null model, is given in the lower component.

To override the AIC default and use the BIC as the selection criterion, we add the option k = log(nrow(data.train.bin)) in the stepAIC() function.

Now run CHUNK 18 to select the final model based on forward selection and the BIC, save it as an lm object named model.forward.BIC, and print a summary.

# CHUNK 18 
# first fit the null model (i.e., model with no predictors) 
model.null <- lm(Balance ~ 1, data = data.train.bin)

model.forward.BIC <- stepAIC(model.null, 
                     direction = "forward", 
                     scope = list(upper = model.full.bin, 
                     lower = model.null), 
                     k = log(nrow(data.train.bin))) 
Start:  AIC=3698.12
Balance ~ 1

                             Df Sum of Sq      RSS    AIC
+ Rating                      1  47082983 16908423 3303.2
+ Income                      1  12910007 51081399 3636.0
+ Student.No                  1   5425335 58566071 3677.2
+ Student.Yes                 1   5425335 58566071 3677.2
<none>                                    63991406 3698.1
+ Cards                       1    163291 63828115 3703.1
+ Gender.Female               1    127201 63864205 3703.2
+ Gender..Male                1    127201 63864205 3703.2
+ Age                         1     79880 63911526 3703.5
+ Education                   1     13121 63978284 3703.8
+ Married.Yes                 1     10920 63980486 3703.8
+ Married.No                  1     10920 63980486 3703.8
+ Ethnicity.Asian             1      3049 63988357 3703.8
+ Ethnicity.Caucasian         1      2324 63989082 3703.8
+ Ethnicity.African.American  1         0 63991406 3703.8

Step:  AIC=3303.21
Balance ~ Rating

                             Df Sum of Sq      RSS    AIC
+ Income                      1   8650956  8257467 3093.2
+ Student.No                  1   4817765 12090659 3208.0
+ Student.Yes                 1   4817765 12090659 3208.0
+ Age                         1    707560 16200863 3296.1
<none>                                    16908423 3303.2
+ Married.Yes                 1    174547 16733876 3305.8
+ Married.No                  1    174547 16733876 3305.8
+ Education                   1     71402 16837022 3307.6
+ Ethnicity.Asian             1     49620 16858804 3308.0
+ Gender.Female               1     31887 16876536 3308.4
+ Gender..Male                1     31887 16876536 3308.4
+ Cards                       1     30049 16878374 3308.4
+ Ethnicity.Caucasian         1     18541 16889883 3308.6
+ Ethnicity.African.American  1      4304 16904120 3308.8

Step:  AIC=3093.2
Balance ~ Rating + Income

                             Df Sum of Sq     RSS    AIC
+ Student.No                  1   5069114 3188353 2812.5
+ Student.Yes                 1   5069114 3188353 2812.5
<none>                                    8257467 3093.2
+ Married.Yes                 1    127904 8129563 3094.2
+ Married.No                  1    127904 8129563 3094.2
+ Age                         1     95208 8162259 3095.4
+ Education                   1     44496 8212971 3097.3
+ Ethnicity.Asian             1     40115 8217352 3097.4
+ Cards                       1     22019 8235448 3098.1
+ Ethnicity.African.American  1      9464 8248003 3098.6
+ Ethnicity.Caucasian         1      8003 8249464 3098.6
+ Gender.Female               1       244 8257223 3098.9
+ Gender..Male                1       244 8257223 3098.9

Step:  AIC=2812.47
Balance ~ Rating + Income + Student.No

                             Df Sum of Sq     RSS    AIC
<none>                                    3188353 2812.5
+ Age                         1     46257 3142096 2813.8
+ Gender.Female               1     24128 3164225 2815.9
+ Gender..Male                1     24128 3164225 2815.9
+ Married.Yes                 1     23759 3164595 2815.9
+ Married.No                  1     23759 3164595 2815.9
+ Ethnicity.Caucasian         1      4275 3184078 2817.8
+ Ethnicity.African.American  1      2346 3186007 2817.9
+ Education                   1      1600 3186753 2818.0
+ Cards                       1       865 3187488 2818.1
+ Ethnicity.Asian             1       747 3187606 2818.1
summary(model.forward.BIC)
Call:
lm(formula = Balance ~ Rating + Income + Student.No, data = data.train.bin)

Residuals:
     Min       1Q   Median       3Q      Max 
-220.292  -78.120   -7.473   63.273  299.895 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept) -167.99297   23.70608  -7.086    1e-11 ***
Rating         4.01634    0.06353  63.223   <2e-16 ***
Income        -7.97413    0.27691 -28.797   <2e-16 ***
Student.No  -421.17339   19.38206 -21.730   <2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 103.6 on 297 degrees of freedom
Multiple R-squared:  0.9502,	Adjusted R-squared:  0.9497 
F-statistic:  1888 on 3 and 297 DF,  p-value: < 2.2e-16

The output parallels that in CHUNK 16 and shows how the model BIC changes when the most significant feature (shown in the first row) is added one at a time. In the first iteration, the addition of Rating (“+ Rating” in the first row of the table means adding Rating) leads to the smallest BIC (3303.2 compared with 3698.12). Notice that the AIC column carries, in fact, the values of the BIC for different linear models because we have overridden the AIC default.

Features are added in the following order:Rating, Income, Student.Yes

Not surprisingly, the resulting model is simpler, with only these three features, compared to six features for the model returned by backward selection with the AIC. All of these three features are statistically significant.

 

Model Evaluation: Performance Metrics

TASK 8: Select and validate the recommended model

  • Evaluate the RMSE of the models in the preceding subsection against the test set.
  • Make a recommendation as to which model should be used.
  • Generate and interpret diagnostic plots for the recommended model to check the model assumptions.

 

Select a model: Which model is most predictive?

Now that we have different linear models of varying degrees of complexity fitted to the training set, it is time to compare their predictive performance on the common test set and see which one is the most predictive. We are particularly interested in whether reducing the complexity of a linear model by stepwise selection algorithms can enhance prediction accuracy. A natural measure of the performance of linear models, where the target variable is numeric, is the test RMSE. The smaller the test RMSE, the better. CHUNK 19 computes and prints the test RMSE of the four models we have constructed.

# CHUNK 19 
RMSE(data.test$Balance, predict(model.null, newdata = data.test.bin)) 
RMSE(data.test$Balance, predict(model.full.bin, newdata = data.test.bin)) 
RMSE(data.test$Balance, predict(model.backward.AIC, newdata = data.test.bin)) 
RMSE(data.test$Balance, predict(model.forward.BIC, newdata = data.test.bin)) 
> RMSE(data.test$Balance, predict(model.null, newdata = data.test.bin)) 
[1] 453.3665

> RMSE(data.test$Balance, predict(model.full.bin, newdata = data.test.bin)) 
[1] 105.0656

> RMSE(data.test$Balance, predict(model.backward.AIC, newdata = data.test.bin)) 
[1] 104.2375

> RMSE(data.test$Balance, predict(model.forward.BIC, newdata = data.test.bin)) 
[1] 102.6785

For your convenience, the following table reports the test RMSE extracted from the R output for each of the four models.

Model BIC
Null model (benchmark) 453.3665
Full model 105.0656
Final model based on backward selection and AIC 104.2375
Final model based on forward selection and BIC 102.6785 (smallest)

Observe that the two reduced models resulting from stepwise selection have a (slightly) lower test RMSE than the full model despite having far fewer features. Comparing the two reduced models, we can see that the model based on forward selection and BIC not only is simpler in the sense that it is nested within that based on the backward selection and AIC, but also is more predictive in the sense that its test RMSE is lower.

On the basis of both prediction accuracy and interpretability, the final model based on forward selection and BIC is justifiably the most promising model among all of the linear models we have considered and we will recommend using this model.

 

Stage 5: Model Validation

Model Diagnostics

Now that the model based on forward selection and BIC appears to be the best among the four linear models, let’s diagnose it carefully for any anomalies with the use of appropriate graphical displays in CHUNK 20.

When the plot() function (this is one of the rare instances in Exam PA in which we make use of R’s base graphics functions) is applied to an lm object (or glm object) such as model.forward.BIC, four diagnostic plots are generated, the first two of which are most easily interpreted.

 

Diagnostic plots for the final model based on forward selection and the BIC fitted to the training set of the Credit data.

# CHUNK 20 
plot(model.forward.BIC) 

 

“Residuals vs Fitted” Plot

The “Residuals vs Fitted” plot graphs the residuals of the model against the fitted values on the training set. If the linear model has sufficiently captured the true relationship between the target variable and the predictors, then there should be no discernible patterns in this plot, where the residuals are dispersed in an erratic manner.

The plot is also an informal test of the homogeneity of error variance. If the amount of fluctuation of the residuals increases as the magnitude of the fitted values increases, exhibiting a funnel shape, then the homoscedasticity assumption is at stake.

Unfortunately, while the homoscedastic assumption does not seem to be violated, there is a clear U-shape in the “Residuals vs Fitted” plot, suggesting that there are non-linear patterns beyond interaction effects that fail to be accommodated by the fitted model. One may try to refine the model by incorporating polynomial terms such as Rating (although polynomial relationships don’t seem very pronounced) and interaction terms such as Income x Rating and Income x Rating . (Exercise: Try this and see how much the prediction accuracy will improve. The improvement can be substantial!)

 

“Normal Q-Q” Plot

The “Normal Q-Q” plot graphs the quantiles of the standardized residuals against the theoretical standard normal quantiles and can be used for checking the normality assumption. If the points lie closely on the 45° straight line passing through the origin, then the normality assumption is supported. In CHUNK 20, the Q-Q plot shows that the normal distribution assumption is maintained for most values, but not in the two extreme tails. This suggests using a response distribution with a fatter tail than the normal distribution, the topic of Chapter 4.

 

Stage 6: Model Maintenance

 

Stage 4: Model Construction and Evaluation (Using Regularization)

TASK 9: Effects of regularization

Elastic net could also be used to perform feature selection.

  • Explain whether using alpha = 0 is appropriate for this case study.
  • Regardless of your conclusion, use elastic net with alpha equal to 0.5 to fit a regularized regression model with the following values of lambda:
    0, 10, 100, 500, 1000

State your observations. Be sure to use the same variables (after releveling) as the full model in Task 5.

 

Model Construction

An alternative to using stepwise selection for identifying useful features is regularization, which reduces the magnitude of the coefficient estimates via the use of a penalty term and serves to prevent overfitting. Variables with limited predictive power will receive a coefficient estimate that is small, if not exactly zero, and therefore are removed from the model.

In this subsection, we will fit a regularized regression model with alpha equal to 0.5 (i.e., an even mix between ridge regression and lasso) to the Credit dataset and compare its performance to that of the models based on the stepwise selection.

Because our goal in this case study is to identify key factors affecting Balance, using alpha equal to 0, which is ridge regression and does not eliminate any variables, is not appropriate.

 

Fitting Regularized Regression Models: The glmnet() function.

To implement regularization, we will use the glmnet package (available on Exam PA), where the main function is the glmnet() function. This function can be used to fit ridge regression models, lasso models, and generate eastic nets, but it has different syntax from the lm() function we have been using.

Instead of directly inputting a formula of the form Y ~ X1 + ... + Xp, we have to pass in a data matrix x (design matrix) carrying the values of the features and a response vector y. The design matrix can be created using the model.matrix() function from a given formula and a data frame.

Run CHUNK 21 to set up the design matrix for the training part of the Credit dataset and look at the first six rows. The formula should involve all of the potentially useful features, including the interaction between Income and Student, not just the predictive features identified in Task 7.

(Remember, an important aim of regularization is to reduce model complexity and, in some cases, select features. Performing regularization only on predictive features runs counter to this aim!)

# CHUNK 21 
X.train <- model.matrix(Balance ~ . + Income:Student, data = data.train) 
head(X.train) #printout a few rows of the design matrix
  (Intercept)  Income Rating Cards Age Education Gender Male StudentYes MarriedNo EthnicityAfrican American EthnicityAsian Income:StudentYes
1           1  14.891    283     2  34        11           1          0         0                         0              0             0.000
2           1 106.025    483     3  82        15           0          1         0                         0              1           106.025
3           1 104.593    514     4  71        11           1          0         1                         0              1             0.000
6           1  80.180    569     4  77        10           1          0         1                         0              0             0.000
8           1  71.408    512     2  87         9           1          0         1                         0              1             0.000
9           1  15.125    266     5  66        13           0          0         1                         0              0             0.000

A particular advantage of the model.matrix() function is that categorical predictors will be automatically transformed into dummy variables, just like the dummyVars() function, and the design matrix is of full rank. This advance conversion is important because unlike the lm() function, the glmnet() function can only work on quantitative inputs, but not character or factor variables. In other words, explicit binarization of categorical variables is required for the glmnet() function to work.

As an example, you can see that Ethnicity has been automatically converted into two dummy variables corresponding to “African American” or “Asian” (remember that “Caucasian” is the baseline level of Ethnicity after releveling in Task 5).

As soon as a design matrix has been set up, it can be passed to the glmnet() function along with the vector of response values and the values of the hyperparameters α and λ. Run CHUNK 22 to fit five regularized regression models with α = 0.5 and λ = 0, 10, 100, 500, 1000 (one model for each value of λ) to the training set, and save the five fitted models collectively as an object named m.lambda.

Note that the family argument is set to gaussian because we are dealing with linear models with normal errors.

# CHUNK 22 
#install.packages("glmnet") # uncomment this Line the first time you use gLmnet 
library(glmnet) 
m.lambda <- glmnet(x = X.train, 
                   y = data.train$Balance, 
                   family = "gaussian", 
                   lambda = c(0, 10, 100, 500, 1000), 
                   alpha = 0.5)

To appreciate the effect of regularization on the fitted model, it is informative to extract the coefficient estimates and see how they vary with the value ofλ. This can be done in two ways:

  • (Suggested in the PA e-learning modules) A glmnet object is a list, two components of which are:
    • a0: This is a vector of size length (lambda) hosting the value of the intercept estimate, one for each value of λ.
    • beta: This is a matrix of coefficient estimates for the predictors, excluding the intercept, arranged by columns according to different values of λ.

Then we use two separate commands to access the estimated intercept and slope coefficients.

# CHUNK 22 (Cont.) 
# first way to extract coefficient estimates 
m.lambda$a0
      s0        s1        s2        s3        s4 
520.4352  301.2973 -224.9669 -506.0179 -579.1925
m.lambda$beta
12 x 5 sparse Matrix of class "dgCMatrix"
                          s0        s1         s2           s3          s4
(Intercept)                . .           .          .            .        
Income                     . .           .         -6.63360837  -7.9320588
Rating                     . 0.6182676   2.034434   3.71042914   4.0114839
Cards                      . .           .          0.51433943   2.9161946
Age                        . .           .         -0.70550849  -0.7885101
Education                  . .           .          .            0.8874960
Gender Male                . .           .          3.08510246  18.3671409
StudentYes                 . .         228.762987 396.62014836 396.2181065
MarriedNo                  . .           .          9.36968005  20.4646766
EthnicityAfrican American  . .           .          .          -11.9281110
EthnicityAsian             . .           .          .           -5.3983474
Income:StudentYes          . .           .          0.01473463   0.4929168

 

  • (Easier and suggested in ISLR) Applying the coef() function to a glmnet object returns a matrix of coefficient estimates, including the intercept, arranged by columns according to different values of λ.
# CHUNK 22 (Cont.) 
# second way 
round(coef(m.lambda), digits = 4)
13 x 5 sparse Matrix of class "dgCMatrix"
                                s0       s1        s2        s3        s4
(Intercept)               520.4352 301.2973 -224.9669 -506.0179 -579.1925
(Intercept)                 .        .         .         .         .     
Income                      .        .         .        -6.6336   -7.9321
Rating                      .        0.6183    2.0344    3.7104    4.0115
Cards                       .        .         .         0.5143    2.9162
Age                         .        .         .        -0.7055   -0.7885
Education                   .        .         .         .         0.8875
Gender Male                 .        .         .         3.0851   18.3671
StudentYes                  .        .       228.7630  396.6201  396.2181
MarriedNo                   .        .         .         9.3697   20.4647
EthnicityAfrican American   .        .         .         .       -11.9281
EthnicityAsian              .        .         .         .        -5.3983
Income:StudentYes           .        .         .         0.0147    0.4929

 

Regardless of which way you use to access the coefficient estimates, the values of λ are arranged in descending order across the five columns from left to right, with s0, s1, s2, s3, and s4 representing λ = 1000, 500, 100, 10, 0, respectively.

Dots in the output represent features whose coefficient estimates are exactly zero and thus dropped from the model. Looking at the “paths” of the coefficients as λ increases from 0, 10, 100, 500, to 1000 (right to left), we can see that the effect of regularization becomes more and more significant, and more and more coefficient estimates shrink to zero, an illustration of the feature selection property of the elastic net.

  • When λ = 0, the coefficient estimates are (almost!) identical to the ordinary least squares estimates in CHUNK 12. This comes as no surprise to us because running regularized regression with λ = 0 is the same as doing ordinary least squares estimation.
  • When λ = 10, the coefficient estimates of Education, EthnicityAfrican American, and EthnicityAsian are dropped due to their statistical insignificance. As we saw in CHUNK 12, these three features are highly insignificant. The coefficient estimates of other features also become smaller in absolute value (e.g., 18.3671-+ 3.0851 for GenderMale).
  • When λ = 100, regularization becomes considerably stronger and a lot more features are eliminated. Only Rating and StudentYes are retained.
  • When λ = 500, regularization is overwhelming and only Rating remains. From CHUNK 12, Rating is the most significant feature, with the largest t-statistic.
  • When λ = 1000, the regularization penalty is so large that all of the coefficient estimates except the intercept become exactly zero, in which case the model is effectively the null model with only the intercept. The estimated intercept is \(\beta^L_0 = 520.4352\), which is also the mean value of Balance in the training set, as you can check using the command mean(data.train$Balance).
# CHUNK 22 (Cont.) 
mean(data.train$Balance)
[1] 520.4352

 

TASK 10: Perform feature selection with regularization

  • Use cross-validation to set the value of lambda for fitting the regularized model (alpha = 0.5) in Task 9.
    • List the features used in the resulting model. Compare the selected features to those using stepwise regression in Task 7. Comment on any differences.
    • Calculate the RMSE on the test set.
    • Recommend which model should be used. Justify your choice.
  • Suggest another method to set the value of lambda. There is no need to implement this method.

 

Hyperparameter tuning: The cv.glmnet() function.

The glmnet() function requires that value(s) of λ be supplied a priori and does not directly tell us the optimal value of λ. In fact, in Task 9 we experimented with five pre-specified values of λ, mainly to explore the impact of regularization, but did not attempt to optimize the value of λ. Fortunately, the companion function cv.glmnet() (with “cv” standing for “cross-validation”) fills this gap. This function performs k-fold cross-validation (10-fold cross-validation, by default), calculates the cross-validation error for a range of values of λ, and produces the optimal value of λ as a by-product.

The cv.glmnet() function is commonly accompanied with an application of the plot() function to a cv.glmnet object. This returns a cross-validation curve, that is, the cross-validation error plotted as a function of the values of λ used. Such a plot has several useful features:

  • The red dots represent the cross-validation error corresponding to a range of values of λ. (or logλ, for plotting purposes) and the vertical lines around the dots are the standard error bands.
  • The numbers at the top of the graph indicate how many features have non-zero coefficient estimates at a given value of λ (or logλ).
  • The first vertical dotted line from the left corresponds to the value of λ (or logλ) at which the cross-validation error is minimized. To directly extract such an optimal value of λ, we can take out the lambda.min component of a cv.glmnet object, which is a list.
  • The second vertical dotted line is not discussed in the PA e-learning modules, but was used in the June 16 and 19, 2020 PA exams. It has to do with the one-standard-error rule, which you may have come across when studying for Exam SRM.
    In the current context, the rule corresponds to the greatest value of λ, or equivalently, the simplest regularized regression model whose cross-validation error is within one standard error of the minimum error (remember, the higher the value of λ, the more regularization there is and the simpler the model becomes).
    The idea is that models within one standard error of the minimum have more or less the same predictive performance and we may opt for the simplest model among these models. The value of λ. based on the one-standard-error rule is contained in the lambda.1se (where “1se” stands for “1 standard-error”) component of a cv.glmnet object.

 

Exercise (An illustration of the one-standard-error rule for regularized regression)

You are fitting regularized linear models to the same dataset using a grid of five λ values. The following table shows the cross-validation error and the associated standard error for each λ value.

λ Cross-Validation Error Standard Error
0.5 10.4512 2.4931
1.0 4.1872 1.9462
1.5 2.9745 1.2476
2.0 3.1968 1.5390
2.5 4.6403 1.9473

Determine the best value of λ using the one-standard-error rule.

Solution

The smallest cross-validation error is attained when λ = 1.5. Since 2.9745 + 1.2476 = 4.2221 and the simplest model (remember, the larger the value of λ, the simpler the regularized model) whose cross-validation error is lower than 4.2221 corresponds to λ = 2.0, the best value of λ to use according to the one-standard-error rule is λ = 2.0.

 

Fitting the regularized model with λ = lambda.min

Now let’s run CHUNK 23 to plot the cross-validation error curve for our regularized model. Note that a random seed is set prior to the call of the cv.glmnet() function to make the choice of the cross-validation folds reproducible (this random seed need not be the same as that for the createDataPartitionO function in CHUNK 9); remember that cross-validation involves randomly dividing the training data into 10 folds.

# CHUNK 23 
set.seed(1111) 
m <- cv.glmnet(x = X.train, 
               y = data.train$Balance, 
               family = "gaussian", 
               alpha = 0.5) 
plot(m) 
Figure 3.4.7: The cross-validation error curve for the regularized model fitted to the training set of the Credit data.

 

m$lambda.min
m$lambda.1se
> m$lambda.min
[1] 0.8885539

> m$lambda.1se
[1] 7.550517

The value of λ that minimizes the cross-validation error is 0.8885539. Such a small value means that hardly any regularization is done to shrink the coefficient estimates. In fact, at this value none of the eleven features is eliminated.

We now refit the regularized model on the training set with λ equal to 0.8885539, save it as an object named m.min, make predictions on the test set, and finally evaluate the model’s RMSE.

These are all done in CHUNK 24. Notice that when the predict() function is applied to a glmnet object, the newx argument should be used instead of the newdata argument, and the model matrix for the test set should be supplied. To learn more, run ?predict.glmnet.

# CHUNK 24
# Fit the regularized model using lambda.min
m.min <- glmnet(x = X.train, 
                y = data.train$Balance, 
                family = "gaussian", 
                lambda = m$lambda.min, 
                alpha = 0.5) 
m.min$beta 
12 x 1 sparse Matrix of class "dgCMatrix"
                                   s0
(Intercept)                 .        
Income                     -7.8131829
Rating                      3.9839963
Cards                       2.6915109
Age                        -0.7829239
Education                   0.7557949
Gender Male                17.0049498
StudentYes                396.3329355
MarriedNo                  19.4182020
EthnicityAfrican American  -9.9708020
EthnicityAsian             -3.6479011
Income:StudentYes           0.4458288
# Set up the desgin matrix for the best test
X.test <- model.matrix(Balance ~ . + Income:Student, data = data.test) 
# Make predictions on the test set and calculation test RMSE
m.min.pred <- predict(m.min, newx = X.test) 
RMSE(data.test$Balance, m.min.pred) 
[1] 103.9201

As we saw in Figure 3.4.7, all of the eleven features are retained at the lambda.min value. The test RMSE of the resulting model is 103.9201, which is slightly higher than that of the final model in Task 7 (which is 102.6785).

In addition to inferior predictive performance, the regularized model is also much harder to interpret due to its substantially larger size, whereas the model in Task 7 only has three features. As a result, it is reasonable to recommend the model in Task 7 as the final model owing to its better prediction accuracy and interpretability.

 

Fitting the regularized model with λ = lambda.1se

Instead of using the value of λ that minimizes the cross-validation error, we can also select λ based on the one-standard-error rule and set it equal to lambda.1se, or 7.550517 (which is much larger than lambda.min). The task statement says that “there is no need to implement this method,” so if this was a real exam task, we could describe what the one-standard-error rule is and stop.

For the purposes of illustration, let’s also fit the resulting regularized model and see how it performs. CHUNK 25 below is a version of CHUNK 24 with m.min and lambda.min changed to m.1se and lambda.1se, respectively.

# CHUNK 25 
# Fit the reguiarized modeL using iambda.1se 
m.1se <- glmnet(x = X.train, 
                y = data.train$Balance, 
                family = "gaussian", 
                lambda = m$lambda.1se, 
                alpha = 0.5) 

# List the coefficient estimates 
m.1se$beta
12 x 1 sparse Matrix of class "dgCMatrix"
                                   s0
(Intercept)                 .        
Income                     -6.9378441
Rating                      3.7813573
Cards                       1.1075391
Age                        -0.7305808
Education                   .        
Gender Male                 6.7717419
StudentYes                397.5199346
MarriedNo                  11.6768447
EthnicityAfrican American   .        
EthnicityAsian              .        
Income:StudentYes           0.1148304

 

# Make predictions on the test set and caicuiate test RMSE
m.1se.pred <- predict(m.1se, newx = X.test)
RMSE(data.test$Balance, m.1se.pred)
[1] 102.8241

The fitted model has eight features, consistent with what we saw in Figure 3.4.7, and Education, EthnicityAfrican American, and EthnicityAsian are removed. The test RMSE has dropped to 102.8241, a slight improvement from that of the model based on lambda.min.

However, the model based on the one-standard-error still cannot compete with the model in Task 7 in terms of prediction accuracy as well as interpretability. This case study therefore further reinforces the message that to construct an effective predictive model, it is necessary to choose the right level of model complexity and select the right features. Putting all variables into your model does not always help!

Note

If you are asked to tune a regularized regression model on the exam, the default is to select the value of A that minimizes the cross-validation error (i.e., take λ = lambda.min). The one-standard-error rule is definitely valid, but it will be a prudent exam tactic to write something expected by the SOA graders, unless you are specifically asked for an alternative tuning strategy.

 

Remark: How to choose the optimal value of α?

As powerful as the cv.glmnet() function is, it only allows us to pick the best value of λ for a given value of α; it has no built-in mechanism for tuning α. To optimize over α, we have to resort to trial and error and go as follows:

  1. Set up a grid of values of α. In an exam setting, it is sensible to experiment with a few values of α, e.g., 0, 0.5, 1. If more values are desired, use a for loop.
  2. For each value of α, we select the corresponding optimal value of λ and evaluate the corresponding cross-validation error.
  3. Adopt the value of α that produces the smallest cross-validation error. (Note: Instead of cross-validation error, here it is fine and perhaps easier coding-wise to choose the value of α that produces the smallest training error because α is not directly related to model complexity.)

In essence, we break down the two-variable optimization problem by first minimizing over λ for a given α, then minimizing over α.