[mathjax]
Learning Objectives
After completing this case study, you should be able to:
- Fit a multiple linear regression model with both numeric and categorical (factor) predictors.
- Detect and accommodate interactions between predictors which can be quantitative or qualitative.
- Perform explicit binarization of categorical predictors using the
dummyVars()
function from thecaret
package and understand why doing so may be beneficial. - Perform stepwise selection using the
stepAIC()
function from theMASS
package and be familiar with the different options allowed by this function. - Generate and interpret diagnostic plots for a linear model.
- Implement regularized regression using the
glmnet()
andcv.glmnet()
functions from theglmnet
package.
Stage 1: Define the Business Problem
Objectives
Our goal here is to identify and interpret key factors that relate to a higher or lower Balance
with the aid of appropriate linear models.
Stage 2: Data Collection
Data Design
Relevance
Read in data and remove irrelevant variables.
# CHUNK 1 library(ISLR) data(Credit) Credit$ID <- NULL
Data Description
The Credit
dataset contains n = 400 observations and 11 variables. Numeric predictors are listed first, followed by categorical ones. The target variable is the last variable in the dataset, Balance
, an integer-valued variable that ranges from 0 to 1,999 and represents the credit card balance (in$) for an individual. The other variables describe the financial and demographic factors of an individual and can be grouped as follows:
- The first six variables,
Income
,Limit
,Rating
,Cards
,Age
, andEducation
, are numeric predictors forBalance
. - The other four variables,
Gender
,Student
,Married
, andEthnicity
, are categorical predictors. The first three variables are binary andEthnicity
has three levels.
TASK 1: Consider a data issue
- Identify a variable that may have potential ethical concerns.
- Discuss the benefits and concerns related to using this variable in a model for this business problem. Regardless of any concerns, continue to use this variable in subsequent analyses.
Data Quality Issues
Ethnicity
Before doing any predictive analysis, one of the variables that we may have to pay attention to is Ethnicity, which can be a controversial variable to use.
On the positive side, this variable may provide potentially useful information for understanding or predicting credit card balance. Depending on the ultimate objective of the study, the ability to make more accurate predictions for credit card balance may have huge societal benefits (e.g., issuing credit cards to appropriate prospective cardholders).
On the negative side, incorporating ethnicity into the prediction procedure may be a sensitive move as making ethnicity-based predictions may be criticized on grounds of discrimination and raise legal concerns.
For this case study, we will take Ethnicity as an ordinary predictor and see if it may prove useful for predicting Balance.
Stage 3: Exploratory Data Analysis
TASK 2: Explore the numeric variables
- Investigate the distribution of the target variable.
- Create a correlation matrix for all the numeric variables (including the target variable) in the dataset.
- Examine the pairwise correlations, both between the predictors and in relation to the target variable, and state your observations.
- Explain whether it is reasonable to delete the
Limit
variable. Regardless of your response, delete this variable from further consideration.
- Create visual representations between the numeric variables and the target variable to identify variables most likely to predict the target.
Descriptive Statistics: Numeric Predictors
Data Exploration: Distribution of the Target Variable
Even if it is not directly requested in the task statement, it is a good idea to explore the distribution of the target variable for virtually all data exploration tasks. The nature of the target variable may suggest useful transformations and determine which predictive analytic techniques should be applied.
Common Mistakes
- Failed to consider the skewed distribution of the target variable before launching into the analysis.
To begin with, let’s explore the distributional characteristics of the target variable Balance
.
summary(Credit) Income Limit Rating Cards Age Education Gender Student Married Ethnicity Balance Min. : 10.35 Min. : 855 Min. : 93.0 Min. :1.000 Min. :23.00 Min. : 5.00 Male :193 No :360 No :155 African American: 99 Min. : 0.00 1st Qu.: 21.01 1st Qu.: 3088 1st Qu.:247.2 1st Qu.:2.000 1st Qu.:41.75 1st Qu.:11.00 Female:207 Yes: 40 Yes:245 Asian :102 1st Qu.: 68.75 Median : 33.12 Median : 4622 Median :344.0 Median :3.000 Median :56.00 Median :14.00 Caucasian :199 Median : 459.50 Mean : 45.22 Mean : 4736 Mean :354.9 Mean :2.958 Mean :55.67 Mean :13.45 Mean : 520.01 3rd Qu.: 57.47 3rd Qu.: 5873 3rd Qu.:437.2 3rd Qu.:4.000 3rd Qu.:70.00 3rd Qu.:16.00 3rd Qu.: 863.00 Max. :186.63 Max. :13913 Max. :982.0 Max. :9.000 Max. :98.00 Max. :20.00 Max. :1999.00
Findings
From the summary in CHUNK 1
, the median, mean, and maximum of Balance are 459.50, 520.01, and 1,999, respectively. The fact that the mean exceeds the median suggests that the distribution of Balance
is skewed to the right (although the skewness does not seem severe).
Graphical Display: Numeric Predictors
Histogram
# CHUNK 2 library(ggplot2) ggplot(Credit, aes(x = Balance)) + geom_histogram() nrow(Credit[Credit$Balance <= 0, ]) [1] 90
Findings
Confirmed by the histogram of Balance
constructed in CHUNK 2
, the distribution of Balance is skewed to the right. However, there are a large number of observations (90 to be exact) for which Balance = 0
. This makes the log transformation problematic and so we will leave Balance
as is.
Exercise: Skewness-Correction Transformation
Suggest a possible transformation to reduce the right skewness of Balance
. Do not apply this transformation.
Solution
While the log transformation is inapplicable to Balance
due to the presence of zero values, the square root transformation still works well because Balance
is non-negative.
Remark
Try to change Balance
in CHUNK 2
to sqrt(Balance)
and you will find that the distribution of the square root of Balance
becomes more symmetric, although the mass at zero remains prominent.
Correlation Matrix
Now let’s turn to the relationship between each numeric predictor and the target variable, which involves bivariate data exploration, with the aid of summary statistics and graphical displays. To have a quick glance at which of the six numeric variables are likely to have significant predictive power, in CHUNK 3
we generate the correlation matrix for these variables and the target variable. To make the matrix easier to inspect, all pairwise correlations are rounded to 4 decimal places.
# CHUNK 3 # Calculate the correlation matrix for numeric variables # The numeric predictors are in the first 6 columns cor.matrix <- cor(Credit[, c(1:6, 11)]) print("Correlation Matrix") round(cor.matrix, digits = 4) [1] "Correlation Matrix" Income Limit Rating Cards Age Education Balance Income 1.0000 0.7921 0.7914 -0.0183 0.1753 -0.0277 0.4637 Limit 0.7921 1.0000 0.9969 0.0102 0.1009 -0.0235 0.8617 Rating 0.7914 0.9969 1.0000 0.0532 0.1032 -0.0301 0.8636 Cards -0.0183 0.0102 0.0532 1.0000 0.0429 -0.0511 0.0865 Age 0.1753 0.1009 0.1032 0.0429 1.0000 0.0036 0.0018 Education -0.0277 -0.0235 -0.0301 -0.0511 0.0036 1.0000 -0.0081 Balance 0.4637 0.8617 0.8636 0.0865 0.0018 -0.0081 1.0000
Findings
-
- Which variables are predictive of
Balance
:
One obvious observation is related to the last row or column of the correlation matrix, which contains the correlation betweenBalance
and each of the six numeric predictors.
Looking at the sign and size of these six correlations, we can say thatLimit
(0.8617) andRating
(0.8636) have very strongly positive linear relationships withBalance
, andIncome
(0.4637) exerts a moderately positive linear effect onBalance
.
The linear relationships betweenBalance
and the other four numeric variables are weak at best given the negligible correlations.
At this preliminary stage,Limit
,Rating
, andIncome
appear to be important predictors affectingBalance
favorably, which aligns with our intuition. - Correlations between variables:
A less obvious observation has to do with the correlations within the six numeric predictors, which are contained in the first six rows and columns of the correlation matrix. There are some notable correlations concerning three variables:- Between
Income
andLimit
(0.7921) - Between
Income
andRating
(0. 7914) - Between
Limit
andRating
(0.9969)
- Between
- Which variables are predictive of
These large correlations can be of concern as they suggest duplication of information contained in the three variables and may complicate the interpretation of the regression coefficients. The last correlation is most alarming, indicating that Limit
and Rating
are almost positively collinear. This is further confirmed by plotting the two variables in comparison with each other, as we do in CHUNK 4
. The scatterplot reveals that the two variables are virtually perfectly related in a linear fashion and do not contribute much information when the other variable is present.
Therefore, we will keep one and only one of the two predictors, say Rating
, and delete Limit
(this can be justified by the slightly higher correlation between Rating
and Balance
), as we do in the last line of CHUNK 4
.
Scatterplots:geom_point()
# CHUNK 4 ggplot(Credit, aes(x = Limit, y = Rating)) + geom_point()
Findings:
-
- See above.
# Delete Limit Credit$Limit <- NULL
Scatterplots between numeric variables and target variable.
The pairwise correlations in CHUNK 3
are only a measure of the strength of the linear relationships between the numeric variables and Balance
. We now use visual displays to gain a more holistic view on their relationships and further confirm the predictive power of Income
and Rating
(and Limit
, if it still remains).
The best way to visualize the relationship between two numeric variables is a scatterplot. In CHUNK 5
, we use a for loop to make a scatterplot, with a smoothed line superimposed, for Balance
against each of the five remaining numeric predictors.
# CHUNK 5 # first save the names of the five numeric predictors as a vector vars.numeric <- colnames(Credit[ , 1:5]) for (i in vars.numeric) { plot <- ggplot(Credit, aes(x = Credit[, i] , y = Balance)) + geom_point() + geom_smooth(method = "lm" , se = FALSE) + labs(x = i) print(plot) }
Findings
We can see that Balance
has an unquestionably strong positive linear relationship with Rating
. There is also a positive linear relationship between Balance
and Income
, but its strength is not as strong and there are hardly any systematic patterns between Balance
and each of Cards
, Age
, and Education
– the fitted regression lines are almost flat, meaning that Balance
is not sensitive to changes in these three predictors. All of these findings are consistent with our earlier deductions based on the correlation matrix in CHUNK 3
. (By the way, you can see a series of vertical strips in the scatterplots for Cards
and Education
because these two variables are integer-valued with few distinct values.)
Descriptive Statistics: Categorical Predictors
TASK 3: Explore the factor variables
Investigate the distribution of the factor variables. Identify a factor variable that is likely to predict the target. For this factor variable (there is no need to discuss the other variables in your response), perform the following:
- Show key descriptive statistics.
- Create visual representations (e.g., graphs, charts).
- Explain why you selected this variable and how the variable relates to the target variable.
In this task, we will investigate the four categorical predictors, Gender
, Student
, Married
, and Ethnicity
, and identify one of them which is likely to be predictive. Some preliminary descriptive statistics of these four variables can be found in the summary in CHUNK 1
. For instance, 360 (90%) of the observations are non-students while 40 (10%) of them are students.
In CHUNK 6
, we produce the mean (and median) of Balance
split by different levels of each of the four factor variables.
# CHUNK 6 library(tidyverse) # Save the names of the categoricai predictors as a vector vars.categorical <- c("Gender", "Student", "Married", "Ethnicity") for (i in vars.categorical) { x <- Credit %>% group_by_(i) %>% summarise( mean = mean(Balance), median = median(Balance), n = n() ) print(x) }
# A tibble: 2 × 4 Gender mean median n <fct> <dbl> <int> <int> 1 " Male" 510. 463 193 2 "Female" 530. 456 207 # A tibble: 2 × 4 Student mean median n <fct> <dbl> <dbl> <int> 1 No 480. 424 360 2 Yes 877. 953 40 # A tibble: 2 × 4 Married mean median n <fct> <dbl> <int> <int> 1 No 523. 467 155 2 Yes 518. 454 245 # A tibble: 3 × 4 Ethnicity mean median n <fct> <dbl> <dbl> <int> 1 African American 531 480 99 2 Asian 512. 414 102 3 Caucasian 518. 465 199
Findings
Of the four categorical predictors, only Student
stands out as making a significant difference to the mean of Balance
over the two levels, “Yes” and “No” , with students (876.8250) having a much higher average Balance
than for non-students (480.3694). The mean of Balance
is more or less the same across different levels of Gender
, Married
, and Ethnicity
. These three variables appear to be ineffective in differentiating the observations of Balance
.
Graphical Display: Categorical Predictors
Split Boxplots
Let’s supplement our conclusions above with some graphical displays. Since Balance
is a numeric variable but we are now dealing with categorical predictors, we will make use of split boxplots, as we do in CHUNK 7
.
# CHUNK 7 for (i in vars.categorical) { plot <- ggplot(Credit, aes(x = Credit[, i], y = Balance)) + geom_boxplot() + labs(x = i) print(plot) }
Findings
As we expect, the box for students is noticeably higher than that for non-students, with a similar spread, and the boxes are very similar across different factor levels for each of Gender
, Married
, and Ethnicity
. These observations support our earlier findings that Student seems to be the only factor variable with significant predictive power.
Overall, the most promising predictors for Balance
are Income
(numeric), Rating
(numeric), and Student
(factor).
TASK 4: Consider two graphs
Your assistant has built the following two plots and asked for your comments:
- A boxplot of
Age
byStudent
. - A scatterplot of
Balance
againstIncome
colored byStudent
.
Run the code to make them.
- Include them in your response.
- State your observations and discuss the impact, if any, of each plot on your later modeling.
Occasionally on the exam, you are asked to focus on specific graphs that highlight subtle but important aspects of the
data that may go unnoticed in a standard data exploration exercise. The two graphs constructed in CHUNK 8
are good examples.
Age against Student
# CHUNK 8 ggplot(Credit, aes(x = Student, y = Age)) + geom_boxplot()
Findings
The boxplot of Age
by Student
shows that those who are students tend to have a smaller age (a lower median). This makes intuitive sense as students are more likely to be younger, although the median age of students, which is close to 50, seems unreasonably high (well, this dataset is simulated!). However, this also means that there is some co-dependence (but not interaction!) between Age
and Student
, and the effects of these two variables on Balance
may be distorted when they serve as predictors at the same time. Fortunately, this concern may not be relevant after all, as we have seen in Task 2 that Age
is unlikely to be an important predictor of Balance
.
Balance against Income by Student
# CHUNK 8 ggplot(Credit, aes (x = Income, y = Balance, color = Student)) + geom_point() + geom_smooth(method = "lm", se = FALSE)
Findings
The scatterplot of Balance
against Income
with the observations colored by Student
and with a smoothed line superimposed for each level of Student
. This plot highlights the possible interaction between Income
and Student
in affecting Balance
.
Recall that an interaction means that the expected effect of one predictor on the target variable depends on the value or level of another predictor. This is reflected in the scatterplot, where the slopes of the two smoothed lines differ quite remarkably, with the non-students being more sensitive to Income
than students perhaps due to their financial status. To accommodate the interaction between Income
and Student
, we will include the interaction variable Income:Student
when constructing linear models in later tasks.
Note
It is a good idea to briefly describe what an interaction is whenever this concept arises before explaining why there is an interaction. This not only helps structure your response, but also provides a good opportunity for you to demonstrate to the SOA your knowledge of predictive analytics!
Stage 4: Model Construction and Evaluation
Model Evaluation: Training/Test Set Split
TASK 5: Explore the effect of releveling factor variables
- Split the data into training and test sets.
- Fit a multiple linear regression model for
Balance
on all variables (except Limit) on the training set. Be sure to take note of the implications in Task 4. Interpret the coefficient estimates forMarried
and for theAsian
level ofEthnicity
. - Relevel the factor variables so that the most frequent level becomes the baseline level and refit the multiple linear regression model. Interpret the coefficient estimates for
Married
and for theAsian
level ofEthnicity
again.
Creation of training and test sets: createDataPartition()
To pave the way for constructing and evaluating the predictive performance of different linear models that will come later, we start this subsection by setting up the training and test sets, again by the createDataPartition()
function . This is done in CHUNK 9
.
createDataPartition(<target_variable>, p = <proportion of training data>, list = <if vector>)
# CHUNK 9 library(caret) set.seed(8964) partition <- createDataPartition(Credit$Balance, p = 0.75, list = FALSE) data.train <- Credit[partition, ] data.test<- Credit[-partition, ] print("TRAIN") mean(data.train$Balance) print("TEST") mean(data.test$Balance)
[1] "TRAIN" > mean(data.train$Balance) [1] 520.4352 [1] "TEST" > mean(data.test$Balance) [1] 518.7374
Again, the mean of Balance
in the training set is pretty close to that in the test set due to the built-in stratification.
Model Construction
Step 1: Fitting the First Model
In CHUNK 10
, we fit the linear model for Balance
regressed on all variables together with the interaction between Income
and Student
identified in Task 4 on the training set, save the fitted model as model.full
(meaning the “full model”), and output a model summary.
Note that when specifying the interaction term in the formula, it is enough to use the colon operator instead of the asterisk because the placeholder “.
” already includes the main effects terms. (In fact, if you change Income:Student
to Income*Student
, the model summary will remain the same.)
# CHUNK 10 model.full <- lm(Balance ~ . + Income:Student, data = data.train) summary(model.full)
Call: lm(formula = Balance ~ . + Income:Student, data = data.train) Residuals: Min 1Q Median 3Q Max -189.17 -76.46 -15.65 66.69 296.17 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) -552.34353 40.34543 -13.690 <2e-16 *** Income -7.93864 0.29465 -26.943 <2e-16 *** Rating 4.01246 0.06416 62.540 <2e-16 *** Cards 2.89772 4.69168 0.618 0.5373 Age -0.78778 0.35994 -2.189 0.0294 * Education 0.88922 1.92035 0.463 0.6437 GenderFemale -18.37030 12.00867 -1.530 0.1272 StudentYes 395.67744 30.72302 12.879 <2e-16 *** MarriedYes -20.46469 12.39735 -1.651 0.0999 . EthnicityAsian 6.53269 17.13794 0.381 0.7033 EthnicityCaucasian 11.95386 14.82444 0.806 0.4207 Income:StudentYes 0.50102 0.50028 1.001 0.3174 --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 103.1 on 289 degrees of freedom Multiple R-squared: 0.952, Adjusted R-squared: 0.9502 F-statistic: 521.4 on 11 and 289 DF, p-value: < 2.2e-16
# of Levels of the Categorical Predictors
Note that the lm()
function automatically turns the four categorical predictors into dummy variables, which then enter the equation of the multiple linear regression model; we need not create the dummy variables beforehand. These dummy variables are named in the form of
<categorical_predictor><level>
For example, StudentYes
is a dummy variable that equals 1
if the observation in question is a student and 0
otherwise. A categorical predictor with r
levels is represented by r-1
dummy variables, each of which is an indicator of one and only one of the r-1
levels of the predictor, so:
-
Gender
,Student
, andMarried
are each represented by1
dummy variable, andEthnicity
by2
dummy variables.
The level that is left out is the baseline level, which, by default, is the alpha-numerically first level.
For instance, the baseline level of Student
is “No
“, which comes before “Yes
” in alphabetical order; this explains why instead of StudentNo
, the StudentYes
dummy variable appears in the model output.
Similarly, the baseline levels of Married
and Ethnicity
are “No
“, and “African American
“, respectively. (If you go back to the frequency tables of the four categorical predictors in CHUNK 1
, you can see that the baseline levels are listed first).
Coefficient Estimates
The coefficient estimates for the categorical predictors have to be interpreted differently from those for numeric predictors. Taking the estimates for MarriedYes
and EthnicityAsian
for illustration, we can say:
-
- The credit card balance of married individuals is lower than that of unmarried (baseline) by 20.46469 on average, holding all other predictors fixed.
- The credit card balance of Asians is higher than that of African Americans (baseline) by 6.53269 on average, holding all other predictors fixed.
The model summary also agrees with the findings in the data exploration tasks we completed (Tasks 2 and 3) and shows that:
Income
, Rating
, Student
(or equivalently, StudentYes
) are highly significant.
Perhaps to your astonishment, the interaction term between Income
and Student
is not quite significant. This does not necessarily contradict what we saw in Task 4, where we examined the relationships among Balance
, Income
, and Student
, and ignored other variables.
In contrast, the linear model we have just fitted takes into account all of the variables in the Credit
dataset. The model says that the interaction becomes insignificant when the effects of other variables have been accounted for.
By the way, you can see that:
-
- The standard error of the coefficient estimate of the “
Asian
” level (17.13794) of theEthnicity
variable is larger than that of the “Caucasian
” level (14.82444). This should come as no surprise to us, considering the larger number of observations in the latter level (seeCHUNK 1
orCHUNK 6
).
- The standard error of the coefficient estimate of the “
Recall that sparse (resp. populous) levels tend to have larger (resp. smaller) standard errors because the coefficient estimates for those levels are based on less (resp. more) data.
Releveling
In most of the PA sample and exam projects, a code chunk is supplied for you to relevel factor variables so that their baseline level is not the alpha-numerically first level, but the level with the most observations. While releveling has no effect on the predicted values, changing the baseline levels will:
- affect the coefficient estimates of the dummy variables (which are now dummy variables with respect to another baseline level) and
- possibly the significance statistics, such as t-statistics for the dummy variables and the corresponding p-values.
Remember, the p-values measure how different the non-baseline levels are from the baseline level, and changing the baseline level certainly changes the measurement of such differences.
In general, for linear models and GLMs the usual practice is to set the baseline level to the most populous level to improve the accuracy of our measures of significance. In CHUNK 11
, we relevel the four categorical predictors using a for loop and the relevel()
function.
# CHUNK 11 for (i in vars.categorical){ # Use the table() function to calculate the frequencies for each factor table <- as.data.frame(table(Credit[, i])) # Determine the level with the highest frequency max <- which.max(table[, 2]) # Save the name of the level with the highest frequency level.name <- as.character(table[max, 1]) # Set the baseline level to the most populous level Credit[, i] <- relevel(Credit[, i], ref = level.name) } summary(Credit[, vars.categorical])
Gender Student Married Ethnicity Female:207 No :360 Yes:245 Caucasian :199 Male :193 Yes: 40 No :155 African American: 99 Asian :102
As you can see, the baseline levels of Gender
, Married
, and Ethnicity
have been changed respectively from “Male
“, “No
“, and “African American
” to “Female
“, “Yes
“, and “Caucasian
“, which now come first in the frequency tables for Married
and Ethnicity
.
What happens to the model output after doing the releveling? Run CHUNK 12
to refit the full model.
# CHUNK 12 # To make sure factors in the training set are reLeveLed data.train <- Credit[partition, ] model.full <- lm(Balance ~ . + Income:Student, data = data.train) summary(model.full)
Call: lm(formula = Balance ~ . + Income:Student, data = data.train) Residuals: Min 1Q Median 3Q Max -189.17 -76.46 -15.65 66.69 296.17 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) -579.22466 40.08600 -14.450 <2e-16 *** Income -7.93864 0.29465 -26.943 <2e-16 *** Rating 4.01246 0.06416 62.540 <2e-16 *** Cards 2.89772 4.69168 0.618 0.5373 Age -0.78778 0.35994 -2.189 0.0294 * Education 0.88922 1.92035 0.463 0.6437 Gender Male 18.37030 12.00867 1.530 0.1272 StudentYes 395.67744 30.72302 12.879 <2e-16 *** MarriedNo 20.46469 12.39735 1.651 0.0999 . EthnicityAfrican American -11.95386 14.82444 -0.806 0.4207 EthnicityAsian -5.42117 14.68523 -0.369 0.7123 Income:StudentYes 0.50102 0.50028 1.001 0.3174 --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 103.1 on 289 degrees of freedom Multiple R-squared: 0.952, Adjusted R-squared: 0.9502 F-statistic: 521.4 on 11 and 289 DF, p-value: < 2.2e-1
Comparing the output in CHUNK 12
with that in CHUNK 10
, we can see that the coefficient estimates of the dummy variables for Married
and Ethnicity
have changed, so have the p-values for Ethnicity Asian
. We can say that:
- The credit card balance of unmarried individuals is higher than that of married (baseline) by 20.46469 on average, holding all other predictors fixed. The size of the coefficient estimate remains unchanged, but it is now positive (-20.46469 -> +20.46469).
- The credit card balance of Asians is lower than that of Caucasian (baseline) by 5.42117 on average, holding all other predictors fixed. If you are interested, this coefficient estimate can be computed from the two coefficient estimates in CHUNK 10: 6.53269 – 11.95386 = -5.42117.
While there is no change in the p-value for Student
, the p-value for the Asian
level of Ethnicity
is different from that in CHUNK 10
. In general, the p-values for factor variables with three or more levels may depend on the choice of the baseline level. If we perform stepwise (backward) selection with binarization and try to drop the most statistically insignificant feature, that is, the one with the highest p-value, then the choice of the baseline levels of the factor variables may affect which feature is to be dropped first. This explains the importance of releveling.
TASK 6: Binarize Factor Variables Manually
- Describe the advantages and disadvantages of binarizing the factor variables before performing feature selection.
- Regardless, binarize the factor variables and refit the multiple linear regression model in Task 5.
One more preparatory step: Binarization of factor variables
Before implementing feature selection, sometimes it is desirable to explicitly binarize categorical predictors with three or more levels. Although the lm()
function is capable of automatically converting categorical predictors into dummy variables behind the scenes, as we saw in CHUNKs 10
and 12
, and so whether or not we binarize the categorical predictors in advance has no impact on a fitted linear model, explicitly doing binarization can help make the feature selection process we are going to perform more meaningful.
- Advantages: Many feature selection functions in R treat factor variables as a single feature and either retain the variable with all of its levels or remove the variable completely. This “all or nothing” feature selection approach ignores the possibility that individual levels of a factor with three or more levels may be:
- Insignificant with regard to the baseline level (and hence could be combined with it to form a bigger baseline level)
- Insignificantly different from other levels (in which case they could be combined into a new, bigger level).
With binarization, we are able to add or drop individual factor levels when doing feature selection.
(Note: This advantage is not very relevant to this case study because only the Ethnicity
variable has more than two levels and both of its dummy variables are not statistically significant. We do binarization here mainly as illustration.)
- Disadvantages: Binarization of factor variables is not without disadvantages.
- As each factor level is considered a separate feature to add or drop, the stepwise selection procedure may take considerably longer to complete. The increase in computational burden is non-trivial when there are a fairly large number of factor levels.
- We may see some non-intuitive or nonsensical results when only a handful of the levels of a categorical predictor are selected.
Compared to generating new features such as \(x^2\), \(x^3\), \(log(x)\), \(exp(x)\), \(sin(x)\) for a numeric variable x, binarization of categorical variables is conceptually more subtle and programmatically more difficult to implement in R.
The PA modules suggest using the dummyVars()
function in the caret
package and we will follow suit. Among the four categorical variables Gender
, Student
, Married
, and Ethnicity
, only Ethnicity
has more than two levels and has to be binarized by two dummy variables. The binarization of Ethnicity
can be done by running CHUNK 13
.
# CHUNK 13 library(caret) binarizer <- dummyVars(paste("~", paste(vars.categorical, collapse = "+")), data = Credit, fullRank = TRUE) # OR type out the categorical predictors one by one #binarizer <- dummyVars(~ Gender + Student + Married + Ethnicity, # data = Credit, fullRank = TRUE) binarized_vars <- data.frame(predict(binarizer, newdata = Credit)) head(binarized_vars)
Gender..Male Student.Yes Married.No Ethnicity.African.American Ethnicity.Asian 1 1 0 0 0 0 2 0 1 0 0 1 3 1 0 1 0 1 4 0 0 1 0 1 5 1 0 0 0 0 6 1 0 1 0 0
Explanation of Coding
The dummyVars()
function takes a formula (there is no target variable) as the first argument and a data frame carrying the categorical variable(s) you want to binarize as the second argument.
In vars.categorical, collapse = "+"
, we use the paste()
function to “paste” the names of the four categorical predictors Gender
, Student
, Married
, and Ethnicity
(recall that vars.categorical
is defined in CHUNK 6
) and separate them by the +
sign.
Then another application of the paste()
function inserts the tilde sign before the “sum” of the names of the categorical predictors and forms the overall formula.
(Note: The use of the paste()
function is suggested in the PA e-learning modules and projects. If you find the paste()
function too difficult to work with, you may simply type out the formula ~ Gender + Student + Married+ Ethnicity
).
The fullRank
option can be set to TRUE
or FALSE
depending on your purpose:
-
- Setting it to
TRUE
is appropriate for regression and allows you to have a full rank parameterization of the model matrix, with the baseline levels of the categorical predictors left out. - When the option is set to
FALSE
(the default value) , the baseline levels would be retained and a rank-deficient model would result. This is more appropriate for principal components and cluster analyses, which will be covered in Chapter 6.
- Setting it to
Below is the outcome of setting fullRank = FALSE
# CHUNK 13 library(caret) binarizer <- dummyVars(paste("~", paste(vars.categorical, collapse = "+")), data = Credit, fullRank = FALSE) # OR type out the categorical predictors one by one #binarizer <- dummyVars(~ Gender + Student + Married + Ethnicity, # data= Credit, fullRank = TRUE) binarized_vars <- data.frame(predict(binarizer, newdata = Credit)) head(binarized_vars)
Gender.Female Gender..Male Student.No Student.Yes Married.Yes Married.No Ethnicity.Caucasian Ethnicity.African.American Ethnicity.Asian 1 0 1 1 0 1 0 1 0 0 2 1 0 0 1 1 0 0 0 1 3 0 1 1 0 0 1 0 0 1 4 1 0 1 0 0 1 0 0 1 5 0 1 1 0 1 0 1 0 0 6 0 1 1 0 0 1 1 0 0
As we said above, it is enough to binarize only Ethnicity
as it is the only categorical predictor with three or more levels. For illustration purposes, we have binarized all of the four categorical predictors to show how multiple factor variables can be dealt with at the same time; this is likely the case you will encounter in the exam. As we expect, the binarization procedure creates two dummy variables for Ethnicity
, called Ethnicity.African.American
and Ethnicity.Asian
(recall that the baseline level after doing the releveling is Caucasian
), indicating whether Ethnicity
equals “African American
” or “Asian
“, respectively.
The binarized_vars
data frame contains all of the dummy variables. In CHUNK 14
, we combine it with the original Credit
dataset via the cbind()
function and rename it with a .bin
suffix to distinguish it from the original non-binarized dataset.
# CHUNK 14 Credit.bin <- cbind(Credit, binarized_vars) head(Credit.bin)
Income Rating Cards Age Education Gender Student Married Ethnicity Balance Gender.Female Gender..Male Student.No Student.Yes Married.Yes Married.No Ethnicity.Caucasian Ethnicity.African.American Ethnicity.Asian 1 14.891 283 2 34 11 Male No Yes Caucasian 333 0 1 1 0 1 0 1 0 0 2 106.025 483 3 82 15 Female Yes Yes Asian 903 1 0 0 1 1 0 0 0 1 3 104.593 514 4 71 11 Male No No Asian 580 0 1 1 0 0 1 0 0 1 4 148.924 681 3 36 11 Female No No Asian 964 1 0 1 0 0 1 0 0 1 5 55.882 357 2 68 16 Male No Yes Caucasian 331 0 1 1 0 1 0 1 0 0 6 80.180 569 4 77 10 Male No No Caucasian 1151 0 1 1 0 0 1 1 0 0
Given the binarized variables, the original categorical predictors are no longer needed as they duplicate the information contained in the dummy variables. They can be dropped by assigning them to the NULL
symbol.
# CHUNK 14 (Cont.) Credit.bin$Gender <- NULL Credit.bin$Student <- NULL Credit.bin$Married <- NULL Credit.bin$Ethnicity <- NULL head(Credit.bin)
Income Rating Cards Age Education Balance Gender.Female Gender..Male Student.No Student.Yes Married.Yes Married.No Ethnicity.Caucasian Ethnicity.African.American Ethnicity.Asian 1 14.891 283 2 34 11 333 0 1 1 0 1 0 1 0 0 2 106.025 483 3 82 15 903 1 0 0 1 1 0 0 0 1 3 104.593 514 4 71 11 580 0 1 1 0 0 1 0 0 1 4 148.924 681 3 36 11 964 1 0 1 0 0 1 0 0 1 5 55.882 357 2 68 16 331 0 1 1 0 1 0 1 0 0 6 80.180 569 4 77 10 1151 0 1 1 0 0 1 1 0 0
Now that the expanded dataset with binarized variables is available, we again split it into the training and test sets, using the same partition vector in CHUNK 9
. There is no need to call the createDataPartition()
function again as partition already contains the set of observations (rows) that should be sent to the training and test sets, and the same training/test set split should be used to evaluate all of our candidate models.
# CHUNK 14- (Cont.) data.train.bin <- Credit.bin[partition, ] data.test.bin <- Credit.bin[-partition, ]
A code chunk for performing binarization of factor variables will be provided to help you. You merely have to change the placeholders to names of the variables in the dataset you are working on. Here is how it looks.
library (caret) # insert the names of the variables to be binarized factor_names <- c("<var1>", "<var2>", ... ) factor_vars <- <data>[, factor_names] # dummyVars is not compatible with factors for (var in factor_names) { factor_vars[, var] <- as.character(factor_vars[, var]) } binarizer <- caret::dummyVars(paste("~", paste(factor_names, collapse = "+" )), data = factor_vars, fullRank = TRUE) binarized_vars <- data.frame(predict(binarizer, newdata = factor_vars)) head(binarized_vars)
Notice two features of the SOA’s code not present in our illustrative code given in CHUNK 13
.
- The first part of the SOA’s code converts the factor variables to character variables. The PA modules and exams say that this conversion is done because “
dummyVars
is not compatible with factors,” which does not seem true. Without this conversion, the binarization continues to work well, as we saw inCHUNK 13
above, and the linear models can still be fit properly, as we will see soon. - The command
caret::dummyVars
explicitly specifies that thedummyVars()
function from thecaret
package is used.
It is very likely that future exams that require the use of binarization will also provide all of the necessary code for you. (If not, type ?dummyVars
to seek help. Remember to load the caret
package first).
Step 1 Revisited: Refitting the Linear Model with Binarized Factors
Now that the training and test sets have been binarized, let’s end this task by refitting the linear model on the binarized training set, where all variables are numeric, and print a summary of the model to make sure everything looks right.
# CHUNK 15 # The interaction term is now Income:Student.Yes, not Income:Student model.full.bin <- lm(Balance ~ . + Income:Student.Yes, data= data.train.bin) summary(model.full.bin)
Call: lm(formula = Balance ~ . + Income:Student.Yes, data = data.train.bin) Residuals: Min 1Q Median 3Q Max -189.17 -76.46 -15.65 66.69 296.17 Coefficients: (4 not defined because of singularities) Estimate Std. Error t value Pr(>|t|) (Intercept) -150.13340 48.40440 -3.102 0.00211 ** Income -7.93864 0.29465 -26.943 < 2e-16 *** Rating 4.01246 0.06416 62.540 < 2e-16 *** Cards 2.89772 4.69168 0.618 0.53731 Age -0.78778 0.35994 -2.189 0.02942 * Education 0.88922 1.92035 0.463 0.64368 Gender.Female -18.37030 12.00867 -1.530 0.12717 Gender..Male NA NA NA NA Student.No -395.67744 30.72302 -12.879 < 2e-16 *** Student.Yes NA NA NA NA Married.Yes -20.46469 12.39735 -1.651 0.09988 . Married.No NA NA NA NA Ethnicity.Caucasian 5.42117 14.68523 0.369 0.71228 Ethnicity.African.American -6.53269 17.13794 -0.381 0.70335 Ethnicity.Asian NA NA NA NA Income:Student.Yes 0.50102 0.50028 1.001 0.31743 --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 103.1 on 289 degrees of freedom Multiple R-squared: 0.952, Adjusted R-squared: 0.9502 F-statistic: 521.4 on 11 and 289 DF, p-value: < 2.2e-16
The summary output is identical to that in CHUNK 12
, showing that binarization has no effect on the quality of the fitted model. However, the two dummy variables for Ethnicity
will be considered separately when we do stepwise selection in the next task.
TASK 7: Select features using stepwise selection
Some of the features in the full model may lack predictive power and result in overfitting.
- Perform stepwise selection using the
stepAIC()
function to determine which features should be retained.- When using this function, there are two decisions to make:
- Forward (direction = “forward”) vs. backward (direction = “backward”)
AIC(k = 2)
vs.BIC (k = log(nrow(data.train)))
.
- When using this function, there are two decisions to make:
Describe each decision and the implications in choosing one option versus the other.
-
- Employ forward selection using BIC.
- Run the summary function on the linear model selected by this process. Provide the summary output and list the variables selected.
Step 2: Removing Non-Predictive Features
As CHUNK 15
shows, only a few features are statistically significant. While including all of the features will improve its fit to the training data, this may make the model too complex in size and result in overfitting (i.e., poor fit to unseen test data). A sensible way to improve its predictive power (and interpretability) is to drop statistically insignificant features and retain genuinely predictive ones. This is what this task is about. In R, dropping insignificant features sequentially can be achieved by several functions, the most important of which in Exam PA is the stepAIC()
function.
Feature Selection by the stepAIC()
Function.
Available from the MASS
package, the stepAIC()
function automates stepwise model selection, in each step adding (forward selection) or dropping the feature (backward selection) to produce the greatest improvement in the model according to a certain criterion (the AIC by default). The process is terminated until no more variables can be added or dropped to improve the model.
Although not much about the stepAIC()
function is presented in the PA e-learning modules, it is used in many of the PA exam projects (almost all PA projects involve feature selection in some way!). The December 2020, June 2020, and June 2019 PA exams even test some of the options of the function that can be changed to customize the feature selection process, so it pays to look at the two important options below that allow us to control the feature selection process and selection criterion:
direction = "backward"
(default) ordirection = "forward"
:
With thestepAIC()
function, we can either implement:- Backward selection, where we start with the full model (with interaction and polynomial terms inserted if desired) and in each step drop the feature that leads to the greatest decrease in the model AIC ( or BIC).
- Forward selection, where we start with a primitive model (usually the model with only the intercept) and add the feature, one at a time, that leads to the greatest decrease in the model AIC (or BIC).
k = 2
(default) ork = log(nrow(data.train))
:
Despite the nomenclature of thestepAIC()
function, it can perform stepwise selection using not only the AIC, but also the BIC as the selection criterion.
To change the selection criterion from the AIC to the BIC, adjust the k argument, which is the multiple of the number of degrees of freedom that serves as the penalty term. Since:- \(\text{AIC}=-2l+\boxed{2}p\) and
- \(\text{BIC}=-2l+\boxed{\ln(n_{tr})}p\),
The AIC and BIC are captured by setting k = 2
(default) and k = log(nrow(data.train))
, respectively.
AIC vs. BIC, backward vs. forward: Which ones to use?
When performing stepwise feature selection, which selection criterion and selection process to use is an important modeling decision. Naturally, none of the selection criteria and selection processes are universally superior and an argument in favor of a particular criterion or process can be put forward based on the context of the given business problem.
AIC vs. BIC
Recall that the AIC and BIC differ only in terms of the size of the penalty that applies to the number of parameters.
-
-
- For the AIC, the penalty per parameter is 2;
- For the BIC, the penalty is the natural logarithm of the size of the training set, or \(\ln(n_{tr})=ln301 = 5.7071\), which is greater.
-
For a more complex model to beat a less complex model in terms of the BIC, the goodness of fit of the former has to far surpass that of the latter to make up for the increase in the penalty term. Thus, the BIC tends to favor a model with fewer features and represents a more conservative approach to feature selection.
Since our goal in this case study is to identify key factors that relate to the target variable Balance
, it makes sense to take a conservative approach and work with as few features as necessary in the final model using the BIC as the selection criterion.
You can also make an argument (though not as convincing) in support of the AIC, saying that due to its propensity for retaining more features than other selection criteria, the client you work with will be given more factors to consider and the risk of missing important features will be smaller.
Backward vs. Forward Selections
The decision of using forward or backward selections can be justified with similar reasons. With the selection criterion held fixed, forward selection is more likely to produce a final model with fewer features (because the model you start with is the null model, which has no features) and better aligns with our goal of identifying key factors affecting Balance
.
Based on the above considerations, it makes sense that we employ forward selection with the BIC in this case study and this is what we are asked to perform according to the task statement. For comparison, we will also conduct backward selection with the AIC. If this was a real exam, you would only need to do forward selection with the BIC.
Backward selection with the AIC
To begin with, let’s run CHUNK 16
to perform stepwise backward selection using the AIC, which is easier coding-wise. As backward selection and AIC are the defaults of the stepAIC()
function, it is enough to apply the function directly to the binarized full model constructed in CHUNK 15
without specifying additional options. We then save the final model as an lm
object named model.backward.AIC
for future manipulation, and print a summary of the final model.
# CHUNK 16 #instaLL.packages("MASSn) # uncomment this Line the first time you use MASS library(MASS) model.backward.AIC <- stepAIC(model.full.bin)
Start: AIC=2802.25 Balance ~ Income + Rating + Cards + Age + Education + Gender.Female + Gender..Male + Student.No + Student.Yes + Married.Yes + Married.No + Ethnicity.Caucasian + Ethnicity.African.American + Ethnicity.Asian + Income:Student.Yes Step: AIC=2802.25 Balance ~ Income + Rating + Cards + Age + Education + Gender.Female + Gender..Male + Student.No + Student.Yes + Married.Yes + Married.No + Ethnicity.Caucasian + Ethnicity.African.American + Income:Student.Yes Step: AIC=2802.25 Balance ~ Income + Rating + Cards + Age + Education + Gender.Female + Gender..Male + Student.No + Student.Yes + Married.Yes + Ethnicity.Caucasian + Ethnicity.African.American + Income:Student.Yes Step: AIC=2802.25 Balance ~ Income + Rating + Cards + Age + Education + Gender.Female + Gender..Male + Student.Yes + Married.Yes + Ethnicity.Caucasian + Ethnicity.African.American + Income:Student.Yes Step: AIC=2802.25 Balance ~ Income + Rating + Cards + Age + Education + Gender.Female + Student.Yes + Married.Yes + Ethnicity.Caucasian + Ethnicity.African.American + Income:Student.Yes Df Sum of Sq RSS AIC - Ethnicity.Caucasian 1 1448 3071419 2800.4 - Ethnicity.African.American 1 1543 3071515 2800.4 - Education 1 2278 3072249 2800.5 - Cards 1 4052 3074024 2800.6 - Income:Student.Yes 1 10654 3080626 2801.3 <none> 3069972 2802.3 - Gender.Female 1 24859 3094830 2802.7 - Married.Yes 1 28946 3098918 2803.1 - Age 1 50885 3120857 2805.2 - Rating 1 41548705 44618677 3605.9 Step: AIC=2800.39 Balance ~ Income + Rating + Cards + Age + Education + Gender.Female + Student.Yes + Married.Yes + Ethnicity.African.American + Income:Student.Yes Df Sum of Sq RSS AIC - Education 1 2255 3073675 2798.6 - Cards 1 3911 3075331 2798.8 - Ethnicity.African.American 1 5595 3077014 2798.9 - Income:Student.Yes 1 10176 3081595 2799.4 <none> 3071419 2800.4 - Gender.Female 1 25165 3096584 2800.8 - Married.Yes 1 29351 3100771 2801.3 - Age 1 51462 3122881 2803.4 - Rating 1 41550276 44621695 3603.9 Step: AIC=2798.61 Balance ~ Income + Rating + Cards + Age + Gender.Female + Student.Yes + Married.Yes + Ethnicity.African.American + Income:Student.Yes Df Sum of Sq RSS AIC - Cards 1 3865 3077540 2797.0 - Ethnicity.African.American 1 5586 3079261 2797.2 - Income:Student.Yes 1 10214 3083888 2797.6 <none> 3073675 2798.6 - Gender.Female 1 25155 3098830 2799.1 - Married.Yes 1 28499 3102174 2799.4 - Age 1 51652 3125327 2801.6 - Rating 1 41550265 44623939 3601.9 Step: AIC=2796.99 Balance ~ Income + Rating + Age + Gender.Female + Student.Yes + Married.Yes + Ethnicity.African.American + Income:Student.Yes Df Sum of Sq RSS AIC - Ethnicity.African.American 1 5579 3083119 2795.5 - Income:Student.Yes 1 10767 3088307 2796.0 <none> 3077540 2797.0 - Gender.Female 1 23659 3101198 2797.3 - Married.Yes 1 27917 3105457 2797.7 - Age 1 49660 3127200 2799.8 - Rating 1 42161655 45239195 3604.0 Step: AIC=2795.54 Balance ~ Income + Rating + Age + Gender.Female + Student.Yes + Married.Yes + Income:Student.Yes Df Sum of Sq RSS AIC - Income:Student.Yes 1 9873 3092992 2794.5 <none> 3083119 2795.5 - Gender.Female 1 23435 3106554 2795.8 - Married.Yes 1 24716 3107835 2795.9 - Age 1 50672 3133791 2798.4 - Rating 1 42165674 45248793 3602.1 Step: AIC=2794.5 Balance ~ Income + Rating + Age + Gender.Female + Student.Yes + Married.Yes Df Sum of Sq RSS AIC <none> 3092992 2794.5 - Gender.Female 1 22314 3115306 2794.7 - Married.Yes 1 25027 3118019 2794.9 - Age 1 49133 3142125 2797.2 - Student.Yes 1 4931600 8024592 3079.5 - Income 1 8357794 11450786 3186.5 - Rating 1 42247226 45340218 3600.7
summary(model.backward.AIC)
Call: lm(formula = Balance ~ Income + Rating + Age + Gender.Female + Student.Yes + Married.Yes, data = data.train.bin) Residuals: Min 1Q Median 3Q Max -195.16 -76.45 -14.03 65.33 285.30 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) -529.16392 26.85954 -19.701 <2e-16 *** Income -7.87666 0.27946 -28.186 <2e-16 *** Rating 4.01180 0.06331 63.370 <2e-16 *** Age -0.77016 0.35638 -2.161 0.0315 * Gender.Female -17.32103 11.89326 -1.456 0.1464 Student.Yes 418.87031 19.34645 21.651 <2e-16 *** Married.Yes -18.73773 12.14860 -1.542 0.1241 --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 102.6 on 294 degrees of freedom Multiple R-squared: 0.9517, Adjusted R-squared: 0.9507 F-statistic: 964.8 on 6 and 294 DF, p-value: < 2.2e-16
When executed, stepAIC()
prints out the results of all iterations, showing how the model AIC (displayed in the AIC column in ascending order) changes when the most insignificant feature is dropped in each step.
For instance, in the first iteration dropping Ethnicity.Asian
(“-Ethnicity.Asian
” in the first row of the table means deleting Ethnicity.Asian
) leads to the lowest model AIC (= 2800.4), down from the current AIC of 2802.25 (the line corresponding to is the current model, where “none” of the variables are dropped). In the second iteration, Education is dropped next to result in the lowest AIC.
Overall, the features are dropped in the following order: Ethnicity.Asian
, Education
, Cards
, Ethnicity
. African.American
, Income:Student.Yes
, then the process terminates.
The following six features remain in the final model: Income
, Rating
, Age
, Gender.Male
, Student.Yes
, Married.No
All of the coefficient estimates in the final model, except for the estimates of Gender.Male
and Married.No
, are statistically significant and the retention of Income
, Rating
, and Student.Yes
is in agreement with what we got in Tasks 2 and 3.
Note
-
- Candidates should run the final model” after performing stepwise model selection and “check that everything looks right.” You can make simple comments like how many or which features are in the final model, whether their retention matches what you observed in the data exploration part (not always the case), and whether the coefficient estimates are statistically significant (not always the case, especially with AIC).
Exercise: Stepwise selection for non-binarized data
In CHUNK 16
above, we performed backward stepwise selection for the linear model fitted to the binarized training set.
Adapt the code to perform backward stepwise selection for the linear model fitted to the non-binarized training set.
Pay attention to the Ethnicity
variable.
Solution: Stepwise selection for non-binarized data
The linear model fitted to the non-binarized training set is constructed in CHUNK 10
and named model. full. We now apply the stepAIC()
function to this lm object and see what happens in the first iteration.
# CHUNK 17 model.no.binarize <- stepAIC(model.full, steps = 1)
Start: AIC=2802.25 Balance ~ Income + Rating + Cards + Age + Education + Gender + Student + Married + Ethnicity + Income:Student Df Sum of Sq RSS AIC - Ethnicity 2 7042 3077014 2798.9 - Education 1 2278 3072249 2800.5 - Cards 1 4052 3074024 2800.6 - Income:Student 1 10654 3080626 2801.3 <none> 3069972 2802.3 - Gender 1 24859 3094830 2802.7 - Married 1 28946 3098918 2803.1 - Age 1 50885 3120857 2805.2 - Rating 1 41548705 44618677 3605.9 Step: AIC=2798.94 Balance ~ Income + Rating + Cards + Age + Education + Gender + Student + Married + Income:Student
Without binarization, the entire Ethnicity
variable, which is associated with two degrees of freedom (look at the Df
column), one for each of the two non-baseline levels, is treated as a single predictor and considered being dropped altogether.
Although this does not make a material difference in this case study, in other cases this may not be desirable as we are ignoring the possibility that only one of the two non-baseline levels of Ethnicity is insignificant while the other one is significant (with respect to the baseline).
Forward Selection with the BIC
To perform forward selection requires more coding effort. Besides the additional option direction = "forward"
to mean that we are doing forward selection, we have to specify, in the scope argument of the stepAIC()
function, a two-component list to define the range of models to be examined in the stepwise selection process.
The most sophisticated model, often the full model with all of the features, is given in the upper component while the most primitive model, usually the null model, is given in the lower component.
To override the AIC default and use the BIC as the selection criterion, we add the option k = log(nrow(data.train.bin))
in the stepAIC()
function.
Now run CHUNK 18
to select the final model based on forward selection and the BIC, save it as an lm
object named model.forward.BIC
, and print a summary.
# CHUNK 18 # first fit the null model (i.e., model with no predictors) model.null <- lm(Balance ~ 1, data = data.train.bin) model.forward.BIC <- stepAIC(model.null, direction = "forward", scope = list(upper = model.full.bin, lower = model.null), k = log(nrow(data.train.bin)))
Start: AIC=3698.12 Balance ~ 1 Df Sum of Sq RSS AIC + Rating 1 47082983 16908423 3303.2 + Income 1 12910007 51081399 3636.0 + Student.No 1 5425335 58566071 3677.2 + Student.Yes 1 5425335 58566071 3677.2 <none> 63991406 3698.1 + Cards 1 163291 63828115 3703.1 + Gender.Female 1 127201 63864205 3703.2 + Gender..Male 1 127201 63864205 3703.2 + Age 1 79880 63911526 3703.5 + Education 1 13121 63978284 3703.8 + Married.Yes 1 10920 63980486 3703.8 + Married.No 1 10920 63980486 3703.8 + Ethnicity.Asian 1 3049 63988357 3703.8 + Ethnicity.Caucasian 1 2324 63989082 3703.8 + Ethnicity.African.American 1 0 63991406 3703.8 Step: AIC=3303.21 Balance ~ Rating Df Sum of Sq RSS AIC + Income 1 8650956 8257467 3093.2 + Student.No 1 4817765 12090659 3208.0 + Student.Yes 1 4817765 12090659 3208.0 + Age 1 707560 16200863 3296.1 <none> 16908423 3303.2 + Married.Yes 1 174547 16733876 3305.8 + Married.No 1 174547 16733876 3305.8 + Education 1 71402 16837022 3307.6 + Ethnicity.Asian 1 49620 16858804 3308.0 + Gender.Female 1 31887 16876536 3308.4 + Gender..Male 1 31887 16876536 3308.4 + Cards 1 30049 16878374 3308.4 + Ethnicity.Caucasian 1 18541 16889883 3308.6 + Ethnicity.African.American 1 4304 16904120 3308.8 Step: AIC=3093.2 Balance ~ Rating + Income Df Sum of Sq RSS AIC + Student.No 1 5069114 3188353 2812.5 + Student.Yes 1 5069114 3188353 2812.5 <none> 8257467 3093.2 + Married.Yes 1 127904 8129563 3094.2 + Married.No 1 127904 8129563 3094.2 + Age 1 95208 8162259 3095.4 + Education 1 44496 8212971 3097.3 + Ethnicity.Asian 1 40115 8217352 3097.4 + Cards 1 22019 8235448 3098.1 + Ethnicity.African.American 1 9464 8248003 3098.6 + Ethnicity.Caucasian 1 8003 8249464 3098.6 + Gender.Female 1 244 8257223 3098.9 + Gender..Male 1 244 8257223 3098.9 Step: AIC=2812.47 Balance ~ Rating + Income + Student.No Df Sum of Sq RSS AIC <none> 3188353 2812.5 + Age 1 46257 3142096 2813.8 + Gender.Female 1 24128 3164225 2815.9 + Gender..Male 1 24128 3164225 2815.9 + Married.Yes 1 23759 3164595 2815.9 + Married.No 1 23759 3164595 2815.9 + Ethnicity.Caucasian 1 4275 3184078 2817.8 + Ethnicity.African.American 1 2346 3186007 2817.9 + Education 1 1600 3186753 2818.0 + Cards 1 865 3187488 2818.1 + Ethnicity.Asian 1 747 3187606 2818.1
summary(model.forward.BIC)
Call: lm(formula = Balance ~ Rating + Income + Student.No, data = data.train.bin) Residuals: Min 1Q Median 3Q Max -220.292 -78.120 -7.473 63.273 299.895 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) -167.99297 23.70608 -7.086 1e-11 *** Rating 4.01634 0.06353 63.223 <2e-16 *** Income -7.97413 0.27691 -28.797 <2e-16 *** Student.No -421.17339 19.38206 -21.730 <2e-16 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 103.6 on 297 degrees of freedom Multiple R-squared: 0.9502, Adjusted R-squared: 0.9497 F-statistic: 1888 on 3 and 297 DF, p-value: < 2.2e-16
The output parallels that in CHUNK 16
and shows how the model BIC changes when the most significant feature (shown in the first row) is added one at a time. In the first iteration, the addition of Rating
(“+ Rating
” in the first row of the table means adding Rating) leads to the smallest BIC (3303.2 compared with 3698.12). Notice that the AIC column carries, in fact, the values of the BIC for different linear models because we have overridden the AIC default.
Features are added in the following order:Rating
, Income, Student.Yes
Not surprisingly, the resulting model is simpler, with only these three features, compared to six features for the model returned by backward selection with the AIC. All of these three features are statistically significant.
Model Evaluation: Performance Metrics
TASK 8: Select and validate the recommended model
- Evaluate the RMSE of the models in the preceding subsection against the test set.
- Make a recommendation as to which model should be used.
- Generate and interpret diagnostic plots for the recommended model to check the model assumptions.
Select a model: Which model is most predictive?
Now that we have different linear models of varying degrees of complexity fitted to the training set, it is time to compare their predictive performance on the common test set and see which one is the most predictive. We are particularly interested in whether reducing the complexity of a linear model by stepwise selection algorithms can enhance prediction accuracy. A natural measure of the performance of linear models, where the target variable is numeric, is the test RMSE. The smaller the test RMSE, the better. CHUNK 19
computes and prints the test RMSE of the four models we have constructed.
# CHUNK 19 RMSE(data.test$Balance, predict(model.null, newdata = data.test.bin)) RMSE(data.test$Balance, predict(model.full.bin, newdata = data.test.bin)) RMSE(data.test$Balance, predict(model.backward.AIC, newdata = data.test.bin)) RMSE(data.test$Balance, predict(model.forward.BIC, newdata = data.test.bin))
> RMSE(data.test$Balance, predict(model.null, newdata = data.test.bin)) [1] 453.3665 > RMSE(data.test$Balance, predict(model.full.bin, newdata = data.test.bin)) [1] 105.0656 > RMSE(data.test$Balance, predict(model.backward.AIC, newdata = data.test.bin)) [1] 104.2375 > RMSE(data.test$Balance, predict(model.forward.BIC, newdata = data.test.bin)) [1] 102.6785
For your convenience, the following table reports the test RMSE extracted from the R output for each of the four models.
Model | BIC |
Null model (benchmark) | 453.3665 |
Full model | 105.0656 |
Final model based on backward selection and AIC | 104.2375 |
Final model based on forward selection and BIC | 102.6785 (smallest) |
Observe that the two reduced models resulting from stepwise selection have a (slightly) lower test RMSE than the full model despite having far fewer features. Comparing the two reduced models, we can see that the model based on forward selection and BIC not only is simpler in the sense that it is nested within that based on the backward selection and AIC, but also is more predictive in the sense that its test RMSE is lower.
On the basis of both prediction accuracy and interpretability, the final model based on forward selection and BIC is justifiably the most promising model among all of the linear models we have considered and we will recommend using this model.
Stage 5: Model Validation
Model Diagnostics
Now that the model based on forward selection and BIC appears to be the best among the four linear models, let’s diagnose it carefully for any anomalies with the use of appropriate graphical displays in CHUNK 20
.
When the plot()
function (this is one of the rare instances in Exam PA in which we make use of R’s base graphics functions) is applied to an lm
object (or glm
object) such as model.forward.BIC
, four diagnostic plots are generated, the first two of which are most easily interpreted.
Diagnostic plots for the final model based on forward selection and the BIC fitted to the training set of the Credit
data.
# CHUNK 20 plot(model.forward.BIC)
“Residuals vs Fitted” Plot
The “Residuals vs Fitted” plot graphs the residuals of the model against the fitted values on the training set. If the linear model has sufficiently captured the true relationship between the target variable and the predictors, then there should be no discernible patterns in this plot, where the residuals are dispersed in an erratic manner.
The plot is also an informal test of the homogeneity of error variance. If the amount of fluctuation of the residuals increases as the magnitude of the fitted values increases, exhibiting a funnel shape, then the homoscedasticity assumption is at stake.
Unfortunately, while the homoscedastic assumption does not seem to be violated, there is a clear U-shape in the “Residuals vs Fitted” plot, suggesting that there are non-linear patterns beyond interaction effects that fail to be accommodated by the fitted model. One may try to refine the model by incorporating polynomial terms such as Rating
(although polynomial relationships don’t seem very pronounced) and interaction terms such as Income x Rating
and Income x Rating
. (Exercise: Try this and see how much the prediction accuracy will improve. The improvement can be substantial!)
“Normal Q-Q” Plot
The “Normal Q-Q” plot graphs the quantiles of the standardized residuals against the theoretical standard normal quantiles and can be used for checking the normality assumption. If the points lie closely on the 45° straight line passing through the origin, then the normality assumption is supported. In CHUNK 20
, the Q-Q plot shows that the normal distribution assumption is maintained for most values, but not in the two extreme tails. This suggests using a response distribution with a fatter tail than the normal distribution, the topic of Chapter 4.
Stage 6: Model Maintenance
Stage 4: Model Construction and Evaluation (Using Regularization)
TASK 9: Effects of regularization
Elastic net could also be used to perform feature selection.
- Explain whether using
alpha = 0
is appropriate for this case study. - Regardless of your conclusion, use elastic net with alpha equal to 0.5 to fit a regularized regression model with the following values of lambda:
0, 10, 100, 500, 1000
State your observations. Be sure to use the same variables (after releveling) as the full model in Task 5.
Model Construction
An alternative to using stepwise selection for identifying useful features is regularization, which reduces the magnitude of the coefficient estimates via the use of a penalty term and serves to prevent overfitting. Variables with limited predictive power will receive a coefficient estimate that is small, if not exactly zero, and therefore are removed from the model.
In this subsection, we will fit a regularized regression model with alpha equal to 0.5 (i.e., an even mix between ridge regression and lasso) to the Credit
dataset and compare its performance to that of the models based on the stepwise selection.
Because our goal in this case study is to identify key factors affecting Balance
, using alpha equal to 0, which is ridge regression and does not eliminate any variables, is not appropriate.
Fitting Regularized Regression Models: The glmnet()
function.
To implement regularization, we will use the glmnet
package (available on Exam PA), where the main function is the glmnet()
function. This function can be used to fit ridge regression models, lasso models, and generate eastic nets, but it has different syntax from the lm()
function we have been using.
Instead of directly inputting a formula of the form Y ~ X1 + ... + Xp
, we have to pass in a data matrix x
(design matrix) carrying the values of the features and a response vector y
. The design matrix can be created using the model.matrix()
function from a given formula and a data frame.
Run CHUNK 21
to set up the design matrix for the training part of the Credit
dataset and look at the first six rows. The formula should involve all of the potentially useful features, including the interaction between Income
and Student
, not just the predictive features identified in Task 7.
(Remember, an important aim of regularization is to reduce model complexity and, in some cases, select features. Performing regularization only on predictive features runs counter to this aim!)
# CHUNK 21 X.train <- model.matrix(Balance ~ . + Income:Student, data = data.train) head(X.train) #printout a few rows of the design matrix
(Intercept) Income Rating Cards Age Education Gender Male StudentYes MarriedNo EthnicityAfrican American EthnicityAsian Income:StudentYes 1 1 14.891 283 2 34 11 1 0 0 0 0 0.000 2 1 106.025 483 3 82 15 0 1 0 0 1 106.025 3 1 104.593 514 4 71 11 1 0 1 0 1 0.000 6 1 80.180 569 4 77 10 1 0 1 0 0 0.000 8 1 71.408 512 2 87 9 1 0 1 0 1 0.000 9 1 15.125 266 5 66 13 0 0 1 0 0 0.000
A particular advantage of the model.matrix()
function is that categorical predictors will be automatically transformed into dummy variables, just like the dummyVars()
function, and the design matrix is of full rank. This advance conversion is important because unlike the lm()
function, the glmnet()
function can only work on quantitative inputs, but not character or factor variables. In other words, explicit binarization of categorical variables is required for the glmnet()
function to work.
As an example, you can see that Ethnicity
has been automatically converted into two dummy variables corresponding to “African American” or “Asian” (remember that “Caucasian” is the baseline level of Ethnicity after releveling in Task 5).
As soon as a design matrix has been set up, it can be passed to the glmnet()
function along with the vector of response values and the values of the hyperparameters α
and λ
. Run CHUNK 22
to fit five regularized regression models with α = 0.5
and λ = 0, 10, 100, 500, 1000
(one model for each value of λ
) to the training set, and save the five fitted models collectively as an object named m.lambda
.
Note that the family argument is set to gaussian
because we are dealing with linear models with normal errors.
# CHUNK 22 #install.packages("glmnet") # uncomment this Line the first time you use gLmnet library(glmnet) m.lambda <- glmnet(x = X.train, y = data.train$Balance, family = "gaussian", lambda = c(0, 10, 100, 500, 1000), alpha = 0.5)
To appreciate the effect of regularization on the fitted model, it is informative to extract the coefficient estimates and see how they vary with the value ofλ
. This can be done in two ways:
- (Suggested in the PA e-learning modules) A
glmnet
object is a list, two components of which are:a0
: This is a vector of size length (lambda) hosting the value of the intercept estimate, one for each value ofλ
.beta
: This is a matrix of coefficient estimates for the predictors, excluding the intercept, arranged by columns according to different values ofλ
.
Then we use two separate commands to access the estimated intercept and slope coefficients.
# CHUNK 22 (Cont.) # first way to extract coefficient estimates m.lambda$a0
s0 s1 s2 s3 s4 520.4352 301.2973 -224.9669 -506.0179 -579.1925
m.lambda$beta
12 x 5 sparse Matrix of class "dgCMatrix" s0 s1 s2 s3 s4 (Intercept) . . . . . Income . . . -6.63360837 -7.9320588 Rating . 0.6182676 2.034434 3.71042914 4.0114839 Cards . . . 0.51433943 2.9161946 Age . . . -0.70550849 -0.7885101 Education . . . . 0.8874960 Gender Male . . . 3.08510246 18.3671409 StudentYes . . 228.762987 396.62014836 396.2181065 MarriedNo . . . 9.36968005 20.4646766 EthnicityAfrican American . . . . -11.9281110 EthnicityAsian . . . . -5.3983474 Income:StudentYes . . . 0.01473463 0.4929168
- (Easier and suggested in ISLR) Applying the
coef()
function to aglmnet
object returns a matrix of coefficient estimates, including the intercept, arranged by columns according to different values ofλ
.
# CHUNK 22 (Cont.) # second way round(coef(m.lambda), digits = 4)
13 x 5 sparse Matrix of class "dgCMatrix" s0 s1 s2 s3 s4 (Intercept) 520.4352 301.2973 -224.9669 -506.0179 -579.1925 (Intercept) . . . . . Income . . . -6.6336 -7.9321 Rating . 0.6183 2.0344 3.7104 4.0115 Cards . . . 0.5143 2.9162 Age . . . -0.7055 -0.7885 Education . . . . 0.8875 Gender Male . . . 3.0851 18.3671 StudentYes . . 228.7630 396.6201 396.2181 MarriedNo . . . 9.3697 20.4647 EthnicityAfrican American . . . . -11.9281 EthnicityAsian . . . . -5.3983 Income:StudentYes . . . 0.0147 0.4929
Regardless of which way you use to access the coefficient estimates, the values of λ
are arranged in descending order across the five columns from left to right, with s0
, s1
, s2
, s3
, and s4
representing λ = 1000, 500, 100, 10, 0
, respectively.
Dots in the output represent features whose coefficient estimates are exactly zero and thus dropped from the model. Looking at the “paths” of the coefficients as λ
increases from 0, 10, 100, 500, to 1000
(right to left), we can see that the effect of regularization becomes more and more significant, and more and more coefficient estimates shrink to zero, an illustration of the feature selection property of the elastic net.
- When
λ = 0
, the coefficient estimates are (almost!) identical to the ordinary least squares estimates inCHUNK 12
. This comes as no surprise to us because running regularized regression withλ = 0
is the same as doing ordinary least squares estimation. - When
λ = 10
, the coefficient estimates ofEducation
,EthnicityAfrican American
, andEthnicityAsian
are dropped due to their statistical insignificance. As we saw inCHUNK 12
, these three features are highly insignificant. The coefficient estimates of other features also become smaller in absolute value (e.g., 18.3671-+ 3.0851 forGenderMale
). - When
λ = 100
, regularization becomes considerably stronger and a lot more features are eliminated. OnlyRating
andStudentYes
are retained. - When
λ = 500
, regularization is overwhelming and onlyRating
remains. FromCHUNK 12
,Rating
is the most significant feature, with the largest t-statistic. - When
λ = 1000
, the regularization penalty is so large that all of the coefficient estimates except the intercept become exactly zero, in which case the model is effectively the null model with only the intercept. The estimated intercept is \(\beta^L_0 = 520.4352\), which is also the mean value ofBalance
in the training set, as you can check using the commandmean(data.train$Balance)
.
# CHUNK 22 (Cont.) mean(data.train$Balance)
[1] 520.4352
TASK 10: Perform feature selection with regularization
- Use cross-validation to set the value of lambda for fitting the regularized model (alpha = 0.5) in Task 9.
- List the features used in the resulting model. Compare the selected features to those using stepwise regression in Task 7. Comment on any differences.
- Calculate the RMSE on the test set.
- Recommend which model should be used. Justify your choice.
- Suggest another method to set the value of lambda. There is no need to implement this method.
Hyperparameter tuning: The cv.glmnet()
function.
The glmnet()
function requires that value(s) of λ
be supplied a priori and does not directly tell us the optimal value of λ. In fact, in Task 9 we experimented with five pre-specified values of λ, mainly to explore the impact of regularization, but did not attempt to optimize the value of λ. Fortunately, the companion function cv.glmnet()
(with “cv
” standing for “cross-validation”) fills this gap. This function performs k-fold cross-validation (10-fold cross-validation, by default), calculates the cross-validation error for a range of values of λ, and produces the optimal value of λ as a by-product.
The cv.glmnet()
function is commonly accompanied with an application of the plot()
function to a cv.glmnet
object. This returns a cross-validation curve, that is, the cross-validation error plotted as a function of the values of λ used. Such a plot has several useful features:
- The red dots represent the cross-validation error corresponding to a range of values of λ. (or logλ, for plotting purposes) and the vertical lines around the dots are the standard error bands.
- The numbers at the top of the graph indicate how many features have non-zero coefficient estimates at a given value of λ (or logλ).
- The first vertical dotted line from the left corresponds to the value of λ (or logλ) at which the cross-validation error is minimized. To directly extract such an optimal value of λ, we can take out the
lambda.min
component of acv.glmnet
object, which is a list. - The second vertical dotted line is not discussed in the PA e-learning modules, but was used in the June 16 and 19, 2020 PA exams. It has to do with the one-standard-error rule, which you may have come across when studying for Exam SRM.
In the current context, the rule corresponds to the greatest value of λ, or equivalently, the simplest regularized regression model whose cross-validation error is within one standard error of the minimum error (remember, the higher the value of λ, the more regularization there is and the simpler the model becomes).
The idea is that models within one standard error of the minimum have more or less the same predictive performance and we may opt for the simplest model among these models. The value of λ. based on the one-standard-error rule is contained in thelambda.1se
(where “1se
” stands for “1 standard-error”) component of acv.glmnet
object.
Exercise (An illustration of the one-standard-error rule for regularized regression)
You are fitting regularized linear models to the same dataset using a grid of five λ values. The following table shows the cross-validation error and the associated standard error for each λ value.
λ | Cross-Validation Error | Standard Error |
0.5 | 10.4512 | 2.4931 |
1.0 | 4.1872 | 1.9462 |
1.5 | 2.9745 | 1.2476 |
2.0 | 3.1968 | 1.5390 |
2.5 | 4.6403 | 1.9473 |
Determine the best value of λ using the one-standard-error rule.
Solution
The smallest cross-validation error is attained when λ = 1.5. Since 2.9745 + 1.2476 = 4.2221 and the simplest model (remember, the larger the value of λ, the simpler the regularized model) whose cross-validation error is lower than 4.2221 corresponds to λ = 2.0, the best value of λ to use according to the one-standard-error rule is λ = 2.0.
Fitting the regularized model with λ = lambda.min
Now let’s run CHUNK 23
to plot the cross-validation error curve for our regularized model. Note that a random seed is set prior to the call of the cv.glmnet()
function to make the choice of the cross-validation folds reproducible (this random seed need not be the same as that for the createDataPartitionO
function in CHUNK 9
); remember that cross-validation involves randomly dividing the training data into 10 folds.
# CHUNK 23 set.seed(1111) m <- cv.glmnet(x = X.train, y = data.train$Balance, family = "gaussian", alpha = 0.5) plot(m)
m$lambda.min m$lambda.1se
> m$lambda.min [1] 0.8885539 > m$lambda.1se [1] 7.550517
The value of λ that minimizes the cross-validation error is 0.8885539. Such a small value means that hardly any regularization is done to shrink the coefficient estimates. In fact, at this value none of the eleven features is eliminated.
We now refit the regularized model on the training set with λ equal to 0.8885539, save it as an object named m.min
, make predictions on the test set, and finally evaluate the model’s RMSE.
These are all done in CHUNK 24
. Notice that when the predict()
function is applied to a glmnet
object, the newx
argument should be used instead of the newdata
argument, and the model matrix for the test set should be supplied. To learn more, run ?predict.glmnet
.
# CHUNK 24 # Fit the regularized model using lambda.min m.min <- glmnet(x = X.train, y = data.train$Balance, family = "gaussian", lambda = m$lambda.min, alpha = 0.5) m.min$beta
12 x 1 sparse Matrix of class "dgCMatrix" s0 (Intercept) . Income -7.8131829 Rating 3.9839963 Cards 2.6915109 Age -0.7829239 Education 0.7557949 Gender Male 17.0049498 StudentYes 396.3329355 MarriedNo 19.4182020 EthnicityAfrican American -9.9708020 EthnicityAsian -3.6479011 Income:StudentYes 0.4458288
# Set up the desgin matrix for the best test X.test <- model.matrix(Balance ~ . + Income:Student, data = data.test)
# Make predictions on the test set and calculation test RMSE m.min.pred <- predict(m.min, newx = X.test) RMSE(data.test$Balance, m.min.pred)
[1] 103.9201
As we saw in Figure 3.4.7, all of the eleven features are retained at the lambda.min
value. The test RMSE of the resulting model is 103.9201, which is slightly higher than that of the final model in Task 7 (which is 102.6785).
In addition to inferior predictive performance, the regularized model is also much harder to interpret due to its substantially larger size, whereas the model in Task 7 only has three features. As a result, it is reasonable to recommend the model in Task 7 as the final model owing to its better prediction accuracy and interpretability.
Fitting the regularized model with λ = lambda.1se
Instead of using the value of λ that minimizes the cross-validation error, we can also select λ based on the one-standard-error rule and set it equal to lambda.1se
, or 7.550517 (which is much larger than lambda.min
). The task statement says that “there is no need to implement this method,” so if this was a real exam task, we could describe what the one-standard-error rule is and stop.
For the purposes of illustration, let’s also fit the resulting regularized model and see how it performs. CHUNK 25
below is a version of CHUNK 24
with m.min
and lambda.min
changed to m.1se
and lambda.1se
, respectively.
# CHUNK 25 # Fit the reguiarized modeL using iambda.1se m.1se <- glmnet(x = X.train, y = data.train$Balance, family = "gaussian", lambda = m$lambda.1se, alpha = 0.5) # List the coefficient estimates m.1se$beta
12 x 1 sparse Matrix of class "dgCMatrix" s0 (Intercept) . Income -6.9378441 Rating 3.7813573 Cards 1.1075391 Age -0.7305808 Education . Gender Male 6.7717419 StudentYes 397.5199346 MarriedNo 11.6768447 EthnicityAfrican American . EthnicityAsian . Income:StudentYes 0.1148304
# Make predictions on the test set and caicuiate test RMSE m.1se.pred <- predict(m.1se, newx = X.test) RMSE(data.test$Balance, m.1se.pred)
[1] 102.8241
The fitted model has eight features, consistent with what we saw in Figure 3.4.7, and Education
, EthnicityAfrican American
, and EthnicityAsian
are removed. The test RMSE has dropped to 102.8241, a slight improvement from that of the model based on lambda.min
.
However, the model based on the one-standard-error still cannot compete with the model in Task 7 in terms of prediction accuracy as well as interpretability. This case study therefore further reinforces the message that to construct an effective predictive model, it is necessary to choose the right level of model complexity and select the right features. Putting all variables into your model does not always help!
Note
If you are asked to tune a regularized regression model on the exam, the default is to select the value of A that minimizes the cross-validation error (i.e., take λ = lambda.min
). The one-standard-error rule is definitely valid, but it will be a prudent exam tactic to write something expected by the SOA graders, unless you are specifically asked for an alternative tuning strategy.
Remark: How to choose the optimal value of α?
As powerful as the cv.glmnet()
function is, it only allows us to pick the best value of λ for a given value of α; it has no built-in mechanism for tuning α. To optimize over α, we have to resort to trial and error and go as follows:
- Set up a grid of values of α. In an exam setting, it is sensible to experiment with a few values of α, e.g., 0, 0.5, 1. If more values are desired, use a
for loop
. - For each value of α, we select the corresponding optimal value of λ and evaluate the corresponding cross-validation error.
- Adopt the value of α that produces the smallest cross-validation error. (Note: Instead of cross-validation error, here it is fine and perhaps easier coding-wise to choose the value of α that produces the smallest training error because α is not directly related to model complexity.)
In essence, we break down the two-variable optimization problem by first minimizing over λ for a given α, then minimizing over α.