[mathjax]
Regularization
What is regularization?
- Reduce model complexity: Reduces the magnitude of the coefficient estimates via the use of a penalty term and serves to prevent overfitting.
- An alternative to using stepwise selection for identifying useful features.
How does regularization work?
- Variables with limited predictive power will receive a coefficient estimate that is small, if not exactly zero, and therefore are removed from the model.
α
- If it is to identify key factors affecting the target variable, using
α = 0
, which is ridge regression and does not eliminate any variables, is not appropriate.
Interactions
Interpretation
There is a significant interaction between [A] and [B], meaning that:
-
- The effect of [A] on [Y] varies for … with and without [B].
3.2 Linear Models Case Study 2: Feature Selection and Regularization
Learning Objectives
After completing this case study, you should be able to:
- Fit a multiple linear regression model with both numeric and categorical (factor) predictors.
- Detect and accommodate interactions between predictors which can be quantitative or qualitative.
- Perform explicit binarization of categorical predictors and understand why doing so may be beneficial.
library(caret)
dummyVars()
- Perform stepwise selection and be familiar with the different options allowed by this function.
library(MASS)
stepAIC()
- Generate and interpret diagnostic plots for a linear model.
- Implement regularized regression.
library(glmnet)
glmnet()
cv.glmnet()
Stage 1: Define the Business Problem
Identify and interpret key factors that relate to a higher or lower Balance
.
Stage 2: Data Collection
Data Design
1. Read in data and remove irrelevant variables.
# CHUNK 1 library(ISLR) data(Credit) Credit$ID <- NULL
Data Description
Predictors |
Variables | |||
Categorical | Binary | Gender |
Student |
Married |
3 Levels | Ethnicity |
|||
Numerical | Income |
Limit |
Rating |
|
Cards |
Age |
Education |
TASK 1: Consider a data issue
|
Data Quality Issue
Ethnicity
may be a sensitive move as making ethnicity-based predictions, criticized on grounds of discrimination and raise legal concerns.
Stage 3: Exploratory Data Analysis
TASK 2: Explore the numeric variables
|
Descriptive Statistics: Numeric Predictors
Always explore the distribution of the target variable first.
- may suggest useful transformations.
- avoid failing to consider skewed distribution of the target variable。
- determine predictive analytic techniques.
How to explore the skewness?
Summary()
summary(Credit) Income Limit Rating Cards Age Education Gender Student Married Ethnicity Balance Min. : 10.35 Min. : 855 Min. : 93.0 Min. :1.000 Min. :23.00 Min. : 5.00 Male :193 No :360 No :155 African American: 99 Min. : 0.00 1st Qu.: 21.01 1st Qu.: 3088 1st Qu.:247.2 1st Qu.:2.000 1st Qu.:41.75 1st Qu.:11.00 Female:207 Yes: 40 Yes:245 Asian :102 1st Qu.: 68.75 Median : 33.12 Median : 4622 Median :344.0 Median :3.000 Median :56.00 Median :14.00 Caucasian :199 Median : 459.50 Mean : 45.22 Mean : 4736 Mean :354.9 Mean :2.958 Mean :55.67 Mean :13.45 Mean : 520.01 3rd Qu.: 57.47 3rd Qu.: 5873 3rd Qu.:437.2 3rd Qu.:4.000 3rd Qu.:70.00 3rd Qu.:16.00 3rd Qu.: 863.00 Max. :186.63 Max. :13913 Max. :982.0 Max. :9.000 Max. :98.00 Max. :20.00 Max. :1999.00
Mean | Median | Key | Conclusion |
520.01 | 459.50 | Mean > Median | The distribution of Balance is slightly skewed to the right |
Graphical Display: Numeric Predictors
Histogram
ggplot(<data>, aes(x = <target_variable>)) + geom_histogram()
# CHUNK 2 library(ggplot2) ggplot(Credit, aes(x = Balance)) + geom_histogram() nrow(Credit[Credit$Balance <= 0, ]) [1] 90
Findings
-
- The distribution of
Balance
is slightly skewed to the right - A large number of observations (90 to be exact) for which
Balance=0
, making the log transformation unavailable.- the square root transformation still works well because Balance is non-negative. The distribution of the square root of
Balance
becomes more symmetric, although the mass at zero remains prominent.
- the square root transformation still works well because Balance is non-negative. The distribution of the square root of
- The distribution of
Bivariate Data Exploration: Correlation Matrix
cor()
# CHUNK 3 # Calculate the correlation matrix for numeric variables # The numeric predictors are in the first 6 columns cor.matrix <- cor(Credit[, c(1:6, 11)]) print("Correlation Matrix") round(cor.matrix, digits = 4) [1] "Correlation Matrix" Income Limit Rating Cards Age Education Balance Income 1.0000 0.7921 0.7914 -0.0183 0.1753 -0.0277 0.4637 Limit 0.7921 1.0000 0.9969 0.0102 0.1009 -0.0235 0.8617 Rating 0.7914 0.9969 1.0000 0.0532 0.1032 -0.0301 0.8636 Cards -0.0183 0.0102 0.0532 1.0000 0.0429 -0.0511 0.0865 Age 0.1753 0.1009 0.1032 0.0429 1.0000 0.0036 0.0018 Education -0.0277 -0.0235 -0.0301 -0.0511 0.0036 1.0000 -0.0081 Balance 0.4637 0.8617 0.8636 0.0865 0.0018 -0.0081 1.0000
Findings
-
- Correlations between predictors and the target variable:
- Important Predictors: Variables being predictive of
Balance
:Limit
(0.8617) have very strongly positive linear relationships withBalance
Rating
(0.8636) have very strongly positive linear relationships withBalance
Income
(0.4637) exerts a moderately positive linear effect onBalance
- Negligible Predictors: The other four numeric variables are weak at best given the negligible correlations.
- Important Predictors: Variables being predictive of
- Correlations between variables:
- Large Correlations:
- Between
Income
andLimit
(0.7921) [icon name=”arrow-alt-circle-right” style=”regular” class=”” unprefixed_class=””] large correlations suggesting duplication of information - Between
Income
andRating
(0. 7914) [icon name=”arrow-alt-circle-right” style=”regular” class=”” unprefixed_class=””] large correlations suggesting duplication of information - Between
Limit
andRating
(0.9969) [icon name=”arrow-alt-circle-right” style=”regular” class=”” unprefixed_class=””] almost positively collinear [icon name=”arrow-alt-circle-right” style=”regular” class=”” unprefixed_class=””] highly possible duplication of information
- Between
- Large Correlations:
- Correlations between predictors and the target variable:
Feature Selection: Income
and Rating
or Limit
Scatterplots
Since Limit
and Rating
are almost positively collinear, plot the two variables in comparison with each other to see if they have linear relationship.
ggplot(<data>, aes(x = <var>, y = <var>)) + geom_point()
# CHUNK 4 ggplot(Credit, aes(x = Limit, y = Rating)) + geom_point()
Findings
-
- The two variables are virtually perfectly related in a linear fashion and do not contribute much information when the other variable is present.
[icon name=”arrow-alt-circle-down” style=”regular” class=”” unprefixed_class=””]
Therefore, we will keep one and only one of the two predictors,Rating
, and deleteLimit
sinceRating
andBalance
have slightly higher correlation.
- The two variables are virtually perfectly related in a linear fashion and do not contribute much information when the other variable is present.
# Delete Limit Credit$Limit <- NULL
Feature Selection: Income
and Rating
Scatterplots between Numeric Variables and Target Variable
The pairwise correlations in CHUNK 3
are only a measure of the strength of the linear relationships between the numeric variables and Balance
.
To further confirm the predictive power of Income
and Rating
, we use a for loop to make a scatterplot, with a smoothed line superimposed, for Balance
against each of the five remaining numeric predictors.
# CHUNK 5 # first save the names of the five numeric predictors as a vector vars.numeric <- colnames(Credit[ , 1:5]) for (i in vars.numeric) { plot <- ggplot(Credit, aes(x = Credit[, i] , y = Balance)) + geom_point() + geom_smooth(method = "lm", se = FALSE) + labs(x = i) print(plot) }
Findings
-
-
Balance
andRating
have an unquestionably strong positive linear relationship.Balance
andIncome
have a positive linear relationship.Balance
and each ofCards
,Age
, and Education hardly have any systematic patterns- The fitted regression lines are almost flat [icon name=”arrow-alt-circle-right” style=”regular” class=”” unprefixed_class=””]
Balance
is not sensitive to changes in these three predictors.
- The fitted regression lines are almost flat [icon name=”arrow-alt-circle-right” style=”regular” class=”” unprefixed_class=””]
- All of these findings are consistent with our earlier deductions based on the correlation matrix in
CHUNK 3
.
-
TASK 3: Explore the factor variables Investigate the distribution of the factor variables. Identify a factor variable that is likely to predict the target. For this factor variable (there is no need to discuss the other variables in your response), perform the following:
|
Descriptive Statistics: Categorical Predictors
In CHUNK 6
, we produce the mean (and median) of Balance
split by different levels of each of the 4 factor variables.
# CHUNK 6 library(tidyverse) # Save the names of the categoricai predictors as a vector vars.categorical <- c("Gender", "Student", "Married", "Ethnicity") for (i in vars.categorical) { x <- Credit %>% group_by_(i) %>% summarise( mean = mean(Balance), median = median(Balance), n = n() ) print(x) }
# A tibble: 2 × 4 Gender mean median n <fct> <dbl> <int> <int> 1 " Male" 510. 463 193 2 "Female" 530. 456 207 # A tibble: 2 × 4 Student mean median n <fct> <dbl> <dbl> <int> 1 No 480. 424 360 2 Yes 877. 953 40 # A tibble: 2 × 4 Married mean median n <fct> <dbl> <int> <int> 1 No 523. 467 155 2 Yes 518. 454 245 # A tibble: 3 × 4 Ethnicity mean median n <fct> <dbl> <dbl> <int> 1 African American 531 480 99 2 Asian 512. 414 102 3 Caucasian 518. 465 199
Findings
Student
makes a significant difference to the mean ofBalance
over the two levels, “Yes
” and “No
”- students (876.8250) having a much higher average
Balance
than for non-students (480.3694).
- students (876.8250) having a much higher average
- The mean of
Balance
is more or less the same across different levels ofGender
,Married
, andEthnicity
.- These three variables appear to be ineffective in differentiating the observations of
Balance
.
- These three variables appear to be ineffective in differentiating the observations of
Graphical Display: Categorical Predictors
Split Boxplots
ggplot(<data>, aes(x = <categorical predictors>, y = <target variable>)) + geom_boxplot()
# CHUNK 7 for (i in vars.categorical) { plot <- ggplot(Credit, aes(x = Credit[, i], y = Balance)) + geom_boxplot() + labs(x = i) print(plot) }
Findings
-
- The box for students is noticeably higher than that for non-students, with a similar spread.
- The boxes are very similar across different factor levels for each of
Gender
,Married
, andEthnicity
.
All in all, Student
seems to be the only factor variable with significant predictive power.
Feature Selection: Student
Overall, the most promising predictors for Balance
are:
Feature Selection: Income
(numeric), Rating
(numeric), and Student
(factor)
TASK 4: Consider two graphs Your assistant has built the following two plots and asked for your comments:
Run the code to make them.
Occasionally on the exam, you are asked to focus on specific graphs that highlight subtle but important aspects of the |
Split Boxplots: Age vs. Student
# CHUNK 8 ggplot(Credit, aes(x = Student, y = Age)) + geom_boxplot()
Findings
-
- Those who are students tend to have a smaller age (a lower median).
- The median age of students, which is close to 50, means that there is some co-dependence (but not interaction!) between
Age
andStudent
.
[icon name=”arrow-alt-circle-down” style=”regular” class=”” unprefixed_class=””]
[icon name=”gem” style=”regular” class=”” unprefixed_class=””] The effects of these two variables onBalance
may be distorted when they serve as predictors at the same time.
Scatterplots: Balance
vs. Income
by Student
# CHUNK 8 ggplot(Credit, aes(x = Income, y = Balance, color = Student)) + geom_point() + geom_smooth(method = "lm", se = FALSE)
Findings
-
- There is a possible interaction between
Income
andStudent
in affectingBalance
.- The slopes of the two smoothed lines differ quite remarkably, with the non-students being more sensitive to
Income
than students perhaps due to their financial status.
- The slopes of the two smoothed lines differ quite remarkably, with the non-students being more sensitive to
- There is a possible interaction between
Feature Generation: Income:Student
Stage 4: Model Construction and Evaluation
TASK 5: Explore the effect of releveling factor variables
|
Model Evaluation: Training/Test Set Split
Creation of training and test sets: createDataPartition()
createDataPartition(<target_variable>, p = <proportion of training data>, list = <if vector>)
# CHUNK 9 library(caret) set.seed(8964) partition <- createDataPartition(Credit$Balance, p = 0.75, list = FALSE) data.train <- Credit[partition, ] data.test <- Credit[-partition, ] print("TRAIN") mean(data.train$Balance) print("TEST") mean(data.test$Balance)
[1] "TRAIN" > mean(data.train$Balance) [1] 520.4352 > print("TEST") [1] "TEST" > mean(data.test$Balance) [1] 518.7374
Model Construction – Step 1: Fitting the First Model
# CHUNK 10 model.full <- lm(Balance ~ . + Income:Student, data = data.train) summary(model.full)
Call: lm(formula = Balance ~ . + Income:Student, data = data.train) Residuals: Min 1Q Median 3Q Max -189.17 -76.46 -15.65 66.69 296.17 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) -552.34353 40.34543 -13.690 <2e-16 *** Income -7.93864 0.29465 -26.943 <2e-16 *** Rating 4.01246 0.06416 62.540 <2e-16 *** Cards 2.89772 4.69168 0.618 0.5373 Age -0.78778 0.35994 -2.189 0.0294 * Education 0.88922 1.92035 0.463 0.6437 GenderFemale -18.37030 12.00867 -1.530 0.1272 StudentYes 395.67744 30.72302 12.879 <2e-16 *** MarriedYes -20.46469 12.39735 -1.651 0.0999 . EthnicityAsian 6.53269 17.13794 0.381 0.7033 EthnicityCaucasian 11.95386 14.82444 0.806 0.4207 Income:StudentYes 0.50102 0.50028 1.001 0.3174 --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 103.1 on 289 degrees of freedom Multiple R-squared: 0.952, Adjusted R-squared: 0.9502 F-statistic: 521.4 on 11 and 289 DF, p-value: < 2.2e-16
# of Levels of the Categorical Predictors
-
lm()
function automatically turns the 4 categorical predictors into dummy variables.- These dummy variables are named in the form of
<categorical_predictor><level>
- A categorical predictor with
r
levels is represented byr-1
dummy variables.Gender
,Student
, andMarried
are each represented by1
dummy variable, andEthnicity
by2
dummy variables.
- The level that is left out is the baseline level, which, by default, is the alpha-numerically first level.
- These dummy variables are named in the form of
Coefficient Estimates – Interpretation for Categorical Predictors
-
MarriedYes
: The credit card balance of married individuals is lower than that of unmarried (baseline) by 20.46469 on average, holding all other predictors fixed.EthnicityAsian
: The credit card balance of Asians is higher than that of African Americans (baseline) by 6.53269 on average, holding all other predictors fixed.
Significance
-
Income
,Rating
,Student
(or equivalently,StudentYes
) are highly significant.Income:Student
is not quite significant. The model says that the interaction becomes insignificant when the effects of other variables have been accounted for.
Standard Errors
-
- The standard error of the coefficient estimate of the “
Asian
” level (17.13794) of theEthnicity
variable is larger than that of the “Caucasian
” level (14.82444).- This should come as no surprise to us, considering the larger number of observations in the latter level (see
CHUNK 1
orCHUNK 6
). - Recall that:
- sparse levels tend to have larger standard errors, because the coefficient estimates for those levels are based on less data;
- populous levels tend to have smaller standard errors, because the coefficient estimates for those levels are based on more data.
- This should come as no surprise to us, considering the larger number of observations in the latter level (see
- The standard error of the coefficient estimate of the “
Model Construction – Releveling
Purpose
To relevel factor variables from baseline level being the most observations to not the alpha-numerically first level.
Effects
-
- have no effect on the predicted values.
- affect the coefficient estimates of the dummy variables.
- possibly change the significance statistics, such as t-statistics for the dummy variables and the corresponding p-values.
- In general, the p-values for factor variables with three or more levels may depend on the choice of the baseline level.
- If we perform stepwise (backward) selection with binarization and try to drop the most statistically insignificant feature, that is, the one with the highest p-value, then the choice of the baseline levels of the factor variables may affect which feature is to be dropped first. This explains the importance of releveling.
Why set the baseline level to the most populous level
-
- to improve the accuracy of our measures of significance.
Releveling
# CHUNK 11 for (i in vars.categorical){ # Use the table() function to calculate the frequencies for each factor table <- as.data.frame(table(Credit[, i])) # Determine the level with the highest frequency max <- which.max(table[, 2]) # Save the name of the level with the highest frequency level.name <- as.character(table[max, 1]) # Set the baseline level to the most populous level Credit[, i] <- relevel(Credit[, i], ref = level.name) } summary(Credit[, vars.categorical])
Gender Student Married Ethnicity Female:207 No :360 Yes:245 Caucasian :199 Male :193 Yes: 40 No :155 African American: 99 Asian :102
Baseline Level: GenderMale
[icon name=”arrow-alt-circle-right” style=”regular” class=”” unprefixed_class=””] GenderFemale
Baseline Level: MarriedNo
[icon name=”arrow-alt-circle-right” style=”regular” class=”” unprefixed_class=””] MarriedYes
Baseline Level: EthnicityAfrican American
[icon name=”arrow-alt-circle-right” style=”regular” class=”” unprefixed_class=””] EthnicityCaucasian
Refitting the Full Model
Run CHUNK 12
to refit the full model:
# CHUNK 12 # To make sure factors in the training set are reLeveLed data.train <- Credit[partition, ] model.full <- lm(Balance ~ . + Income:Student, data = data.train) summary(model.full)
Call: lm(formula = Balance ~ . + Income:Student, data = data.train) Residuals: Min 1Q Median 3Q Max -189.17 -76.46 -15.65 66.69 296.17 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) -579.22466 40.08600 -14.450 <2e-16 *** Income -7.93864 0.29465 -26.943 <2e-16 *** Rating 4.01246 0.06416 62.540 <2e-16 *** Cards 2.89772 4.69168 0.618 0.5373 Age -0.78778 0.35994 -2.189 0.0294 * Education 0.88922 1.92035 0.463 0.6437 Gender Male 18.37030 12.00867 1.530 0.1272 StudentYes 395.67744 30.72302 12.879 <2e-16 *** MarriedNo 20.46469 12.39735 1.651 0.0999 . EthnicityAfrican American -11.95386 14.82444 -0.806 0.4207 EthnicityAsian -5.42117 14.68523 -0.369 0.7123 Income:StudentYes 0.50102 0.50028 1.001 0.3174 --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 103.1 on 289 degrees of freedom Multiple R-squared: 0.952, Adjusted R-squared: 0.9502 F-statistic: 521.4 on 11 and 289 DF, p-value: < 2.2e-1
Findings: Comparing CHUNK 12
with CHUNK 10
Change in the coefficient estimate of the dummy variables for Married
:
Coefficient: MarriedYes -> MarriedNo Estimate: -20.46469 -> +20.46469 |
[icon name=”arrow-alt-circle-right” style=”regular” class=”” unprefixed_class=””] |
The credit card balance of unmarried individuals is higher than that of married (baseline) by 20.46469 on average, holding all other predictors fixed. The size of the coefficient estimate remains unchanged, but it is now positive (-20.46469 -> +20.46469). |
Change in the coefficient estimate of the dummy variables for Ethnicity
:
Coefficient: EthnicityAsian -> EthnicityAfrican American Estimate: 6.53269 -> -5.42117 |
[icon name=”arrow-alt-circle-right” style=”regular” class=”” unprefixed_class=””] | The credit card balance of Asians is lower than that of Caucasian (baseline) by 5.42117 on average, holding all other predictors fixed. 6.53269 - 11.95386 = -5.42117 |
Change in the p-value for EthnicityAsian
:
p-value: 0.7033
-> 0.7123
TASK 6: Binarize Factor Variables Manually
|
Model Construction – One more preparatory step – Binarization of factor variables
Purpose
To binarize categorical predictors with three or more levels.
Effects
-
- Whether or not we binarize the categorical predictors in advance has no impact on a fitted linear model.
- doing binarization can help make the feature selection process we are going to perform more meaningful.
Advantages
-
- With binarization, we are able to add or drop individual factor levels when doing feature selection.
- Many feature selection functions in R treat factor variables as a single feature and either retain the variable with all of its levels or remove the variable completely. This “all or nothing” feature selection approach ignores the possibility that individual levels of a factor with three or more levels may be:
- Insignificant with regard to the baseline level (and hence could be combined with it to form a bigger baseline level)
- Insignificantly different from other levels (in which case they could be combined into a new, bigger level).
- Many feature selection functions in R treat factor variables as a single feature and either retain the variable with all of its levels or remove the variable completely. This “all or nothing” feature selection approach ignores the possibility that individual levels of a factor with three or more levels may be:
- With binarization, we are able to add or drop individual factor levels when doing feature selection.
Disadvantages
-
- The stepwise selection procedure may take considerably longer to complete.
- Each factor level is considered a separate feature to add or drop.
- We may see some non-intuitive or nonsensical results when only a handful of the levels of a categorical predictor are selected.
- The stepwise selection procedure may take considerably longer to complete.
Binarization
Only Ethnicity
has more than two levels and has to be binarized by two dummy variables. The binarization of Ethnicity
can be done by running CHUNK 13
.
# CHUNK 13 library(caret) binarizer <- dummyVars(paste("~", paste(vars.categorical, collapse = "+")), data = Credit, fullRank = TRUE) # OR type out the categorical predictors one by one #binarizer <- dummyVars(~ Gender + Student + Married + Ethnicity, # data = Credit, fullRank = TRUE) binarized_vars <- data.frame(predict(binarizer, newdata = Credit)) head(binarized_vars)
Gender..Male Student.Yes Married.No Ethnicity.African.American Ethnicity.Asian 1 1 0 0 0 0 2 0 1 0 0 1 3 1 0 1 0 1 4 0 0 1 0 1 5 1 0 0 0 0 6 1 0 1 0 0
Explanation of Coding
-
-
dummyVars()
function- takes a formula (there is no target variable) as the first argument and
paste()
function
Invars.categorical, collapse = "+"
, we use thepaste()
function to “paste” the names of the four categorical predictorsGender
,Student
,Married
, andEthnicity
(recall thatvars.categorical
is defined inCHUNK 6
) and separate them by the+
sign.
- a data frame carrying the categorical variable(s) you want to binarize as the second argument.
fullRank
option
ThefullRank
option can be set toTRUE
orFALSE
depending on your purpose:fullRank = TRUE
is appropriate for regression and allows you to have a full rank parameterization of the model matrix, with the baseline levels of the categorical predictors left out.fullRank = FALSE
(the default value), the baseline levels would be retained and a rank-deficient model would result. This is more appropriate for principal components and cluster analyses.
- takes a formula (there is no target variable) as the first argument and
-
Below is the outcome of setting fullRank = FALSE
# CHUNK 13 library(caret) binarizer <- dummyVars(paste("~", paste(vars.categorical, collapse = "+")), data = Credit, fullRank = FALSE) # OR type out the categorical predictors one by one #binarizer <- dummyVars(~ Gender + Student + Married + Ethnicity, # data= Credit, fullRank = TRUE) binarized_vars <- data.frame(predict(binarizer, newdata = Credit)) head(binarized_vars)
Gender.Female Gender..Male Student.No Student.Yes Married.Yes Married.No Ethnicity.Caucasian Ethnicity.African.American Ethnicity.Asian 1 0 1 1 0 1 0 1 0 0 2 1 0 0 1 1 0 0 0 1 3 0 1 1 0 0 1 0 0 1 4 1 0 1 0 0 1 0 0 1 5 0 1 1 0 1 0 1 0 0 6 0 1 1 0 0 1 1 0 0
The binarization procedure creates two dummy variables for Ethnicity
, called:
-
Ethnicity.African.American
, andEthnicity.Asian
, indicating whetherEthnicity
equals “African American
” or “Asian
“, respectively.
Refitting the Full Model
In CHUNK 14
, we combine it with the original Credit
dataset via the cbind()
function.
# CHUNK 14 Credit.bin <- cbind(Credit, binarized_vars) head(Credit.bin)
Income Rating Cards Age Education Gender Student Married Ethnicity Balance Gender.Female Gender..Male Student.No Student.Yes Married.Yes Married.No Ethnicity.Caucasian Ethnicity.African.American Ethnicity.Asian 1 14.891 283 2 34 11 Male No Yes Caucasian 333 0 1 1 0 1 0 1 0 0 2 106.025 483 3 82 15 Female Yes Yes Asian 903 1 0 0 1 1 0 0 0 1 3 104.593 514 4 71 11 Male No No Asian 580 0 1 1 0 0 1 0 0 1 4 148.924 681 3 36 11 Female No No Asian 964 1 0 1 0 0 1 0 0 1 5 55.882 357 2 68 16 Male No Yes Caucasian 331 0 1 1 0 1 0 1 0 0 6 80.180 569 4 77 10 Male No No Caucasian 1151 0 1 1 0 0 1 1 0 0
Given the binarized variables, the original categorical predictors are no longer needed as they duplicate the information contained in the dummy variables.
# CHUNK 14 (Cont.) Credit.bin$Gender <- NULL Credit.bin$Student <- NULL Credit.bin$Married <- NULL Credit.bin$Ethnicity <- NULL head(Credit.bin)
Income Rating Cards Age Education Balance Gender.Female Gender..Male Student.No Student.Yes Married.Yes Married.No Ethnicity.Caucasian Ethnicity.African.American Ethnicity.Asian 1 14.891 283 2 34 11 333 0 1 1 0 1 0 1 0 0 2 106.025 483 3 82 15 903 1 0 0 1 1 0 0 0 1 3 104.593 514 4 71 11 580 0 1 1 0 0 1 0 0 1 4 148.924 681 3 36 11 964 1 0 1 0 0 1 0 0 1 5 55.882 357 2 68 16 331 0 1 1 0 1 0 1 0 0 6 80.180 569 4 77 10 1151 0 1 1 0 0 1 1 0 0
Model Evaluation: Training/Test Set Split
Now that the expanded dataset with binarized variables is available, we again split it into the training and test sets.
# CHUNK 14- (Cont.) data.train.bin <- Credit.bin[partition, ] data.test.bin <- Credit.bin[-partition, ]
Model Construction – Step 1 Revisited: Refitting the Linear Model with Binarized Factors
# CHUNK 15 # The interaction term is now Income:Student.Yes, not Income:Student model.full.bin <- lm(Balance ~ . + Income:Student.Yes, data= data.train.bin) summary(model.full.bin)
Call: lm(formula = Balance ~ . + Income:Student.Yes, data = data.train.bin) Residuals: Min 1Q Median 3Q Max -189.17 -76.46 -15.65 66.69 296.17 Coefficients: (4 not defined because of singularities) Estimate Std. Error t value Pr(>|t|) (Intercept) -150.13340 48.40440 -3.102 0.00211 ** Income -7.93864 0.29465 -26.943 < 2e-16 *** Rating 4.01246 0.06416 62.540 < 2e-16 *** Cards 2.89772 4.69168 0.618 0.53731 Age -0.78778 0.35994 -2.189 0.02942 * Education 0.88922 1.92035 0.463 0.64368 Gender.Female -18.37030 12.00867 -1.530 0.12717 Gender..Male NA NA NA NA Student.No -395.67744 30.72302 -12.879 < 2e-16 *** Student.Yes NA NA NA NA Married.Yes -20.46469 12.39735 -1.651 0.09988 . Married.No NA NA NA NA Ethnicity.Caucasian 5.42117 14.68523 0.369 0.71228 Ethnicity.African.American -6.53269 17.13794 -0.381 0.70335 Ethnicity.Asian NA NA NA NA Income:Student.Yes 0.50102 0.50028 1.001 0.31743 --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 103.1 on 289 degrees of freedom Multiple R-squared: 0.952, Adjusted R-squared: 0.9502 F-statistic: 521.4 on 11 and 289 DF, p-value: < 2.2e-16
Findings:
- The summary output is identical to that in
CHUNK 12
, showing that binarization has no effect on the quality of the fitted model.
TASK 7: Select features using stepwise selection Some of the features in the full model may lack predictive power and result in overfitting.
Describe each decision and the implications in choosing one option versus the other.
|
Model Construction – Step 2: Removing Non-Predictive Features Using stepAIC()
How does stepAIC()
work?
MASS:: stepAIC()
function automates stepwise model selection, in each step adding (forward selection) or dropping the feature (backward selection) to produce the greatest improvement in the model according to a certain criterion (the AIC by default). The process is terminated until no more variables can be added or dropped to improve the model.
stepAIC()
Function
-
- AIC with Backward Selection:
stepAIC(<data>)
- AIC with Forward Selection:
stepAIC(<empty model>, direction = "forward", scope = list(upper = <full model>, lower = <empty model>))
- BIC with Backward Selection:
stepAIC(<data>, k = log(nrow(<data>)))
- BIC with Forward Selection:
stepAIC(<empty model>, direction = "forward", scope = list(upper = <full model>, lower = <empty model>), k = log(nrow(<data>)))
- AIC with Backward Selection:
Note:
-
k
is set to the penalty termk = 2
for AIC (default) ork = log(nrow(<data>))
for BIC based on:- \(\text{AIC}=-2l+\boxed{2}p\) and
- \(\text{BIC}=-2l+\boxed{\ln(n_{tr})}p\)
Backward vs. Forward Selection
-
- Forward selection is more likely to produce a final model with fewer features.
-
AIC vs. BIC
- BIC is more conservative than AIC because the penalty term ln(ntr) = ln301 = 5.7071 > 2
- BIC tends to favor a model with fewer features than AIC
- <=>Using BIC as the selection criterion will retain less features
- <=>Using BIC as the selection criterion will give less factors to consider
- <=>Using BIC as the selection criterion will have less risk of missing important features.
Backward selection with the AIC
# CHUNK 16 #instaLL.packages("MASSn) # uncomment this Line the first time you use MASS library(MASS) model.backward.AIC <- stepAIC(model.full.bin)
Start: AIC=2802.25 Balance ~ Income + Rating + Cards + Age + Education + Gender.Female + Gender..Male + Student.No + Student.Yes + Married.Yes + Married.No + Ethnicity.Caucasian + Ethnicity.African.American + Ethnicity.Asian + Income:Student.Yes Step: AIC=2802.25 Balance ~ Income + Rating + Cards + Age + Education + Gender.Female + Gender..Male + Student.No + Student.Yes + Married.Yes + Married.No + Ethnicity.Caucasian + Ethnicity.African.American + Income:Student.Yes Step: AIC=2802.25 Balance ~ Income + Rating + Cards + Age + Education + Gender.Female + Gender..Male + Student.No + Student.Yes + Married.Yes + Ethnicity.Caucasian + Ethnicity.African.American + Income:Student.Yes Step: AIC=2802.25 Balance ~ Income + Rating + Cards + Age + Education + Gender.Female + Gender..Male + Student.Yes + Married.Yes + Ethnicity.Caucasian + Ethnicity.African.American + Income:Student.Yes Step: AIC=2802.25 Balance ~ Income + Rating + Cards + Age + Education + Gender.Female + Student.Yes + Married.Yes + Ethnicity.Caucasian + Ethnicity.African.American + Income:Student.Yes Df Sum of Sq RSS AIC - Ethnicity.Caucasian 1 1448 3071419 2800.4 - Ethnicity.African.American 1 1543 3071515 2800.4 - Education 1 2278 3072249 2800.5 - Cards 1 4052 3074024 2800.6 - Income:Student.Yes 1 10654 3080626 2801.3 <none> 3069972 2802.3 - Gender.Female 1 24859 3094830 2802.7 - Married.Yes 1 28946 3098918 2803.1 - Age 1 50885 3120857 2805.2 - Rating 1 41548705 44618677 3605.9 Step: AIC=2800.39 Balance ~ Income + Rating + Cards + Age + Education + Gender.Female + Student.Yes + Married.Yes + Ethnicity.African.American + Income:Student.Yes Df Sum of Sq RSS AIC - Education 1 2255 3073675 2798.6 - Cards 1 3911 3075331 2798.8 - Ethnicity.African.American 1 5595 3077014 2798.9 - Income:Student.Yes 1 10176 3081595 2799.4 <none> 3071419 2800.4 - Gender.Female 1 25165 3096584 2800.8 - Married.Yes 1 29351 3100771 2801.3 - Age 1 51462 3122881 2803.4 - Rating 1 41550276 44621695 3603.9 Step: AIC=2798.61 Balance ~ Income + Rating + Cards + Age + Gender.Female + Student.Yes + Married.Yes + Ethnicity.African.American + Income:Student.Yes Df Sum of Sq RSS AIC - Cards 1 3865 3077540 2797.0 - Ethnicity.African.American 1 5586 3079261 2797.2 - Income:Student.Yes 1 10214 3083888 2797.6 <none> 3073675 2798.6 - Gender.Female 1 25155 3098830 2799.1 - Married.Yes 1 28499 3102174 2799.4 - Age 1 51652 3125327 2801.6 - Rating 1 41550265 44623939 3601.9 Step: AIC=2796.99 Balance ~ Income + Rating + Age + Gender.Female + Student.Yes + Married.Yes + Ethnicity.African.American + Income:Student.Yes Df Sum of Sq RSS AIC - Ethnicity.African.American 1 5579 3083119 2795.5 - Income:Student.Yes 1 10767 3088307 2796.0 <none> 3077540 2797.0 - Gender.Female 1 23659 3101198 2797.3 - Married.Yes 1 27917 3105457 2797.7 - Age 1 49660 3127200 2799.8 - Rating 1 42161655 45239195 3604.0 Step: AIC=2795.54 Balance ~ Income + Rating + Age + Gender.Female + Student.Yes + Married.Yes + Income:Student.Yes Df Sum of Sq RSS AIC - Income:Student.Yes 1 9873 3092992 2794.5 <none> 3083119 2795.5 - Gender.Female 1 23435 3106554 2795.8 - Married.Yes 1 24716 3107835 2795.9 - Age 1 50672 3133791 2798.4 - Rating 1 42165674 45248793 3602.1 Step: AIC=2794.5 Balance ~ Income + Rating + Age + Gender.Female + Student.Yes + Married.Yes Df Sum of Sq RSS AIC <none> 3092992 2794.5 - Gender.Female 1 22314 3115306 2794.7 - Married.Yes 1 25027 3118019 2794.9 - Age 1 49133 3142125 2797.2 - Student.Yes 1 4931600 8024592 3079.5 - Income 1 8357794 11450786 3186.5 - Rating 1 42247226 45340218 3600.7
summary(model.backward.AIC)
Call: lm(formula = Balance ~ Income + Rating + Age + Gender.Female + Student.Yes + Married.Yes, data = data.train.bin) Residuals: Min 1Q Median 3Q Max -195.16 -76.45 -14.03 65.33 285.30 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) -529.16392 26.85954 -19.701 <2e-16 *** Income -7.87666 0.27946 -28.186 <2e-16 *** Rating 4.01180 0.06331 63.370 <2e-16 *** Age -0.77016 0.35638 -2.161 0.0315 * Gender.Female -17.32103 11.89326 -1.456 0.1464 Student.Yes 418.87031 19.34645 21.651 <2e-16 *** Married.Yes -18.73773 12.14860 -1.542 0.1241 --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 102.6 on 294 degrees of freedom Multiple R-squared: 0.9517, Adjusted R-squared: 0.9507 F-statistic: 964.8 on 6 and 294 DF, p-value: < 2.2e-16
Findings:
-
- The features are dropped in the following order:
Ethnicity.Asian
,Education
,Cards
,Ethnicity. African.American
,Income:Student.Yes
, then the process terminates. - The following six features remain in the final model:
Income
,Rating
,Age
,Gender.Male
,Student.Yes
,Married.No
- All of the coefficient estimates in the final model, except for the estimates of
Gender.Male
andMarried.No
, are statistically significant. The retention ofIncome
,Rating
, andStudent.Yes
is in agreement with what we got in Tasks 2 and 3.
- The features are dropped in the following order:
Forward Selection with the BIC
# CHUNK 18 # first fit the null model (i.e., model with no predictors) model.null <- lm(Balance ~ 1, data = data.train.bin) model.forward.BIC <- stepAIC(model.null, direction = "forward", scope = list(upper = model.full.bin, lower = model.null), k = log(nrow(data.train.bin)))
Start: AIC=3698.12 Balance ~ 1 Df Sum of Sq RSS AIC + Rating 1 47082983 16908423 3303.2 + Income 1 12910007 51081399 3636.0 + Student.No 1 5425335 58566071 3677.2 + Student.Yes 1 5425335 58566071 3677.2 <none> 63991406 3698.1 + Cards 1 163291 63828115 3703.1 + Gender.Female 1 127201 63864205 3703.2 + Gender..Male 1 127201 63864205 3703.2 + Age 1 79880 63911526 3703.5 + Education 1 13121 63978284 3703.8 + Married.Yes 1 10920 63980486 3703.8 + Married.No 1 10920 63980486 3703.8 + Ethnicity.Asian 1 3049 63988357 3703.8 + Ethnicity.Caucasian 1 2324 63989082 3703.8 + Ethnicity.African.American 1 0 63991406 3703.8 Step: AIC=3303.21 Balance ~ Rating Df Sum of Sq RSS AIC + Income 1 8650956 8257467 3093.2 + Student.No 1 4817765 12090659 3208.0 + Student.Yes 1 4817765 12090659 3208.0 + Age 1 707560 16200863 3296.1 <none> 16908423 3303.2 + Married.Yes 1 174547 16733876 3305.8 + Married.No 1 174547 16733876 3305.8 + Education 1 71402 16837022 3307.6 + Ethnicity.Asian 1 49620 16858804 3308.0 + Gender.Female 1 31887 16876536 3308.4 + Gender..Male 1 31887 16876536 3308.4 + Cards 1 30049 16878374 3308.4 + Ethnicity.Caucasian 1 18541 16889883 3308.6 + Ethnicity.African.American 1 4304 16904120 3308.8 Step: AIC=3093.2 Balance ~ Rating + Income Df Sum of Sq RSS AIC + Student.No 1 5069114 3188353 2812.5 + Student.Yes 1 5069114 3188353 2812.5 <none> 8257467 3093.2 + Married.Yes 1 127904 8129563 3094.2 + Married.No 1 127904 8129563 3094.2 + Age 1 95208 8162259 3095.4 + Education 1 44496 8212971 3097.3 + Ethnicity.Asian 1 40115 8217352 3097.4 + Cards 1 22019 8235448 3098.1 + Ethnicity.African.American 1 9464 8248003 3098.6 + Ethnicity.Caucasian 1 8003 8249464 3098.6 + Gender.Female 1 244 8257223 3098.9 + Gender..Male 1 244 8257223 3098.9 Step: AIC=2812.47 Balance ~ Rating + Income + Student.No Df Sum of Sq RSS AIC <none> 3188353 2812.5 + Age 1 46257 3142096 2813.8 + Gender.Female 1 24128 3164225 2815.9 + Gender..Male 1 24128 3164225 2815.9 + Married.Yes 1 23759 3164595 2815.9 + Married.No 1 23759 3164595 2815.9 + Ethnicity.Caucasian 1 4275 3184078 2817.8 + Ethnicity.African.American 1 2346 3186007 2817.9 + Education 1 1600 3186753 2818.0 + Cards 1 865 3187488 2818.1 + Ethnicity.Asian 1 747 3187606 2818.1
summary(model.forward.BIC)
Call: lm(formula = Balance ~ Rating + Income + Student.No, data = data.train.bin) Residuals: Min 1Q Median 3Q Max -220.292 -78.120 -7.473 63.273 299.895 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) -167.99297 23.70608 -7.086 1e-11 *** Rating 4.01634 0.06353 63.223 <2e-16 *** Income -7.97413 0.27691 -28.797 <2e-16 *** Student.No -421.17339 19.38206 -21.730 <2e-16 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 103.6 on 297 degrees of freedom Multiple R-squared: 0.9502, Adjusted R-squared: 0.9497 F-statistic: 1888 on 3 and 297 DF, p-value: < 2.2e-16
Findings:
-
- Features are added in the following order:
Rating
, Income, Student.Yes - The final model is simpler with less features and all coefficient estimates are statistically significant.
- Features are added in the following order:
TASK 8: Select and validate the recommended model
|
Model Evaluation: Performance Metrics
Model Selection – test RMSE
We are particularly interested in whether reducing the complexity of a linear model by stepwise selection algorithms can enhance prediction accuracy. A natural measure of the performance of linear models, where the target variable is numeric, is the test RMSE.
# CHUNK 19 RMSE(data.test$Balance, predict(model.null, newdata = data.test.bin)) RMSE(data.test$Balance, predict(model.full.bin, newdata = data.test.bin)) RMSE(data.test$Balance, predict(model.backward.AIC, newdata = data.test.bin)) RMSE(data.test$Balance, predict(model.forward.BIC, newdata = data.test.bin))
> RMSE(data.test$Balance, predict(model.null, newdata = data.test.bin)) [1] 453.3665 > RMSE(data.test$Balance, predict(model.full.bin, newdata = data.test.bin)) [1] 105.0656 > RMSE(data.test$Balance, predict(model.backward.AIC, newdata = data.test.bin)) [1] 104.2375 > RMSE(data.test$Balance, predict(model.forward.BIC, newdata = data.test.bin)) [1] 102.6785
Findings:
Model | BIC |
Null model (benchmark) | 453.3665 |
Full model | 105.0656 |
Final model based on backward selection and AIC | 104.2375 |
Final model based on forward selection and BIC | 102.6785 (smallest) |
On the basis of both prediction accuracy and interpretability, the final model based on forward selection and BIC is justifiably the most promising model among all of the linear models we have considered and we will recommend using this model.
Stage 5: Model Validation
Model Diagnostics – plot()
# CHUNK 20 plot(model.forward.BIC)
1. “Residuals vs Fitted” Plot
no discernible patterns.
-
-
- Residuals are dispersed in an erratic manner.
-
Not exhibiting a funnel shape.
-
-
-
- The amount of fluctuation of the residuals increases as the magnitude of the fitted values increases.
- An informal test of the homogeneity of error variance
-
-
Not exhibiting a U-shape
Conclusion:
-
- A clear U-shape suggests that there are non-linear patterns beyond interaction effects that fail to be accommodated by the fitted model.
- One may try to refine the model by incorporating polynomial terms such as
Rating
(although polynomial relationships don’t seem very pronounced) and interaction terms such asIncome x Rating
andIncome x Rating
.
2. “Normal Q-Q” Plot
What does “Q-Q Plot” plot?
-
- raw residuals: \(e_i=y_i-\hat{\mu}_i\)
- deviance residuals: \(d_i=y_i-\hat{\mu}_i\)
Purpose
-
- check model assumptions, such as heteroscedasticity / homoscedasticity.
- test the normality of residuals.
Scenarios
-
- The residuals lie closely and randomly on the 45° straight line passing through the origin. The normality assumption is supported.
- The residuals are normally distributed
- The residuals have no systematic patterns when considered on their own and with respect to the predictors.
If there are systematic patterns, it means that the model is not capturing some important features of the data. - Residuals have approximately constant variance upon standardization (using the standard error implied by the GLM).
- The residuals deviate substantially upward at the two ends the data. The data appears more skewed than the specified distribution.
- There are a lot more large deviance residuals than would be anticipated under the specified distribution.
- The residuals lie closely and randomly on the 45° straight line passing through the origin. The normality assumption is supported.
Conclusion:
-
- This suggests using a response distribution with a fatter tail than the normal distribution.
Stage 4: Model Construction and Evaluation (Using Regularization)
Model Construction – Fitting Regularized Regression Models
glmnet::glmnet()
Function
-
- Create a data matrix (design matrix) using
model.matrix(y ~ . + x_1:x_2, data = <data set>)
- In
glmnet()
, setfamily
family = "gaussian"
if errors are normal.
- In
glmnet()
, setlambda = c(λ_1, λ_2, ..., λ_n)
for each model.
- Create a data matrix (design matrix) using
glmnet(x = <data matrix>, y = <target variable>, family = "gaussian",
lambda = c(0, 10, 100, 500, 1000), alpha = 0.5)
Create a Design Matrix
# CHUNK 21 X.train <- model.matrix(Balance ~ . + Income:Student, data = data.train) head(X.train) #printout a few rows of the design matrix
(Intercept) Income Rating Cards Age Education Gender Male StudentYes MarriedNo EthnicityAfrican American EthnicityAsian Income:StudentYes 1 1 14.891 283 2 34 11 1 0 0 0 0 0.000 2 1 106.025 483 3 82 15 0 1 0 0 1 106.025 3 1 104.593 514 4 71 11 1 0 1 0 1 0.000 6 1 80.180 569 4 77 10 1 0 1 0 0 0.000 8 1 71.408 512 2 87 9 1 0 1 0 1 0.000 9 1 15.125 266 5 66 13 0 0 1 0 0 0.000
Fit to a Regularized Regression Model
# CHUNK 22 #install.packages("glmnet") # uncomment this Line the first time you use gLmnet library(glmnet) m.lambda <- glmnet(x = X.train, y = data.train$Balance, family = "gaussian", lambda = c(0, 10, 100, 500, 1000), alpha = 0.5)
Extract the coefficient estimates to see how they vary with the value ofλ
1. Using the properties of the glmnet
object
# CHUNK 22 (Cont.) # first way to extract coefficient estimates m.lambda$a0
s0 s1 s2 s3 s4 520.4352 301.2973 -224.9669 -506.0179 -579.1925
m.lambda$beta
12 x 5 sparse Matrix of class "dgCMatrix" s0 s1 s2 s3 s4 (Intercept) . . . . . Income . . . -6.63360837 -7.9320588 Rating . 0.6182676 2.034434 3.71042914 4.0114839 Cards . . . 0.51433943 2.9161946 Age . . . -0.70550849 -0.7885101 Education . . . . 0.8874960 Gender Male . . . 3.08510246 18.3671409 StudentYes . . 228.762987 396.62014836 396.2181065 MarriedNo . . . 9.36968005 20.4646766 EthnicityAfrican American . . . . -11.9281110 EthnicityAsian . . . . -5.3983474 Income:StudentYes . . . 0.01473463 0.4929168
2. Apply the coef()
function to a glmnet
object
# CHUNK 22 (Cont.) # second way round(coef(m.lambda), digits = 4)
13 x 5 sparse Matrix of class "dgCMatrix" s0 s1 s2 s3 s4 (Intercept) 520.4352 301.2973 -224.9669 -506.0179 -579.1925 (Intercept) . . . . . Income . . . -6.6336 -7.9321 Rating . 0.6183 2.0344 3.7104 4.0115 Cards . . . 0.5143 2.9162 Age . . . -0.7055 -0.7885 Education . . . . 0.8875 Gender Male . . . 3.0851 18.3671 StudentYes . . 228.7630 396.6201 396.2181 MarriedNo . . . 9.3697 20.4647 EthnicityAfrican American . . . . -11.9281 EthnicityAsian . . . . -5.3983 Income:StudentYes . . . 0.0147 0.4929
Findings:
Column | lambda | Conclusion |
s4 | 0 |
|
s3 | 10 |
|
s2 | 100 |
|
s1 | 500 |
|
s0 | 1000 |
|
TASK 10: Perform feature selection with regularization
|
Hyperparameter Tuning: The cv.glmnet()
Function
cv.glmnet()
Purpose
This function performs k-fold cross-validation (10-fold cross-validation, by default), calculates the cross-validation error for a range of values of λ, and produces the optimal value of λ as a by-product.
plot(<cv.glmnet object>)
Purpose
Returns a cross-validation curve, the cross-validation error plotted as a function of the values of λ used.
-
-
- Red dots: represent the cross-validation error corresponding to a range of values of λ or logλ.
- Vertical lines: around the dots, are the standard error bands.
- The numbers at the top of the graph: indicate how many features have non-zero coefficient estimates at a given value of λ (or logλ).
- The first vertical dotted line from the left: The value of λ (or logλ) at which the cross-validation error is minimized.
- The second vertical dotted line from the left: can also be obtained by
lambda.1se
, the greatest value of λ, or equivalently, the simplest regularized regression model whose cross-validation error is within one standard error of the minimum error.
-
If you are asked to tune a regularized regression model on the exam, the default is to select the value of λ that minimizes the cross-validation error (i.e., take λ = lambda.min
).
# CHUNK 23 set.seed(1111) m <- cv.glmnet(x = X.train, y = data.train$Balance, family = "gaussian", alpha = 0.5) plot(m)
m$lambda.min m$lambda.1se
> m$lambda.min [1] 0.8885539 > m$lambda.1se [1] 7.550517
Findings:
-
- The first dotted line from the left has λ =
m$lambda.min = 0.8885539
- The second dotted line from the left has λ =
m$lambda.1se = 7.550517
- The first dotted line from the left has λ =
Fit the regularized model using lambda.min
The value of λ minimizes the cross-validation error.
# CHUNK 24 # Fit the regularized model using lambda.min m.min <- glmnet(x = X.train, y = data.train$Balance, family = "gaussian", lambda = m$lambda.min, alpha = 0.5) m.min$beta
12 x 1 sparse Matrix of class "dgCMatrix" s0 (Intercept) . Income -7.8131829 Rating 3.9839963 Cards 2.6915109 Age -0.7829239 Education 0.7557949 Gender Male 17.0049498 StudentYes 396.3329355 MarriedNo 19.4182020 EthnicityAfrican American -9.9708020 EthnicityAsian -3.6479011 Income:StudentYes 0.4458288
# Set up the desgin matrix for the best test X.test <- model.matrix(Balance ~ . + Income:Student, data = data.test)
# Make predictions on the test set and calculation test RMSE m.min.pred <- predict(m.min, newx = X.test) RMSE(data.test$Balance, m.min.pred)
[1] 103.9201
Findings:
-
- All of the 11 features are retained at the
lambda.min
value, as we observed fromplot(m)
- test RMSE of the resulting model is 103.9201, which is slightly higher than that of the final model in Task 7 (which is 102.6785).
- The regularized model is also much harder to interpret due to its substantially larger size, whereas the model in Task 7 only has 3 features.
As a result, it is reasonable to recommend the model in Task 7 as the final model owing to its better prediction accuracy and interpretability.
- All of the 11 features are retained at the
Fitting the regularized model with λ = lambda.1se
The value of λ, based on the one-standard-error rule, is equal to lambda.1se
, or 7.550517.
# CHUNK 25 # Fit the reguiarized modeL using iambda.1se m.1se <- glmnet(x = X.train, y = data.train$Balance, family = "gaussian", lambda = m$lambda.1se, alpha = 0.5) # List the coefficient estimates m.1se$beta
12 x 1 sparse Matrix of class "dgCMatrix" s0 (Intercept) . Income -6.9378441 Rating 3.7813573 Cards 1.1075391 Age -0.7305808 Education . Gender Male 6.7717419 StudentYes 397.5199346 MarriedNo 11.6768447 EthnicityAfrican American . EthnicityAsian . Income:StudentYes 0.1148304
# Make predictions on the test set and caicuiate test RMSE m.1se.pred <- predict(m.1se, newx = X.test) RMSE(data.test$Balance, m.1se.pred)
[1] 102.8241
Findings:
-
- The fitted model has 8 features, as we observed from
plot(m)
, andEducation
,EthnicityAfrican American
, andEthnicityAsian
are removed. - test RMSE has dropped to 102.8241, a slight improvement from that of the model based on
lambda.min
. - The regularized model has worse prediction accuracy as well as interpretability than the model in Task 7.
As a result, it is reasonable to recommend the model in Task 7 as the final model owing to its better prediction accuracy and interpretability.
- The fitted model has 8 features, as we observed from
How to choose the optimal value of α?
To optimize over α, we have to resort to trial and error and go as follows:
-
- Set up a grid of values of α (given by the question beforehand).
- For each value of α, select the corresponding optimal value of λ and evaluate the corresponding cross-validation error.
- Adopt the value of α that produces the smallest cross-validation error.
In essence, we break down the two-variable optimization problem by first minimizing over λ for a given α, then minimizing over α.
4.1. Generalized Linear Models
GLMs
Equation
\(g(\mu)=\eta:=\beta_0+\beta_1X_1+…+\beta_pX_p\), where
- \(g(\cdot)\): The link function “linking” the target mean µ to the linear combination of the predictors
- monotonic: allows us to invert the link function to \(\mu=g^{-1}(\eta)=g^{-1}(\beta_0+\beta_1X_1+…+\beta_pX_p)\)
Linear model can be obtained by identity link g(µ) = µ, making \(\mu=\eta=\beta_0+\beta_1X_1+…+\beta_pX_p\)
GLMs vs. Linear Models
- GLM is more flexible
- Distribution of the target variable:
The target variable in a GLM is no longer confined to the class of normal random variables but only from exponential family of distributions. - Relationship between the target mean and linear predictors:
GLM sets a function of the target mean to be linearly related to the predictors.
- Distribution of the target variable:
GLMs vs. Linear Models on Transformed Data
- The transformations are applied differently (externally vs. internally).
- Why do we transform data? to bring the target observations closer to normal so that they can be reasonably described by a normal linear model.
- Data is directly modeled by GLMs.
- The target variable is not transformed.
- The transformation plays its role only within the GLM itself.
Typical Relationships between Target Means and X
Equation | Relationship | Component |
μ=Xβ | The target mean has a power relationship with X | β is the exponent |
Selection of Target Distributions
To choose one that best aligns with the characteristics of a given target variable.
- Positive, continuous, and right-skewed data:
- Examples: claim amounts, income, and amount of insurance coverage
- Target Distribution: gamma and inverse Gaussian, to capture the skewness of the target variable directly without the use of transformations.
- gamma has positively related mean and variance, a desirable characteristic to model claim severity.
- inverse Gaussian has a similar behavior but is more highly skewed than the gamma.
- Binary data:
- Examples: In classification problem with the target variable indicating the occurrence (1) or non-occurrence (0) of an event of interest.
- Target Distribution: The Binomial (Bernoulli) distribution is probably the only reasonable choice.
- Count data:
- Examples: The number of times a certain event of interest happens over a reference time period. It assumes only non-negative integer values, possibly with a right skew.
- Target Distribution: The Poisson distribution is one of the most natural candidate distributions.
- One of its drawbacks is that it requires mean = variance.
- Aggregate data:
- Examples: Aggregate claim losses, which tend to have a large mass at zero as most policies have no claims and a continuous distribution skewed to the right reflecting total claim sizes when there are claims.
- Target Distribution: The Tweedie distribution. A random variable S follows the Tweedie distribution if it admits the compound Poisson-gamma representation.
- a Poisson sum of gamma random variables
- It has a discrete probability mass at zero and a probability density function on the positive real line.
Selection of Link Functions
Appropriateness of Predictions
Examples | Positive mean | Unit-valued mean |
Distribution | poisson, gamma, and inverse Gaussian | |
Mean | positive and unbounded | |
Link Function | log link: g(µ) = ln µ | logit link: \(\eta=g(\pi)=\ln \dfrac{\pi}{1-\pi}=\ln(\text{odds})\) |
Benefits |
|
|
Interpretability
Definition
A good link function should have ease of interpretation.
What is ease of interpretation?
Easy to form statements that describe the effects of the predictors of the model on the target mean in terms of the model coefficients.
Can answer
If a continuous predictor X increases by 1 unit, by how much will the target mean change?
Canonical Link
Advantage
-
- simplify the mathematics of the estimation procedure and making it more likely to converge.
Important factors to consider
-
- Whether the predictions provided by the link align with the characteristics of the target variable.
- Whether the resulting GLM is easy to interpret.
Common Canonical Link
Target Distribution | Canonical Link | Mathematical Form | Other Link(s) |
Normal | Identity | \(g(\mu)=\mu\) | – |
Binomial | Logit | \(g(\pi)=\ln (\dfrac{\pi}{1-pi})\) | Probit, cloglog |
Poisson | Natural Log | \(g(\mu)=\ln \mu\) | – |
Gamma | Inverse | \(g(\mu)=\mu^{-1}\) | Log |
Inverse Gaussian | Squared Inverse | \(g(\mu)=\mu^{-2}\) | Log |
Interpretation of GLM Coefficients
- depends on the choice of the link function, while target distribution plays no role in the interpretation.
Log Link
ln µ = η \(\mu=e^\eta =e^{\beta_0+\beta_1 X_1+…+\beta_j X_j+…+\beta_p X_p}=e^{\beta_0}\times e^{\beta_1 X_1}\times e^{\beta_j X_j}\times e^{\beta_p X_p}\)
For a unit increase in Xj:
Case | Changes | Interpretation | |
Numeric Xj | Multiplicative Changes | \(\text{new }\mu = \boxed{e^{\beta_j}}\times\text{old }\mu\) |
|
Percentage Changes | \(\dfrac{(e^{\beta_j}-1)\mu}{\mu}=e^{\beta_j}-1\) |
|
|
Categorical Xj | Multiplicative Changes | \(e^{\beta_j}\) × Xj at baseline level |
|
Percentage Changes | \(100(e^{\beta_j}-1)\)% |
|
Logit Link
For a unit increase in Xj:
Case | Changes | Interpretation | |
Numeric Xj | Multiplicative Changes | \(\text{new }\mu = \boxed{e^{\beta_j}}\times\text{old odds}\) |
|
Percentage Changes | \(100(e^{\beta_j}-1)\)% of old odds |
|
Weights and Offsets
Purpose
designed to incorporate a measure of exposure into a GLM to improve the fitting.
Methodology: Weights
To take advantage of the fact that different observations in the data may have different exposures and thus different degrees of precision, we can attach a higher weight to observations with a larger exposure so that these more credible observations will carry more weight in the estimation of the model coefficients. The goal is to improve the reliability of the fitting procedure.
Methodology: Offsets
Assumption: the target mean is directly proportional to the exposure.
Let Ei be the exposure for the ith observation, then the model equation is:
\(\ln\mu_i=\boxed{\ln E_i}+\eta_i=\ln E_i+(\beta_0+\beta_1X_{i1}+…+\beta_pX_{ip})\), where
- the logged exposure \(\ln E_i\) is called an offset.
- The coefficient of \(\ln E_i\) is a priori to be 1.
The above equation can be written in two ways to show further insights:
- Upon exponentiation, it is equivalent to \(\mu_i=E_i e^{\eta_j}\), which shows that µi and Ei are in direct proportion.
-
\(\ln(\dfrac{\mu_i}{E_i})=\eta_i=\beta_0+\beta_1X_{i1}+…+\beta_pX_{ip}\)
In this form, we are modeling the occurrence rate of the event of interest, or the expected number of events per exposure.
Offsets vs. Weights
Offsets and weights impose vastly different structures on the mean and variance of the target variable.
Key Differences:
- the form of the target variable in the dataset (average vs. aggregate).
- how the exposure affects the target variable (mean vs. variance).
Differences:
- Weights: To use weights properly, the observations of the target variable should be averaged by exposure. Due to the averaging, the variance of each observation is inversely related to the size of the exposure, which serves as the weight for that observation. However, the weights do not affect the mean of the target variable.
- Offsets: For offsets, the observations are values aggregated over the exposure units. The exposure, when serving as an offset, is in direct proportion to the mean of the target variable, but otherwise leaves its variance unaffected.
(Remember that for a GLM, the mean and variance of the target variable are generally related, so an offset also affects the variance, but only as a by-product)
Weights | Offsets | |
Target Variable: Form | averaging observations | aggregating observations |
Target Variable: Mean | no effects | the exposure is direct proportion to the target mean |
Target Variable: Variance | the variance is inversely related to the size of the exposure | no effects |
Special Case: Weights and Offsets
Poisson regression with the log link produce identical results and the following two generic models are equivalent in the sense that they share the same coefficient estimates, standard errors, p-values, and most importantly, predictions:
- Model 1: A Poisson regression model fitted to Y (count) using ln E as the offset
- Model 2: A Poisson regression model fitted to Y/ E (frequency) using E as the weight
Model Quantities and Diagnostic Tools
Coefficient Estimates
Global Goodness-of-Fit Measure: Deviance
R2
-
- R2 is not a goodness-of-fit measure for GLM because it is based on the normal distribution.
Deviance
-
- R2 measures how a linear model compares with the most primitive linear model (the intercept-only linear model);
- Deviance measures the extent to which the GLM departs from the most elaborate GLM, which is known as the saturated model.
- The saturated model, lSAT , has the same target distribution as the fitted GLM, l.
- It has as many model parameters as the number of training observations.
- It perfectly fits each training observation and is a very flexible GLM.
Formula:
\(D=2(l_{SAT}-l)\)
Interpretation:
-
- The lower the deviance, the closer the GLM is to the model with a perfect fit, and the better its goodness of fit on the training set.
Model Statistics:
-
- For a normal target variable with variance σ2, \(D=\dfrac{\sum_{i=1}^{n_{tr}}(y_i-\hat{y}_i)^2}{\sigma^2}\)
In this sense, the deviance can be viewed as a generalization of the residual sum of squares that works even for non-normal target variables. - The deviance can only be used to compare GLMs having the same target distribution.
(so that they share the same maximized loglikelihood of the saturated model, lSAT) - The deviance residual, an important model diagnostic tool for GLMs.
- For a normal target variable with variance σ2, \(D=\dfrac{\sum_{i=1}^{n_{tr}}(y_i-\hat{y}_i)^2}{\sigma^2}\)
Local Goodness-of-Fit Measure: Deviance Residuals
Raw residuals for GLM?
-
- The target distribution of GLMs are not normally distributed.
- Variances are not constant.
Formula
\(d_i=y_i-\hat{\mu}_i\) and \(\sum_{i=1}^{n}d^2_i=D\)
Properties (parallel those of the raw residuals in a linear model)
(By examining Q-Q plots of deviance residuals)
-
- They are approximately normally distributed for most target distributions in the linear exponential family (the binomial distribution is a notable exception).
- They have no systematic patterns when considered on their own and with respect to the predictors.
- They have approximately constant variance upon standardization (using the standard error implied by the GLM).
The deviance residuals don’t have to be normally distributed for the model to be valid, so the normality / non-normality of the residuals doesn’t necessarily tell you anything.
Penalized Likelihood Measures: AIC and BIC
Formula
-
- \(AIC=-2Z+2p\)
- \(BIC=-2Z +\ln (n_{tr})p\)
Performance Metrics for Classifiers
RMSE?
-
- RMSE is the most commonly used performance metric for numeric target variables.
- GLMs accommodate both numeric and categorical target variables Other performance metrics
Confusion Matrix for Binary Classifier
Predicted probabilities into the predicted classes
To translate the predicted probabilities into the predicted classes, we need a pre-specified cutoff (or threshold).
-
-
- If the predicted probability of the event of interest for an observation is higher than the cutoff, then the event is predicted to occur.
- If the predicted probability of the event of interest for an observation is lower than the cutoff, then the event is predicted not to occur.
-
Prediction | Reference (= Actual) | |
– |
+ | |
– |
TN | FN |
+ |
FP | TP |
-
-
- TP = true positive: The classifier predicts that the event occurs (“positive”) and indeed it does (“true”).
- TN = true negative: The classifier predicts that the event does not occur (“negative”) and indeed it does not (“true”).
- FP = false positive: The classifier predicts that the event occurs (“positive”), but it does not (“false”).
- FN = false negative: The classifier predicts that the event does not occur (“negative”), but it does (“false”).
-
Performance Metrics
\(\text{Classification Error Rate }=\dfrac{FN+FP}{n}=\dfrac{\text{Actually did not happen}}{n}\) = the proportion of misclassifications
\(\text{Accuracy }=\dfrac{TN+TP}{n}=\dfrac{\text{Actually happened as predicted}}{n}\) = the proportion of correctly classified observations
\(\text{Sensitivity }=\dfrac{TP}{TP+FN}=\dfrac{\text{Actually happened when predicted to occur}}{Actually happened as predicted}\) = the relative frequency of correctly predicting an event of interest when the event does take place
\(\text{Specificity }=\dfrac{TN}{TP+FN}=\dfrac{\text{Actually happened when predicted not to occur}}{Actually happened as predicted}\) = the relative frequency of correctly predicting an event of interest when the event does take place
Effects of Changing the Cutoff
A trade-off between having high sensitivity and having high specificity
-
-
- Ideally, the classifier and the cutoff should be such that both the sensitivity and specificity are close to 1.
- In general, sensitivity and specificity, as a function of the cutoff, are in conflict with each other.
- The determination of the precise cutoff requires weighing the benefits of correct classifications (TN and TP) and the costs of misclassifications (FN and FP).
-
An optimal cutoff can then be chosen by maximizing the overall profit or minimizing the overall cost as a function of the cutoff.
Extreme Case 1
Cutoff = 0 All possibilities > 0
Prediction | Reference (= Actual) | |
– | + | |
– | 0 | 0 |
+ | n_ | n+ |
where:
-
-
- n_ = TN + FP is the total number of negative observations.
- n+ = FN + TP is the total number of positive observations.
- n = n_ + n+ = TN + FP + FN + TP is the total number of observations.
-
Increasing the Cutoff
Cutoff Negative observations, the entries in the confusion matrix will be moving to the first row
Prediction | Reference (= Actual) | |
– | + | |
– | TN | FN |
+ | FP | TP |
-
-
- Sensitivity but specificity.
-
Extreme Case 2
Cutoff = 1 All possibilities < 0
Prediction | Reference (= Actual) | |
– | + | |
– | n_ |
n+ |
+ | 0 | 0 |
-
-
- Sensitivity = 0 and specificity = 1.
-
Receiver Operating Characteristic (ROC) Curves
Definition
A graphical tool to plot the sensitivity against the specificity of a given classifier for each cutoff ranging from 0 to 1. (x, y) := (specificity, sensitivity).
A good classifier should be such that:
-
- its sensitivity and specificity are both close to 1 for a wide range of cutoffs its ROC curve should bend to the upper left and approach the top-left corner point (1, 1). The closer the curve is to the top-left corner, the better the predictive ability.
Predictive Performance: Area under the Curve (AUC)
How to evaluate predictive performance? The higher AUC, the better predictive performance.
AUC = 1
Such a classifier:
-
-
- separates all of the observations into the two classes correctly whenever the cutoff is not 0 or 1, with:
- Sensitivity = 1
- Specificity = 1
- separates all of the observations into the two classes correctly whenever the cutoff is not 0 or 1, with:
-
AUC = 0.5
The naive classifier which classifies the observations purely randomly without using the information contained in the predictors。
Prediction | Reference (= Actual) | |
– | + | |
– | (1 – p)n_ | (1 – p)n+ |
+ | pn_ | pn+ |
-
-
- Sensitivity = \(\dfrac{pn_{+}}{pn_{+}+(1-p)n_{+}}\)
- Specificity = \(\dfrac{(1-p)n_{-}}{(1-p)n_{-}+pn_{-}}\)
-
Sidebar: Unbalanced data
Why unbalanced data is problematic?
With unbalanced data, a classifier implicitly places more weight on the majority class and tries to match the training observations in that class, without paying enough/attention to the minority class.
It gets worse if the minority class is the class of interest, the positive class.
Undersampling
Definition
Undersampling produces roughly balanced data by drawing fewer observations from the negative class (“undersampled”) and retaining all of the positive class observations.
Pros
With relatively more data from the positive class (in proportion), this will improve the classifier’s ability to pick up the signal leading to the positive class.
Cons
The drawback is that the classifier, now based on less data and therefore less information about the negative class, could be less robust and more prone to overfitting.
Oversampling
Definition
Oversampling keeps all of the original data, but oversamples (with replacement) the positive class to reduce the imbalance between the two classes.
Some of the positive class observations will appear more than once in the balanced data.
Data Set
Oversampling should be performed on the training data, following the training/test set split.
4.2. Generalized Linear Models Case Study 1
Learning Objectives
- Select appropriate distributions and link functions for a positive, continuous target variable with a right skew.
- Fit a GLM using the
glm()
function in R and specify the options of this function appropriately. - Make predictions for GLMs using the
predict()
function and compare the predictive performance of different GLMs. - Generate and interpret diagnostic plots for a GLM.
Stage 1: Define the Business Problem
Objectives
To build GLMs to predict the size of settled claims using related risk factors in the dataset, select the most promising GLM, and quantify its predictive accuracy.
Modeling Options
- Log-transform the claim size and fit a normal linear model.
- GLM with the log link function and normal distribution
- GLM with gamma / inverse gamma distribution
TASK 1: Data pre-processing Determine which numeric variables in the data, if any, should be treated as factors. Do the factor conversion. |
Stage 2: Data Collection
# CHUNK 1 persinj <- read.csv("persinj.csv") summary(persinj)
amt inj legrep op_time Min. : 10 Min. :1.00 Min. :0.0000 Min. : 0.10 1st Qu.: 6297 1st Qu.:1.00 1st Qu.:0.0000 1st Qu.:23.00 Median : 13854 Median :1.00 Median :1.0000 Median :45.90 Mean : 38367 Mean :1.83 Mean :0.6366 Mean :46.33 3rd Qu.: 35123 3rd Qu.:2.00 3rd Qu.:1.0000 3rd Qu.:69.30 Max. :4485797 Max. :9.00 Max. :1.0000 Max. :99.10
Findings:
- We need to convert
inj
from numeric to categorical variable, since theinj
variable is merely a label of group membership indicating the injury group to which each policy belongs.
Convert inj
to a Factor
# CHUNK 2 persinj$inj <- as.factor(persinj$inj) summary(persinj)
amt inj legrep op_time Min. : 10 1:15638 Min. :0.0000 Min. : 0.10 1st Qu.: 6297 2: 3376 1st Qu.:0.0000 1st Qu.:23.00 Median : 13854 3: 1133 Median :1.0000 Median :45.90 Mean : 38367 4: 189 Mean :0.6366 Mean :46.33 3rd Qu.: 35123 5: 188 3rd Qu.:1.0000 3rd Qu.:69.30 Max. :4485797 6: 256 Max. :1.0000 Max. :99.10 9: 1256
Findings:
- Since the baseline level (“1”) has the most observations, there is no need to relevel
inj
.
Stage 3: Exploratory Data Analysis
N/A.
TASK 2: Fit a linear model as a benchmark
|
Stage 4: Model Construction and Evaluation
Training/Test Set Split
library(caret) set.seed(2019) partition <- createDataPartition(y = persinj$amt, p = .75, list = FALSE) data.train <- persinj[partition, ] data.test <- persinj[-partition, ] summary(data.train$amt) summary(data.test$amt)
Validating the training test and test set: close mean.
summary(data.train$amt) summary(data.test$amt)
> summary(data.train$amt) Min. 1st Qu. Median Mean 3rd Qu. Max. 20 6297 13852 38670 35118 4485797 > summary(data.test$amt) Min. 1st Qu. Median Mean 3rd Qu. Max. 10 6299 13858 37458 35140 1450818
Findings:
- The maximum claim sizes of the two sets differ considerably even with the use of stratified sampling
- The two means are not as close to each other as we expect.
- The two medians, 13,852 and 13,858, are very close to each other.
Benchmark Model: Normal Linear model on Log of Claim Size
ggplot(persinj, aes(x = op_time, y = log(amt), color = legrep)) + geom_point(alpha = 0.25) + geom_smooth(method = "lm", se = FALSE)
Findings:
- It seems plausible that legal representation and operational time interact with each other to affect
amt
, so we will incorporate an interaction term for these two predictors.
Found Interaction Term: legrep: op_time
Fitting the benchmark model
Our benchmark model is the linear model on the log of the claim size fitted using ordinary least squares (OLS).
# CHUNK 4 glm.ols <- glm(log(amt) ~ inj + legrep * op_time, family = gaussian(link = "identity"), data = data.train) summary(glm.ols)
Call: glm(formula = log(amt) ~ inj + legrep * op_time, family = gaussian(link = "identity"), data = data.train) Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 7.5212622 0.0269021 279.579 < 2e-16 *** inj2 0.5912759 0.0237858 24.858 < 2e-16 *** inj3 0.8161053 0.0385945 21.146 < 2e-16 *** inj4 0.7958558 0.0901834 8.825 < 2e-16 *** inj5 0.6462279 0.0894089 7.228 5.12e-13 *** inj6 0.3823662 0.0777895 4.915 8.95e-07 *** inj9 -0.8151999 0.0367039 -22.210 < 2e-16 *** legrep 0.9120834 0.0339725 26.848 < 2e-16 *** op_time 0.0358738 0.0005093 70.431 < 2e-16 *** legrep:op_time -0.0101808 0.0006384 -15.946 < 2e-16 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 (Dispersion parameter for gaussian family taken to be 1.146774) Null deviance: 35056 on 16527 degrees of freedom Residual deviance: 18942 on 16518 degrees of freedom AIC: 49180 Number of Fisher Scoring iterations: 2
Findings:
-
- Summary of a GLM does not show the R2 and F-statistic, which are specific to linear models
- The deviance (as shown in Residual deviance) and AIC of the OLS linear model, besides the coefficient estimates, standard errors, and p-values.
Making predictions and Calculating RMSE
# CHUNK 5 # Make predictions pred.ols <- exp(predict(glm.ols, newdata = data.test)) head(pred.ols) # RMSE RMSE(data.test$amt, pred.ols) RMSE(data.test$amt, mean(data.train$amt))
> head(pred.ols) 3 10 13 21 24 27 1853.5345 1853.5345 4609.6856 820.2833 820.2833 820.2833 > RMSE(data.test$amt, pred.ols) [1] 72888.33 > RMSE(data.test$amt, mean(data.train$amt)) [1] 79370.9
Assessing the Performance of OLS Model:
Test RMSE of the OLS model = 72,888.33 > Test RMSE of the intercept-only model = 79,370.9
The OLS model represents a decent improvement over the intercept-only model.
TASK 3: Select a distribution and link function
|
GLM 1: Log-Link Normal GLM on Claim Size
Merits:
- This GLM ensures positive predictions.
Demerits:
- It allows for the possibility that the observations of the target variable are negative.
Fitting the GLM model
# CHUNK 6 glm.log <- glm(amt ~ inj + legrep * op_time, family = gaussian(link = "log"), data = data.train) summary(glm.log)
Call: glm(formula = amt ~ inj + legrep * op_time, family = gaussian(link = "log"), data = data.train) Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 8.559238 0.120163 71.230 < 2e-16 *** inj2 0.572982 0.031018 18.473 < 2e-16 *** inj3 0.825502 0.034683 23.801 < 2e-16 *** inj4 0.883796 0.057430 15.389 < 2e-16 *** inj5 1.531415 0.034374 44.552 < 2e-16 *** inj6 1.506266 0.042987 35.040 < 2e-16 *** inj9 -0.552178 0.146506 -3.769 0.000164 *** legrep 0.068766 0.138410 0.497 0.619318 op_time 0.027185 0.001508 18.029 < 2e-16 *** legrep:op_time 0.003195 0.001721 1.857 0.063383 . --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 (Dispersion parameter for gaussian family taken to be 6897153068) Null deviance: 1.4770e+14 on 16527 degrees of freedom Residual deviance: 1.1393e+14 on 16518 degrees of freedom AIC: 421348 Number of Fisher Scoring iterations: 8
Making predictions and Calculating RMSE
pred.log <- predict(glm.log, newdata = data.test, type = "response") head(pred.log) RMSE(data.test$amt, pred.log)
> head(pred.log) 3 10 13 21 24 27 5228.901 5228.901 5602.912 3010.251 3010.251 3010.251 > RMSE(data.test$amt, pred.log) [1] 69623.89
Findings:
-
- Test RMSE decreases to 69,623.89
The normal GLM with log link outperforms the OLS linear model in terms of prediction accuracy.
- Test RMSE decreases to 69,623.89
GLM 2: Log-Link Gamma GLM
Justification of using log-link Gamma GLM
The target variable, amt , is a positive, continuous variable with a right skew. |
Gamma distribution |
To ensure that all predictions are positive, a characteristic of the target variable; |
Log link |
Makes the model coefficients easy to interpret – the coefficients are closely related to multiplicative changes to the linear predictor. |
Training the model
Run CHUNK 7
to train a gamma GLM with the log link.
# CHUNK 7 glm.gamma <- glm(amt ~ inj + legrep * op_time, family = Gamma(link = "log"), data = data.train) summary(glm.gamma)
Call: glm(formula = amt ~ inj + legrep * op_time, family = Gamma(link = "log"), data = data.train) Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 8.2056640 0.0331403 247.604 < 2e-16 *** inj2 0.5975174 0.0293014 20.392 < 2e-16 *** inj3 0.8568408 0.0475440 18.022 < 2e-16 *** inj4 1.1029945 0.1110956 9.928 < 2e-16 *** inj5 1.5036755 0.1101415 13.652 < 2e-16 *** inj6 0.9004443 0.0958278 9.396 < 2e-16 *** inj9 -0.6321655 0.0452150 -13.981 < 2e-16 *** legrep 0.4393948 0.0418502 10.499 < 2e-16 *** op_time 0.0344292 0.0006275 54.871 < 2e-16 *** legrep:op_time -0.0049104 0.0007865 -6.243 4.38e-10 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 (Dispersion parameter for Gamma family taken to be 1.740277) Null deviance: 33141 on 16527 degrees of freedom Residual deviance: 16705 on 16518 degrees of freedom AIC: 365833 Number of Fisher Scoring iterations: 8
RMSE(data.test$amt, predict(glm.gamma, newdata = data.test, type = "response"))
> RMSE(data.test$amt, predict(glm.gamma, newdata = data.test, type = "response")) [1] 68831.79
Findings:
-
- The test RMSE drops further to 68,831.79 compared with the normal GLM with the log link. This shows that the gamma GLM is the most predictive among the models we have fitted with respect to the test RMSE.
GLM 3: Inverse Gaussian GLM
Justification of using square-link inverse-Gamma GLM
The target variable, amt , is a positive, continuous variable with a right skew. |
inverse Gamma distribution |
To ensure that all predictions are positive, a characteristic of the target variable; |
Log link |
Makes the model coefficients easy to interpret – the coefficients are closely related to multiplicative changes to the linear predictor. |
The inverse Gaussian distribution often has a fatter tail than a gamma distribution. Its canonical link function is the inverse square link \(g(\mu)=\dfrac{1}{\mu^2}\), which ensures positive predictions but is not easy to interpret, so the log link is also commonly used.
Training the model
In CHUNK 8
, we try to fit an inverse Gaussian GLM with a log link, but to no avail.
# CHUNK 8 glm.ig <- glm(amt ~ inj + legrep * op_time, data = data.train, family = inverse.gaussian(link = "log"))
Error: inner loop 1; cannot correct step size In addition: Warning messages: 1: step size truncated due to divergence 2: step size truncated due to divergence
The fitting cannot be completed due to convergence issues.
Model: log-link Gamma GLM
TASK 4: Validate the model
|
Model Validation
Gamma GLM vs. Benchmark OLS model
Benchmark OLS Model | Log-Link Gamma GLM | |
test RMSE | 72,888.33 | 68,831.79 |
The summary analysis of the gamma GLM shows that all coefficient estimates, with an extremely small p-value, are statistically significant, so it is wise to retain all existing features in the model.
Model Diagnostics
In CHUNK 9
we make some diagnostic plots for the gamma GLM to see if the model assumptions are largely satisfied.
# CHUNK 9 plot(glm.gamma, which = 1) qqnorm(residuals(glm.gamma)) qqline(residuals(glm.gamma))
Findings:
- “Residuals vs Fitted” Plot
- The residuals mostly scatter around 0 in a structureless manner, but with a sizable amount of fluctuation.
- The spread of the positive residuals tends to be much larger than that of the negative residuals, with a few outliers having an unusually positive residual (observations 5422, 10224, and 18870).
- Q-Q Plot
- The plot looks fine in the middle portion and on the left end, but seems problematic on the right end, with the points deviating quite markedly from the reference line.
- The fact that the rightmost points deviate significantly upward means that there are a lot more large, positive standardized deviance residuals than under a normal distribution, and by extension, the distributions of the deviance residuals and the data are more skewed than the gamma distribution.
- It appears that a fatter-tailed model may perform better.
(Note: There is also a Q-Q plot produced by running plot (glm.gamma
), but that plot is for standardized Pearson residuals, which, unlike deviance residuals, are not approximately normal even if a correct GLM is specified. This is why we use separate code to produce a Q-Q plot for deviance residuals.)
TASK 5: Interpret the model Run the selected model from Task 3 on the full dataset.
|
Fitting the selected model with full dataset
In CHUNK 10
, we rerun the gamma GLM on the full dataset (data = persinj
) to provide a more robust prediction for new, future data (the larger the data set used to train a model, the more reliable the fitting) and print the model summary.
# CHUNK 10 glm.final <- glm(amt ~ inj + legrep * op_time, family = Gamma(link = "log"), data = persinj) summary(glm.final)
Call: glm(formula = amt ~ inj + legrep * op_time, family = Gamma(link = "log"), data = persinj) Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 8.2033907 0.0279443 293.562 < 2e-16 *** inj2 0.6281073 0.0248995 25.226 < 2e-16 *** inj3 0.8882054 0.0402472 22.069 < 2e-16 *** inj4 1.1199670 0.0951566 11.770 < 2e-16 *** inj5 1.3963478 0.0955128 14.619 < 2e-16 *** inj6 0.8867918 0.0816012 10.867 < 2e-16 *** inj9 -0.6205268 0.0381881 -16.249 < 2e-16 *** legrep 0.4437842 0.0354423 12.521 < 2e-16 *** op_time 0.0343052 0.0005303 64.685 < 2e-16 *** legrep:op_time -0.0050443 0.0006663 -7.571 3.86e-14 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 (Dispersion parameter for Gamma family taken to be 1.675808) Null deviance: 44010 on 22035 degrees of freedom Residual deviance: 22242 on 22026 degrees of freedom AIC: 487535 Number of Fisher Scoring iterations: 8
Findings:
- The model summary shows that injury code, legal representation, and operational time are all statistically significant factors affecting the expected claim size of a personal injury insurance.
- There is a significant interaction between legal representation and operational time, meaning that:
the effect of operational time on the expected claim size varies for injuries with and without legal representation.
Making predictions
Since we are using the log link, we need to exponentiate the coefficients to get the multiplicative changes.
# Exponentiate the coefficients to get the multiplicative changes exp(coef(glm.final))
(Intercept) inj2 inj3 inj4 inj5 inj6 inj9 legrep op_time legrep:op_time 3653.3167103 1.8740602 2.4307635 3.0647532 4.0404165 2.4273297 0.5376611 1.5585941 1.0349004 0.9949684
Interpretation
Since the log link is used,
-
- The expected claim size is multiplied by \(e^{\beta_j}\) for every unit increase in a numeric predictor with coefficient βj, or equivalently,
- When a qualitative predictor moves from its baseline level to a new level represented by a dummy variable with coefficient βj, holding everything else constant (equivalently, the percentage change is \(e^{\beta_j}-1\)).
More precisely,
-
- Injury Code
- With the exception of injuries with injury code 9 (which represents unclassified injuries), the expected claim size for injuries with codes 2, 3, 4, 5, 6 is higher than that for injuries with code 1 (baseline).
- For instance, the expected claim size for injuries with code 5 is estimated to be e1.3963478 = 4.0404 times of that of injuries with code 1.
- The estimated coefficients for codes 2, 3, 4, 5, 6 are all positive and mostly increasing across the codes makes intuitive sense as we would expect the more serious the injury, the larger the expected claim size.
However, we would expect the estimated coefficient for code 6 to be the largest as code 6 corresponds to fatalities. This is not the case for the gamma GLM.
- With the exception of injuries with injury code 9 (which represents unclassified injuries), the expected claim size for injuries with codes 2, 3, 4, 5, 6 is higher than that for injuries with code 1 (baseline).
- Operational Time
- Operational time is positively associated with the expected claim size, though the effect is not as positive for injuries with legal representation.
- A unit increase in operational time is associated with a multiplicative increase of e0·0343052 = 1.0349004 in the expected claim size for injuries without legal representation. The multiplicative increase reduces to e0.0343o52-o.oo5o443 = 1.029693 for injuries with legal representation. The positive effect is reasonable because larger claims are often more difficult to quantify and require more time to settle, leading to longer delays.
- Operational time is positively associated with the expected claim size, though the effect is not as positive for injuries with legal representation.
- Legal Representative
- For most injuries, those with legal representation have a higher expected claim size than those without legal representation, unless the injuries take an extraordinarily long time to settle (operational time higher than 88 to be precise).
This is intuitive as clients with the assistance of the legal advice are in a better position to fight for a larger settled claim size.
- For most injuries, those with legal representation have a higher expected claim size than those without legal representation, unless the injuries take an extraordinarily long time to settle (operational time higher than 88 to be precise).
- Injury Code
EXAM NOTE
When you are asked to interpret the results of a GLM, the following three-part structure can be quite useful:
-
- First interpret the precise values of the estimated coefficients, e.g., every unit increase in a continuous predictor is associated with a multiplicative change of \(e^{\beta_j}\) in the expected value of the target variable, holding everything else fixed.
- Comment on whether the sign of the estimated coefficients conforms to intuition. Do the results make sense? You can use your common knowledge in this part.
- Relate the findings to the “big picture” and explain how they can help clients make a better decision in life.
4.3. Case Study 2: GLMs for Binary Target Variables
Learning Objectives
Compared to GLMs for numeric target variables, GLM-based classifiers enjoy some subtly unique features, which will be revealed in the course of this case study. At the completion of this section, you should be able to:
- Combine factor levels to reduce the dimension of the data.
- Select appropriate link functions for binary target variables.
- Implement different kinds of GLMs for binary target variables in R.
- Incorporate an offset into a logistic regression model.
- Interpret the results of a fitted logistic regression model.
Stage 1: Define the Business Problem
Objective
To construct appropriate GLMs to identify key factors associated with claim occurrence.
TASK 1: Edit the data Perform data-processing by:
|
Stage 2: Data Collection
Read in Data Set
Run CHUNK 1
to load the insuranceData
package and the data, and print a summary of each variable.
# CHUNK 1 library (insuranceData) data(dataCar) summary(dataCar) Or # CHUNK 1 # Prior to R v4.0, the option stringsAsFactors was TRUE by default, i.e. string-like variables would automatically get converted to factors when being read in by dataCar <- read.csv("vehins.csv") # For R v4.0+, dataCar <- read.csv("vehins.csv", stringsAsFactors=TRUE) summary(dataCar)
clm exposure veh_value numclaims claimcst veh_body veh_age gender area agecat Min. :0.00000 Min. :0.002738 Min. : 0.000 Min. :0.00000 Min. : 0.0 Length:67856 Min. :1.000 Length:67856 Length:67856 Min. :1.000 1st Qu.:0.00000 1st Qu.:0.219028 1st Qu.: 1.010 1st Qu.:0.00000 1st Qu.: 0.0 Class :character 1st Qu.:2.000 Class :character Class :character 1st Qu.:2.000 Median :0.00000 Median :0.446270 Median : 1.500 Median :0.00000 Median : 0.0 Mode :character Median :3.000 Mode :character Mode :character Median :3.000 Mean :0.06814 Mean :0.468651 Mean : 1.777 Mean :0.07276 Mean : 137.3 Mean :2.674 Mean :3.485 3rd Qu.:0.00000 3rd Qu.:0.709103 3rd Qu.: 2.150 3rd Qu.:0.00000 3rd Qu.: 0.0 3rd Qu.:4.000 3rd Qu.:5.000 Max. :1.00000 Max. :0.999316 Max. :34.560 Max. :4.00000 Max. :55922.1 Max. :4.000 Max. :6.000
Findings:
- The mean of
clm
(which is treated in R as a numeric variable taking only two values, 0 and 1) is 0.06814, meaning that 6.814% of the 67,856 vehicle insurance policies had at least one claim over the policy period. veh_value
ranges from 0 to 34.560 and its mean (1.777) is higher than its median (1.500), suggesting that its distribution is moderately right-skewed.
Variable | Description | Characteristics |
veh_value |
Vehicle value (in $10,000s) | Numeric with three decimal places, from 0 to 34.560 |
exposure |
Exposure of policy | Numeric from 0 to 1 |
clm |
Claim occurrence | Integer from 0 to 1 (0 = no, 1 = yes) |
numclaims |
Number of claims | Integer from 0 to 4 |
claimcst |
Claim size | Positive value from 0 to 55,922.1 (= 0 if no claim) |
veh_body |
Vehicle body type | Factor with 13 levels: BUS, CONVT (= convertible), COUPE, HBACK (= hatchback), HDTOP (= hardtop), MCARA (= motorized caravan), MIBUS (= minibus), PANVN ( = panel van), RDS TR ( = roadster), SEDAN, STNWG (= station wagon), TRUCK, UTE(= utility) |
veh_age |
Vehicle age | Integer from 1 (youngest) to 4 |
gender |
Gender of policyholder | A factor with two levels: M (= male) and F (= female) |
area |
Area of residence of policyholder |
Factor with 6 levels: A, B, C, D, E, F |
ageeat |
Age band of policyholder | Integer from 1 (youngest) to 6 |
X_OBSTAT_ |
(Unknown) | A factor with one level “01101 0 0 0” |
Data Quality Issue
Remove Irrelevant Data and Target Leakage
Remark: The variables that serve as predictors of the GLMs are available before claims are observed.
There are two variables whose values are known at the same time as or after claims are submitted, meaning that they cannot function as predictors of clm
in practice.
-
numclaims
, the number of claims, cannot be known before a claim is submitted.claimcst
, claim size, cannot be known before a claim is submitted.
# CHUNK 2 table(dataCar$clm, dataCar$numclaims)
0 1 2 3 4 0 63232 0 0 0 0 1 0 4333 271 18 2
In the second part of CHUNK 2
, we drop numclaims
and claimcst
to avoid target leakage, as well as the mysterious variable X_0BSTAT _
from the dataset.
# CHUNK 2 (cont.) dataCar$claimcst <- NULL dataCar$numclaims <- NULL dataCar$X_0BSTAT_ <- NULL
Remove Unreasonable Data
The dataset also contains a small number of observations (53, to be exact) which involve a zero vehicle value. Although veh_value
is measured in $10,000s, a value of zero for veh_value
means that the value of the vehicle is lower than 0.001($10, 000) = $10, which does not make practical sense.
# CHUNK 2 (cont.) nrow(dataCar[dataCar$veh_value == 0, ]) dataCar <- dataCar[dataCar$veh_value > 0, ]
[1] 53
Factorize Numerical Variables for Group Data
Since veh_age
and agecat
are labels of age groups and they do not represent exact ages, it may be desirable to treat them as factors instead of numeric variables.
In CHUNK 3
, the conversion of veh_age
and agecat
from integers to factors can be achieved by the as.factor()
function.
# CHUNK 3 # Before factor conversion class(dataCar$agecat) class(dataCar$veh_age) dataCar$agecat <- as.factor(dataCar$agecat) dataCar$veh_age <- as.factor(dataCar$veh_age) # After factor conversion class(dataCar$agecat) class(dataCar$veh_age)
> # CHUNK 3 # Before factor conversion > class(dataCar$agecat) [1] "integer" > class(dataCar$veh_age) [1] "integer" > > dataCar$agecat <- as.factor(dataCar$agecat) > dataCar$veh_age <- as.factor(dataCar$veh_age) > > # After factor conversion > class(dataCar$agecat) [1] "factor" > class(dataCar$veh_age) [1] "factor"
Relevel Categorical Variables
The rest of CHUNK 3
relevels all of the categorical predictors, including veh_age
and agecat
, so that their baseline level is the level with the most observations.
vars.cat <- c("veh_age", "veh_body", "gender", "area", "agecat") for (i in vars.cat){ table <- as.data.frame(table(dataCar[, i])) max <- which.max(table[, 2]) level.name <- as.character(table[max, 1]) dataCar[, i] <- relevel(dataCar[, i], ref = level.name) } summary(dataCar)
clm exposure veh_value claimcst veh_body veh_age gender area agecat Min. :0.00000 Min. :0.002738 Min. : 0.180 Min. : 0.0 SEDAN :22232 3:20060 F:38584 C:20530 4:16179 1st Qu.:0.00000 1st Qu.:0.219028 1st Qu.: 1.010 1st Qu.: 0.0 HBACK :18915 1:12254 M:29219 A:16302 1: 5734 Median :0.00000 Median :0.446270 Median : 1.500 Median : 0.0 STNWG :16234 2:16582 B:13328 2:12868 Mean :0.06811 Mean :0.468481 Mean : 1.778 Mean : 137.1 UTE : 4586 4:18907 D: 8161 3:15757 3rd Qu.:0.00000 3rd Qu.:0.709103 3rd Qu.: 2.150 3rd Qu.: 0.0 TRUCK : 1742 E: 5907 5:10722 Max. :1.00000 Max. :0.999316 Max. :34.560 Max. :55922.1 HDTOP : 1579 F: 3575 6: 6543 (Other): 2515
Findings:
- For all of the categorical predictors, their baseline level is listed first and contains the greatest number of observations, as desired.
TASK 2: Perform univariate exploration of the non-factor variables Use graphical displays and summary statistics to determine if any of the nonfactor variable(s) except exposure should be transformed and, if so, what transformation should be made. Do your recommended transformation(s), if any, and delete the original variable(s). Note: For this task, there is no need to analyze the exposure variable, which will be treated later. |
Stage 3: Exploratory Data Analysis
In CHUNK 4
, for such a strictly positive and skewed variable, the log transformation is commonly used, so we create a log-transformed version of veh_value
called log_veh_value
, whose distribution is much more symmetric and bell-shaped, and delete the original veh_value
in the second part of CHUNK 4
.
# CHUNK 4 library(ggplot2) ggplot(dataCar, aes(x = veh_value)) + geom_histogram() ggplot(dataCar, aes(x = log(veh_value))) + geom_histogram() dataCar$log_veh_value <- log(dataCar$veh_value) dataCar$veh_value <- NULL
Figure 4.3.1: A histogram for veh_value
(left) and the log of veh_value
(right) in the dataCar
dataset.
TASK 3: Explore the relationship of each predictor to Use graphical displays and summary statistics to form preliminary conclusions regarding which variables are likely to have significant predictive power. |
Numeric predictors
In CHUNK 6
, we get the mean or median of log_veh_value
and exposure
over the two levels of clm
.
# CHUNK 6 library(tidyverse) dataCar %>% group_by_("clm") %>% summarise( mean= mean(log_veh_value), median= median(log_veh_value), n = n() )
# A tibble: 2 × 4 clm mean median n <int> <dbl> <dbl> <int> 1 0 0.382 0.399 63185 2 1 0.448 0.451 4618
dataCar %>% group_by_("clm") %>% summarise( mean = mean(exposure), median = median(exposure), n = n() )
# A tibble: 2 × 4 clm mean median n <int> <dbl> <dbl> <int> 1 0 0.458 0.427 63185 2 1 0.611 0.637 4618
Findings:
- As we expect, the means and medians are higher when there is a claim, so:
- Interpretation: a policyholder with a larger vehicle value or a larger exposure has a higher tendency to submit an insurance claim.
In CHUNK 5
, we produce boxplots for log_veh_value
and exposure
indexed by the two levels (0 and 1) of clm
.
# CHUNK 5 ggplot(dataCar, aes(x = factor(clm), y = log_veh_value)) + geom_boxplot(fill = "red") ggplot(dataCar, aes(x = factor(clm), y = exposure)) + geom_boxplot(fill = "red")
Figure 4.3.2: Split boxplots for log_veh_value
and exposure
by clm
in the dataCar
dataset.
Findings:
- It appears that claim occurrence (
clm = 1
) is associated with a slightly higher value oflog_veh_value
and a much higher value ofexposure
.
Categorical Predictors
The other predictors, veh_body
, veh_age
(after conversion), gender
, area
, and agecat
(after conversion), are all categorical variables.
For categorical-categorical combinations, filled bar charts are useful. In CHUNK 7
, we use a for loop to construct the bar charts for each categorical predictor color-filled by clm
.
# CHUNK 7 for (i in vars.cat) { plot <- ggplot(dataCar, aes(x = dataCar[, i], fill = factor(clm))) + geom_bar(position = "fill") + labs(x = i, y = "percent") + theme(axis.text.x = element_text(angle = 90, hjust = 1)) print(plot) }
Figure 4.3.3: A bar chart for agecat
by clm
in the dataCar
dataset
Findings:
veh_age
: Higher rate of claim occurrence for group 2.veh_body
: Much higher rate forBUS
and lower rate forCONVT
,MIBUS
, andUTE
.area
: Higher rate for areaF
.agecat
: The claim occurrence rate appears to decrease withagecat
.
We can also obtain a more accurate understanding of how the categorical predictors affect clm
by looking at the mean of clm
(which is the claim occurrence rate) by different levels of the predictors, as in CHUNK 8
.
# CHUNK 8 for (i in vars.cat) { print(i) x <- dataCar %>% group_by_(i) %>% summarise(mean = mean(clm), n = n()) print(x) }
[1] "veh_age" # A tibble: 4 × 3 veh_age mean n <fct> <dbl> <int> 1 3 0.0679 20060 2 1 0.0672 12254 3 2 0.0759 16582 4 4 0.0620 18907 [1] "veh_body" # A tibble: 13 × 3 veh_body mean n <fct> <dbl> <int> 1 SEDAN 0.0664 22232 2 BUS 0.184 38 3 CONVT 0.0370 81 4 COUPE 0.0872 780 5 HBACK 0.0668 18915 6 HDTOP 0.0823 1579 7 MCARA 0.116 121 8 MIBUS 0.0601 716 9 PANVN 0.0824 752 10 RDSTR 0.0741 27 11 STNWG 0.0721 16234 12 TRUCK 0.0683 1742 13 UTE 0.0567 4586 [1] "gender" # A tibble: 2 × 3 gender mean n <fct> <dbl> <int> 1 F 0.0686 38584 2 M 0.0675 29219 [1] "area" # A tibble: 6 × 3 area mean n <fct> <dbl> <int> 1 C 0.0688 20530 2 A 0.0664 16302 3 B 0.0723 13328 4 D 0.0607 8161 5 E 0.0652 5907 6 F 0.0783 3575 [1] "agecat" # A tibble: 6 × 3 agecat mean n <fct> <dbl> <int> 1 4 0.0682 16179 2 1 0.0863 5734 3 2 0.0724 12868 4 3 0.0705 15757 5 5 0.0572 10722 6 6 0.0556 6543
The descriptive statistics are consistent with what we observe from the filled bar charts in CHUNK 7
.
TASK 4: Reduce the number of factor levels where appropriate Several of the factor variables have a small number of observations at some of the levels. Consider using knowledge of the factor levels as well as evidence from Task 3 to combine some of them into factor levels with more observations. |
Combining Categorical Predictors
We combine factor levels based on (all of below):
- the similarity of means/medians
- extremely low counts
Arguably the most problematic categorical predictor is veh_body
, which has the largest number of levels (13 levels) out of the five factor variables. Some levels like “BUS
” and “RDSTR
” have very few observations, so our priority is to combine these sparse levels with more populous levels with a similar claim occurrence rate.
Combining BUS
and MCARA
into BUS-MCARA
BUS-MCARA
: “BUS
” is extremely sparse, but it has the highest claim occurrence rate. The level with the second highest rate is “MCARA
“, which is quite sparse as well. In the first two lines of CHUNK 9
, let’s combine these two levels into a larger level called “BUS-MCARA
“.
# CHUNK 9 # Combine "BUS" and "MCARA" as one level called "BUS-MCARA" levels(dataCar$veh_body) [levels(dataCar$veh_body) == "BUS"] <- "BUS-MCARA" levels(dataCar$veh_body) [levels(dataCar$veh_body) == "MCARA"] <- "BUS-MCARA"
Combining others into OTHER
There are other sparse levels such as “CONVT
” and ”RDSTR
“. We can observe that the claim occurrence rates in these two levels are quite similar to the rates in other levels (except “BUS
” and “MCARA
“) , ranging from about 0.05 to about 0.08, so we may combine all these levels as one big level called “OTHER
” . That’s what the rest of CHUNK 9
does.
# CHUNK 9 (Cont.) # Combine LeveLs other than "BUS_MCARA" as "OTHER" levels(dataCar$veh_body) [!(levels(dataCar$veh_body) == "BUS-MCARA")] <- "OTHER" table(dataCar$veh_body)
OTHER BUS-MCARA 67644 159
After consolidating its factor levels, veh_body
only has two levels, “OTHER
” (baseline level) and “BUS-MCARA
“. The baseline level has relatively low claim occurrence rate while the “BUS-MCARA
” level has a much higher rate, even though this level has much fewer observations.
TASK 5: Select an interaction Select one pair of features that may be included as an interaction variable in your GLM. Do this by first proposing two variables that are likely to interact and then using appropriate graphical displays to confirm the existence of an interaction. Continue until a promising interaction has been identified. Include your selected interaction variables when constructing a GLM in subsequent tasks. |
Interactions
In CHUNK 10
, we explore the possible interaction between:
log_veh_value
(numeric) and veh_body
(factor)
via the use of boxplots for log_veh_value
split by clm
(factor) faceted by veh_body
(see Figure 4.3.4).
# CHUNK 10 ggplot(dataCar, aes(x = factor(clm), y = log_veh_value)) + geom_boxplot() + facet_wrap(~ veh_body)
Figure 4.3.4: Faceted split boxplots for log_veh_value
by clm
and veh_body
in the dataCar
data.
Findings:
For OTHER
vehicles, claim occurrence is associated with a slightly higher log_veh_value
, but the association is much more positive for BUS-MCARA
vehicles. In other words,
log_veh_value
exerts a much stronger effect on claim occurrence forBUS-MCARA
vehicles than forOTHER
vehicles.
Discover interaction term: log_veh_value:veh_body
TASK 6: Select a link function
Note:
|
Stage 4: Model Construction and Evaluation
Creation of Training and Test sets
In CHUNK 11
, createDataPartition()
function to split the data into the training (75%) and test (25%) sets.
# CHUNK 11 library(caret) set.seed(4769) partition <- createDataPartition(y = dataCar$clm, p = 0.75, list = FALSE) data.train <- dataCar[partition, ] data.test <- dataCar[-partition, ] mean(data.train$clm) mean(data.test$clm)
> mean(data.train$clm) [1] 0.06870784 > mean(data.test$clm) [1] 0.06631268
Validating the data split process: To check that the two sets are representative, we observe that the claim occurrence rate in the training set is 6.87% and that in the test set is 6.63%. That the two rates are similar shows that the built-in stratification of the target variable works well.
Model Distribution and Link Function
Distribution: Since the target variable clm
is binary, the binomial distribution is the only reasonable candidate for the distribution of the target variable.
Link Function:
- Log link: can be safely ruled out as the mean of the binary target variable, given by p = η in this case, is not necessarily lying between 0 and 1 (predictions, although non-negative, may be greater than 1, which is impossible for a probability).
- Logit link: is most commonly used and most interpretable due to its intimate connections with the odds of an event.
- Probit, Cauchit, and Complementary Log-Log link: all produce results which are much more difficult to interpret because of the rather complex relationship between p and η.
Distribution: Binomial
Link Function: Logit link and Probit link for comparison
Fitting the Model with the Link Functions: Logit and Probit
In CHUNK 12
, we fit separately the logistic regression model and probit regression model to the training set.
Because we are specifically asked not to use the exposure
variable, we use the formula:
clm ~ . - exposure + log_ veh_ value: veh_body
,
meaning to regress clm
on all variables in the training set except for exposure
(using - exposure
), with the interaction term between log_veh_value
and veh_body
added.
# CHUNK 12 logit <- glm(clm ~ . - exposure + log_veh_value:veh_body, data = data.train, family = binomial) probit <- glm(clm ~ . - exposure + log_veh_value:veh_body, data = data.train, family = binomial(link = "probit"))
Which link function to use?
The most natural way to compare these two GLMs is to look at their predictive performance on the test set. As the task statement explicitly told us not to do so, we instead look at metrics computed on the training set that assess not only the goodness of fit of a GLM to the training set, but also the complexity of the model, and strike a balance between the two.
In the second part of CHUNK 12
, we use the AIC()
function to return the AIC of each of the two GLMs.
# CHUNK 12 (Cont.) AIC(logit) AIC(probit)
> # CHUNK 12 (Cont.) > AIC(logit) [1] 25354.09 > AIC(probit) [1] 25353.79
The two AIC values are very close to each other, but the logistic regression model is much more interpretable than the probit model, so we will use the logit link in subsequent tasks.
Link Function := Logit Link
TASK 7: Specify an offset
|
Setting Offset: exposure
exposure
is strongly positively related to claim occurrence. This comes as no surprise since exposure records the effective duration of each vehicle insurance policy in the dataset (according to the data dictionary) and captures the amount of risk each policy is exposed to.
Other things equal, the larger the value of exposure
, the higher the claim occurrence probability.
One good way to put exposure
to good use is to take it as an offset: ln(exposure), so that the model equation reads:
ln(odds) = ln(exposure) + η,
where η is the linear predictor without the offset, or equivalently, odds = exposure x eη.
Refitting the Model with Log Link
In CHUNK 13
, we refit the logistic regression model for clm
using exposure as an offset and output its AIC and summary.
# CHUNK 13 logit.offset <- glm(clm ~ . - exposure + log_veh_value:veh_body, data = data.train, offset = log(exposure), family = binomial) AIC(logit.offset) summary(logit.offset)
> AIC(logit.offset) [1] 24459.17 > summary(logit.offset) Call: glm(formula = clm ~ . - exposure + log_veh_value:veh_body, family = binomial, data = data.train, offset = log(exposure)) Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) -1.8401822 0.0553407 -33.252 < 2e-16 *** veh_bodyBUS-MCARA 1.2064692 0.5839503 2.066 0.038824 * veh_age1 -0.0237603 0.0577435 -0.411 0.680720 veh_age2 0.0707487 0.0489945 1.444 0.148735 veh_age4 -0.0258168 0.0530979 -0.486 0.626818 genderM -0.0471309 0.0367956 -1.281 0.200233 areaA -0.0418634 0.0489577 -0.855 0.392500 areaB 0.0839797 0.0503891 1.667 0.095589 . areaD -0.1630404 0.0637039 -2.559 0.010487 * areaE -0.1027563 0.0706512 -1.454 0.145831 areaF 0.0636621 0.0798244 0.798 0.425145 agecat1 0.2960493 0.0653958 4.527 5.98e-06 *** agecat2 0.0788806 0.0537650 1.467 0.142339 agecat3 -0.0003662 0.0514177 -0.007 0.994317 agecat5 -0.2342049 0.0604906 -3.872 0.000108 *** agecat6 -0.2749970 0.0746007 -3.686 0.000228 *** log_veh_value 0.1470760 0.0400693 3.671 0.000242 *** veh_bodyBUS-MCARA:log_veh_value -0.5307286 0.6035585 -0.879 0.379221 --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 (Dispersion parameter for binomial family taken to be 1) Null deviance: 24567 on 50852 degrees of freedom Residual deviance: 24423 on 50835 degrees of freedom AIC: 24459 Number of Fisher Scoring iterations: 5
Findings:
-
- The AIC of the logistic regression model with an offset is 24,459.17, down from 25,354.09 (
CHUNK 12
) when an offset is absent.
This indicates that the incorporation of the offset leads to a moderate amount of improvement on the quality of the logistic regression model.
- The AIC of the logistic regression model with an offset is 24,459.17, down from 25,354.09 (
TASK 8: Generate confusion matrices and AUCs
|
Constructing Confusion Matrices
Determine the Cutoff
From CHUNK 3
, only 6.811 % of the policies in the whole dataset has a claim, so using a cutoff of 0.5 (which is the most intuitive value) will result in almost every policy predicted to have no claim.
To make the logistic regression model produce more meaningful results, we set the cutoff to 0.06811.
Construct Confusion Matrix on the Training Set
In CHUNK 14
, we produce the predicted class probabilities and predicted class labels for the logistic regression model, and generate a confusion matrix on each of the training set and the test set using a cutoff of 0.06811.
It is important to note that:
-
- The first two arguments of the
confusionMatrix()
function must be factors, so we need to apply thefactor()
function to:class.train
,train$clm
,class.test
,test$clm
.
- By default, R will take the alpha-numerically first level (which is “0”, or no claims) as the level of interest. By adding the option
positive = "1"
to theconfusionMatrix()
function, we explicitly specify the factor level that corresponds to the “positive” result to be “1” - For some unknown reason, very few past PA exams do so. Without specifying the
positive
argument, the “sensitivity” and “specificity” in the output are, in fact, the specificity and sensitivity we are interested in.
- The first two arguments of the
# CHUNK 14 cutoff <- 0.06811 # you can try other values # Generate predicted probabilities on the training set pred.train <- predict(logit.offset, type = "response") # Turn predicted probabilities into predicted classes class.train <- 1*(pred.train > cutoff) # OR ifelse(pred.train > cutoff, 1, 0) confusionMatrix(factor(class.train), factor(data.train$clm), positive = "1")
Confusion Matrix and Statistics Reference Prediction 0 1 0 25474 1017 1 21885 2477 Accuracy : 0.5496 95% CI : (0.5453, 0.554) No Information Rate : 0.9313 P-Value [Acc > NIR] : 1 Kappa : 0.0655 Mcnemar's Test P-Value : <2e-16 Sensitivity : 0.70893 Specificity : 0.53789 Pos Pred Value : 0.10167 Neg Pred Value : 0.96161 Prevalence : 0.06871 Detection Rate : 0.04871 Detection Prevalence : 0.47907 Balanced Accuracy : 0.62341 'Positive' Class : 1
Construct Confusion Matrix on the Test Set
# CHUNK 14 (Cont.) # Generate predicted probabilities on the test set pred.test <- predict(logit.offset, newdata = data.test, type = "response") class.test <- 1*(pred.test > cutoff) confusionMatrix(factor(class.test), factor(data.test$clm), positive = "1")
Confusion Matrix and Statistics Reference Prediction 0 1 0 8483 332 1 7343 792 Accuracy : 0.5472 95% CI : (0.5397, 0.5547) No Information Rate : 0.9337 P-Value [Acc > NIR] : 1 Kappa : 0.0617 Mcnemar's Test P-Value : <2e-16 Sensitivity : 0.70463 Specificity : 0.53602 Pos Pred Value : 0.09736 Neg Pred Value : 0.96234 Prevalence : 0.06631 Detection Rate : 0.04673 Detection Prevalence : 0.47994 Balanced Accuracy : 0.62032 'Positive' Class : 1
Findings:
From the two confusion matrices, we can see that:
Sensitivity vs. specificity:
In both matrices, the accuracy is only marginally higher than 0.5, but the accuracy does not take into account the fact that the insurance company should be more interested in correctly classifying policyholders with a claim (who will file for payment) than correctly classifying policyholders without a claim. In other words,
the specificity of the logistic regression model should be more of interest to the insurance company than the sensitivity.
We can see that the model has a decent specificity (close to 0. 70) on both the training and test sets whereas its sensitivity is relatively low, meaning that
the model does a pretty good job of identifying policies with a claim, but it is not very good at identifying no-claim policies.
If time allows, it is a good move to tune the cutoff and try a wide range of values to decide on one that results in a better balance between the sensitivity and specificity.
Training vs. test:
The accuracy, sensitivity, and specificity of the model on the test set are all slightly lower than the corresponding quantities on the training set. This is an indication of a small (but not serious) extent of overfitting for the logistic regression model.
Determining AUCs and ROC Plots
The AUC of a classifier can be computed by the roc()
and auc()
functions from the pR0C
package. In CHUNK 15
, we make an ROC plot and obtain the AUC of the logistic regression model on both the training and test sets.
# CHUNK 15 library(pROC) roc.train <- roc(data.train$clm, pred.train) auc(roc.train) roc.test <- roc(data.test$elm, pred.test) auc(roc.test)
> auc(roc.train) Area under the curve: 0.6647 > auc(roc.test) Area under the curve: 0.6581
Figure 4.3.5: The ROC plots for the full logistic regression model on the training set (left) and test set (right) in the dataCar
data.
Training Set |
Test Set |
The command (pty = "s"
) can be used to generate a square (which is what “s” stands for) plotting region, which improves the appearance of the ROC plot.
Findings:
- The two AUCs are close to 0.66, which means that the model is making better (but perhaps not way better) predictions than by chance.
- The fact that the test AUC is slightly lower than the training AUC supports our earlier conclusion that there is a small extent of overfitting for the logistic regression model.
TASK 9: Select features using stepwise selection Some of the features may lack predictive power and result in overfitting.
When finished with this task, this will be your final model form. |
Binarization
In CHUNK 16, we use the dummyVars()
function from the caret
package to binarize the categorical predictors with three or more levels, which include agecat
, area
, and veh_age
.
# CHUNK 16 library(caret) binarizer <- dummyVars(~ agecat + area + veh_age, data = dataCar, fullRank = TRUE) binarized.vars <- data.frame(predict(binarizer, newdata = dataCar)) head(binarized.vars)
agecat.1 agecat.2 agecat.3 agecat.5 agecat.6 area.A area.B area.D area.E area.F veh_age.1 veh_age.2 veh_age.4 1 0 1 0 0 0 0 0 0 0 0 0 0 0 2 0 0 0 0 0 1 0 0 0 0 0 1 0 3 0 1 0 0 0 0 0 0 1 0 0 1 0 4 0 1 0 0 0 0 0 1 0 0 0 1 0 5 0 1 0 0 0 0 0 0 0 0 0 0 1 6 0 0 0 0 0 0 0 0 0 0 0 0 0
> binarizer$lvls $agecat [1] "4" "1" "2" "3" "5" "6" $area [1] "C" "A" "B" "D" "E" "F" $veh_age [1] "3" "1" "2" "4"
Then we attach the binarized variables to the dataCar
dataset (with a .bin
suffix), delete the original categorical variables, and reuse the partition vector generated in CHUNK 11
to partition
the dataset into the training and test sets, now with binarized variables.
# CHUNK 16 (Cont.) # Attach the binarized variables dataCar.bin <- cbind(dataCar, binarized.vars) # Remove the original factor variab les dataCar.bin$agecat <- NULL dataCar.bin$area <- NULL dataCar.bin$veh_age <- NULL #Setup the binarized training and test sets data.train.bin <- dataCar.bin[partition, ] data.test.bin <- dataCar.bin[-partition, ]
In CHUNK 17
, we fit the logistic regression model (with interaction) to the binarized training set and print a model summary.
# CHUNK 17 # Fit the Logistic regression model to the binarized training set logit.full <- glm(clm ~ . - exposure + log_veh_value:veh_body, data = data.train.bin, offset = log(exposure), family = binomial) summary (logit.full)
Call: glm(formula = clm ~ . - exposure + log_veh_value:veh_body, family = binomial, data = data.train.bin, offset = log(exposure)) Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) -1.8401822 0.0553407 -33.252 < 2e-16 *** veh_bodyBUS-MCARA 1.2064692 0.5839503 2.066 0.038824 * genderM -0.0471309 0.0367956 -1.281 0.200233 log_veh_value 0.1470760 0.0400693 3.671 0.000242 *** agecat.1 0.2960493 0.0653958 4.527 5.98e-06 *** agecat.2 0.0788806 0.0537650 1.467 0.142339 agecat.3 -0.0003662 0.0514177 -0.007 0.994317 agecat.5 -0.2342049 0.0604906 -3.872 0.000108 *** agecat.6 -0.2749970 0.0746007 -3.686 0.000228 *** area.A -0.0418634 0.0489577 -0.855 0.392500 area.B 0.0839797 0.0503891 1.667 0.095589 . area.D -0.1630404 0.0637039 -2.559 0.010487 * area.E -0.1027563 0.0706512 -1.454 0.145831 area.F 0.0636621 0.0798244 0.798 0.425145 veh_age.1 -0.0237603 0.0577435 -0.411 0.680720 veh_age.2 0.0707487 0.0489945 1.444 0.148735 veh_age.4 -0.0258168 0.0530979 -0.486 0.626818 veh_bodyBUS-MCARA:log_veh_value -0.5307286 0.6035585 -0.879 0.379221 --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 (Dispersion parameter for binomial family taken to be 1) Null deviance: 24567 on 50852 degrees of freedom Residual deviance: 24423 on 50835 degrees of freedom AIC: 24459 Number of Fisher Scoring iterations: 5
The summary output is identical to that in CHUNK 13
, showing that the binarization is done properly and does not have any effect on the model.
Stepwise Selection
In CHUNK 18
, we run the stepAIC()
function to perform forward selection using the BIC as requested, which represents a more conservative approach to feature selection, allowing the insurance company to identify key factors affecting claim occurrence rates.
# CHUNK 18 library(MASS) logit.null <- glm(clm ~ 1, data= data.train.bin, offset= log(exposure), family= binomial) logit.reduced <- stepAIC(logit.null, direction= "forward", k = log(nrow(data.train.bin)), scope = list(upper = logit.full, lower = logit.null)) summary(logit.reduced)
Start: AIC=24578.19 clm ~ 1 Df Deviance AIC + log_veh_value 1 24532 24553 + agecat.1 1 24533 24554 + agecat.5 1 24541 24563 + agecat.6 1 24542 24564 + veh_age.4 1 24549 24571 + veh_age.2 1 24554 24575 <none> 24567 24578 + area.D 1 24558 24580 + agecat.2 1 24559 24581 + area.F 1 24561 24583 + veh_body 1 24562 24583 + area.B 1 24562 24584 + area.E 1 24566 24587 + gender 1 24566 24588 + veh_age.1 1 24566 24588 + area.A 1 24566 24588 + agecat.3 1 24567 24589 Step: AIC=24553.41 clm ~ log_veh_value Df Deviance AIC + agecat.1 1 24498 24531 + agecat.5 1 24505 24538 + agecat.6 1 24512 24544 <none> 24532 24553 + area.D 1 24522 24554 + area.B 1 24524 24557 + agecat.2 1 24524 24557 + veh_age.2 1 24527 24560 + veh_body 1 24527 24560 + gender 1 24528 24561 + area.F 1 24529 24561 + area.E 1 24529 24562 + veh_age.1 1 24530 24563 + veh_age.4 1 24531 24563 + area.A 1 24531 24564 + agecat.3 1 24532 24564 Step: AIC=24530.58 clm ~ log_veh_value + agecat.1 Df Deviance AIC + agecat.5 1 24479 24522 + agecat.6 1 24482 24526 + agecat.2 1 24485 24528 <none> 24498 24531 + area.D 1 24488 24532 + area.B 1 24490 24534 + veh_body 1 24494 24537 + veh_age.2 1 24494 24537 + gender 1 24495 24538 + area.F 1 24496 24539 + area.E 1 24496 24539 + agecat.3 1 24496 24539 + veh_age.1 1 24496 24540 + veh_age.4 1 24497 24541 + area.A 1 24497 24541 Step: AIC=24522.16 clm ~ log_veh_value + agecat.1 + agecat.5 Df Deviance AIC + agecat.6 1 24457 24511 <none> 24479 24522 + area.D 1 24469 24524 + area.B 1 24470 24525 + agecat.2 1 24472 24526 + veh_body 1 24474 24528 + veh_age.2 1 24475 24529 + gender 1 24476 24530 + area.E 1 24476 24530 + area.F 1 24477 24531 + veh_age.1 1 24477 24531 + veh_age.4 1 24478 24532 + area.A 1 24478 24532 + agecat.3 1 24479 24533 Step: AIC=24511.36 clm ~ log_veh_value + agecat.1 + agecat.5 + agecat.6 Df Deviance AIC <none> 24457 24511 + area.D 1 24448 24514 + area.B 1 24449 24514 + veh_body 1 24452 24517 + veh_age.2 1 24452 24517 + agecat.2 1 24454 24519 + area.E 1 24455 24520 + gender 1 24455 24520 + veh_age.1 1 24456 24521 + area.F 1 24456 24521 + veh_age.4 1 24456 24521 + agecat.3 1 24456 24522 + area.A 1 24456 24522 > summary(logit.reduced) Call: glm(formula = clm ~ log_veh_value + agecat.1 + agecat.5 + agecat.6, family = binomial, data = data.train.bin, offset = log(exposure)) Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) -1.84794 0.02533 -72.959 < 2e-16 *** log_veh_value 0.15913 0.02917 5.455 4.90e-08 *** agecat.1 0.27479 0.05845 4.701 2.59e-06 *** agecat.5 -0.25995 0.05300 -4.905 9.36e-07 *** agecat.6 -0.30815 0.06856 -4.495 6.97e-06 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 (Dispersion parameter for binomial family taken to be 1) Null deviance: 24567 on 50852 degrees of freedom Residual deviance: 24457 on 50848 degrees of freedom AIC: 24467 Number of Fisher Scoring iterations: 5
The final model carries only four features:
log_veh_value
, agecat.1
, agecat.5
, and agecat.6
CHUNK 19
generates the AUC for the final model.
# CHUNK 19 pred.reduced <- predict(logit.reduced, newdata = data.test.bin, type= "response") roc.reduced <- roc(data.test$clm, pred.reduced) auc(roc.reduced)
Area under the curve: 0.6595
The AUC = 0.6595
, is slightly higher than that for the full model (0.6581). Not only does the final model have a slightly better prediction performance, it is also much simpler, with only four features. In terms of both prediction performance and interpretability, the final model is superior and we are going to use this model in the final task.
Interpretation of Model Results
TASK 10: Interpret the model
|
In CHUN 20
, we refit the reduced logistic regression model to the full dataset (note the argument data = dataCar.bin
, not data = data.train.bin
) and save it as a glm
object named logit.final
.
# CHUNK 20 logit.final <- glm(clm ~ log_veh_value + agecat.1 + agecat.5 + agecat.6, data= dataCar.bin, offset= log(exposure), family= binomial) summary(logit.final) # Exponentiate the coefficients to get the multiplicative effects exp(coef(logit.final))
Call: glm(formula = clm ~ log_veh_value + agecat.1 + agecat.5 + agecat.6, family = binomial, data = dataCar.bin, offset = log(exposure)) Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) -1.86185 0.02205 -84.438 < 2e-16 *** log_veh_value 0.15843 0.02532 6.257 3.92e-10 *** agecat.1 0.25731 0.05138 5.008 5.51e-07 *** agecat.5 -0.25433 0.04603 -5.525 3.29e-08 *** agecat.6 -0.23858 0.05784 -4.125 3.72e-05 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 (Dispersion parameter for binomial family taken to be 1) Null deviance: 32572 on 67802 degrees of freedom Residual deviance: 32445 on 67798 degrees of freedom AIC: 32455 Number of Fisher Scoring iterations: 5 > # Exponentiate the coefficients to get the multiplicative effects > exp(coef(logit.final)) (Intercept) log_veh_value agecat.1 agecat.5 agecat.6 0.1553857 1.1716752 1.2934442 0.7754369 0.7877484
There are two ways to interpret the output of a fitted logistic regression model.
Odds-Based Interpretation
Table 4.4 provides the odds-based interpretation of the final model using the value of each estimated coefficient. The statements there are in terms of multiplicative changes in the odds of claim occurrence. Equivalently, we can also state the findings in terms of percentage changes.
Table 4.4: Odds-based interpretation of the reduced logistic regression model fitted to the full dataCar
dataset.
Feature | Coefficient Estimate | Interpretation |
log_ veh_ value |
0.15843 |
A unit increase in the log of vehicle value is associated with a multiplicative increase of e0.15843 = 1.1717 (> 1) in the odds of claim occurrence, holding all other features fixed. The fact that this increase exceeds 1 makes intuitive sense because the more valuable the vehicle, the more likely a driver will submit insurance claims in the case of accidental damage. |
agecat.1 |
0.25731 | The odds of claim occurrence for policyholders in age category 1 is e0·25731 = 1.2934 (> 1) times of that in the baseline level (which is now categories 2, 3, and 4 combined). It is possible that younger drivers are more reckless and tend to have higher accident rates. |
agecat.5 |
-0.25433 | The odds of claim occurrence for policyholders in age category 5 is e-0·25433 = 0.7754 (< 1) times of that in the baseline level. This also aligns with intuition as more mature drivers are more careful and tend to have lower accident rates. |
agecat.6 |
-0.23858 | The odds of claim occurrence for policyholders in age category 6 is e-0·23858 = 0.7877 (< 1) times of that in the baseline level. |
Probability-Based Interpretation
Steps:
1. Look for the the mean values of numeric predictors and modal (i.e., most common) levels of categorical predictors.
# CHUNK 21 # To look at mean or modal values summary(dataCar)
clm exposure veh_body veh_age gender area agecat log_veh_value Min. :0.00000 Min. :0.002738 OTHER :67644 3:20060 F:38584 C:20530 4:16179 Min. :-1.71480 1st Qu.:0.00000 1st Qu.:0.219028 BUS-MCARA: 159 1:12254 M:29219 A:16302 1: 5734 1st Qu.: 0.00995 Median :0.00000 Median :0.446270 2:16582 B:13328 2:12868 Median : 0.40547 Mean :0.06811 Mean :0.468481 4:18907 D: 8161 3:15757 Mean : 0.38675 3rd Qu.:0.00000 3rd Qu.:0.709103 E: 5907 5:10722 3rd Qu.: 0.76547 Max. :1.00000 Max. :0.999316 F: 3575 6: 6543 Max. : 3.54270
2. Create a new data hosting the feature combinations we want to explore.
new.data <- data.frame( exposure = c(0.468481, 0.515329, 0.468481, 0.468481, 0.468481, 0.468481), log_veh_value = c(0.38675, 0.38675, 0.42543, 0.38675, 0.38675, 0.38675), agecat.1 = c(0, 0, 0, 1, 0, 0), agecat.5 = c(0, 0, 0, 0, 1, 0), agecat.6 = c(0, 0, 0, 0, 0, 1) ) new.data
exposure log_veh_value agecat.1 agecat.5 agecat.6 1 0.468481 0.38675 0 0 0 2 0.515329 0.38675 0 0 0 3 0.468481 0.42543 0 0 0 4 0.468481 0.38675 1 0 0 5 0.468481 0.38675 0 1 0 6 0.468481 0.38675 0 0 1
3. In each of the five remaining rows, we alter the value of one and only one feature (exposure
, log_veh_value
, agecat
, in that order) to look at the effect of that predictor on the predicted claim occurrence probability.
predict(logit.final, newdata = new.data, type = "response")
1 2 3 4 5 6 0.07183549 0.07845544 0.07224517 0.09099701 0.05661722 0.05746447
Table 4.5 summarizes the results (the feature that is changed is shown in bold).
exposure |
log_veh_value |
agecat.1 |
agecat.5 |
Predicted Probability | Interpretation |
0.468481 | 0.38675 | 0 | 0 | 0.07183549 | |
0.515329 | 0.38675 | 0 | 0 | 0.07845544 |
A 10% increase in |
0.468481 | 0.42543 | 0 | 0 | 0.07224517 |
A 10% increase in |
0.468481 | 0.38675 | 1 | 0 | 0.09099701 |
Being in age category 1 increases the claim occurrence probability by 0.0910-0.0718 = 0.0192 compared to being in the baseline age category. |
0.468481 | 0.38675 | 0 | 1 | 0.05661722 |
Being in age category 5 decreases the claim occurrence probability by 0.0718-0.0566 = 0.0152 compared to being in the baseline age category. |
0.468481 | 0.38675 | 0 | 0 | 0.05746447 |
Being in age category 6 decreases the claim occurrence probability by 0.0718-0.0575 = 0.0143 compared to being in the baseline age category. |
4.4. Generalized Linear Models Case Study 3: GLMs for Count and Aggregate Loss Variables
Learning Objectives
- Select appropriate distributions and link functions for count and severity variables.
- Identify appropriate offsets and weights for count and severity variables.
- Implement GLMs for count and severity variables in R.
- Assess the quality of a Poisson GLM using the Pearson goodness-of-fit statistic.
- Combine the GLMs for count and severity variables to make predictions for an aggregate loss variable.
Stage 1: Define the Business Problem
Objective
To identify important determinants of the two target variables and to use these determinants to make accurate predictions with GLMs.
Stage 2: Data Collection
Full Data Set
Run CHUNK 1
to load the Swedish
package and the data, and print a summary of each variable. There are 2182 observations and 7 variables.
# CHUNK 1 Swedish <- read.csv("SwedishMotorInsurance.csv", stringsAsFactors=TRUE) summary(Swedish)
Kilometres Zone Bonus Make Insured Claims Payment Min. :1.000 Min. :1.00 Min. :1.000 Min. :1.000 Min. : 0.01 Min. : 0.00 Min. : 0 1st Qu.:2.000 1st Qu.:2.00 1st Qu.:2.000 1st Qu.:3.000 1st Qu.: 21.61 1st Qu.: 1.00 1st Qu.: 2989 Median :3.000 Median :4.00 Median :4.000 Median :5.000 Median : 81.53 Median : 5.00 Median : 27404 Mean :2.986 Mean :3.97 Mean :4.015 Mean :4.992 Mean : 1092.20 Mean : 51.87 Mean : 257008 3rd Qu.:4.000 3rd Qu.:6.00 3rd Qu.:6.000 3rd Qu.:7.000 3rd Qu.: 389.78 3rd Qu.: 21.00 3rd Qu.: 111954 Max. :5.000 Max. :7.00 Max. :7.000 Max. :9.000 Max. :127687.27 Max. :3338.00 Max. :18245026
Findings:
- The summary does not suggest any missing or abnormal values.
Data Description
Variable | Description | Characteristics |
Kilometres |
Distance driven by a vehicle in km per year, grouped into five categories (it is not exactly the distance driven, but a higher number does mean a longer distance) | Integer from 1 to 5
|
Zone |
Geographic zone of a vehicle | Integer from 1 to 7 |
Bonus |
No claims bonus, equal to the number of years, plus one, since the last claim | Integer from 1 to 7 |
Make |
Type of a vehicle | Integer from 1 to 8 representing eight different common car models. All other models are combined in class 9. |
Insured |
Number of policyholder years (fraction of the year that the policyholder has a contract with the issuing company) | Numeric from 0.01 to 127687.27 |
Claims |
Number of claims | Integer from Oto 3338 |
Payment |
Sum of payments in Swedish kroner | Integer from 0 to 18245026 |
TASK 1: Select offsets Select an appropriate offset for each of the two target variables, claim counts and aggregate payments.
|
Choosing Offsets for Claims
and Payment
There are two target variables in this case study,
- the number of claims (
Claims
) and - the aggregate payments (
Payment
).
The Number of Claims: Claims
It is reasonable to expect that it is proportional to the number of policyholder years (Insured
), which provide Claims
an exposure basis.
Other things equal, the longer a policyholder is covered by an automobile insurance policy, the larger the number of claims submitted.
The Aggregate Payments: Payment
In the same vein, we expect that aggregate payments vary in proportion to the number of claims.
Each claim generates a payment (see CHUNK 3
) and it makes intuitive sense that:
With more claims submitted, the sum of payments becomes larger.
At this stage, we can conjecture that:
Insured
may be a good offset forClaims
andClaims
in turn may serve as a good offset forPayment
In CHUNK 2
, we construct two scatterplots, one for Claims
against Insured
and one for Payment
against Claims
.
# CHUNK 2 ggplot(Swedish, aes(x = Insured, y = Claims)) + geom_point() ggplot(Swedish, aes(x = Claims, y = Payment)) + geom_point()
Findings:
- It is clear that there is a strong proportional relationship between
Claims
andInsured
and betweenPayment
andClaims
.
This supports the use ofInsured
andClaims
as offsets (to be included on a suitable scale) forClaims
andPayment
, respectively.
Not strictly required by the task statement, we can make two additional observations, involving Claims
, Insured
, and Payment
:
1. If Claims = 0
, then Payment = 0
.
# CHUNK 3 summary(Swedish[Swedish$Claims == 0, ]$Payment) Swedish.severity <- Swedish[Swedish$Claims > 0, ]
Min. 1st Qu. Median Mean 3rd Qu. Max. 0 0 0 0 0 0
2. Create two variables:
-
ClaimsPerinsured
: the number of claims per policyholder yearPaymentPerClaim
: the payment per claim, claim severity
# CHUNK 4 Swedish$ClaimsPerinsured <- Swedish$Claims / Swedish$Insured Swedish.severity$PaymentPerClaim <- Swedish.severity$Payment / Swedish.severity$Claims
Generate scatterplots for these two variables:
# CHUNK 4 (cont.) ggplot(Swedish, aes(x = Insured, y = ClaimsPerinsured)) + geom_point() ggplot(Swedish.severity, aes(x = Claims, y = PaymentPerClaim)) + geom_point()
Findings:
- The size as well as the variability of
ClaimsPerinsured
andPaymentPerClaim
shrink significantly withInsured
andClaims
, respectively, showing that averaging goes a long way towards stabilizing the two target variables.
TASK 2: Edit and explore the data Edit and explore the data to prepare for further analysis and identify potentially predictive variables.
|
Data Quality Issue
Factorize Zone
and Make
as they do not convey any numerical order:
# CHUNK 5 Swedish$Zone <- as.factor(Swedish$Zone) Swedish$Make <- as.factor(Swedish$Make) Swedish.severity$Zone <- as.factor(Swedish.severity$Zone) Swedish.severity$Make <- as.factor(Swedish.severity$Make)
Stage 3: Exploratory Data Analysis
Distributions of the Target Variables
In CHUNK 6
, we explore the distributions of the two offset-corrected target variables, ClaimsPerinsured
and PaymentPerClaim
.
# CHUNK 6 summary(Swedish$ClaimsPerinsured) summary(Swedish.severity$PaymentPerClaim) ggplot(Swedish, aes(x = ClaimsPerinsured)) + geom_histogram()
> summary(Swedish$ClaimsPerinsured) Min. 1st Qu. Median Mean 3rd Qu. Max. 0.00000 0.02648 0.05123 0.06914 0.08906 1.66667 > summary(Swedish.severity$PaymentPerClaim) Min. 1st Qu. Median Mean 3rd Qu. Max. 72 2643 4375 5206 5985 31442
Findings:
- Both
ClaimsPerinsured
andPaymentPerClaim
are non-negative continuous variables (whileClaims
takes only non-negative integers, the division byInsured
makesClaimsPerinsured
continuous). The former ranges from 0 (corresponding to no claims) to 1.66667 while the latter ranges from 72 to 31,442. - Even after being accounted for by the offsets, the two target variables exhibit a pronounced right skew, especially for
PaymentPerClaim
, as shown in the histograms and reflected by their means substantially exceeding their medians. We will take the right skew into account when developing GLMs later.
Potentially Useful Predictors
In search of potentially predictive variables out of Kilometres
, Bonus
, Zone
, and Make
, CHUNK 7
provides the means and medians of the two offset-corrected target variables split by distinct values or levels of the four predictors. (Split boxplots can also be used here.)
# CHUNK 7 library(tidyverse) vars <- c("Kilometres", "Bonus", "Zone", "Make") for (i in vars) { x <- Swedish%>% group_by_(i) %>% summarise( mean = mean(ClaimsPerinsured), median = median(ClaimsPerinsured), n = n() ) print(x) } # A tibble: 5 × 4 Kilometres mean median n <int> <dbl> <dbl> <int> 1 1 0.0561 0.0425 439 2 2 0.0651 0.0488 441 3 3 0.0718 0.0552 441 4 4 0.0705 0.0555 434 5 5 0.0827 0.0574 427 # A tibble: 7 × 4 Bonus mean median n <int> <dbl> <dbl> <int> 1 1 0.129 0.117 307 2 2 0.0792 0.0741 312 3 3 0.0676 0.0591 310 4 4 0.0659 0.0514 310 5 5 0.0550 0.0454 313 6 6 0.0524 0.0447 315 7 7 0.0364 0.0350 315 # A tibble: 7 × 4 Zone mean median n <fct> <dbl> <dbl> <int> 1 1 0.104 0.0862 315 2 2 0.0795 0.0645 315 3 3 0.0722 0.0575 315 4 4 0.0575 0.0473 315 5 5 0.0626 0.0469 313 6 6 0.0569 0.0435 315 7 7 0.0504 0 294 # A tibble: 9 × 4 Make mean median n <fct> <dbl> <dbl> <int> 1 1 0.0761 0.0614 245 2 2 0.0802 0.0607 245 3 3 0.0576 0.0389 242 4 4 0.0333 0.0216 238 5 5 0.0919 0.0691 244 6 6 0.0543 0.0430 244 7 7 0.0838 0.0555 242 8 8 0.0729 0.0487 237 9 9 0.0712 0.0630 245 |
# CHUNK 7 (cont.) for (i in vars) { x <- Swedish.severity%>% group_by_(i) %>% summarise( mean= mean(PaymentPerClaim), median= median(PaymentPerClaim), n = n() ) print(x) } # A tibble: 5 × 4 Kilometres mean median n <int> <dbl> <dbl> <int> 1 1 4912. 4239. 381 2 2 5068. 4650. 399 3 3 5445. 4494. 378 4 4 5065. 3901. 328 5 5 5601. 4207. 311 # A tibble: 7 × 4 Bonus mean median n <int> <dbl> <dbl> <int> 1 1 4913. 4195. 252 2 2 5451. 4238. 246 3 3 4899. 3974. 246 4 4 5257. 4095. 240 5 5 5188. 3900 247 6 6 5251. 4464. 270 7 7 5440. 5107. 296 # A tibble: 7 × 4 Zone mean median n <fct> <dbl> <dbl> <int> 1 1 4959. 4399. 295 2 2 4569. 4142. 295 3 3 5167. 4500. 293 4 4 5411. 4904. 306 5 5 5625. 4145. 236 6 6 5702. 4775. 264 7 7 5016. 2605. 108 # A tibble: 9 × 4 Make mean median n <fct> <dbl> <dbl> <int> 1 1 5083. 4828. 227 2 2 4854. 4313. 205 3 3 5950. 4438. 178 4 4 4877. 3367 155 5 5 4986. 3797. 206 6 6 5193. 4482. 212 7 7 4759. 3429 197 8 8 6603. 4387 173 9 9 4849. 4791. 244
|
# Remove redundant variabLes rm(i, vars, x)
Findings:
ClaimsPerlnsured
:
It appears that ClaimsPerlnsured
has a monotonic relationship with both Kilometres
and Bonus
, being increasing in Kilometres
and decreasing in Bonus
.
These findings make intuitive sense as:
a longer distance traveled indicates a more worn-out vehicle and a smaller
Bonus
suggests poorer driving experience of a policyholder.
The average of ClaimsPerlnsured
is also particularly low in certain geographical zones such as 4, 5, 6, and 7, and for certain car models such as 3, 4, and 6.
PaymentPerClaim
:
The effects of the four predictors on PaymentPerClaim
are not as pronounced as for ClaimsPerlnsured
. Unlike ClaimsPerlnsured
, Kilometres
and Bonus
do not seem to exert a monotonic effect on PaymentPerClaim
.
We can see that PaymentPerClaim
has relatively close means and medians across different values of Kilometres
and Bonus
, and relatively high means and medians for Zone 4 and Zone 6, and relatively low medians for cars of Make 4 and Make 7.
TASK 3: Construct a GLM for
|
Stage 4: Model Construction and Selection
Creation of Training and Test sets
In CHUNK 8
, we do the stratified sampling based on the number of claims per insured, ClaimsPerinsured
.
# CHUNK 8 library(caret) set.seed(14) train_ind <- createDataPartition(Swedish$ClaimsPerinsured, p = 0.7, list = FALSE) data.train <- Swedish[train_ind, ] data.test <- Swedish[-train_ind, ] # Takeout the severity portions of the training and test sets data.train.severity <- data.train[data.train$Claims > 0, ] data.train.severity$PaymentPerClaim <- data.train.severity$Payment / data.train.severity$Claims data.test.severity <- data.test[data.test$Claims > 0, ] data.test.severity$PaymentPerClaim <- data.test.severity$Payment / data.test.severity$Claims print("TRAIN") mean(data.train$ClaimsPerinsured) mean(data.train.severity$PaymentPerClaim) print("TEST") mean(data.test$ClaimsPerinsured) mean(data.test.severity$PaymentPerClaim)
> print("TRAIN") [1] "TRAIN" > mean(data.train$ClaimsPerinsured) [1] 0.06896561 > mean(data.train.severity$PaymentPerClaim) [1] 5253.53 > print("TEST") [1] "TEST" > mean(data.test$ClaimsPerinsured) [1] 0.06955017 > mean(data.test.severity$PaymentPerClaim) [1] 5094.656
Findings:
- The average number of claims per insured on the training set and that on the test set are very close to each other.
Fitting the Model: A GLM for Claims
with Kilometres
and Bonus
as numeric variables.
- Distribution: the Poisson distribution is a suitable target distribution, because
Claims
is a non-negative integer-valued variable with a right skew. - Link: log link is most commonly employed in conjunction with the Poisson distribution as:
- it ensures that the model predictions are always non-negative
- makes for an interpretable model, and
- is the canonical link.
In CHUNK 9
, we fit a log-link Poisson GLM for Claims
on the training set using Kilometres
, Zone
, Bonus
, and Make
as predictors and the logarithm of Insured
as an offset. Here Kilometres
and Bonus
are treated as numeric variables.
# CHUNK 9 count.numeric <- glm(Claims ~ Kilometres + Zone + Bonus + Make, data = data.train offset = log(Insured), family = poisson) summary(count.numeric)
Call: glm(formula = Claims ~ Kilometres + Zone + Bonus + Make, family = poisson, data = data.train, offset = log(Insured)) Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) -1.813830 0.016746 -108.311 < 2e-16 *** Kilometres 0.131218 0.003028 43.330 < 2e-16 *** Zone2 -0.230780 0.011698 -19.728 < 2e-16 *** Zone3 -0.382588 0.012185 -31.397 < 2e-16 *** Zone4 -0.573191 0.010981 -52.197 < 2e-16 *** Zone5 -0.328492 0.018044 -18.205 < 2e-16 *** Zone6 -0.522192 0.014480 -36.063 < 2e-16 *** Zone7 -0.727917 0.050782 -14.334 < 2e-16 *** Bonus -0.201320 0.001609 -125.103 < 2e-16 *** Make2 0.047689 0.026055 1.830 0.0672 . Make3 -0.268146 0.033077 -8.107 5.20e-16 *** Make4 -0.687657 0.027191 -25.290 < 2e-16 *** Make5 0.154635 0.023003 6.722 1.79e-11 *** Make6 -0.350481 0.020309 -17.258 < 2e-16 *** Make7 -0.061202 0.027085 -2.260 0.0238 * Make8 -0.074676 0.040067 -1.864 0.0624 . Make9 -0.069484 0.011548 -6.017 1.78e-09 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 (Dispersion parameter for poisson family taken to be 1) Null deviance: 23800.5 on 1529 degrees of freedom Residual deviance: 2681.6 on 1513 degrees of freedom AIC: 8059.4 Number of Fisher Scoring iterations: 4
Note:
Payment
andClaimsPerinsured
are excluded because they cannot serve as predictors ofClaim
.Insured
is move to the offset argument on the log scale to play its role as ..?
Findings:
- The model summary shows that the coefficient estimates for
Kilometres
andBonus
are positive and negative, respectively, consistent with what we saw inCHUNK 7
. - All features, including the individual factor levels of
Zone
andMake
(except for Levels 2 and 8), are statistically significant, indicating that they are valuable for accounting forClaims
and obviating the need for feature removal.
Digression: Can we model ClaimsPerinsured
directly?
Note that because of the log link, the equation of the Poisson GLM we have just fitted is equivalent to:
\(\ln(\text{ClaimsPerInsured})=\ln(\dfrac{Claims}{Insured})=\eta\),
where η is the linear predictor comprising Kilometres
, Zone
, Bonus
, and Make
. You may wonder:
To save work, can we instead fit a log-link Poisson GLM directly to
ClaimsPerinsured
without having to supplylog(Insured)
as an offset?
This is explored in CHUNK 10
.
# CHUNK 10 freq <- glm(ClaimsPerinsured ~ Kilometres + Zone + Bonus + Make, data= data.train, weight = Insured, family = poisson) summary(freq)
> summary(freq) Call: glm(formula = ClaimsPerinsured ~ Kilometres + Zone + Bonus + Make, family = poisson, data = data.train, weights = Insured) Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) -1.813830 0.016747 -108.311 < 2e-16 *** Kilometres 0.131218 0.003028 43.330 < 2e-16 *** Zone2 -0.230780 0.011698 -19.728 < 2e-16 *** Zone3 -0.382588 0.012185 -31.397 < 2e-16 *** Zone4 -0.573191 0.010981 -52.197 < 2e-16 *** Zone5 -0.328492 0.018044 -18.205 < 2e-16 *** Zone6 -0.522192 0.014480 -36.063 < 2e-16 *** Zone7 -0.727917 0.050784 -14.334 < 2e-16 *** Bonus -0.201320 0.001609 -125.103 < 2e-16 *** Make2 0.047689 0.026055 1.830 0.0672 . Make3 -0.268146 0.033077 -8.107 5.20e-16 *** Make4 -0.687657 0.027191 -25.290 < 2e-16 *** Make5 0.154635 0.023003 6.722 1.79e-11 *** Make6 -0.350481 0.020309 -17.258 < 2e-16 *** Make7 -0.061202 0.027085 -2.260 0.0238 * Make8 -0.074676 0.040067 -1.864 0.0624 . Make9 -0.069484 0.011548 -6.017 1.78e-09 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 (Dispersion parameter for poisson family taken to be 1) Null deviance: 23800.5 on 1529 degrees of freedom Residual deviance: 2681.6 on 1513 degrees of freedom AIC: Inf Number of Fisher Scoring iterations: 6
Notice that:
y = ClaimsPerinsured
and no offset is specified.- Due to the averaging by
Insured
, the variability ofClaimsPerinsured
is lowered and decreases with the value ofInsured
- With one exception, the summary for this model is identical to that in
CHUNK 9
, indicating that the two fitted models are essentially the same. - AIC= Inf , because of the non-integer values of
ClaimsPerinsured
(whereas a Poisson random variable only takes non-negative integer values). This inconsistency between the nature ofClaimsPerinsured
and a genuine Poisson random variable prohibits theglm()
function from computing the AIC of the frequency model.
As a result, no feature selection based on the AIC can be performed.
A GLM for Claims with Kilometres
and Bonus
as factor variables
# CHUNK 11 count.factor <- glm(Claims ~ factor(Kilometres) + Zone + factor(Bonus) + Make, data = data.train, offset = log(Insured), family= poisson) summary(count.factor)
Call: glm(formula = Claims ~ factor(Kilometres) + Zone + factor(Bonus) + Make, family = poisson, data = data.train, offset = log(Insured)) Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) -1.793205 0.016512 -108.598 < 2e-16 *** factor(Kilometres)2 0.194910 0.010023 19.446 < 2e-16 *** factor(Kilometres)3 0.294007 0.009877 29.766 < 2e-16 *** factor(Kilometres)4 0.370948 0.014373 25.809 < 2e-16 *** factor(Kilometres)5 0.559839 0.014991 37.345 < 2e-16 *** Zone2 -0.239285 0.011750 -20.364 < 2e-16 *** Zone3 -0.382898 0.012331 -31.053 < 2e-16 *** Zone4 -0.574708 0.011185 -51.382 < 2e-16 *** Zone5 -0.337239 0.018218 -18.511 < 2e-16 *** Zone6 -0.532676 0.014746 -36.123 < 2e-16 *** Zone7 -0.708291 0.050845 -13.930 < 2e-16 *** factor(Bonus)2 -0.453835 0.015951 -28.453 < 2e-16 *** factor(Bonus)3 -0.699796 0.017949 -38.987 < 2e-16 *** factor(Bonus)4 -0.823508 0.018985 -43.376 < 2e-16 *** factor(Bonus)5 -0.906880 0.017397 -52.129 < 2e-16 *** factor(Bonus)6 -0.983759 0.014285 -68.864 < 2e-16 *** factor(Bonus)7 -1.332578 0.010936 -121.853 < 2e-16 *** Make2 0.049512 0.026094 1.897 0.0578 . Make3 -0.269633 0.033111 -8.143 3.85e-16 *** Make4 -0.661879 0.027371 -24.182 < 2e-16 *** Make5 0.152141 0.023016 6.610 3.84e-11 *** Make6 -0.342875 0.020333 -16.863 < 2e-16 *** Make7 -0.059942 0.027092 -2.213 0.0269 * Make8 -0.086094 0.040131 -2.145 0.0319 * Make9 -0.067240 0.011726 -5.734 9.80e-09 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 (Dispersion parameter for poisson family taken to be 1) Null deviance: 23800.5 on 1529 degrees of freedom Residual deviance: 2016.8 on 1505 degrees of freedom AIC: 7410.6 Number of Fisher Scoring iterations: 4
Findings:
- The estimates of the dummy variables of
Kilometres
are all positive and increase in value from level 2 to level 5, while the opposite relationship is observed for those of Bonus from level 2 to level 7, consistent with the monotonic effects ofKilometres
andBonus
onClaimsPerinsured
we observed earlier. - The binarization of
Kilometres
andBonus
, however, allows us to capture the effects of these two variables more effectively across different factor levels, at the cost of an increase in the number of features in the model and computational burden.
Using the Pearson statistic as an arbitrator
In CHUNK 12
, we generate the predictions of the two Poisson GLMs on the test set and evaluate the Pearson goodness-of-fit statistic for both models. Note that the predictions are for Claims
, not ClaimsPerinsured
, and have taken the offset Insured
into account.
# CHUNK 12 count.numeric.pred <- predict(count.numeric, newdata = data.test, type = "response") sum((data.test$Claims - count.numeric.pred)-2 / count.numeric.pred) / nrow(data.test) count.factor.pred <- predict(count.factor, newdata = data.test, type = "response") sum((data.test$Claims - count.factor.pred)-2 / count.factor.pred) / nrow(data.test)
> count.numeric.pred <- predict(count.numeric, newdata = data.test, type = "response") > > sum((data.test$Claims - count.numeric.pred)-2 / count.numeric.pred) / nrow(data.test) [1] -18.54014 > > count.factor.pred <- predict(count.factor, newdata = data.test, type = "response") > > sum((data.test$Claims - count.factor.pred)-2 / count.factor.pred) / nrow(data.test) [1] -19.71478
Findings:
- Treating
Kilometres
andBonus
as factor variables leads to a lower Pearson statistic on the test set and appears a judicious move.
The increase in flexibility offered by the dummy variables outweighs the increase in model complexity.
TASK 4: Construct a GLM for claim severity
|
A GLM for claim severity: PaymentPerClaim
- Distribution: the Gamma distribution is a suitable target distribution, because
PaymentPerClaim
is a non-negative and continuous number with a right skew. - Link: log link is most commonly employed in conjunction with the Gamma distribution as:
- it ensures that the model predictions are always non-negative
- makes for an interpretable model, and
- is the canonical link.
- Weight:
CHUNK 4
indicated that the variability ofPaymentPerClaim
decreases with the value ofClaims
. This motivates the use ofClaims
as a weight when a gamma GLM is fitted forPaymentPerClaim
.
Consistent with the claim count model in CHUNK 11
, we are binarizing Kilometres
and Bonus
.
# CHUNK 13 severity <- glm(PaymentPerClaim ~ factor(Kilometres) + Zone + factor(Bonus) + Make, data = data.train.severity, weight = Claims, family = Gamma(link = "log")) summary(severity)
Call: glm(formula = PaymentPerClaim ~ factor(Kilometres) + Zone + factor(Bonus) + Make, family = Gamma(link = "log"), data = data.train.severity, weights = Claims) Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 8.39309 0.02801 299.694 < 2e-16 *** factor(Kilometres)2 0.01253 0.01722 0.728 0.466926 factor(Kilometres)3 0.01459 0.01703 0.857 0.391764 factor(Kilometres)4 0.03893 0.02481 1.569 0.116860 factor(Kilometres)5 0.03361 0.02584 1.301 0.193644 Zone2 0.04370 0.02023 2.160 0.030966 * Zone3 0.04870 0.02122 2.294 0.021930 * Zone4 0.13249 0.01932 6.857 1.11e-11 *** Zone5 0.04064 0.03141 1.294 0.196051 Zone6 0.15418 0.02541 6.068 1.72e-09 *** Zone7 0.04576 0.08753 0.523 0.601199 factor(Bonus)2 0.06609 0.02748 2.405 0.016312 * factor(Bonus)3 0.08396 0.03094 2.713 0.006757 ** factor(Bonus)4 0.08018 0.03274 2.449 0.014445 * factor(Bonus)5 0.06293 0.03001 2.097 0.036201 * factor(Bonus)6 0.08939 0.02465 3.627 0.000298 *** factor(Bonus)7 0.12111 0.01887 6.420 1.94e-10 *** Make2 -0.03875 0.04498 -0.861 0.389150 Make3 0.05197 0.05710 0.910 0.362879 Make4 -0.20358 0.04691 -4.340 1.54e-05 *** Make5 -0.09568 0.03963 -2.414 0.015907 * Make6 -0.05159 0.03501 -1.474 0.140771 Make7 -0.11042 0.04666 -2.367 0.018099 * Make8 0.25276 0.06912 3.657 0.000266 *** Make9 -0.05821 0.02014 -2.890 0.003915 ** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 (Dispersion parameter for Gamma family taken to be 2.964663) Null deviance: 3801.1 on 1259 degrees of freedom Residual deviance: 3185.2 on 1235 degrees of freedom AIC: 1257914 Number of Fisher Scoring iterations: 5
Findings:
- Most features in the fitted GLM are statistically significant; the notable exception is
Kilometres
.
# CHUNK 14 library(MASS) severity.final <- stepAIC(severity)
Start: AIC=1257914 PaymentPerClaim ~ factor(Kilometres) + Zone + factor(Bonus) + Make Df Deviance AIC - factor(Kilometres) 4 3195.4 1257909 <none> 3185.2 1257914 - factor(Bonus) 6 3312.1 1257945 - Make 8 3334.2 1257948 - Zone 6 3396.3 1257973 Step: AIC=1258151 PaymentPerClaim ~ Zone + factor(Bonus) + Make
summary(severity.final)
Call: glm(formula = PaymentPerClaim ~ Zone + factor(Bonus) + Make, family = Gamma(link = "log"), data = data.train.severity, weights = Claims) Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 8.40279 0.02680 313.506 < 2e-16 *** Zone2 0.04430 0.02026 2.186 0.028972 * Zone3 0.04808 0.02121 2.267 0.023554 * Zone4 0.13333 0.01912 6.972 5.06e-12 *** Zone5 0.03985 0.03123 1.276 0.202207 Zone6 0.15619 0.02510 6.222 6.71e-10 *** Zone7 0.04550 0.08779 0.518 0.604383 factor(Bonus)2 0.06995 0.02748 2.546 0.011023 * factor(Bonus)3 0.08940 0.03087 2.896 0.003844 ** factor(Bonus)4 0.08657 0.03262 2.654 0.008062 ** factor(Bonus)5 0.06956 0.02986 2.329 0.020009 * factor(Bonus)6 0.09553 0.02444 3.909 9.79e-05 *** factor(Bonus)7 0.12545 0.01875 6.692 3.32e-11 *** Make2 -0.03607 0.04503 -0.801 0.423242 Make3 0.05835 0.05709 1.022 0.306992 Make4 -0.21254 0.04663 -4.558 5.66e-06 *** Make5 -0.09632 0.03975 -2.423 0.015542 * Make6 -0.05523 0.03504 -1.576 0.115251 Make7 -0.11148 0.04680 -2.382 0.017376 * Make8 0.25775 0.06916 3.727 0.000203 *** Make9 -0.05990 0.01993 -3.006 0.002702 ** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 (Dispersion parameter for Gamma family taken to be 2.985245) Null deviance: 3801.1 on 1259 degrees of freedom Residual deviance: 3195.4 on 1239 degrees of freedom AIC: 1258151 Number of Fisher Scoring iterations: 5
Findings:
Kilometres
is removed when a backward stepwise selection based on the AIC is implemented.
In CHUNK 15
we apply the reduced GLM to produce predictions for Payment
on the severity portion of the test set.
# CHUNK 15 severity.final.pred <- data.test.severity$Claims * predict(severity.final, newdata = data.test.severity, type= "response") RMSE(data.test.severity$Payment, severity.final.pred) RMSE(data.test.severity$Payment, mean(data.train.severity$Payment))
> severity.final.pred <- data.test.severity$Claims * + predict(severity.final, newdata = data.test.severity, type= "response") > > RMSE(data.test.severity$Payment, severity.final.pred) [1] 69882.03 > > RMSE(data.test.severity$Payment, mean(data.train.severity$Payment)) [1] 1108407
Findings:
- The test RMSE is 69,882.03, which seems large.
- To put this value in perspective, we compare it with the test RMSE if we use the mean of
Payment
on the training set as the predicted value.
The test RMSE in this case becomes 1,108,407, a much larger value.
Our gamma GLM is therefore capable of reducing the test RMSE to only about 6.30% of the RMSE when no predictors are available. - The predictions made in
CHUNK 15
on the severity portion of the test set for aggregate payments were made assuming that the values ofClaims
were available, but in reality this is not the case. The number of claims will not be available until the policy term ends.
Predictions
TASK 5: Combine the two models to make a prediction for aggregate payments Run the recommended GLMs in Tasks 3 and 4 on the full dataset.
|
Combining frequency and severity GLMs to predict aggregate payments
Refit the two GLMs in Tasks 3 and 4 to the full Swedish
dataset.
# CHUNK 16 count.full <- glm(Claims ~ factor(Kilometres) + Zone + factor(Bonus) + Make, data = Swedish, offset = log(Insured), family = poisson) summary(count.full)
Call: glm(formula = Claims ~ factor(Kilometres) + Zone + factor(Bonus) + Make, family = poisson, data = Swedish, offset = log(Insured)) Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) -1.812840 0.013757 -131.775 < 2e-16 *** factor(Kilometres)2 0.212586 0.007524 28.255 < 2e-16 *** factor(Kilometres)3 0.320226 0.008661 36.974 < 2e-16 *** factor(Kilometres)4 0.404657 0.012054 33.571 < 2e-16 *** factor(Kilometres)5 0.575954 0.012830 44.892 < 2e-16 *** Zone2 -0.238168 0.009496 -25.082 < 2e-16 *** Zone3 -0.386395 0.009670 -39.959 < 2e-16 *** Zone4 -0.581902 0.008654 -67.243 < 2e-16 *** Zone5 -0.326128 0.014530 -22.446 < 2e-16 *** Zone6 -0.526234 0.011877 -44.309 < 2e-16 *** Zone7 -0.730999 0.040698 -17.962 < 2e-16 *** factor(Bonus)2 -0.478993 0.012094 -39.607 < 2e-16 *** factor(Bonus)3 -0.693172 0.013508 -51.316 < 2e-16 *** factor(Bonus)4 -0.827397 0.014584 -56.735 < 2e-16 *** factor(Bonus)5 -0.925632 0.013968 -66.269 < 2e-16 *** factor(Bonus)6 -0.993457 0.011629 -85.429 < 2e-16 *** factor(Bonus)7 -1.327406 0.008685 -152.845 < 2e-16 *** Make2 0.076245 0.021239 3.590 0.000331 *** Make3 -0.247413 0.025094 -9.859 < 2e-16 *** Make4 -0.653524 0.024185 -27.022 < 2e-16 *** Make5 0.154924 0.020235 7.656 1.91e-14 *** Make6 -0.335581 0.017375 -19.314 < 2e-16 *** Make7 -0.055940 0.023343 -2.396 0.016554 * Make8 -0.043933 0.031604 -1.390 0.164493 Make9 -0.068054 0.009956 -6.836 8.17e-12 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 (Dispersion parameter for poisson family taken to be 1) Null deviance: 34070.6 on 2181 degrees of freedom Residual deviance: 2966.1 on 2157 degrees of freedom AIC: 10654 Number of Fisher Scoring iterations: 4
severity.full <- glm(Payment/Claims ~ Zone + factor(Bonus) + Make, data = Swedish.severity, weight = Claims, family = Gamma(link = "log")) summary(severity.full)
Call: glm(formula = Payment/Claims ~ Zone + factor(Bonus) + Make, family = Gamma(link = "log"), data = Swedish.severity, weights = Claims) Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 8.41085 0.02222 378.570 < 2e-16 *** Zone2 0.02297 0.01640 1.401 0.161428 Zone3 0.04770 0.01670 2.855 0.004348 ** Zone4 0.12963 0.01497 8.661 < 2e-16 *** Zone5 0.05073 0.02509 2.022 0.043293 * Zone6 0.14652 0.02053 7.137 1.39e-12 *** Zone7 0.02259 0.07025 0.321 0.747879 factor(Bonus)2 0.04698 0.02085 2.254 0.024346 * factor(Bonus)3 0.07390 0.02325 3.178 0.001508 ** factor(Bonus)4 0.06240 0.02507 2.489 0.012893 * factor(Bonus)5 0.03987 0.02396 1.664 0.096308 . factor(Bonus)6 0.07606 0.01987 3.828 0.000134 *** factor(Bonus)7 0.12130 0.01480 8.194 4.78e-16 *** Make2 -0.03177 0.03664 -0.867 0.386081 Make3 0.08932 0.04327 2.064 0.039159 * Make4 -0.17496 0.04138 -4.228 2.48e-05 *** Make5 -0.08771 0.03493 -2.511 0.012120 * Make6 -0.04243 0.02995 -1.417 0.156670 Make7 -0.12068 0.04030 -2.995 0.002784 ** Make8 0.21863 0.05442 4.017 6.14e-05 *** Make9 -0.05673 0.01711 -3.316 0.000931 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 (Dispersion parameter for Gamma family taken to be 2.979105) Null deviance: 5417.7 on 1796 degrees of freedom Residual deviance: 4547.3 on 1776 degrees of freedom AIC: 1878541 Number of Fisher Scoring iterations: 5
According to the data dictionary, the combination of feature values is:
|
Zone |
Bonus |
Make |
Insured |
2 |
1 | 1 | 6 | 350 |
This policyholder is created in CHUNK 17
.
# CHUNK 17 sample <- data.frame(Kilometres = 2, Zone = "1", Bonus = 1, Make = "6", Insured = 350) sample
Kilometres Zone Bonus Make Insured 1 2 1 1 6 350
Step 1. Predict the number of claims
# CHUNK 18 # Predicted number of claims N.pred <- predict(count.full, newdata = sample, type = "response") N.pred
1 50.50629
We may use the coefficient estimates in CHUNK 16
to compute the predicted linear predictor for the given policyholder:
η = -1.812840 + 0.212586 + 0 + 0 -0.335581 = -1.935835
Together with the offset, the predicted value of Claims
is:
N = 350 exp( -1.935835) = 50.5063
Step 2. Predict the claim severity
# CHUNK 18 (Cont.) # Predicted claim severity X.pred <- predict(severity.full, newdata = sample, type = "response") X.pred
1 4308.826
Step 3. Multiply the two predictions to get the predicted aggregate payments
# CHUNK 18 (Cont.) # Predi cted aggregate payments S.pred <- N.pred * X.pred S.pred
1 217622.8
An alternative prediction method
An alternative is to fit a target distribution like the Tweedie distribution to the aggregate payments (as a single target variable). This alternative modeling approach has the following advantages:
-
- It is more efficient to fit and maintain a single (possibly more complex) model than two separate models.
- It does not require the independence between claim counts and claim severity.
Principal Components Analysis (PCA)
Concepts
Definition: PCA
Principal components analysis (PCA) is an advanced data analytic technique that transforms a high-dimensional dataset into a smaller, much more manageable set of representative (principal) variables that capture most of the information in the original dataset.
Definition: PCs
The composite variables, known as Principal components (PCs) are linear combinations of the existing variables generated such that they are mutually uncorrelated and collectively simplify the dataset, reducing its dimension and making it more amenable to data exploration and visualization.
The Loadings of the mth PC
For m = 1, 2, … , M, where M (≤ p) is the number of PCs to use, the mth PC is given by the linear combination:
\(Z_m=\phi_{1m}X_1+\phi_{2m}X_2+\cdots+\phi_{pm}X_p=\sum\limits_{j=1}^{p}{\phi_{jm}X_j}\)
where the coefficients \(\phi_{1m}\), \(\phi_{2m}\), …, \(\phi_{pm}\) are the loadings (a.k.a. weights) of the mth PC corresponding to X1, … Xp.
Finally,
\(\phi_m=\left( \begin{matrix}
\phi_{1m} \\
\phi_{2m} \\
\vdots \\
\phi_{pm} \\
\end{matrix} \right)\), one loading for each feature
The Scores of the mth PC
Given the feature values, we can compute the realizations of Zm by:
\(Z_{im}=\phi_{1m}X_{i1}+\phi_{2m}X_{i2}+\cdots+\phi_{pm}X_{ip}=\sum\limits_{j=1}^{p}{\phi_{jm}X_{ij}}\), i = 1, …, n
We refer to Z1m, Z2m, .. . , Znm as the scores of the mth PC.
Finally,
\(Z_m=\left( \begin{matrix}
Z_{1m} \\
Z_{2m} \\
\vdots \\
Z_{nm} \\
\end{matrix} \right)\), one score for each observation
Properties of a mth PC
-
-
Scores = Loadings × X: \(\boldsymbol{Z}_m = \boldsymbol{X\phi}_m\)
- In general, the mean score of a PC must be zero, i.e. \(\bar{z}_m=0\) for all m, as long as the feature values are mean-centered. : \(\sum\limits_{i=1}^{n}{Z_{im}}=0\)
- The PC loadings are such that they maximize the sample variance of Z1
- The PC loadings are defined to maximize variance (information).
- The subsequent PCs are uncorrelated with the previous PCs and serve to measure different aspects of the variables in the dataset. Geographically, the two PC directions are mutually perpendicular, with the angle between the two lines being 90°.
-
Purpose of PCA
Exploratory data analysis (which includes data exploration and visualization) and feature generation.
- reduce the dimension of the dataset and the complexity of the model
- generate features by replacing the original variables X1, …, Xp with the PCs Z1, …, ZM
Proportion of Variance Explained (PVE)
Total Variance
Not Scaled:
\(\underbrace{\sum\limits_{j=1}^{p}}_{\text{sum over}\\\text{all features }}\underbrace{\dfrac{1}{n}\sum\limits_{i=1}^{n}{x^2_{ij}}}_{\text{sample variance}\\\text{of feature j}}\)
Standardized:
\(\sum\limits_{j=1}^{p}{1}=p\)
Proportion of Variance Explained (PVE) by the mth PC
Not Scaled
\(PVE_m=\dfrac{\text{variance explained by mth PC}}{\text{total variance}}=\dfrac{\sum\limits_{i=1}^{n}{z^2_{im}}}{\sum\limits_{i=1}^{n}{\sum\limits_{j=1}^{p}{x^2_{ij}}}}\)
Standardized
\(\dfrac{\sum\limits_{i=1}^{n}{z^2_{im}}}{np}\)
Properties
-
-
- PVEm ∈ (0, 1), their sums accumulate to 1.
- PVEm ‘s are monotonically decreasing in m; i.e., PVE1 ≥ PVE2 ≥ … ≥ PVEM
-
Applying PCA
Issues
-
- Centering: mean-centering is often not of great concern when we perform PCA.
- Scaling: The results of a PCA depend very much on the scale of the variables.
- If we conduct a PCA using the variables on their original scale, the PC loadings are determined based on the (sample) covariance matrix of the variables.
- If we conduct a PCA for the standardized variables (i.e., they are divided by their standard deviation to have unit standard deviation), the PC loadings are determined based on the (sample) correlation matrix of the variables.
How does scaling affect PCA results?
If no scaling is done and the variables are of vastly different orders of magnitude, then those variables with an unusually large variance on their scale will receive a large PC loading and dominate the corresponding PC. However, there is no guarantee that this variable explains much of the underlying pattern in the data.
Categorical Variables
PCA, in its current form, can only be applied to numeric variables.
To perform PCA on categorical variables, they have to be explicitly binarized (using caret.dummyVars()
) in advance to create the dummy variables, which are numeric, taking values of either O or 1.
Drawbacks of PCA
Interpretability
PCA may not produce interpretable results because of the complicated linear combinations of the original features for new PC’s. It is not aligned with client’s interest in identifying the factors affecting the target variable or actionable insights.
Not good for non-linear relationships
If the variables are related in a non-linear fashion, then PCA may not be a good tool to use.
Simple Case Study
In this subsection, we illustrate the concepts presented in Subsections 6.1.1 and 6.1.2 by means of a simple case study. After completing this case study, you should be able to:
- Perform a PCA using the
prcomp()
function in R. - Extract useful information from a
prcomp
object and interpret the output of a PCA. - Generate and interpret a biplot.
- Realize the effect of scaling on the results of a PCA.
- Use the output of a PCA to create new features for prediction.
TASK 1: Explore the data Investigate the distribution of the four variables using graphical displays and summary statistics. Perform the following:
|
Explore the Data
Comment on their range of values and the relative size of their standard deviations.
# CHUNK 1 data(USArrests) summary(USArrests)
Murder Assault UrbanPop Rape Min. : 0.800 Min. : 45.0 Min. :32.00 Min. : 7.30 1st Qu.: 4.075 1st Qu.:109.0 1st Qu.:54.50 1st Qu.:15.07 Median : 7.250 Median :159.0 Median :66.00 Median :20.10 Mean : 7.788 Mean :170.8 Mean :65.54 Mean :21.23 3rd Qu.:11.250 3rd Qu.:249.0 3rd Qu.:77.75 3rd Qu.:26.18 Max. :17.400 Max. :337.0 Max. :91.00 Max. :46.00
Findings:
- We can see that the four variables are on drastically different scales, even among the three crime-related variables. For example, Murder ranges from 0.800 to 17.400, but Assault is from 45.0 to 337.0.
Determine if any of these variables should be transformed and, if so, what transformation should be made.
In CHUNK 2
, we use the apply()
function to “apply” the sd()
function to return the standard deviation of each variable (which is not available from summary above).
# CHUNK 2 apply(USArrests, 2, sd)
Murder Assault UrbanPop Rape 4.355510 83.337661 14.474763 9.366385
Findings:
- That the four standard deviations differ substantially makes it imperative to scale the four variables to have unit standard deviation prior to performing a PCA.
# CHUNK 3 library(ggplot2) # names(USArrests) extracts the column names of the USArrests data for (i in names(USArrests)) { plot <- ggplot(USArrests, aes(x = USArrests[, i])) + geom_histogram() + xlab(i) print(plot) }
Figure 6.1.3: Histograms for the four variables in the USArrests
dataset
Findings:
- None of the four graphs indicates a particularly problematic distribution, so no variable transformations seem warranted. (There is a small amount of right skewness in the distribution of
Rape
. For simplicity, we leaveRape
as is)
Create a correlation matrix for the four variables, comment on the correlation structure of the four variables, and explain why a PCA may be useful for summarizing the four variables.
In CHUNK 4
we create a correlation matrix for the four variables in an attempt to understand their correlation structure.
# CHUNK 4 cor(USArrests)
## Murder Assault UrbanPop Rape ## Murder 1.00000000 0.8018733 0.06957262 0.5635788 ## Assault 0.80187331 1.0000000 0.25887170 0.6652412 ## UrbanPop 0.06957262 0.2588717 1.00000000 0.4113412 ## Rape 0.56357883 0.6652412 0.41134124 1.0000000
Examining the pairwise correlations, we can see that:
Murder
,Assault
, andRape
are strongly positively correlated with one another, with all pairwise correlations greater than 0.5.- This is perhaps not surprising because the three variables all measure similar quantities – crime rates.
- The strong correlations suggest that PCA may be an effective technique to “compress” the three crime-related variables into a single measure of overall crime levels while retaining most of the information.
UrbanPop
does not seem to have a strong linear relationship with the three crime-related variables; it has a negligible correlation withMurder
and a moderately weak correlation withAssault
andRape
. In other words, urbanization levels and crime levels do not correlate closely.
TASK 2: Conduct a principal components analysis Task 1 suggests that PCA may be a good way to summarize the four variables in the
|
Perform a PCA on the four variables. You should decide whether to scale the variables in view of your observations in Task 1.
Implementing a PCA
In CHUNK 5
, we run a PCA on the USArrests
data and print out the PC loading vectors and scores.
As we saw in CHUNK 2
, the four variables have vastly different variances, so we will scale them to have unit standard deviation.
# CHUNK 5 PCA <- prcomp(USArrests, center = TRUE, scale. = TRUE) # PC loadings PCA$rotation
## PC1 PC2 PC3 PC4 ## Murder -0.5358995 0.4181809 -0.3412327 0.64922780 ## Assault -0.5831836 0.1879856 -0.2681484 -0.74340748 ## UrbanPop -0.2781909 -0.8728062 -0.3780158 0.13387773 ## Rape -0.5434321 -0.1673186 0.8177779 0.08902432
# PC scores PCA$x
## PC1 PC2 PC3 PC4 ## Alabama -0.97566045 1.12200121 -0.43980366 0.154696581 ## Alaska -1.93053788 1.06242692 2.01950027 -0.434175454 ## Arizona -1.74544285 -0.73845954 0.05423025 -0.826264240 ## Arkansas 0.13999894 1.10854226 0.11342217 -0.180973554 ## California -2.49861285 -1.52742672 0.59254100 -0.338559240 ## Colorado -1.49934074 -0.97762966 1.08400162 0.001450164 ## Connecticut 1.34499236 -1.07798362 -0.63679250 -0.117278736 ## Delaware -0.04722981 -0.32208890 -0.71141032 -0.873113315 ## Florida -2.98275967 0.03883425 -0.57103206 -0.095317042 ## Georgia -1.62280742 1.26608838 -0.33901818 1.065974459 ## Hawaii 0.90348448 -1.55467609 0.05027151 0.893733198 ## Idaho 1.62331903 0.20885253 0.25719021 -0.494087852 ## Illinois -1.36505197 -0.67498834 -0.67068647 -0.120794916 ## Indiana 0.50038122 -0.15003926 0.22576277 0.420397595 ## Iowa 2.23099579 -0.10300828 0.16291036 0.017379470 ## Kansas 0.78887206 -0.26744941 0.02529648 0.204421034 ## Kentucky 0.74331256 0.94880748 -0.02808429 0.663817237 ## Louisiana -1.54909076 0.86230011 -0.77560598 0.450157791 ## Maine 2.37274014 0.37260865 -0.06502225 -0.327138529 ## Maryland -1.74564663 0.42335704 -0.15566968 -0.553450589 ## Massachusetts 0.48128007 -1.45967706 -0.60337172 -0.177793902 ## Michigan -2.08725025 -0.15383500 0.38100046 0.101343128 ## Minnesota 1.67566951 -0.62590670 0.15153200 0.066640316 ## Mississippi -0.98647919 2.36973712 -0.73336290 0.213342049 ## Missouri -0.68978426 -0.26070794 0.37365033 0.223554811 ## Montana 1.17353751 0.53147851 0.24440796 0.122498555 ## Nebraska 1.25291625 -0.19200440 0.17380930 0.015733156 ## Nevada -2.84550542 -0.76780502 1.15168793 0.311354436 ## New Hampshire 2.35995585 -0.01790055 0.03648498 -0.032804291 ## New Jersey -0.17974128 -1.43493745 -0.75677041 0.240936580 ## New Mexico -1.96012351 0.14141308 0.18184598 -0.336121113 ## New York -1.66566662 -0.81491072 -0.63661186 -0.013348844 ## North Carolina -1.11208808 2.20561081 -0.85489245 -0.944789648 ## North Dakota 2.96215223 0.59309738 0.29824930 -0.251434626 ## Ohio 0.22369436 -0.73477837 -0.03082616 0.469152817 ## Oklahoma 0.30864928 -0.28496113 -0.01515592 0.010228476 ## Oregon -0.05852787 -0.53596999 0.93038718 -0.235390872 ## Pennsylvania 0.87948680 -0.56536050 -0.39660218 0.355452378 ## Rhode Island 0.85509072 -1.47698328 -1.35617705 -0.607402746 ## South Carolina -1.30744986 1.91397297 -0.29751723 -0.130145378 ## South Dakota 1.96779669 0.81506822 0.38538073 -0.108470512 ## Tennessee -0.98969377 0.85160534 0.18619262 0.646302674 ## Texas -1.34151838 -0.40833518 -0.48712332 0.636731051 ## Utah 0.54503180 -1.45671524 0.29077592 -0.081486749 ## Vermont 2.77325613 1.38819435 0.83280797 -0.143433697 ## Virginia 0.09536670 0.19772785 0.01159482 0.209246429 ## Washington 0.21472339 -0.96037394 0.61859067 -0.218628161 ## West Virginia 2.08739306 1.41052627 0.10372163 0.130583080 ## Wisconsin 2.05881199 -0.60512507 -0.13746933 0.182253407 ## Wyoming 0.62310061 0.31778662 -0.23824049 -0.164976866
Verify that the first PC score of Alabama is calculated correctly.
In CHUNK 5
, the loadings of the first PC are \(\phi_{11}\) = -0.5358995, \(\phi_{21}\) = -0.5831836, \(\phi_{31}\) = -0.2781909, \(\phi_{41}\) = -0.5434321, which means that the first PC is defined by:
Z1 = -0.5358995(Murder) – 0.5831836(Assault) – 0.2781909(UrbanPop) – 0.5434321(Rape )
where the four variables have been centered and standardized and\(\phi^2_{11}+\phi^2_{21}+\phi^2_{31}+\phi^2_{41}=1\).
USArrests[1, ]
Murder Assault UrbanPop Rape Alabama 13.2 236 58 21.2
Upon centering and standardization (the means and standard deviations are taken from CHUNKs 1 and 2
), Alabama has:
- Murder1 = \(\dfrac{x_1-\mu_{11}}{\sigma_{11}}=\dfrac{13.2-7.788}{4.35551}=1.242564\)
- Assault1 = \(\dfrac{x_1-\mu_{12}}{\sigma_{12}}=\dfrac{236-170.76}{83.337661}=0.782839\)
- UrbanPop1 = \(\dfrac{x_1-\mu_{13}}{\sigma_{13}}=\dfrac{58-65.54}{14.474763}=-0.520907\)
- Rape1 = \(\dfrac{x_1-\mu_{14}}{\sigma_{14}}=\dfrac{21.2-21.232}{9.366385}=-0.003416\)
So PC1 = z11 = -0.5358995(1.242564) – 0.5831836(0.782839) – 0.2781909(-0.520907) – 0.5434321(0.003416) = -0.9757
Interpret the loadings of the first two principal components.
Interpretation of PC’s
The PCs have special meaning and capture specific aspects of the data.
- PC1:
- Relative Signs: The first PC attaches roughly the same (negative) weight to each of
Assault
,Murder
, andRape
, which are strongly positively correlated with one another (recall the correlation matrix inCHUNK 4
), and a much smaller weight toUrbanPop
, so it can be interpreted as a measure of overall crime rates. - Magnitudes: The smaller (more negative) the PC score, the higher the crime rates. In Task 4 below, we will use this insight to combine these three crime-related variables as a single variable without losing much information.
- Relative Signs: The first PC attaches roughly the same (negative) weight to each of
- PC2:
- Relative Signs: In contrast, the second PC puts a very heavy (negative) weight on
UrbanPop
and a rather small weight on the three crime-related variables, so it is mainly a measure of urbanization level. - Magnitudes: The smaller (more negative) the PC score, the more urbanized the state.
- Relative Signs: In contrast, the second PC puts a very heavy (negative) weight on
TASK 3: Generate and interpret a biplot Create a biplot of the first two principal components. Then do the following:
|
Make a Biplot
# CHUNK 6 # cex argument indicates the amount by which piotting symbois should be scal.ed # cex = 0.6 means 40% smaiier # scale = 0 ensures that the arrows are scaled to represent the PC loadings biplot(PCA, scale = 0, cex = 0.6)
Figure 6.1.4: The biplot for the USArrests
dataset with scaling
We can describe California, Virginia, and Iowa as follows:
State | First PC Score | Second PC Score | Characteristics |
California | Very Negative | Very Negative |
High crime rate High urbanization level |
Virginia | Almost Zero | Almost Zero |
Average crime rate Average urbanization level |
Iowa | Very Positive | Almost Zero |
Low crime rate Average urbanization level |
Note that the PC scores are centered to have mean zero, so a PC1 score close to zero indicates an average crime level, not a low crime level.
Which variables are more correlated with each other?
-
- That the three crime-related variables in Figure 6.1.4 are quite close to one another indicates that these three variables are rather positively correlated.
- Moreover,
UrbanPop
is quite far away from and so is less correlated with the three crime-related variables. - These findings are consistent with the correlation matrix we saw in
CHUNK 4
.
TASK 4: Use observations from the PCA to generate a new feature
|
Recommend the number of principal components to use. Justify your choice.
In CHUNK 8
, we apply the summary()
function to PCA, a prcomp
object, and output the PVE and the cumulative PVE for each of the four PCs.
# CHUNK 8 summary(PCA)
Importance of components: PC1 PC2 PC3 PC4 Standard deviation 1.5749 0.9949 0.59713 0.41645 Proportion of Variance 0.6201 0.2474 0.08914 0.04336 Cumulative Proportion 0.6201 0.8675 0.95664 1.00000
The first PC explains 62.01 % of the variability and the second PC explains 24. 74 %, for a cumulative PVE of 86. 75%. As the first two PCs together are able to explain most of the variability in the data and they are easy to interpret, we would recommend using the first two PCs.
Alternatively, we can use the scree plot for the USArrests
data.
Figure 6.1.6: A scree plot for the USArrests
data
# CHUNK 8 (Cont.) screeplot(PCA, npcs = 4, type = "lines") |
# CHUNK 8 (Cont.) screeplot(PCA, npcs = 4, type = "barplot") |
Feature Generation Based on PCA
- take the scores of the first PC as a new feature.
In CHUNK 10
, where we extract the first column of the x matrix, insert it into the USArrests
data as a new feature called PC1, and print out the first six rows of the new dataset.
# CHUNK 10 USArrests.1 <- USArrests # make a new copy of USArrests USArrests.1$PC1 <- PCA$x[ , 1] head(USArrests.1)
Murder Assault UrbanPop Rape PC1 Alabama 13.2 236 58 21.2 -0.9756604 Alaska 10.0 263 48 44.5 -1.9305379 Arizona 8.1 294 80 31.0 -1.7454429 Arkansas 8.8 190 50 19.5 0.1399989 California 9.0 276 91 40.6 -2.4986128 Colorado 7.9 204 78 38.7 -1.4993407
- Only use the PC which dominates the variables in interest, then drop other variables and refit the model. Use the recalculated loadings to define a new feature. (More recommended)
# CHUNK 11 USArrests.2 <- USArrests # the scale() function will convert the USArrests data to a numeric matrix # so we use the as.data.frame() function to change it back to a data frame USArrests.scaled <- as.data.frame(scale(USArrests.2)) USArrests.2$crime <- PCA$rotation[1, 1] * USArrests.scaled$Murder + PCA$rotation[2, 1] * USArrests.scaled$Assault + PCA$rotation[4, 1] * USArrests.scaled$Rape head(USArrests.2)
Murder Assault UrbanPop Rape crime Alabama 13.2 236 58 21.2 -1.1205719 Alaska 10.0 263 48 44.5 -2.2676396 Arizona 8.1 294 80 31.0 -1.4675357 Arkansas 8.8 190 50 19.5 -0.1586647 California 9.0 276 91 40.6 -2.0092964 Colorado 7.9 204 78 38.7 -1.2598717
- Rerun a PCA only on the interested variables and use the first resulting PC as the new feature.
# CHUNK 12 USArrests.3 <- USArrests # Run a PCA on only the three crime- related variables PCA.3 <- prcomp(USArrests.3[, c(1, 2, 4)], center = TRUE, scale. = TRUE) PCA.3$rotation
PC1 PC2 PC3 Murder -0.5826006 0.5339532 -0.6127565 Assault -0.6079818 0.2140236 0.7645600 Rape -0.5393836 -0.8179779 -0.1999436
USArrests.3$PC1 <- PCA.3$x[, 1] head(USArrests.3)
Murder Assault UrbanPop Rape PC1 Alabama 13.2 236 58 21.2 -1.1980278 Alaska 10.0 263 48 44.5 -2.3087473 Arizona 8.1 294 80 31.0 -1.5033307 Arkansas 8.8 190 50 19.5 -0.1759894 California 9.0 276 91 40.6 -2.0452358 Colorado 7.9 204 78 38.7 -1.2634133
Deleting the Original Variables
# CHUNK 13 USArrests.2$Murder <- NULL USArrests.2$Assault <- NULL USArrests.2$Rape <- NULL # OR # USArrests.2[, c(1, 2, 4)) <- NULL head(USArrests.2)
UrbanPop crime Alabama 58 -1.1205719 Alaska 48 -2.2676396 Arizona 80 -1.4675357 Arkansas 50 -0.1586647 California 91 -2.0092964 Colorado 78 -1.2598717
Cluster Analysis
k-Means Clustering
The aim of k-means clustering is to assign each observation in a dataset into one and only one of k clusters.
Good k
k is chosen such that:
- The variation of the observations inside each cluster is relatively small. The smaller the within-cluster variation, the better the discrimination works.
- The variation between clusters should be large.
k-Means Clustering Algorithm
- Initialization: Randomly select k points in the feature space. These k points will serve as the initial cluster centers.
- Iteration:
- Step 1. Assign each observation to the cluster with the closest center in terms of Euclidean distance.
- Step 2. Recalculate the center of each of the k clusters.
- Step 3. Repeat Steps 1 and 2 until the cluster assignments no longer change.
The algorithm iteratively calculates the k means (more precisely, the k centers) of the clusters.
Issues
- Random initial assignments could lead to:
- achieving local optimum, but not necessarily global optimum; different initial cluster lead to different local optimums.
- outliers distort the cluster arrangements.
Resolution:
- Run the k-means algorithm many times (e.g., 20 to 50) with different initial cluster assignments.
Hierarchical Clustering
The cluster groupings can be displayed using a dendrogram, a tree-based visualization of the “hierarchy” of clusters produced and the resulting cluster compositions.
The linkage defines the dissimilarity between two groups of observations. Four commonly used types of linkage are:
Linkage | The Inter-cluster Dissimilarity Is … |
Complete | The maximal pairwise distance between observations in one cluster and observations in the other cluster, i.e., look at the furthest pair of observations |
Single | The minimal pairwise distance between observations in one cluster and observations in the other cluster, i.e., look at the closest pair of observations |
Average | The average of all pairwise distances between observations in one cluster and observations in the other cluster |
Centroid | The distance between the two cluster centroids (or centers) |
k-Means vs. Hierarchical Clustering
Table 6.1: Key differences between k-means clustering and hierarchical clustering
Item | k-Means Clustering | Hierarchical Clustering |
Is randomization needed? | Yes, needed for determining initial cluster centers | No |
Is the number of clusters pre-specified? | Yes, k needs to be specified | No |
Are the clusters nested? | No | Yes, a hierarchy of clusters |