[mathjax]

Regularization

What is regularization?

Reduce model complexity: Reduces the magnitude of the coefficient estimates via the use of a penalty term and serves to prevent overfitting.
An alternative to using stepwise selection for identifying useful features.

How does regularization work?

Variables with limited predictive power will receive a coefficient estimate that is small, if not exactly zero, and therefore are removed from the model.

If it is to identify key factors affecting the target variable, using α = 0, which is ridge regression and does not eliminate any variables, is not appropriate.

Interactions

Interpretation

There is a significant interaction between [A] and [B], meaning that:

- The effect of [A] on [Y] varies for … with and without [B].

3.2 Linear Models Case Study 2: Feature Selection and Regularization

Learning Objectives

After completing this case study, you should be able to:

Fit a multiple linear regression model with both numeric and categorical (factor) predictors.
Detect and accommodate interactions between predictors which can be quantitative or qualitative.
Perform explicit binarization of categorical predictors and understand why doing so may be beneficial.
- library(caret)
- dummyVars()
Perform stepwise selection and be familiar with the different options allowed by this function.
- library(MASS)
- stepAIC()
Generate and interpret diagnostic plots for a linear model.
Implement regularized regression.
- library(glmnet)
- glmnet()
- cv.glmnet()

Stage 1: Define the Business Problem

Identify and interpret key factors that relate to a higher or lower Balance.

Stage 2: Data Collection

Data Design

1. Read in data and remove irrelevant variables.

# CHUNK 1
library(ISLR)
data(Credit)
Credit$ID <- NULL

Data Description

Predictors		Variables
Categorical	Binary	`Gender`	`Student`	`Married`
Categorical	3 Levels	`Ethnicity`
Numerical		`Income`	`Limit`	`Rating`
Numerical		`Cards`	`Age`	`Education`

TASK 1: Consider a data issue

Identify a variable that may have potential ethical concerns.
Discuss the benefits and concerns related to using this variable in a model for this business problem. Regardless of any concerns, continue to use this variable in subsequent analyses.

Data Quality Issue

Ethnicity
may be a sensitive move as making ethnicity-based predictions, criticized on grounds of discrimination and raise legal concerns.

Stage 3: Exploratory Data Analysis

TASK 2: Explore the numeric variables

Investigate the distribution of the target variable.
Create a correlation matrix for all the numeric variables (including the target variable) in the dataset.
- Examine the pairwise correlations, both between the predictors and in relation to the target variable, and state your observations.
- Explain whether it is reasonable to delete the Limit variable. Regardless of your response, delete this variable from further consideration.
Create visual representations between the numeric variables and the target variable to identify variables most likely to predict the target.

Descriptive Statistics: Numeric Predictors

Always explore the distribution of the target variable first.

may suggest useful transformations.
- avoid failing to consider skewed distribution of the target variable。
determine predictive analytic techniques.

How to explore the skewness?

Summary()

summary(Credit)

     Income           Limit           Rating          Cards            Age          Education        Gender    Student   Married              Ethnicity      Balance       
 Min.   : 10.35   Min.   :  855   Min.   : 93.0   Min.   :1.000   Min.   :23.00   Min.   : 5.00    Male :193   No :360   No :155   African American: 99   Min.   :   0.00  
 1st Qu.: 21.01   1st Qu.: 3088   1st Qu.:247.2   1st Qu.:2.000   1st Qu.:41.75   1st Qu.:11.00   Female:207   Yes: 40   Yes:245   Asian           :102   1st Qu.:  68.75  
 Median : 33.12   Median : 4622   Median :344.0   Median :3.000   Median :56.00   Median :14.00                                    Caucasian       :199   Median : 459.50  
 Mean   : 45.22   Mean   : 4736   Mean   :354.9   Mean   :2.958   Mean   :55.67   Mean   :13.45                                                           Mean   : 520.01  
 3rd Qu.: 57.47   3rd Qu.: 5873   3rd Qu.:437.2   3rd Qu.:4.000   3rd Qu.:70.00   3rd Qu.:16.00                                                           3rd Qu.: 863.00  
 Max.   :186.63   Max.   :13913   Max.   :982.0   Max.   :9.000   Max.   :98.00   Max.   :20.00                                                           Max.   :1999.00

Mean	Median	Key	Conclusion
520.01	459.50	Mean > Median	The distribution of `Balance` is slightly skewed to the right

Graphical Display: Numeric Predictors

Histogram

ggplot(<data>, aes(x = <target_variable>)) + geom_histogram()

# CHUNK 2
library(ggplot2)
ggplot(Credit, aes(x = Balance)) +
    geom_histogram()

nrow(Credit[Credit$Balance <= 0, ])
[1] 90

Findings

- The distribution of Balance is slightly skewed to the right
- A large number of observations (90 to be exact) for which Balance=0, making the log transformation unavailable.
  - the square root transformation still works well because Balance is non-negative. The distribution of the square root of Balance becomes more symmetric, although the mass at zero remains prominent.

Bivariate Data Exploration: Correlation Matrix

cor()

# CHUNK 3 
# Calculate the correlation matrix for numeric variables 
# The numeric predictors are in the first 6 columns 
cor.matrix <- cor(Credit[, c(1:6, 11)])
print("Correlation Matrix")
round(cor.matrix, digits = 4)

[1] "Correlation Matrix"
           Income   Limit  Rating   Cards    Age Education Balance
Income     1.0000  0.7921  0.7914 -0.0183 0.1753   -0.0277  0.4637
Limit      0.7921  1.0000  0.9969  0.0102 0.1009   -0.0235  0.8617
Rating     0.7914  0.9969  1.0000  0.0532 0.1032   -0.0301  0.8636
Cards     -0.0183  0.0102  0.0532  1.0000 0.0429   -0.0511  0.0865
Age        0.1753  0.1009  0.1032  0.0429 1.0000    0.0036  0.0018
Education -0.0277 -0.0235 -0.0301 -0.0511 0.0036    1.0000 -0.0081
Balance    0.4637  0.8617  0.8636  0.0865 0.0018   -0.0081  1.0000

Findings

- Correlations between predictors and the target variable:
  - Important Predictors: Variables being predictive of Balance:
    - Limit (0.8617) have very strongly positive linear relationships with Balance
    - Rating (0.8636) have very strongly positive linear relationships with Balance
    - Income (0.4637) exerts a moderately positive linear effect on Balance
  - Negligible Predictors: The other four numeric variables are weak at best given the negligible correlations.
- Correlations between variables:
  - Large Correlations:
    - Between Income and Limit (0.7921) [icon name=”arrow-alt-circle-right” style=”regular” class=”” unprefixed_class=””] large correlations suggesting duplication of information
    - Between Income and Rating (0. 7914) [icon name=”arrow-alt-circle-right” style=”regular” class=”” unprefixed_class=””] large correlations suggesting duplication of information
    - Between Limit and Rating (0.9969) [icon name=”arrow-alt-circle-right” style=”regular” class=”” unprefixed_class=””] almost positively collinear [icon name=”arrow-alt-circle-right” style=”regular” class=”” unprefixed_class=””] highly possible duplication of information

Feature Selection: Income and Rating or Limit

Scatterplots

Since Limit and Rating are almost positively collinear, plot the two variables in comparison with each other to see if they have linear relationship.

ggplot(<data>, aes(x = <var>, y = <var>)) + geom_point()

# CHUNK 4
ggplot(Credit, aes(x = Limit, y = Rating)) +
    geom_point()

Findings

- The two variables are virtually perfectly related in a linear fashion and do not contribute much information when the other variable is present.
  [icon name=”arrow-alt-circle-down” style=”regular” class=”” unprefixed_class=””]
  Therefore, we will keep one and only one of the two predictors, Rating, and delete Limit since Rating and Balance have slightly higher correlation.

# Delete Limit 
Credit$Limit <- NULL

Feature Selection: Income and Rating

Scatterplots between Numeric Variables and Target Variable

The pairwise correlations in CHUNK 3 are only a measure of the strength of the linear relationships between the numeric variables and Balance.

To further confirm the predictive power of Income and Rating, we use a for loop to make a scatterplot, with a smoothed line superimposed, for Balance against each of the five remaining numeric predictors.

# CHUNK 5 
# first save the names of the five numeric predictors as a vector 
vars.numeric <- colnames(Credit[ , 1:5]) 

for (i in vars.numeric) { 
    plot <- ggplot(Credit, aes(x = Credit[, i] , y = Balance)) + 
        geom_point() + 
        geom_smooth(method = "lm", se = FALSE) + labs(x = i) 
    print(plot) 
}

Findings

- - Balance and Rating have an unquestionably strong positive linear relationship.
  - Balance and Income have a positive linear relationship.
  - Balance and each of Cards, Age, and Education hardly have any systematic patterns
    - The fitted regression lines are almost flat [icon name=”arrow-alt-circle-right” style=”regular” class=”” unprefixed_class=””] Balance is not sensitive to changes in these three predictors.
  - All of these findings are consistent with our earlier deductions based on the correlation matrix in CHUNK 3.

TASK 3: Explore the factor variables

Investigate the distribution of the factor variables. Identify a factor variable that is likely to predict the target. For this factor variable (there is no need to discuss the other variables in your response), perform the following:

Show key descriptive statistics.
Create visual representations (e.g., graphs, charts).
Explain why you selected this variable and how the variable relates to the target variable.

Descriptive Statistics: Categorical Predictors

In CHUNK 6, we produce the mean (and median) of Balance split by different levels of each of the 4 factor variables.

# CHUNK 6
library(tidyverse)

# Save the names of the categoricai predictors as a vector
vars.categorical <- c("Gender", "Student", "Married", "Ethnicity")

for (i in vars.categorical) {
    x <- Credit %>%
        group_by_(i) %>%
        summarise(
            mean = mean(Balance),
            median = median(Balance),
            n = n()
        )
    print(x) 
}

# A tibble: 2 × 4
  Gender    mean median     n
  <fct>    <dbl>  <int> <int>
1 " Male"   510.    463   193
2 "Female"  530.    456   207
# A tibble: 2 × 4
  Student  mean median     n
  <fct>   <dbl>  <dbl> <int>
1 No       480.    424   360
2 Yes      877.    953    40
# A tibble: 2 × 4
  Married  mean median     n
  <fct>   <dbl>  <int> <int>
1 No       523.    467   155
2 Yes      518.    454   245
# A tibble: 3 × 4
  Ethnicity         mean median     n
  <fct>            <dbl>  <dbl> <int>
1 African American  531     480    99
2 Asian             512.    414   102
3 Caucasian         518.    465   199

Findings

Student makes a significant difference to the mean of Balance over the two levels, “Yes” and “No”
- students (876.8250) having a much higher average Balance than for non-students (480.3694).
The mean of Balance is more or less the same across different levels of Gender, Married, and Ethnicity.
- These three variables appear to be ineffective in differentiating the observations of Balance.

Graphical Display: Categorical Predictors

Split Boxplots

ggplot(<data>, aes(x = <categorical predictors>, y = <target variable>)) + geom_boxplot()

# CHUNK 7
for (i in vars.categorical) {
    plot <- ggplot(Credit, aes(x = Credit[, i], y = Balance)) +
        geom_boxplot() +
    labs(x = i)
    print(plot)
}

Findings

- The box for students is noticeably higher than that for non-students, with a similar spread.
- The boxes are very similar across different factor levels for each of Gender, Married, and Ethnicity.

All in all, Student seems to be the only factor variable with significant predictive power.

Feature Selection: Student

Overall, the most promising predictors for Balance are:

Feature Selection: Income (numeric), Rating (numeric), and Student (factor)

TASK 4: Consider two graphs

Your assistant has built the following two plots and asked for your comments:

A boxplot of Age by Student.
A scatterplot of Balance against Income colored by Student.

Run the code to make them.

Include them in your response.
State your observations and discuss the impact, if any, of each plot on your later modeling.

Occasionally on the exam, you are asked to focus on specific graphs that highlight subtle but important aspects of the
data that may go unnoticed in a standard data exploration exercise. The two graphs constructed in CHUNK 8 are good examples.

Split Boxplots: Age vs. Student

# CHUNK 8
ggplot(Credit, aes(x = Student, y = Age)) +
    geom_boxplot()

Findings

- Those who are students tend to have a smaller age (a lower median).
- The median age of students, which is close to 50, means that there is some co-dependence (but not interaction!) between Age and Student.
  [icon name=”arrow-alt-circle-down” style=”regular” class=”” unprefixed_class=””]
  [icon name=”gem” style=”regular” class=”” unprefixed_class=””] The effects of these two variables on Balance may be distorted when they serve as predictors at the same time.

Scatterplots: `Balance` vs. `Income` by `Student`

# CHUNK 8
ggplot(Credit, aes(x = Income, y = Balance, color = Student)) +
    geom_point() +
    geom_smooth(method = "lm", se = FALSE)

Findings

- There is a possible interaction between Income and Student in affecting Balance.
  - The slopes of the two smoothed lines differ quite remarkably, with the non-students being more sensitive to Income than students perhaps due to their financial status.

Feature Generation: Income:Student

Stage 4: Model Construction and Evaluation

TASK 5: Explore the effect of releveling factor variables

Split the data into training and test sets.
Fit a multiple linear regression model for Balance on all variables (except Limit) on the training set. Be sure to take note of the implications in Task 4. Interpret the coefficient estimates for Married and for the Asian level of Ethnicity.
Relevel the factor variables so that the most frequent level becomes the baseline level and refit the multiple linear regression model. Interpret the coefficient estimates for Married and for the Asian level of Ethnicity again.

Model Evaluation: Training/Test Set Split

Creation of training and test sets: `createDataPartition()`

createDataPartition(<target_variable>, p = <proportion of training data>, list = <if vector>)

# CHUNK 9
library(caret)
set.seed(8964)
partition <- createDataPartition(Credit$Balance, p = 0.75, list = FALSE)
data.train <- Credit[partition, ]
data.test <- Credit[-partition, ]
print("TRAIN")
mean(data.train$Balance)
print("TEST")
mean(data.test$Balance)

[1] "TRAIN"
> mean(data.train$Balance)
[1] 520.4352
> print("TEST")
[1] "TEST"
> mean(data.test$Balance)
[1] 518.7374

Model Construction – Step 1: Fitting the First Model

# CHUNK 10 
model.full <- lm(Balance ~ . + Income:Student, data = data.train)
summary(model.full)

Call:
lm(formula = Balance ~ . + Income:Student, data = data.train)

Residuals:
    Min      1Q  Median      3Q     Max 
-189.17  -76.46  -15.65   66.69  296.17 

Coefficients:
                     Estimate Std. Error t value Pr(>|t|)    
(Intercept)        -552.34353   40.34543 -13.690   <2e-16 ***
Income               -7.93864    0.29465 -26.943   <2e-16 ***
Rating                4.01246    0.06416  62.540   <2e-16 ***
Cards                 2.89772    4.69168   0.618   0.5373    
Age                  -0.78778    0.35994  -2.189   0.0294 *  
Education             0.88922    1.92035   0.463   0.6437    
GenderFemale        -18.37030   12.00867  -1.530   0.1272    
StudentYes          395.67744   30.72302  12.879   <2e-16 ***
MarriedYes          -20.46469   12.39735  -1.651   0.0999 .  
EthnicityAsian        6.53269   17.13794   0.381   0.7033    
EthnicityCaucasian   11.95386   14.82444   0.806   0.4207    
Income:StudentYes     0.50102    0.50028   1.001   0.3174    
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 103.1 on 289 degrees of freedom
Multiple R-squared:  0.952,	Adjusted R-squared:  0.9502 
F-statistic: 521.4 on 11 and 289 DF,  p-value: < 2.2e-16

# of Levels of the Categorical Predictors

- lm() function automatically turns the 4 categorical predictors into dummy variables.
  - These dummy variables are named in the form of <categorical_predictor><level>
  - A categorical predictor with r levels is represented by r-1 dummy variables.
    - Gender, Student, and Married are each represented by 1 dummy variable, and
    - Ethnicity by 2 dummy variables.
  - The level that is left out is the baseline level, which, by default, is the alpha-numerically first level.

Coefficient Estimates – Interpretation for Categorical Predictors

- MarriedYes: The credit card balance of married individuals is lower than that of unmarried (baseline) by 20.46469 on average, holding all other predictors fixed.
- EthnicityAsian: The credit card balance of Asians is higher than that of African Americans (baseline) by 6.53269 on average, holding all other predictors fixed.

Significance

- Income, Rating, Student (or equivalently, StudentYes) are highly significant.
- Income:Student is not quite significant. The model says that the interaction becomes insignificant when the effects of other variables have been accounted for.

Standard Errors

- The standard error of the coefficient estimate of the “Asian” level (17.13794) of the Ethnicity variable is larger than that of the “Caucasian” level (14.82444).
  - This should come as no surprise to us, considering the larger number of observations in the latter level (see CHUNK 1 or CHUNK 6).
  - Recall that:
    - sparse levels tend to have larger standard errors, because the coefficient estimates for those levels are based on less data;
    - populous levels tend to have smaller standard errors, because the coefficient estimates for those levels are based on more data.

Model Construction – Releveling

Purpose

To relevel factor variables from baseline level being the most observations to not the alpha-numerically first level.

Effects

- have no effect on the predicted values.
- affect the coefficient estimates of the dummy variables.
- possibly change the significance statistics, such as t-statistics for the dummy variables and the corresponding p-values.
  - In general, the p-values for factor variables with three or more levels may depend on the choice of the baseline level.
  - If we perform stepwise (backward) selection with binarization and try to drop the most statistically insignificant feature, that is, the one with the highest p-value, then the choice of the baseline levels of the factor variables may affect which feature is to be dropped first. This explains the importance of releveling.

Why set the baseline level to the most populous level

- to improve the accuracy of our measures of significance.

Releveling

# CHUNK 11
for (i in vars.categorical){
  # Use the table() function to calculate the frequencies for each factor
  table <- as.data.frame(table(Credit[, i]))
  # Determine the level with the highest frequency
  max <- which.max(table[, 2])
  # Save the name of the level with the highest frequency
  level.name <- as.character(table[max, 1])
  # Set the baseline level to the most populous level 
  Credit[, i] <- relevel(Credit[, i], ref = level.name) 
}

summary(Credit[, vars.categorical])

   Gender    Student   Married              Ethnicity  
Female:207   No :360   Yes:245   Caucasian       :199  
 Male :193   Yes: 40   No :155   African American: 99  
                                 Asian           :102

Baseline Level: GenderMale [icon name=”arrow-alt-circle-right” style=”regular” class=”” unprefixed_class=””] GenderFemale

Baseline Level: MarriedNo [icon name=”arrow-alt-circle-right” style=”regular” class=”” unprefixed_class=””] MarriedYes

Baseline Level: EthnicityAfrican American [icon name=”arrow-alt-circle-right” style=”regular” class=”” unprefixed_class=””] EthnicityCaucasian

Refitting the Full Model

Run CHUNK 12 to refit the full model:

# CHUNK 12
# To make sure factors in the training set are reLeveLed
data.train <- Credit[partition, ]
model.full <- lm(Balance ~ . + Income:Student, data = data.train)
summary(model.full)

Call:
lm(formula = Balance ~ . + Income:Student, data = data.train)

Residuals:
    Min      1Q  Median      3Q     Max 
-189.17  -76.46  -15.65   66.69  296.17 

Coefficients:
                            Estimate Std. Error t value Pr(>|t|)    
(Intercept)               -579.22466   40.08600 -14.450   <2e-16 ***
Income                      -7.93864    0.29465 -26.943   <2e-16 ***
Rating                       4.01246    0.06416  62.540   <2e-16 ***
Cards                        2.89772    4.69168   0.618   0.5373    
Age                         -0.78778    0.35994  -2.189   0.0294 *  
Education                    0.88922    1.92035   0.463   0.6437    
Gender Male                 18.37030   12.00867   1.530   0.1272    
StudentYes                 395.67744   30.72302  12.879   <2e-16 ***
MarriedNo                   20.46469   12.39735   1.651   0.0999 .  
EthnicityAfrican American  -11.95386   14.82444  -0.806   0.4207    
EthnicityAsian              -5.42117   14.68523  -0.369   0.7123    
Income:StudentYes            0.50102    0.50028   1.001   0.3174    
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 103.1 on 289 degrees of freedom
Multiple R-squared:  0.952,	Adjusted R-squared:  0.9502 
F-statistic: 521.4 on 11 and 289 DF,  p-value: < 2.2e-1

Findings: Comparing CHUNK 12 with CHUNK 10

Change in the coefficient estimate of the dummy variables for Married:

Coefficient: MarriedYes -> MarriedNo
Estimate: -20.46469 -> +20.46469

[icon name=”arrow-alt-circle-right” style=”regular” class=”” unprefixed_class=””]

The credit card balance of unmarried individuals is higher than that of married (baseline) by 20.46469 on average, holding all other predictors fixed.

The size of the coefficient estimate remains unchanged, but it is now positive (-20.46469 -> +20.46469).

Change in the coefficient estimate of the dummy variables for Ethnicity:

Coefficient: EthnicityAsian -> EthnicityAfrican American
Estimate: 6.53269 -> -5.42117 [icon name=”arrow-alt-circle-right” style=”regular” class=”” unprefixed_class=””] The credit card balance of Asians is lower than that of Caucasian (baseline) by 5.42117 on average, holding all other predictors fixed. 6.53269 - 11.95386 = -5.42117

Change in the p-value for EthnicityAsian:

p-value: 0.7033 -> 0.7123

TASK 6: Binarize Factor Variables Manually

Describe the advantages and disadvantages of binarizing the factor variables before performing feature selection.
Regardless, binarize the factor variables and refit the multiple linear regression model in Task 5.

Model Construction – One more preparatory step – Binarization of factor variables

Purpose

To binarize categorical predictors with three or more levels.

Effects

- Whether or not we binarize the categorical predictors in advance has no impact on a fitted linear model.
- doing binarization can help make the feature selection process we are going to perform more meaningful.

Advantages

- With binarization, we are able to add or drop individual factor levels when doing feature selection.
  - Many feature selection functions in R treat factor variables as a single feature and either retain the variable with all of its levels or remove the variable completely. This “all or nothing” feature selection approach ignores the possibility that individual levels of a factor with three or more levels may be:
    - Insignificant with regard to the baseline level (and hence could be combined with it to form a bigger baseline level)
    - Insignificantly different from other levels (in which case they could be combined into a new, bigger level).

Disadvantages

- The stepwise selection procedure may take considerably longer to complete.
  - Each factor level is considered a separate feature to add or drop.
- We may see some non-intuitive or nonsensical results when only a handful of the levels of a categorical predictor are selected.

Binarization

Only Ethnicity has more than two levels and has to be binarized by two dummy variables. The binarization of Ethnicity can be done by running CHUNK 13.

# CHUNK 13
library(caret)
binarizer <- dummyVars(paste("~", paste(vars.categorical, collapse = "+")),
                       data = Credit, fullRank = TRUE)
# OR type out the categorical predictors one by one 
#binarizer <- dummyVars(~ Gender + Student + Married + Ethnicity, 
#                       data = Credit, fullRank = TRUE)
binarized_vars <- data.frame(predict(binarizer, newdata = Credit))
head(binarized_vars)

  Gender..Male Student.Yes Married.No Ethnicity.African.American Ethnicity.Asian
1            1           0          0                          0               0
2            0           1          0                          0               1
3            1           0          1                          0               1
4            0           0          1                          0               1
5            1           0          0                          0               0
6            1           0          1                          0               0

Explanation of Coding

- - dummyVars() function
    - takes a formula (there is no target variable) as the first argument and
      - paste() function
        In vars.categorical, collapse = "+", we use the paste() function to “paste” the names of the four categorical predictors Gender, Student, Married, and Ethnicity (recall that vars.categorical is defined in CHUNK 6) and separate them by the + sign.
    - a data frame carrying the categorical variable(s) you want to binarize as the second argument.
    - fullRank option
      The fullRank option can be set to TRUE or FALSE depending on your purpose:
      - fullRank = TRUE is appropriate for regression and allows you to have a full rank parameterization of the model matrix, with the baseline levels of the categorical predictors left out.
      - fullRank = FALSE (the default value), the baseline levels would be retained and a rank-deficient model would result. This is more appropriate for principal components and cluster analyses.

Below is the outcome of setting fullRank = FALSE

# CHUNK 13
library(caret)
binarizer <- dummyVars(paste("~", paste(vars.categorical, collapse = "+")),
                       data = Credit, fullRank = FALSE)
# OR type out the categorical predictors one by one 
#binarizer <- dummyVars(~ Gender + Student + Married + Ethnicity, 
#                       data= Credit, fullRank = TRUE) 
binarized_vars <- data.frame(predict(binarizer, newdata = Credit))
head(binarized_vars)

  Gender.Female Gender..Male Student.No Student.Yes Married.Yes Married.No Ethnicity.Caucasian Ethnicity.African.American Ethnicity.Asian
1             0            1          1           0           1          0                   1                          0               0
2             1            0          0           1           1          0                   0                          0               1
3             0            1          1           0           0          1                   0                          0               1
4             1            0          1           0           0          1                   0                          0               1
5             0            1          1           0           1          0                   1                          0               0
6             0            1          1           0           0          1                   1                          0               0

The binarization procedure creates two dummy variables for Ethnicity, called:

- Ethnicity.African.American, and
- Ethnicity.Asian, indicating whether Ethnicity equals “African American” or “Asian“, respectively.

Refitting the Full Model

In CHUNK 14, we combine it with the original Credit dataset via the cbind() function.

# CHUNK 14 
Credit.bin <- cbind(Credit, binarized_vars) 
head(Credit.bin)

   Income Rating Cards Age Education Gender Student Married Ethnicity Balance Gender.Female Gender..Male Student.No Student.Yes Married.Yes Married.No Ethnicity.Caucasian Ethnicity.African.American Ethnicity.Asian
1  14.891    283     2  34        11   Male      No     Yes Caucasian     333             0            1          1           0           1          0                   1                          0               0
2 106.025    483     3  82        15 Female     Yes     Yes     Asian     903             1            0          0           1           1          0                   0                          0               1
3 104.593    514     4  71        11   Male      No      No     Asian     580             0            1          1           0           0          1                   0                          0               1
4 148.924    681     3  36        11 Female      No      No     Asian     964             1            0          1           0           0          1                   0                          0               1
5  55.882    357     2  68        16   Male      No     Yes Caucasian     331             0            1          1           0           1          0                   1                          0               0
6  80.180    569     4  77        10   Male      No      No Caucasian    1151             0            1          1           0           0          1                   1                          0               0

Given the binarized variables, the original categorical predictors are no longer needed as they duplicate the information contained in the dummy variables.

# CHUNK 14 (Cont.) 
Credit.bin$Gender <- NULL 
Credit.bin$Student <- NULL 
Credit.bin$Married <- NULL 
Credit.bin$Ethnicity <- NULL 
head(Credit.bin)

   Income Rating Cards Age Education Balance Gender.Female Gender..Male Student.No Student.Yes Married.Yes Married.No Ethnicity.Caucasian Ethnicity.African.American Ethnicity.Asian
1  14.891    283     2  34        11     333             0            1          1           0           1          0                   1                          0               0
2 106.025    483     3  82        15     903             1            0          0           1           1          0                   0                          0               1
3 104.593    514     4  71        11     580             0            1          1           0           0          1                   0                          0               1
4 148.924    681     3  36        11     964             1            0          1           0           0          1                   0                          0               1
5  55.882    357     2  68        16     331             0            1          1           0           1          0                   1                          0               0
6  80.180    569     4  77        10    1151             0            1          1           0           0          1                   1                          0               0

Model Evaluation: Training/Test Set Split

Now that the expanded dataset with binarized variables is available, we again split it into the training and test sets.

# CHUNK 14- (Cont.) 
data.train.bin <- Credit.bin[partition, ] 
data.test.bin <- Credit.bin[-partition, ]

Model Construction – Step 1 Revisited: Refitting the Linear Model with Binarized Factors

# CHUNK 15 
# The interaction term is now Income:Student.Yes, not Income:Student 
model.full.bin <- lm(Balance ~ . + Income:Student.Yes, data= data.train.bin) 
summary(model.full.bin)

Call:
lm(formula = Balance ~ . + Income:Student.Yes, data = data.train.bin)

Residuals:
    Min      1Q  Median      3Q     Max 
-189.17  -76.46  -15.65   66.69  296.17 

Coefficients: (4 not defined because of singularities)
                             Estimate Std. Error t value Pr(>|t|)    
(Intercept)                -150.13340   48.40440  -3.102  0.00211 ** 
Income                       -7.93864    0.29465 -26.943  < 2e-16 ***
Rating                        4.01246    0.06416  62.540  < 2e-16 ***
Cards                         2.89772    4.69168   0.618  0.53731    
Age                          -0.78778    0.35994  -2.189  0.02942 *  
Education                     0.88922    1.92035   0.463  0.64368    
Gender.Female               -18.37030   12.00867  -1.530  0.12717    
Gender..Male                       NA         NA      NA       NA    
Student.No                 -395.67744   30.72302 -12.879  < 2e-16 ***
Student.Yes                        NA         NA      NA       NA    
Married.Yes                 -20.46469   12.39735  -1.651  0.09988 .  
Married.No                         NA         NA      NA       NA    
Ethnicity.Caucasian           5.42117   14.68523   0.369  0.71228    
Ethnicity.African.American   -6.53269   17.13794  -0.381  0.70335    
Ethnicity.Asian                    NA         NA      NA       NA    
Income:Student.Yes            0.50102    0.50028   1.001  0.31743    
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 103.1 on 289 degrees of freedom
Multiple R-squared:  0.952,	Adjusted R-squared:  0.9502 
F-statistic: 521.4 on 11 and 289 DF,  p-value: < 2.2e-16

Findings:

The summary output is identical to that in CHUNK 12, showing that binarization has no effect on the quality of the fitted model.

TASK 7: Select features using stepwise selection

Some of the features in the full model may lack predictive power and result in overfitting.

Perform stepwise selection using the stepAIC() function to determine which features should be retained.
- When using this function, there are two decisions to make:
  - Forward (direction = “forward”) vs. backward (direction = “backward”)
  - AIC(k = 2) vs. BIC (k = log(nrow(data.train))).

Describe each decision and the implications in choosing one option versus the other.

- Employ forward selection using BIC.
Run the summary function on the linear model selected by this process. Provide the summary output and list the variables selected.

Model Construction – Step 2: Removing Non-Predictive Features Using `stepAIC()`

How does `stepAIC()` work?

MASS:: stepAIC() function automates stepwise model selection, in each step adding (forward selection) or dropping the feature (backward selection) to produce the greatest improvement in the model according to a certain criterion (the AIC by default). The process is terminated until no more variables can be added or dropped to improve the model.

`stepAIC()` Function

Backward [icon name=”arrow-alt-circle-right” style=”regular” class=”” unprefixed_class=””] Forward: direction = "forward", scope = list(upper = <full model>, lower = <empty model>)

AIC [icon name=”arrow-alt-circle-right” style=”regular” class=”” unprefixed_class=””] BIC: k = 2 [icon name=”arrow-alt-circle-right” style=”regular” class=”” unprefixed_class=””] k = log(nrow(<data>))

- AIC with Backward Selection: stepAIC(<data>)
- AIC with Forward Selection: stepAIC(<empty model>, direction = "forward", scope = list(upper = <full model>, lower = <empty model>))
- BIC with Backward Selection: stepAIC(<data>, k = log(nrow(<data>)))
- BIC with Forward Selection: stepAIC(<empty model>, direction = "forward", scope = list(upper = <full model>, lower = <empty model>), k = log(nrow(<data>)))

Note:

- k is set to the penalty term k = 2 for AIC (default) or k = log(nrow(<data>)) for BIC based on:
  - $\text{AIC}=-2l+\boxed{2}p$ and
  - $\text{BIC}=-2l+\boxed{\ln(n_{tr})}p$

Backward vs. Forward Selection

- Forward selection is more likely to produce a final model with fewer features.

AIC vs. BIC
- BIC is more conservative than AIC because the penalty term ln(n_tr) = ln301 = 5.7071 > 2
- BIC tends to favor a model with fewer features than AIC
  - <=>Using BIC as the selection criterion will retain less features
  - <=>Using BIC as the selection criterion will give less factors to consider
  - <=>Using BIC as the selection criterion will have less risk of missing important features.

Backward selection with the AIC

# CHUNK 16 
#instaLL.packages("MASSn) # uncomment this Line the first time you use MASS 
library(MASS) 
model.backward.AIC <- stepAIC(model.full.bin)

Start:  AIC=2802.25
Balance ~ Income + Rating + Cards + Age + Education + Gender.Female + 
    Gender..Male + Student.No + Student.Yes + Married.Yes + Married.No + 
    Ethnicity.Caucasian + Ethnicity.African.American + Ethnicity.Asian + 
    Income:Student.Yes


Step:  AIC=2802.25
Balance ~ Income + Rating + Cards + Age + Education + Gender.Female + 
    Gender..Male + Student.No + Student.Yes + Married.Yes + Married.No + 
    Ethnicity.Caucasian + Ethnicity.African.American + Income:Student.Yes


Step:  AIC=2802.25
Balance ~ Income + Rating + Cards + Age + Education + Gender.Female + 
    Gender..Male + Student.No + Student.Yes + Married.Yes + Ethnicity.Caucasian + 
    Ethnicity.African.American + Income:Student.Yes


Step:  AIC=2802.25
Balance ~ Income + Rating + Cards + Age + Education + Gender.Female + 
    Gender..Male + Student.Yes + Married.Yes + Ethnicity.Caucasian + 
    Ethnicity.African.American + Income:Student.Yes


Step:  AIC=2802.25
Balance ~ Income + Rating + Cards + Age + Education + Gender.Female + 
    Student.Yes + Married.Yes + Ethnicity.Caucasian + Ethnicity.African.American + 
    Income:Student.Yes

                             Df Sum of Sq      RSS    AIC
- Ethnicity.Caucasian         1      1448  3071419 2800.4
- Ethnicity.African.American  1      1543  3071515 2800.4
- Education                   1      2278  3072249 2800.5
- Cards                       1      4052  3074024 2800.6
- Income:Student.Yes          1     10654  3080626 2801.3
<none>                                     3069972 2802.3
- Gender.Female               1     24859  3094830 2802.7
- Married.Yes                 1     28946  3098918 2803.1
- Age                         1     50885  3120857 2805.2
- Rating                      1  41548705 44618677 3605.9

Step:  AIC=2800.39
Balance ~ Income + Rating + Cards + Age + Education + Gender.Female + 
    Student.Yes + Married.Yes + Ethnicity.African.American + 
    Income:Student.Yes

                             Df Sum of Sq      RSS    AIC
- Education                   1      2255  3073675 2798.6
- Cards                       1      3911  3075331 2798.8
- Ethnicity.African.American  1      5595  3077014 2798.9
- Income:Student.Yes          1     10176  3081595 2799.4
<none>                                     3071419 2800.4
- Gender.Female               1     25165  3096584 2800.8
- Married.Yes                 1     29351  3100771 2801.3
- Age                         1     51462  3122881 2803.4
- Rating                      1  41550276 44621695 3603.9

Step:  AIC=2798.61
Balance ~ Income + Rating + Cards + Age + Gender.Female + Student.Yes + 
    Married.Yes + Ethnicity.African.American + Income:Student.Yes

                             Df Sum of Sq      RSS    AIC
- Cards                       1      3865  3077540 2797.0
- Ethnicity.African.American  1      5586  3079261 2797.2
- Income:Student.Yes          1     10214  3083888 2797.6
<none>                                     3073675 2798.6
- Gender.Female               1     25155  3098830 2799.1
- Married.Yes                 1     28499  3102174 2799.4
- Age                         1     51652  3125327 2801.6
- Rating                      1  41550265 44623939 3601.9

Step:  AIC=2796.99
Balance ~ Income + Rating + Age + Gender.Female + Student.Yes + 
    Married.Yes + Ethnicity.African.American + Income:Student.Yes

                             Df Sum of Sq      RSS    AIC
- Ethnicity.African.American  1      5579  3083119 2795.5
- Income:Student.Yes          1     10767  3088307 2796.0
<none>                                     3077540 2797.0
- Gender.Female               1     23659  3101198 2797.3
- Married.Yes                 1     27917  3105457 2797.7
- Age                         1     49660  3127200 2799.8
- Rating                      1  42161655 45239195 3604.0

Step:  AIC=2795.54
Balance ~ Income + Rating + Age + Gender.Female + Student.Yes + 
    Married.Yes + Income:Student.Yes

                     Df Sum of Sq      RSS    AIC
- Income:Student.Yes  1      9873  3092992 2794.5
<none>                             3083119 2795.5
- Gender.Female       1     23435  3106554 2795.8
- Married.Yes         1     24716  3107835 2795.9
- Age                 1     50672  3133791 2798.4
- Rating              1  42165674 45248793 3602.1

Step:  AIC=2794.5
Balance ~ Income + Rating + Age + Gender.Female + Student.Yes + 
    Married.Yes

                Df Sum of Sq      RSS    AIC
<none>                        3092992 2794.5
- Gender.Female  1     22314  3115306 2794.7
- Married.Yes    1     25027  3118019 2794.9
- Age            1     49133  3142125 2797.2
- Student.Yes    1   4931600  8024592 3079.5
- Income         1   8357794 11450786 3186.5
- Rating         1  42247226 45340218 3600.7

summary(model.backward.AIC)

Call:
lm(formula = Balance ~ Income + Rating + Age + Gender.Female + 
    Student.Yes + Married.Yes, data = data.train.bin)

Residuals:
    Min      1Q  Median      3Q     Max 
-195.16  -76.45  -14.03   65.33  285.30 

Coefficients:
                Estimate Std. Error t value Pr(>|t|)    
(Intercept)   -529.16392   26.85954 -19.701   <2e-16 ***
Income          -7.87666    0.27946 -28.186   <2e-16 ***
Rating           4.01180    0.06331  63.370   <2e-16 ***
Age             -0.77016    0.35638  -2.161   0.0315 *  
Gender.Female  -17.32103   11.89326  -1.456   0.1464    
Student.Yes    418.87031   19.34645  21.651   <2e-16 ***
Married.Yes    -18.73773   12.14860  -1.542   0.1241    
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 102.6 on 294 degrees of freedom
Multiple R-squared:  0.9517,	Adjusted R-squared:  0.9507 
F-statistic: 964.8 on 6 and 294 DF,  p-value: < 2.2e-16

Findings:

- The features are dropped in the following order: Ethnicity.Asian, Education, Cards, Ethnicity. African.American, Income:Student.Yes, then the process terminates.
- The following six features remain in the final model: Income, Rating, Age, Gender.Male, Student.Yes, Married.No
- All of the coefficient estimates in the final model, except for the estimates of Gender.Male and Married.No, are statistically significant. The retention of Income, Rating, and Student.Yes is in agreement with what we got in Tasks 2 and 3.

Forward Selection with the BIC

# CHUNK 18 
# first fit the null model (i.e., model with no predictors) 
model.null <- lm(Balance ~ 1, data = data.train.bin)

model.forward.BIC <- stepAIC(model.null, 
                     direction = "forward", 
                     scope = list(upper = model.full.bin, 
                     lower = model.null), 
                     k = log(nrow(data.train.bin)))

Start:  AIC=3698.12
Balance ~ 1

                             Df Sum of Sq      RSS    AIC
+ Rating                      1  47082983 16908423 3303.2
+ Income                      1  12910007 51081399 3636.0
+ Student.No                  1   5425335 58566071 3677.2
+ Student.Yes                 1   5425335 58566071 3677.2
<none>                                    63991406 3698.1
+ Cards                       1    163291 63828115 3703.1
+ Gender.Female               1    127201 63864205 3703.2
+ Gender..Male                1    127201 63864205 3703.2
+ Age                         1     79880 63911526 3703.5
+ Education                   1     13121 63978284 3703.8
+ Married.Yes                 1     10920 63980486 3703.8
+ Married.No                  1     10920 63980486 3703.8
+ Ethnicity.Asian             1      3049 63988357 3703.8
+ Ethnicity.Caucasian         1      2324 63989082 3703.8
+ Ethnicity.African.American  1         0 63991406 3703.8

Step:  AIC=3303.21
Balance ~ Rating

                             Df Sum of Sq      RSS    AIC
+ Income                      1   8650956  8257467 3093.2
+ Student.No                  1   4817765 12090659 3208.0
+ Student.Yes                 1   4817765 12090659 3208.0
+ Age                         1    707560 16200863 3296.1
<none>                                    16908423 3303.2
+ Married.Yes                 1    174547 16733876 3305.8
+ Married.No                  1    174547 16733876 3305.8
+ Education                   1     71402 16837022 3307.6
+ Ethnicity.Asian             1     49620 16858804 3308.0
+ Gender.Female               1     31887 16876536 3308.4
+ Gender..Male                1     31887 16876536 3308.4
+ Cards                       1     30049 16878374 3308.4
+ Ethnicity.Caucasian         1     18541 16889883 3308.6
+ Ethnicity.African.American  1      4304 16904120 3308.8

Step:  AIC=3093.2
Balance ~ Rating + Income

                             Df Sum of Sq     RSS    AIC
+ Student.No                  1   5069114 3188353 2812.5
+ Student.Yes                 1   5069114 3188353 2812.5
<none>                                    8257467 3093.2
+ Married.Yes                 1    127904 8129563 3094.2
+ Married.No                  1    127904 8129563 3094.2
+ Age                         1     95208 8162259 3095.4
+ Education                   1     44496 8212971 3097.3
+ Ethnicity.Asian             1     40115 8217352 3097.4
+ Cards                       1     22019 8235448 3098.1
+ Ethnicity.African.American  1      9464 8248003 3098.6
+ Ethnicity.Caucasian         1      8003 8249464 3098.6
+ Gender.Female               1       244 8257223 3098.9
+ Gender..Male                1       244 8257223 3098.9

Step:  AIC=2812.47
Balance ~ Rating + Income + Student.No

                             Df Sum of Sq     RSS    AIC
<none>                                    3188353 2812.5
+ Age                         1     46257 3142096 2813.8
+ Gender.Female               1     24128 3164225 2815.9
+ Gender..Male                1     24128 3164225 2815.9
+ Married.Yes                 1     23759 3164595 2815.9
+ Married.No                  1     23759 3164595 2815.9
+ Ethnicity.Caucasian         1      4275 3184078 2817.8
+ Ethnicity.African.American  1      2346 3186007 2817.9
+ Education                   1      1600 3186753 2818.0
+ Cards                       1       865 3187488 2818.1
+ Ethnicity.Asian             1       747 3187606 2818.1

summary(model.forward.BIC)

Call:
lm(formula = Balance ~ Rating + Income + Student.No, data = data.train.bin)

Residuals:
     Min       1Q   Median       3Q      Max 
-220.292  -78.120   -7.473   63.273  299.895 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept) -167.99297   23.70608  -7.086    1e-11 ***
Rating         4.01634    0.06353  63.223   <2e-16 ***
Income        -7.97413    0.27691 -28.797   <2e-16 ***
Student.No  -421.17339   19.38206 -21.730   <2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 103.6 on 297 degrees of freedom
Multiple R-squared:  0.9502,	Adjusted R-squared:  0.9497 
F-statistic:  1888 on 3 and 297 DF,  p-value: < 2.2e-16

Findings:

- Features are added in the following order:Rating, Income, Student.Yes
- The final model is simpler with less features and all coefficient estimates are statistically significant.

TASK 8: Select and validate the recommended model

Evaluate the RMSE of the models in the preceding subsection against the test set.
Make a recommendation as to which model should be used.
Generate and interpret diagnostic plots for the recommended model to check the model assumptions.

Model Evaluation: Performance Metrics

Model Selection – test RMSE

We are particularly interested in whether reducing the complexity of a linear model by stepwise selection algorithms can enhance prediction accuracy. A natural measure of the performance of linear models, where the target variable is numeric, is the test RMSE.

# CHUNK 19 
RMSE(data.test$Balance, predict(model.null, newdata = data.test.bin)) 
RMSE(data.test$Balance, predict(model.full.bin, newdata = data.test.bin)) 
RMSE(data.test$Balance, predict(model.backward.AIC, newdata = data.test.bin)) 
RMSE(data.test$Balance, predict(model.forward.BIC, newdata = data.test.bin))

> RMSE(data.test$Balance, predict(model.null, newdata = data.test.bin)) 
[1] 453.3665

> RMSE(data.test$Balance, predict(model.full.bin, newdata = data.test.bin)) 
[1] 105.0656

> RMSE(data.test$Balance, predict(model.backward.AIC, newdata = data.test.bin)) 
[1] 104.2375

> RMSE(data.test$Balance, predict(model.forward.BIC, newdata = data.test.bin)) 
[1] 102.6785

Findings:

Model	BIC
Null model (benchmark)	453.3665
Full model	105.0656
Final model based on backward selection and AIC	104.2375
Final model based on forward selection and BIC	102.6785 (smallest)

On the basis of both prediction accuracy and interpretability, the final model based on forward selection and BIC is justifiably the most promising model among all of the linear models we have considered and we will recommend using this model.

Stage 5: Model Validation

Model Diagnostics – `plot()`

# CHUNK 20 
plot(model.forward.BIC)

1. “Residuals vs Fitted” Plot

no discernible patterns.

- - Residuals are dispersed in an erratic manner.

Not exhibiting a funnel shape.

- - - The amount of fluctuation of the residuals increases as the magnitude of the fitted values increases.
    - An informal test of the homogeneity of error variance

Not exhibiting a U-shape

Conclusion:

- A clear U-shape suggests that there are non-linear patterns beyond interaction effects that fail to be accommodated by the fitted model.
- One may try to refine the model by incorporating polynomial terms such as Rating (although polynomial relationships don’t seem very pronounced) and interaction terms such as Income x Rating and Income x Rating.

2. “Normal Q-Q” Plot

What does “Q-Q Plot” plot?

- raw residuals: $e_i=y_i-\hat{\mu}_i$
- deviance residuals: $d_{_}i=y_i-\hat{\mu}_i$

Purpose

- check model assumptions, such as heteroscedasticity / homoscedasticity.
- test the normality of residuals.

Scenarios

- The residuals lie closely and randomly on the 45° straight line passing through the origin. The normality assumption is supported.
  - The residuals are normally distributed
  - The residuals have no systematic patterns when considered on their own and with respect to the predictors.
    If there are systematic patterns, it means that the model is not capturing some important features of the data.
  - Residuals have approximately constant variance upon standardization (using the standard error implied by the GLM).
- The residuals deviate substantially upward at the two ends the data. The data appears more skewed than the specified distribution.
  - There are a lot more large deviance residuals than would be anticipated under the specified distribution.

Conclusion:

- This suggests using a response distribution with a fatter tail than the normal distribution.

Stage 4: Model Construction and Evaluation (Using Regularization)

Model Construction – Fitting Regularized Regression Models

`glmnet::glmnet()` Function

1. Create a data matrix (design matrix) using model.matrix(y ~ . + x_1:x_2, data = <data set>)
2. In glmnet(), set family
  - family = "gaussian" if errors are normal.
3. In glmnet(), set lambda = c(λ_1, λ_2, ..., λ_n) for each model.

glmnet(x = <data matrix>, y = <target variable>, family = "gaussian", lambda = c(0, 10, 100, 500, 1000), alpha = 0.5)

Create a Design Matrix

# CHUNK 21 
X.train <- model.matrix(Balance ~ . + Income:Student, data = data.train) 
head(X.train) #printout a few rows of the design matrix

  (Intercept)  Income Rating Cards Age Education Gender Male StudentYes MarriedNo EthnicityAfrican American EthnicityAsian Income:StudentYes
1           1  14.891    283     2  34        11           1          0         0                         0              0             0.000
2           1 106.025    483     3  82        15           0          1         0                         0              1           106.025
3           1 104.593    514     4  71        11           1          0         1                         0              1             0.000
6           1  80.180    569     4  77        10           1          0         1                         0              0             0.000
8           1  71.408    512     2  87         9           1          0         1                         0              1             0.000
9           1  15.125    266     5  66        13           0          0         1                         0              0             0.000

Fit to a Regularized Regression Model

# CHUNK 22 
#install.packages("glmnet") # uncomment this Line the first time you use gLmnet 
library(glmnet) 
m.lambda <- glmnet(x = X.train, 
                   y = data.train$Balance, 
                   family = "gaussian", 
                   lambda = c(0, 10, 100, 500, 1000), 
                   alpha = 0.5)

Extract the coefficient estimates to see how they vary with the value of`λ`

1. Using the properties of the `glmnet` object

# CHUNK 22 (Cont.) 
# first way to extract coefficient estimates 
m.lambda$a0

      s0        s1        s2        s3        s4 
520.4352  301.2973 -224.9669 -506.0179 -579.1925

m.lambda$beta

12 x 5 sparse Matrix of class "dgCMatrix"
                          s0        s1         s2           s3          s4
(Intercept)                . .           .          .            .        
Income                     . .           .         -6.63360837  -7.9320588
Rating                     . 0.6182676   2.034434   3.71042914   4.0114839
Cards                      . .           .          0.51433943   2.9161946
Age                        . .           .         -0.70550849  -0.7885101
Education                  . .           .          .            0.8874960
Gender Male                . .           .          3.08510246  18.3671409
StudentYes                 . .         228.762987 396.62014836 396.2181065
MarriedNo                  . .           .          9.36968005  20.4646766
EthnicityAfrican American  . .           .          .          -11.9281110
EthnicityAsian             . .           .          .           -5.3983474
Income:StudentYes          . .           .          0.01473463   0.4929168

2. Apply the `coef()` function to a `glmnet` object

# CHUNK 22 (Cont.) 
# second way 
round(coef(m.lambda), digits = 4)

13 x 5 sparse Matrix of class "dgCMatrix"
                                s0       s1        s2        s3        s4
(Intercept)               520.4352 301.2973 -224.9669 -506.0179 -579.1925
(Intercept)                 .        .         .         .         .     
Income                      .        .         .        -6.6336   -7.9321
Rating                      .        0.6183    2.0344    3.7104    4.0115
Cards                       .        .         .         0.5143    2.9162
Age                         .        .         .        -0.7055   -0.7885
Education                   .        .         .         .         0.8875
Gender Male                 .        .         .         3.0851   18.3671
StudentYes                  .        .       228.7630  396.6201  396.2181
MarriedNo                   .        .         .         9.3697   20.4647
EthnicityAfrican American   .        .         .         .       -11.9281
EthnicityAsian              .        .         .         .        -5.3983
Income:StudentYes           .        .         .         0.0147    0.4929

Findings:

Column	lambda	Conclusion
s4	0	The coefficient estimates are (almost!) identical to the ordinary least squares estimates in `CHUNK 12`.
s3	10	Dropped Coef: `Education`, `EthnicityAfrican American`, and `EthnicityAsian`
s2	100	Only `Rating` and `StudentYes` are retained.
s1	500	Only `Rating` is retained. (Recall from CHUNK 12 `Rating` is the most significant feature, with the largest t-statistic).
s0	1000	The regularization penalty is so large that all of the coefficient estimates except the intercept become exactly zero.

TASK 10: Perform feature selection with regularization

Use cross-validation to set the value of lambda for fitting the regularized model (alpha = 0.5) in Task 9.
- List the features used in the resulting model. Compare the selected features to those using stepwise regression in Task 7. Comment on any differences.
- Calculate the RMSE on the test set.
- Recommend which model should be used. Justify your choice.
Suggest another method to set the value of lambda. There is no need to implement this method.

Hyperparameter Tuning: The `cv.glmnet()` Function

`cv.glmnet()` Purpose

This function performs k-fold cross-validation (10-fold cross-validation, by default), calculates the cross-validation error for a range of values of λ, and produces the optimal value of λ as a by-product.

`plot(<cv.glmnet object>)` Purpose

Returns a cross-validation curve, the cross-validation error plotted as a function of the values of λ used.

- - Red dots: represent the cross-validation error corresponding to a range of values of λ or logλ.
  - Vertical lines: around the dots, are the standard error bands.
  - The numbers at the top of the graph: indicate how many features have non-zero coefficient estimates at a given value of λ (or logλ).
  - The first vertical dotted line from the left: The value of λ (or logλ) at which the cross-validation error is minimized.
  - The second vertical dotted line from the left: can also be obtained by lambda.1se, the greatest value of λ, or equivalently, the simplest regularized regression model whose cross-validation error is within one standard error of the minimum error.

If you are asked to tune a regularized regression model on the exam, the default is to select the value of λ that minimizes the cross-validation error (i.e., take λ = lambda.min).

# CHUNK 23 
set.seed(1111) 
m <- cv.glmnet(x = X.train, 
               y = data.train$Balance, 
               family = "gaussian", 
               alpha = 0.5) 
plot(m)

Figure 3.4.7: The cross-validation error curve for the regularized model fitted to the training set of the `Credit` data.

m$lambda.min
m$lambda.1se

> m$lambda.min
[1] 0.8885539

> m$lambda.1se
[1] 7.550517

Findings:

- The first dotted line from the left has λ = m$lambda.min = 0.8885539
- The second dotted line from the left has λ = m$lambda.1se = 7.550517

Fit the regularized model using `lambda.min`

The value of λ minimizes the cross-validation error.

# CHUNK 24
# Fit the regularized model using lambda.min
m.min <- glmnet(x = X.train, 
                y = data.train$Balance, 
                family = "gaussian", 
                lambda = m$lambda.min, 
                alpha = 0.5) 
m.min$beta

12 x 1 sparse Matrix of class "dgCMatrix"
                                   s0
(Intercept)                 .        
Income                     -7.8131829
Rating                      3.9839963
Cards                       2.6915109
Age                        -0.7829239
Education                   0.7557949
Gender Male                17.0049498
StudentYes                396.3329355
MarriedNo                  19.4182020
EthnicityAfrican American  -9.9708020
EthnicityAsian             -3.6479011
Income:StudentYes           0.4458288

# Set up the desgin matrix for the best test
X.test <- model.matrix(Balance ~ . + Income:Student, data = data.test)

# Make predictions on the test set and calculation test RMSE
m.min.pred <- predict(m.min, newx = X.test) 
RMSE(data.test$Balance, m.min.pred)

[1] 103.9201

Findings:

- All of the 11 features are retained at the lambda.min value, as we observed from plot(m)
- test RMSE of the resulting model is 103.9201, which is slightly higher than that of the final model in Task 7 (which is 102.6785).
- The regularized model is also much harder to interpret due to its substantially larger size, whereas the model in Task 7 only has 3 features.
  As a result, it is reasonable to recommend the model in Task 7 as the final model owing to its better prediction accuracy and interpretability.

Fitting the regularized model with `λ = lambda.1se`

The value of λ, based on the one-standard-error rule, is equal to lambda.1se, or 7.550517.

# CHUNK 25 
# Fit the reguiarized modeL using iambda.1se 
m.1se <- glmnet(x = X.train, 
                y = data.train$Balance, 
                family = "gaussian", 
                lambda = m$lambda.1se, 
                alpha = 0.5) 

# List the coefficient estimates 
m.1se$beta

12 x 1 sparse Matrix of class "dgCMatrix"
                                   s0
(Intercept)                 .        
Income                     -6.9378441
Rating                      3.7813573
Cards                       1.1075391
Age                        -0.7305808
Education                   .        
Gender Male                 6.7717419
StudentYes                397.5199346
MarriedNo                  11.6768447
EthnicityAfrican American   .        
EthnicityAsian              .        
Income:StudentYes           0.1148304

# Make predictions on the test set and caicuiate test RMSE
m.1se.pred <- predict(m.1se, newx = X.test)
RMSE(data.test$Balance, m.1se.pred)

[1] 102.8241

Findings:

- The fitted model has 8 features, as we observed from plot(m), andEducation, EthnicityAfrican American, and EthnicityAsian are removed.
- test RMSE has dropped to 102.8241, a slight improvement from that of the model based on lambda.min.
- The regularized model has worse prediction accuracy as well as interpretability than the model in Task 7.
  As a result, it is reasonable to recommend the model in Task 7 as the final model owing to its better prediction accuracy and interpretability.

How to choose the optimal value of α?

To optimize over α, we have to resort to trial and error and go as follows:

1. Set up a grid of values of α (given by the question beforehand).
2. For each value of α, select the corresponding optimal value of λ and evaluate the corresponding cross-validation error.
3. Adopt the value of α that produces the smallest cross-validation error.

In essence, we break down the two-variable optimization problem by first minimizing over λ for a given α, then minimizing over α.

4.1. Generalized Linear Models

GLMs

Equation

$g(\mu)=\eta:=\beta_0+\beta_1X_1+…+\beta_pX_p$, where

$g(\cdot)$: The link function “linking” the target mean µ to the linear combination of the predictors
- monotonic: allows us to invert the link function to $\mu=g^{-1}(\eta)=g^{-1}(\beta_0+\beta_1X_1+…+\beta_pX_p)$

Linear model can be obtained by identity link g(µ) = µ, making $\mu=\eta=\beta_0+\beta_1X_1+…+\beta_pX_p$

GLMs vs. Linear Models

GLM is more flexible
- Distribution of the target variable:
  The target variable in a GLM is no longer confined to the class of normal random variables but only from exponential family of distributions.
- Relationship between the target mean and linear predictors:
  GLM sets a function of the target mean to be linearly related to the predictors.

GLMs vs. Linear Models on Transformed Data

The transformations are applied differently (externally vs. internally).
- Why do we transform data? to bring the target observations closer to normal so that they can be reasonably described by a normal linear model.
Data is directly modeled by GLMs.
- The target variable is not transformed.
- The transformation plays its role only within the GLM itself.

Typical Relationships between Target Means and X

Equation	Relationship	Component
μ=X^β	The target mean has a power relationship with X	β is the exponent

Selection of Target Distributions

To choose one that best aligns with the characteristics of a given target variable.

Positive, continuous, and right-skewed data:
- Examples: claim amounts, income, and amount of insurance coverage
- Target Distribution: gamma and inverse Gaussian, to capture the skewness of the target variable directly without the use of transformations.
  - gamma has positively related mean and variance, a desirable characteristic to model claim severity.
  - inverse Gaussian has a similar behavior but is more highly skewed than the gamma.
Binary data:
- Examples: In classification problem with the target variable indicating the occurrence (1) or non-occurrence (0) of an event of interest.
- Target Distribution: The Binomial (Bernoulli) distribution is probably the only reasonable choice.
Count data:
- Examples: The number of times a certain event of interest happens over a reference time period. It assumes only non-negative integer values, possibly with a right skew.
- Target Distribution: The Poisson distribution is one of the most natural candidate distributions.
  - One of its drawbacks is that it requires mean = variance.
Aggregate data:
- Examples: Aggregate claim losses, which tend to have a large mass at zero as most policies have no claims and a continuous distribution skewed to the right reflecting total claim sizes when there are claims.
- Target Distribution: The Tweedie distribution. A random variable S follows the Tweedie distribution if it admits the compound Poisson-gamma representation.
  - a Poisson sum of gamma random variables
  - It has a discrete probability mass at zero and a probability density function on the positive real line.

Selection of Link Functions

Appropriateness of Predictions

Examples	Positive mean	Unit-valued mean
Distribution	poisson, gamma, and inverse Gaussian
Mean	positive and unbounded
Link Function	log link: g(µ) = ln µ	logit link: $\eta=g(\pi)=\ln \dfrac{\pi}{1-\pi}=\ln(\text{odds})$
Benefits	µ = e^η is always positive and any positive value can be captured. Easy to interpret because it generates a multiplicative model.	π is always ∈ (0,1] Easy to interpret due to its connections with the log link.

Interpretability

Definition

A good link function should have ease of interpretation.

What is ease of interpretation?

Easy to form statements that describe the effects of the predictors of the model on the target mean in terms of the model coefficients.

Can answer

If a continuous predictor X increases by 1 unit, by how much will the target mean change?

Canonical Link

Advantage

- simplify the mathematics of the estimation procedure and making it more likely to converge.

Important factors to consider

- Whether the predictions provided by the link align with the characteristics of the target variable.
- Whether the resulting GLM is easy to interpret.

Common Canonical Link

Target Distribution	Canonical Link	Mathematical Form	Other Link(s)
Normal	Identity	$g(\mu)=\mu$	–
Binomial	Logit	$g(\pi)=\ln (\dfrac{\pi}{1-pi})$	Probit, cloglog
Poisson	Natural Log	$g(\mu)=\ln \mu$	–
Gamma	Inverse	$g(\mu)=\mu^{-1}$	Log
Inverse Gaussian	Squared Inverse	$g(\mu)=\mu^{-2}$	Log

Interpretation of GLM Coefficients

depends on the choice of the link function, while target distribution plays no role in the interpretation.

Log Link

ln µ = η $\mu=e^\eta =e^{\beta_0+\beta_1 X_1+…+\beta_j X_j+…+\beta_p X_p}=e^{\beta_0}\times e^{\beta_1 X_1}\times e^{\beta_j X_j}\times e^{\beta_p X_p}$

For a unit increase in X_j:

Case		Changes	Interpretation
Numeric X_j	Multiplicative Changes	$\text{new }\mu = \boxed{e^{\beta_j}}\times\text{old }\mu$	If β_j > 0, then $e^{\beta_j}>1$ and so the target mean gets amplified. If β_j < 0, then $e^{\beta_j}<1$ and so the target mean gets shrunken).
Numeric X_j	Percentage Changes	$\dfrac{(e^{\beta_j}-1)\mu}{\mu}=e^{\beta_j}-1$	If β_j is positive, then $e^{\beta_j}-1>0$ and so the target mean increases. If β_j is negative, then $e^{\beta_j}-1<0$ and so the target mean decreases.
Categorical X_j	Multiplicative Changes	$e^{\beta_j}$ × X_j at baseline level	If β_j > 0, then $e^{\beta_j}>1$ and so the target mean gets amplified. If β_j < 0, then $e^{\beta_j}<1$ and so the target mean gets shrunken).
Categorical X_j	Percentage Changes	$100(e^{\beta_j}-1)$%	If β_j is positive, then $e^{\beta_j}-1>0$ and so the target mean increases. If β_j is negative, then $e^{\beta_j}-1<0$ and so the target mean decreases.

Logit Link

For a unit increase in X_j:

Case		Changes	Interpretation
Numeric X_j	Multiplicative Changes	$\text{new }\mu = \boxed{e^{\beta_j}}\times\text{old odds}$	If β_j > 0, then $e^{\beta_j}>1$ and so the target mean gets amplified. If β_j < 0, then $e^{\beta_j}<1$ and so the target mean gets shrunken).
Numeric X_j	Percentage Changes	$100(e^{\beta_j}-1)$% of old odds	If β_j is positive, then $e^{\beta_j}-1>0$ and so the target mean increases. If β_j is negative, then $e^{\beta_j}-1<0$ and so the target mean decreases.

Weights and Offsets

Purpose

designed to incorporate a measure of exposure into a GLM to improve the fitting.

Methodology: Weights

To take advantage of the fact that different observations in the data may have different exposures and thus different degrees of precision, we can attach a higher weight to observations with a larger exposure so that these more credible observations will carry more weight in the estimation of the model coefficients. The goal is to improve the reliability of the fitting procedure.

Methodology: Offsets

Assumption: the target mean is directly proportional to the exposure.

Let E_i be the exposure for the i^th observation, then the model equation is:

$\ln\mu_i=\boxed{\ln E_i}+\eta_i=\ln E_i+(\beta_0+\beta_1X_{i1}+…+\beta_pX_{ip})$, where

the logged exposure $\ln E_i$ is called an offset.
The coefficient of $\ln E_i$ is a priori to be 1.

The above equation can be written in two ways to show further insights:

Upon exponentiation, it is equivalent to $\mu_i=E_i e^{\eta_j}$, which shows that µ_i and E_i are in direct proportion.
$\ln(\dfrac{\mu_i}{E_i})=\eta_i=\beta_0+\beta_1X_{i1}+…+\beta_pX_{ip}$
In this form, we are modeling the occurrence rate of the event of interest, or the expected number of events per exposure.

Offsets vs. Weights

Offsets and weights impose vastly different structures on the mean and variance of the target variable.

Key Differences:

the form of the target variable in the dataset (average vs. aggregate).
how the exposure affects the target variable (mean vs. variance).

Differences:

Weights: To use weights properly, the observations of the target variable should be averaged by exposure. Due to the averaging, the variance of each observation is inversely related to the size of the exposure, which serves as the weight for that observation. However, the weights do not affect the mean of the target variable.
Offsets: For offsets, the observations are values aggregated over the exposure units. The exposure, when serving as an offset, is in direct proportion to the mean of the target variable, but otherwise leaves its variance unaffected.
(Remember that for a GLM, the mean and variance of the target variable are generally related, so an offset also affects the variance, but only as a by-product)

	Weights	Offsets
Target Variable: Form	averaging observations	aggregating observations
Target Variable: Mean	no effects	the exposure is direct proportion to the target mean
Target Variable: Variance	the variance is inversely related to the size of the exposure	no effects

Special Case: Weights and Offsets

Poisson regression with the log link produce identical results and the following two generic models are equivalent in the sense that they share the same coefficient estimates, standard errors, p-values, and most importantly, predictions:

Model 1: A Poisson regression model fitted to Y (count) using ln E as the offset
Model 2: A Poisson regression model fitted to Y/ E (frequency) using E as the weight

Model Quantities and Diagnostic Tools

Coefficient Estimates

Global Goodness-of-Fit Measure: Deviance

R²

- R² is not a goodness-of-fit measure for GLM because it is based on the normal distribution.

Deviance

- R² measures how a linear model compares with the most primitive linear model (the intercept-only linear model);
- Deviance measures the extent to which the GLM departs from the most elaborate GLM, which is known as the saturated model.
  - The saturated model, l_SAT , has the same target distribution as the fitted GLM, l.
  - It has as many model parameters as the number of training observations.
  - It perfectly fits each training observation and is a very flexible GLM.

Formula:

$D=2(l_{SAT}-l)$

Interpretation:

- The lower the deviance, the closer the GLM is to the model with a perfect fit, and the better its goodness of fit on the training set.

Model Statistics:

- For a normal target variable with variance σ², $D=\dfrac{\sum_{i=1}^{n_{tr}}(y_i-\hat{y}_i)^2}{\sigma^2}$
  In this sense, the deviance can be viewed as a generalization of the residual sum of squares that works even for non-normal target variables.
- The deviance can only be used to compare GLMs having the same target distribution.
  (so that they share the same maximized loglikelihood of the saturated model, l_SAT)
- The deviance residual, an important model diagnostic tool for GLMs.

Local Goodness-of-Fit Measure: Deviance Residuals

Raw residuals for GLM?

- The target distribution of GLMs are not normally distributed.
- Variances are not constant.

Formula

$d_i=y_i-\hat{\mu}_i$ and $\sum_{i=1}^{n}d^2_i=D$

Properties (parallel those of the raw residuals in a linear model)

(By examining Q-Q plots of deviance residuals)

- They are approximately normally distributed for most target distributions in the linear exponential family (the binomial distribution is a notable exception).
- They have no systematic patterns when considered on their own and with respect to the predictors.
- They have approximately constant variance upon standardization (using the standard error implied by the GLM).

The deviance residuals don’t have to be normally distributed for the model to be valid, so the normality / non-normality of the residuals doesn’t necessarily tell you anything.

Penalized Likelihood Measures: AIC and BIC

Formula

- $AIC=-2Z+2p$
- $BIC=-2Z +\ln (n_{tr})p$

Performance Metrics for Classifiers

RMSE?

- RMSE is the most commonly used performance metric for numeric target variables.
- GLMs accommodate both numeric and categorical target variables Other performance metrics

Confusion Matrix for Binary Classifier

Predicted probabilities into the predicted classes

To translate the predicted probabilities into the predicted classes, we need a pre-specified cutoff (or threshold).

- - If the predicted probability of the event of interest for an observation is higher than the cutoff, then the event is predicted to occur.
  - If the predicted probability of the event of interest for an observation is lower than the cutoff, then the event is predicted not to occur.

Prediction	Reference (= Actual)
Prediction	–	+
–	TN	FN
+	FP	TP

- - TP = true positive: The classifier predicts that the event occurs (“positive”) and indeed it does (“true”).
  - TN = true negative: The classifier predicts that the event does not occur (“negative”) and indeed it does not (“true”).
  - FP = false positive: The classifier predicts that the event occurs (“positive”), but it does not (“false”).
  - FN = false negative: The classifier predicts that the event does not occur (“negative”), but it does (“false”).

Performance Metrics

$\text{Classification Error Rate }=\dfrac{FN+FP}{n}=\dfrac{\text{Actually did not happen}}{n}$ = the proportion of misclassifications

$\text{Accuracy }=\dfrac{TN+TP}{n}=\dfrac{\text{Actually happened as predicted}}{n}$ = the proportion of correctly classified observations

$\text{Sensitivity }=\dfrac{TP}{TP+FN}=\dfrac{\text{Actually happened when predicted to occur}}{Actually happened as predicted}$ = the relative frequency of correctly predicting an event of interest when the event does take place

$\text{Specificity }=\dfrac{TN}{TP+FN}=\dfrac{\text{Actually happened when predicted not to occur}}{Actually happened as predicted}$ = the relative frequency of correctly predicting an event of interest when the event does take place

Effects of Changing the Cutoff

A trade-off between having high sensitivity and having high specificity

- - Ideally, the classifier and the cutoff should be such that both the sensitivity and specificity are close to 1.
  - In general, sensitivity and specificity, as a function of the cutoff, are in conflict with each other.
  - The determination of the precise cutoff requires weighing the benefits of correct classifications (TN and TP) and the costs of misclassifications (FN and FP).

An optimal cutoff can then be chosen by maximizing the overall profit or minimizing the overall cost as a function of the cutoff.

Extreme Case 1

Cutoff = 0 All possibilities > 0

Prediction	Reference (= Actual)
Prediction	–	+
–	0	0
+	n_	n₊

where:

- - n_ = TN + FP is the total number of negative observations.
  - n₊ = FN + TP is the total number of positive observations.
  - n = n_ + n₊ = TN + FP + FN + TP is the total number of observations.

Increasing the Cutoff

Cutoff Negative observations, the entries in the confusion matrix will be moving to the first row

Prediction	Reference (= Actual)
Prediction	–	+
–	TN	FN
+	FP	TP

- - Sensitivity but specificity

Extreme Case 2

Cutoff = 1 All possibilities < 0

Prediction	Reference (= Actual)
Prediction	–	+
–	n_	n₊
+	0	0

- - Sensitivity = 0 and specificity = 1.

Receiver Operating Characteristic (ROC) Curves

Definition

A graphical tool to plot the sensitivity against the specificity of a given classifier for each cutoff ranging from 0 to 1. (x, y) := (specificity, sensitivity).

A good classifier should be such that:

- its sensitivity and specificity are both close to 1 for a wide range of cutoffs its ROC curve should bend to the upper left and approach the top-left corner point (1, 1). The closer the curve is to the top-left corner, the better the predictive ability.

Predictive Performance: Area under the Curve (AUC)

How to evaluate predictive performance? The higher AUC, the better predictive performance.

AUC = 1

Such a classifier:

- - separates all of the observations into the two classes correctly whenever the cutoff is not 0 or 1, with:
    - Sensitivity = 1
    - Specificity = 1

AUC = 0.5

The naive classifier which classifies the observations purely randomly without using the information contained in the predictors。

Prediction	Reference (= Actual)
Prediction	–	+
–	(1 – p)n_	(1 – p)n₊
+	pn_	pn₊

- - Sensitivity = $\dfrac{pn_{+}}{pn_{+}+(1-p)n_{+}}$
  - Specificity = $\dfrac{(1-p)n_{-}}{(1-p)n_{-}+pn_{-}}$

Sidebar: Unbalanced data

Why unbalanced data is problematic?

With unbalanced data, a classifier implicitly places more weight on the majority class and tries to match the training observations in that class, without paying enough/attention to the minority class.

It gets worse if the minority class is the class of interest, the positive class.

Undersampling

Definition

Undersampling produces roughly balanced data by drawing fewer observations from the negative class (“undersampled”) and retaining all of the positive class observations.

Pros

With relatively more data from the positive class (in proportion), this will improve the classifier’s ability to pick up the signal leading to the positive class.

Cons

The drawback is that the classifier, now based on less data and therefore less information about the negative class, could be less robust and more prone to overfitting.

Oversampling

Definition

Oversampling keeps all of the original data, but oversamples (with replacement) the positive class to reduce the imbalance between the two classes.

Some of the positive class observations will appear more than once in the balanced data.

Data Set

Oversampling should be performed on the training data, following the training/test set split.

4.2. Generalized Linear Models Case Study 1

Learning Objectives

Select appropriate distributions and link functions for a positive, continuous target variable with a right skew.
Fit a GLM using the glm() function in R and specify the options of this function appropriately.
Make predictions for GLMs using the predict() function and compare the predictive performance of different GLMs.
Generate and interpret diagnostic plots for a GLM.

Stage 1: Define the Business Problem

Objectives

To build GLMs to predict the size of settled claims using related risk factors in the dataset, select the most promising GLM, and quantify its predictive accuracy.

Modeling Options

Log-transform the claim size and fit a normal linear model.
GLM with the log link function and normal distribution
GLM with gamma / inverse gamma distribution

TASK 1: Data pre-processing

Determine which numeric variables in the data, if any, should be treated as factors. Do the factor conversion.

Stage 2: Data Collection

# CHUNK 1 
persinj <- read.csv("persinj.csv") 
summary(persinj)

     amt               inj           legrep          op_time     
Min.   :     10   Min.   :1.00   Min.   :0.0000   Min.   : 0.10  
1st Qu.:   6297   1st Qu.:1.00   1st Qu.:0.0000   1st Qu.:23.00  
Median :  13854   Median :1.00   Median :1.0000   Median :45.90  
Mean   :  38367   Mean   :1.83   Mean   :0.6366   Mean   :46.33  
3rd Qu.:  35123   3rd Qu.:2.00   3rd Qu.:1.0000   3rd Qu.:69.30  
Max.   :4485797   Max.   :9.00   Max.   :1.0000   Max.   :99.10

Findings:

We need to convert inj from numeric to categorical variable, since the inj variable is merely a label of group membership indicating the injury group to which each policy belongs.

Convert `inj` to a Factor

# CHUNK 2 
persinj$inj <- as.factor(persinj$inj) 
summary(persinj)

     amt          inj           legrep          op_time     
Min.   :     10   1:15638   Min.   :0.0000   Min.   : 0.10  
1st Qu.:   6297   2: 3376   1st Qu.:0.0000   1st Qu.:23.00  
Median :  13854   3: 1133   Median :1.0000   Median :45.90  
Mean   :  38367   4:  189   Mean   :0.6366   Mean   :46.33  
3rd Qu.:  35123   5:  188   3rd Qu.:1.0000   3rd Qu.:69.30  
Max.   :4485797   6:  256   Max.   :1.0000   Max.   :99.10  
                  9: 1256

Findings:

Since the baseline level (“1”) has the most observations, there is no need to relevel inj.

Stage 3: Exploratory Data Analysis

N/A.

TASK 2: Fit a linear model as a benchmark

Split the data into training and test sets.
Run an ordinary least squares (OLS) model for claim size or a transformation of claim size on the training set. Be sure to reflect your observations in Section 2.2 (if you still remember what you did there!).
Calculate the RMSE of the model on the test set. This provides a benchmark for further model development.

Stage 4: Model Construction and Evaluation

Training/Test Set Split

library(caret)
set.seed(2019)
partition <- createDataPartition(y = persinj$amt, p = .75, list = FALSE) 
data.train <- persinj[partition, ]
data.test <- persinj[-partition, ]
summary(data.train$amt)
summary(data.test$amt)

Validating the training test and test set: close mean.

summary(data.train$amt)
summary(data.test$amt)

> summary(data.train$amt)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
     20    6297   13852   38670   35118 4485797 
> summary(data.test$amt) 
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
     10    6299   13858   37458   35140 1450818

Findings:

The maximum claim sizes of the two sets differ considerably even with the use of stratified sampling
The two means are not as close to each other as we expect.
The two medians, 13,852 and 13,858, are very close to each other.

Benchmark Model: Normal Linear model on Log of Claim Size

ggplot(persinj, aes(x = op_time, y = log(amt), color = legrep)) +
  geom_point(alpha = 0.25) + geom_smooth(method = "lm", se = FALSE)

Findings:

It seems plausible that legal representation and operational time interact with each other to affect amt, so we will incorporate an interaction term for these two predictors.

Found Interaction Term: legrep: op_time

Fitting the benchmark model

Our benchmark model is the linear model on the log of the claim size fitted using ordinary least squares (OLS).

# CHUNK 4 
glm.ols <- glm(log(amt) ~ inj + legrep * op_time, 
               family = gaussian(link = "identity"), data = data.train) 
summary(glm.ols)

Call:
glm(formula = log(amt) ~ inj + legrep * op_time, family = gaussian(link = "identity"), 
    data = data.train)

Coefficients:
                 Estimate Std. Error t value Pr(>|t|)    
(Intercept)     7.5212622  0.0269021 279.579  < 2e-16 ***
inj2            0.5912759  0.0237858  24.858  < 2e-16 ***
inj3            0.8161053  0.0385945  21.146  < 2e-16 ***
inj4            0.7958558  0.0901834   8.825  < 2e-16 ***
inj5            0.6462279  0.0894089   7.228 5.12e-13 ***
inj6            0.3823662  0.0777895   4.915 8.95e-07 ***
inj9           -0.8151999  0.0367039 -22.210  < 2e-16 ***
legrep          0.9120834  0.0339725  26.848  < 2e-16 ***
op_time         0.0358738  0.0005093  70.431  < 2e-16 ***
legrep:op_time -0.0101808  0.0006384 -15.946  < 2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for gaussian family taken to be 1.146774)

    Null deviance: 35056  on 16527  degrees of freedom
Residual deviance: 18942  on 16518  degrees of freedom
AIC: 49180

Number of Fisher Scoring iterations: 2

Findings:

- Summary of a GLM does not show the R² and F-statistic, which are specific to linear models
- The deviance (as shown in Residual deviance) and AIC of the OLS linear model, besides the coefficient estimates, standard errors, and p-values.

Making predictions and Calculating RMSE

# CHUNK 5
# Make predictions
pred.ols <- exp(predict(glm.ols, newdata = data.test)) 
head(pred.ols)

# RMSE
RMSE(data.test$amt, pred.ols)
RMSE(data.test$amt, mean(data.train$amt))

> head(pred.ols)
        3        10        13        21        24        27 
1853.5345 1853.5345 4609.6856  820.2833  820.2833  820.2833 

> RMSE(data.test$amt, pred.ols)
[1] 72888.33

> RMSE(data.test$amt, mean(data.train$amt))
[1] 79370.9

Assessing the Performance of OLS Model:

Test RMSE of the OLS model = 72,888.33 > Test RMSE of the intercept-only model = 79,370.9

The OLS model represents a decent improvement over the intercept-only model.

TASK 3: Select a distribution and link function

Evaluate three potential combinations of distribution and link function for applying a GLM to the training dataset. (Typing ?family in the Console will provide help. Included are the combinations that can be used with the glm function.)
Explain, prior to fitting the models, why your choices are reasonable for this problem.
Fit the three models to the training set and calculate their test RMSE whenever possible.
Select the best combination, justifying your choice. Use only that model in subsequent tasks.

GLM 1: Log-Link Normal GLM on Claim Size

Merits:

This GLM ensures positive predictions.

Demerits:

It allows for the possibility that the observations of the target variable are negative.

Fitting the GLM model

# CHUNK 6 
glm.log <- glm(amt ~ inj + legrep * op_time, 
               family = gaussian(link = "log"), data = data.train) 
summary(glm.log)

Call:
glm(formula = amt ~ inj + legrep * op_time, family = gaussian(link = "log"), 
    data = data.train)

Coefficients:
                Estimate Std. Error t value Pr(>|t|)    
(Intercept)     8.559238   0.120163  71.230  < 2e-16 ***
inj2            0.572982   0.031018  18.473  < 2e-16 ***
inj3            0.825502   0.034683  23.801  < 2e-16 ***
inj4            0.883796   0.057430  15.389  < 2e-16 ***
inj5            1.531415   0.034374  44.552  < 2e-16 ***
inj6            1.506266   0.042987  35.040  < 2e-16 ***
inj9           -0.552178   0.146506  -3.769 0.000164 ***
legrep          0.068766   0.138410   0.497 0.619318    
op_time         0.027185   0.001508  18.029  < 2e-16 ***
legrep:op_time  0.003195   0.001721   1.857 0.063383 .  
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for gaussian family taken to be 6897153068)

    Null deviance: 1.4770e+14  on 16527  degrees of freedom
Residual deviance: 1.1393e+14  on 16518  degrees of freedom
AIC: 421348

Number of Fisher Scoring iterations: 8

Making predictions and Calculating RMSE

pred.log <- predict(glm.log, newdata = data.test, type = "response") 
head(pred.log)
RMSE(data.test$amt, pred.log)

> head(pred.log)
       3       10       13       21       24       27 
5228.901 5228.901 5602.912 3010.251 3010.251 3010.251 

> RMSE(data.test$amt, pred.log)
[1] 69623.89

Findings:

- Test RMSE decreases to 69,623.89
  
  The normal GLM with log link outperforms the OLS linear model in terms of prediction accuracy.

GLM 2: Log-Link Gamma GLM

Justification of using log-link Gamma GLM

The target variable, `amt`, is a positive, continuous variable with a right skew.	Gamma distribution
To ensure that all predictions are positive, a characteristic of the target variable;	Log link
Makes the model coefficients easy to interpret – the coefficients are closely related to multiplicative changes to the linear predictor.	Log link

Training the model

Run CHUNK 7 to train a gamma GLM with the log link.

# CHUNK 7 
glm.gamma <- glm(amt ~ inj + legrep * op_time, 
                 family = Gamma(link = "log"), data = data.train) 
summary(glm.gamma)

Call:
glm(formula = amt ~ inj + legrep * op_time, family = Gamma(link = "log"), 
    data = data.train)

Coefficients:
                 Estimate Std. Error t value Pr(>|t|)    
(Intercept)     8.2056640  0.0331403 247.604  < 2e-16 ***
inj2            0.5975174  0.0293014  20.392  < 2e-16 ***
inj3            0.8568408  0.0475440  18.022  < 2e-16 ***
inj4            1.1029945  0.1110956   9.928  < 2e-16 ***
inj5            1.5036755  0.1101415  13.652  < 2e-16 ***
inj6            0.9004443  0.0958278   9.396  < 2e-16 ***
inj9           -0.6321655  0.0452150 -13.981  < 2e-16 ***
legrep          0.4393948  0.0418502  10.499  < 2e-16 ***
op_time         0.0344292  0.0006275  54.871  < 2e-16 ***
legrep:op_time -0.0049104  0.0007865  -6.243 4.38e-10 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for Gamma family taken to be 1.740277)

    Null deviance: 33141  on 16527  degrees of freedom
Residual deviance: 16705  on 16518  degrees of freedom
AIC: 365833

Number of Fisher Scoring iterations: 8

RMSE(data.test$amt, predict(glm.gamma, newdata = data.test, type = "response"))

> RMSE(data.test$amt, predict(glm.gamma, newdata = data.test, type = "response"))
[1] 68831.79

Findings:

- The test RMSE drops further to 68,831.79 compared with the normal GLM with the log link. This shows that the gamma GLM is the most predictive among the models we have fitted with respect to the test RMSE.

GLM 3: Inverse Gaussian GLM

Justification of using square-link inverse-Gamma GLM

The target variable, `amt`, is a positive, continuous variable with a right skew.	inverse Gamma distribution
To ensure that all predictions are positive, a characteristic of the target variable;	Log link
Makes the model coefficients easy to interpret – the coefficients are closely related to multiplicative changes to the linear predictor.	Log link

The inverse Gaussian distribution often has a fatter tail than a gamma distribution. Its canonical link function is the inverse square link $g(\mu)=\dfrac{1}{\mu^2}$, which ensures positive predictions but is not easy to interpret, so the log link is also commonly used.

Training the model

In CHUNK 8, we try to fit an inverse Gaussian GLM with a log link, but to no avail.

# CHUNK 8 
glm.ig <- glm(amt ~ inj + legrep * op_time, data = data.train, 
              family = inverse.gaussian(link = "log"))

Error: inner loop 1; cannot correct step size
In addition: Warning messages:
1: step size truncated due to divergence 
2: step size truncated due to divergence

The fitting cannot be completed due to convergence issues.

Model: log-link Gamma GLM

TASK 4: Validate the model

Compare the recommended model from Task 3 to the OLS model from Task 2.
Provide and interpret diagnostic plots for the recommended model to check the model assumptions.

Model Validation

Gamma GLM vs. Benchmark OLS model

	Benchmark OLS Model	Log-Link Gamma GLM
test RMSE	72,888.33	68,831.79

The summary analysis of the gamma GLM shows that all coefficient estimates, with an extremely small p-value, are statistically significant, so it is wise to retain all existing features in the model.

Model Diagnostics

In CHUNK 9 we make some diagnostic plots for the gamma GLM to see if the model assumptions are largely satisfied.

# CHUNK 9
plot(glm.gamma, which = 1)
qqnorm(residuals(glm.gamma)) 
qqline(residuals(glm.gamma))

Findings:

“Residuals vs Fitted” Plot
- The residuals mostly scatter around 0 in a structureless manner, but with a sizable amount of fluctuation.
- The spread of the positive residuals tends to be much larger than that of the negative residuals, with a few outliers having an unusually positive residual (observations 5422, 10224, and 18870).
Q-Q Plot
- The plot looks fine in the middle portion and on the left end, but seems problematic on the right end, with the points deviating quite markedly from the reference line.
- The fact that the rightmost points deviate significantly upward means that there are a lot more large, positive standardized deviance residuals than under a normal distribution, and by extension, the distributions of the deviance residuals and the data are more skewed than the gamma distribution.
- It appears that a fatter-tailed model may perform better.
  (Note: There is also a Q-Q plot produced by running plot (glm.gamma), but that plot is for standardized Pearson residuals, which, unlike deviance residuals, are not approximately normal even if a correct GLM is specified. This is why we use separate code to produce a Q-Q plot for deviance residuals.)

TASK 5: Interpret the model

Run the selected model from Task 3 on the full dataset.

Run the summary function and provide the output.
Interpret the coefficient estimates in a manner that will provide useful information to an insurance company or a personal injury insurance policyholder.

Fitting the selected model with full dataset

In CHUNK 10, we rerun the gamma GLM on the full dataset (data = persinj) to provide a more robust prediction for new, future data (the larger the data set used to train a model, the more reliable the fitting) and print the model summary.

# CHUNK 10 
glm.final <- glm(amt ~ inj + legrep * op_time, 
                 family = Gamma(link = "log"), data = persinj) 
summary(glm.final)

Call:
glm(formula = amt ~ inj + legrep * op_time, family = Gamma(link = "log"), 
    data = persinj)

Coefficients:
                 Estimate Std. Error t value Pr(>|t|)    
(Intercept)     8.2033907  0.0279443 293.562  < 2e-16 ***
inj2            0.6281073  0.0248995  25.226  < 2e-16 ***
inj3            0.8882054  0.0402472  22.069  < 2e-16 ***
inj4            1.1199670  0.0951566  11.770  < 2e-16 ***
inj5            1.3963478  0.0955128  14.619  < 2e-16 ***
inj6            0.8867918  0.0816012  10.867  < 2e-16 ***
inj9           -0.6205268  0.0381881 -16.249  < 2e-16 ***
legrep          0.4437842  0.0354423  12.521  < 2e-16 ***
op_time         0.0343052  0.0005303  64.685  < 2e-16 ***
legrep:op_time -0.0050443  0.0006663  -7.571 3.86e-14 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for Gamma family taken to be 1.675808)

    Null deviance: 44010  on 22035  degrees of freedom
Residual deviance: 22242  on 22026  degrees of freedom
AIC: 487535

Number of Fisher Scoring iterations: 8

Findings:

The model summary shows that injury code, legal representation, and operational time are all statistically significant factors affecting the expected claim size of a personal injury insurance.
There is a significant interaction between legal representation and operational time, meaning that:
the effect of operational time on the expected claim size varies for injuries with and without legal representation.

Making predictions

Since we are using the log link, we need to exponentiate the coefficients to get the multiplicative changes.

# Exponentiate the coefficients to get the multiplicative changes
exp(coef(glm.final))

 (Intercept)           inj2           inj3           inj4           inj5           inj6           inj9         legrep        op_time legrep:op_time 
3653.3167103      1.8740602      2.4307635      3.0647532      4.0404165      2.4273297      0.5376611      1.5585941      1.0349004      0.9949684

Interpretation

Since the log link is used,

- The expected claim size is multiplied by $e^{\beta_j}$ for every unit increase in a numeric predictor with coefficient β_j, or equivalently,
- When a qualitative predictor moves from its baseline level to a new level represented by a dummy variable with coefficient β_j, holding everything else constant (equivalently, the percentage change is $e^{\beta_j}-1$).

More precisely,

- Injury Code
  - With the exception of injuries with injury code 9 (which represents unclassified injuries), the expected claim size for injuries with codes 2, 3, 4, 5, 6 is higher than that for injuries with code 1 (baseline).
    - For instance, the expected claim size for injuries with code 5 is estimated to be e^1.3963478 = 4.0404 times of that of injuries with code 1.
  - The estimated coefficients for codes 2, 3, 4, 5, 6 are all positive and mostly increasing across the codes makes intuitive sense as we would expect the more serious the injury, the larger the expected claim size.
    However, we would expect the estimated coefficient for code 6 to be the largest as code 6 corresponds to fatalities. This is not the case for the gamma GLM.
- Operational Time
  - Operational time is positively associated with the expected claim size, though the effect is not as positive for injuries with legal representation.
    - A unit increase in operational time is associated with a multiplicative increase of e^0·0343052 = 1.0349004 in the expected claim size for injuries without legal representation. The multiplicative increase reduces to e^{0.0343o52-o.oo5o443} = 1.029693 for injuries with legal representation. The positive effect is reasonable because larger claims are often more difficult to quantify and require more time to settle, leading to longer delays.
- Legal Representative
  - For most injuries, those with legal representation have a higher expected claim size than those without legal representation, unless the injuries take an extraordinarily long time to settle (operational time higher than 88 to be precise).
    This is intuitive as clients with the assistance of the legal advice are in a better position to fight for a larger settled claim size.

EXAM NOTE

When you are asked to interpret the results of a GLM, the following three-part structure can be quite useful:

- First interpret the precise values of the estimated coefficients, e.g., every unit increase in a continuous predictor is associated with a multiplicative change of $e^{\beta_j}$ in the expected value of the target variable, holding everything else fixed.
- Comment on whether the sign of the estimated coefficients conforms to intuition. Do the results make sense? You can use your common knowledge in this part.
- Relate the findings to the “big picture” and explain how they can help clients make a better decision in life.

4.3. Case Study 2: GLMs for Binary Target Variables

Learning Objectives

Compared to GLMs for numeric target variables, GLM-based classifiers enjoy some subtly unique features, which will be revealed in the course of this case study. At the completion of this section, you should be able to:

Combine factor levels to reduce the dimension of the data.
Select appropriate link functions for binary target variables.
Implement different kinds of GLMs for binary target variables in R.
Incorporate an offset into a logistic regression model.
Interpret the results of a fitted logistic regression model.

Stage 1: Define the Business Problem

Objective

To construct appropriate GLMs to identify key factors associated with claim occurrence.

TASK 1: Edit the data

Perform data-processing by:

Removing questionable observations and variables that are not suitable as predictors for clm.
Changing veh_age and agecat to factor variables. Discuss the advantage(s) and disadvantage(s) of the factor conversions. Regardless, convert the two variables to factors.

Stage 2: Data Collection

Read in Data Set

Run CHUNK 1 to load the insuranceData package and the data, and print a summary of each variable.

# CHUNK 1
library (insuranceData) 
data(dataCar) 
summary(dataCar)

Or

# CHUNK 1
# Prior to R v4.0, the option stringsAsFactors was TRUE by default, i.e. string-like variables would automatically get converted to factors when being read in by
dataCar <- read.csv("vehins.csv")
# For R v4.0+, 
dataCar <- read.csv("vehins.csv", stringsAsFactors=TRUE)
summary(dataCar)

     clm             exposure          veh_value        numclaims          claimcst         veh_body            veh_age         gender              area               agecat     
Min.   :0.00000   Min.   :0.002738   Min.   : 0.000   Min.   :0.00000   Min.   :    0.0   Length:67856       Min.   :1.000   Length:67856       Length:67856       Min.   :1.000  
1st Qu.:0.00000   1st Qu.:0.219028   1st Qu.: 1.010   1st Qu.:0.00000   1st Qu.:    0.0   Class :character   1st Qu.:2.000   Class :character   Class :character   1st Qu.:2.000  
Median :0.00000   Median :0.446270   Median : 1.500   Median :0.00000   Median :    0.0   Mode  :character   Median :3.000   Mode  :character   Mode  :character   Median :3.000  
Mean   :0.06814   Mean   :0.468651   Mean   : 1.777   Mean   :0.07276   Mean   :  137.3                      Mean   :2.674                                         Mean   :3.485  
3rd Qu.:0.00000   3rd Qu.:0.709103   3rd Qu.: 2.150   3rd Qu.:0.00000   3rd Qu.:    0.0                      3rd Qu.:4.000                                         3rd Qu.:5.000  
Max.   :1.00000   Max.   :0.999316   Max.   :34.560   Max.   :4.00000   Max.   :55922.1                      Max.   :4.000                                         Max.   :6.000

Findings:

The mean of clm (which is treated in R as a numeric variable taking only two values, 0 and 1) is 0.06814, meaning that 6.814% of the 67,856 vehicle insurance policies had at least one claim over the policy period.
veh_value ranges from 0 to 34.560 and its mean (1.777) is higher than its median (1.500), suggesting that its distribution is moderately right-skewed.

Variable	Description	Characteristics
`veh_value`	Vehicle value (in $10,000s)	Numeric with three decimal places, from 0 to 34.560
`exposure`	Exposure of policy	Numeric from 0 to 1
`clm`	Claim occurrence	Integer from 0 to 1 (0 = no, 1 = yes)
`numclaims`	Number of claims	Integer from 0 to 4
`claimcst`	Claim size	Positive value from 0 to 55,922.1 (= 0 if no claim)
`veh_body`	Vehicle body type	Factor with 13 levels: BUS, CONVT (= convertible), COUPE, HBACK (= hatchback), HDTOP (= hardtop), MCARA (= motorized caravan), MIBUS (= minibus), PANVN ( = panel van), RDS TR ( = roadster), SEDAN, STNWG (= station wagon), TRUCK, UTE(= utility)
`veh_age`	Vehicle age	Integer from 1 (youngest) to 4
`gender`	Gender of policyholder	A factor with two levels: M (= male) and F (= female)
`area`	Area of residence of policyholder	Factor with 6 levels: A, B, C, D, E, F
`ageeat`	Age band of policyholder	Integer from 1 (youngest) to 6
`X_OBSTAT_`	(Unknown)	A factor with one level “01101 0 0 0”

Data Quality Issue

Remove Irrelevant Data and Target Leakage

Remark: The variables that serve as predictors of the GLMs are available before claims are observed.

There are two variables whose values are known at the same time as or after claims are submitted, meaning that they cannot function as predictors of clm in practice.

- numclaims , the number of claims, cannot be known before a claim is submitted.
- claimcst , claim size, cannot be known before a claim is submitted.

# CHUNK 2
table(dataCar$clm, dataCar$numclaims)

      0     1     2     3     4
0 63232     0     0     0     0
1     0  4333   271    18     2

In the second part of CHUNK 2, we drop numclaims and claimcst to avoid target leakage, as well as the mysterious variable X_0BSTAT _ from the dataset.

# CHUNK 2 (cont.)
dataCar$claimcst <- NULL 
dataCar$numclaims <- NULL 
dataCar$X_0BSTAT_ <- NULL

Remove Unreasonable Data

The dataset also contains a small number of observations (53, to be exact) which involve a zero vehicle value. Although veh_value is measured in $10,000s, a value of zero for veh_value means that the value of the vehicle is lower than 0.001($10, 000) = $10, which does not make practical sense.

# CHUNK 2 (cont.)
nrow(dataCar[dataCar$veh_value == 0, ])
dataCar <- dataCar[dataCar$veh_value > 0, ]

[1] 53

Factorize Numerical Variables for Group Data

Since veh_age and agecat are labels of age groups and they do not represent exact ages, it may be desirable to treat them as factors instead of numeric variables.

In CHUNK 3, the conversion of veh_age and agecat from integers to factors can be achieved by the as.factor() function.

# CHUNK 3 
# Before factor conversion 
class(dataCar$agecat)
class(dataCar$veh_age)

dataCar$agecat <- as.factor(dataCar$agecat)
dataCar$veh_age <- as.factor(dataCar$veh_age)

# After factor conversion
class(dataCar$agecat)
class(dataCar$veh_age)

> # CHUNK 3 # Before factor conversion
> class(dataCar$agecat)
[1] "integer"

> class(dataCar$veh_age)
[1] "integer"
> 
> dataCar$agecat <- as.factor(dataCar$agecat)
> dataCar$veh_age <- as.factor(dataCar$veh_age)
> 
> # After factor conversion
> class(dataCar$agecat)
[1] "factor"

> class(dataCar$veh_age)
[1] "factor"

Relevel Categorical Variables

The rest of CHUNK 3 relevels all of the categorical predictors, including veh_age and agecat, so that their baseline level is the level with the most observations.

vars.cat <- c("veh_age", "veh_body", "gender", "area", "agecat") 

for (i in vars.cat){
  table <- as.data.frame(table(dataCar[, i]))
  max <- which.max(table[, 2]) 
  level.name <- as.character(table[max, 1]) 
  dataCar[, i] <- relevel(dataCar[, i], ref = level.name) 
}

summary(dataCar)

     clm             exposure          veh_value         claimcst          veh_body     veh_age   gender    area      agecat   
Min.   :0.00000   Min.   :0.002738   Min.   : 0.180   Min.   :    0.0   SEDAN  :22232   3:20060   F:38584   C:20530   4:16179  
1st Qu.:0.00000   1st Qu.:0.219028   1st Qu.: 1.010   1st Qu.:    0.0   HBACK  :18915   1:12254   M:29219   A:16302   1: 5734  
Median :0.00000   Median :0.446270   Median : 1.500   Median :    0.0   STNWG  :16234   2:16582             B:13328   2:12868  
Mean   :0.06811   Mean   :0.468481   Mean   : 1.778   Mean   :  137.1   UTE    : 4586   4:18907             D: 8161   3:15757  
3rd Qu.:0.00000   3rd Qu.:0.709103   3rd Qu.: 2.150   3rd Qu.:    0.0   TRUCK  : 1742                       E: 5907   5:10722  
Max.   :1.00000   Max.   :0.999316   Max.   :34.560   Max.   :55922.1   HDTOP  : 1579                       F: 3575   6: 6543  
                                                                        (Other): 2515

Findings:

For all of the categorical predictors, their baseline level is listed first and contains the greatest number of observations, as desired.

TASK 2: Perform univariate exploration of the non-factor variables

Use graphical displays and summary statistics to determine if any of the nonfactor variable(s) except exposure should be transformed and, if so, what transformation should be made. Do your recommended transformation(s), if any, and delete the original variable(s).

Note: For this task, there is no need to analyze the exposure variable, which will be treated later.

Stage 3: Exploratory Data Analysis

In CHUNK 4, for such a strictly positive and skewed variable, the log transformation is commonly used, so we create a log-transformed version of veh_value called log_veh_value, whose distribution is much more symmetric and bell-shaped, and delete the original veh_value in the second part of CHUNK 4.

# CHUNK 4 
library(ggplot2)
ggplot(dataCar, aes(x = veh_value)) + 
   geom_histogram() 
ggplot(dataCar, aes(x = log(veh_value))) + 
   geom_histogram() 
dataCar$log_veh_value <- log(dataCar$veh_value) 
dataCar$veh_value <- NULL

Figure 4.3.1: A histogram for veh_value (left) and the log of veh_value (right) in the dataCar dataset.

TASK 3: Explore the relationship of each predictor to clm

Use graphical displays and summary statistics to form preliminary conclusions regarding which variables are likely to have significant predictive power.

Numeric predictors

In CHUNK 6, we get the mean or median of log_veh_value and exposure over the two levels of clm.

# CHUNK 6 
library(tidyverse) 
dataCar %>%
  group_by_("clm") %>%
  summarise( 
    mean= mean(log_veh_value), 
    median= median(log_veh_value), 
    n = n()
)

# A tibble: 2 × 4
    clm  mean median     n
  <int> <dbl>  <dbl> <int>
1     0 0.382  0.399 63185
2     1 0.448  0.451  4618

dataCar %>% 
group_by_("clm") %>% 
summarise(
  mean = mean(exposure), 
  median = median(exposure), 
  n = n()
)

# A tibble: 2 × 4
    clm  mean median     n
  <int> <dbl>  <dbl> <int>
1     0 0.458  0.427 63185
2     1 0.611  0.637  4618

Findings:

As we expect, the means and medians are higher when there is a claim, so:
- Interpretation: a policyholder with a larger vehicle value or a larger exposure has a higher tendency to submit an insurance claim.

In CHUNK 5, we produce boxplots for log_veh_value and exposure indexed by the two levels (0 and 1) of clm.

# CHUNK 5 
ggplot(dataCar, aes(x = factor(clm), y = log_veh_value)) + 
   geom_boxplot(fill = "red") 
ggplot(dataCar, aes(x = factor(clm), y = exposure)) + 
   geom_boxplot(fill = "red")

Figure 4.3.2: Split boxplots for log_veh_value and exposure by clm in the dataCar dataset.

Findings:

It appears that claim occurrence (clm = 1) is associated with a slightly higher value of log_veh_value and a much higher value of exposure.

Categorical Predictors

The other predictors, veh_body, veh_age (after conversion), gender, area, and agecat (after conversion), are all categorical variables.

For categorical-categorical combinations, filled bar charts are useful. In CHUNK 7, we use a for loop to construct the bar charts for each categorical predictor color-filled by clm.

# CHUNK 7
for (i in vars.cat) {
  plot <- ggplot(dataCar, aes(x = dataCar[, i], fill = factor(clm))) +
    geom_bar(position = "fill") +
    labs(x = i, y = "percent") +
    theme(axis.text.x = element_text(angle = 90, hjust = 1))
  print(plot)
}

Figure 4.3.3: A bar chart for agecat by clm in the dataCar dataset

Findings:

veh_age: Higher rate of claim occurrence for group 2.
veh_body: Much higher rate for BUS and lower rate for CONVT, MIBUS, and UTE.
area: Higher rate for area F.
agecat: The claim occurrence rate appears to decrease with agecat.

We can also obtain a more accurate understanding of how the categorical predictors affect clm by looking at the mean of clm (which is the claim occurrence rate) by different levels of the predictors, as in CHUNK 8.

# CHUNK 8
for (i in vars.cat) {
  print(i)
  x <- dataCar %>%
    group_by_(i) %>%
    summarise(mean = mean(clm),
              n = n()) 
  print(x)
}

[1] "veh_age"
# A tibble: 4 × 3
  veh_age   mean     n
  <fct>    <dbl> <int>
1 3       0.0679 20060
2 1       0.0672 12254
3 2       0.0759 16582
4 4       0.0620 18907

[1] "veh_body"
# A tibble: 13 × 3
   veh_body   mean     n
   <fct>     <dbl> <int>
 1 SEDAN    0.0664 22232
 2 BUS      0.184     38
 3 CONVT    0.0370    81
 4 COUPE    0.0872   780
 5 HBACK    0.0668 18915
 6 HDTOP    0.0823  1579
 7 MCARA    0.116    121
 8 MIBUS    0.0601   716
 9 PANVN    0.0824   752
10 RDSTR    0.0741    27
11 STNWG    0.0721 16234
12 TRUCK    0.0683  1742
13 UTE      0.0567  4586

[1] "gender"
# A tibble: 2 × 3
  gender   mean     n
  <fct>   <dbl> <int>
1 F      0.0686 38584
2 M      0.0675 29219

[1] "area"
# A tibble: 6 × 3
  area    mean     n
  <fct>  <dbl> <int>
1 C     0.0688 20530
2 A     0.0664 16302
3 B     0.0723 13328
4 D     0.0607  8161
5 E     0.0652  5907
6 F     0.0783  3575

[1] "agecat"
# A tibble: 6 × 3
  agecat   mean     n
  <fct>   <dbl> <int>
1 4      0.0682 16179
2 1      0.0863  5734
3 2      0.0724 12868
4 3      0.0705 15757
5 5      0.0572 10722
6 6      0.0556  6543

The descriptive statistics are consistent with what we observe from the filled bar charts in CHUNK 7.

TASK 4: Reduce the number of factor levels where appropriate

Several of the factor variables have a small number of observations at some of the levels. Consider using knowledge of the factor levels as well as evidence from Task 3 to combine some of them into factor levels with more observations.

Combining Categorical Predictors

We combine factor levels based on (all of below):

the similarity of means/medians
extremely low counts

Arguably the most problematic categorical predictor is veh_body, which has the largest number of levels (13 levels) out of the five factor variables. Some levels like “BUS” and “RDSTR” have very few observations, so our priority is to combine these sparse levels with more populous levels with a similar claim occurrence rate.

Combining `BUS` and `MCARA` into `BUS-MCARA`

BUS-MCARA: “BUS” is extremely sparse, but it has the highest claim occurrence rate. The level with the second highest rate is “MCARA“, which is quite sparse as well. In the first two lines of CHUNK 9, let’s combine these two levels into a larger level called “BUS-MCARA“.

# CHUNK 9
# Combine "BUS" and "MCARA" as one level called "BUS-MCARA"
levels(dataCar$veh_body) [levels(dataCar$veh_body) == "BUS"] <- "BUS-MCARA" 
levels(dataCar$veh_body) [levels(dataCar$veh_body) == "MCARA"] <- "BUS-MCARA"

Combining others into `OTHER`

There are other sparse levels such as “CONVT” and ”RDSTR“. We can observe that the claim occurrence rates in these two levels are quite similar to the rates in other levels (except “BUS” and “MCARA“) , ranging from about 0.05 to about 0.08, so we may combine all these levels as one big level called “OTHER” . That’s what the rest of CHUNK 9 does.

# CHUNK 9 (Cont.) 
# Combine LeveLs other than "BUS_MCARA" as "OTHER" 
levels(dataCar$veh_body) [!(levels(dataCar$veh_body) == "BUS-MCARA")] <- "OTHER"

table(dataCar$veh_body)

OTHER BUS-MCARA 
67644       159

After consolidating its factor levels, veh_body only has two levels, “OTHER” (baseline level) and “BUS-MCARA“. The baseline level has relatively low claim occurrence rate while the “BUS-MCARA” level has a much higher rate, even though this level has much fewer observations.

TASK 5: Select an interaction

Select one pair of features that may be included as an interaction variable in your GLM. Do this by first proposing two variables that are likely to interact and then using appropriate graphical displays to confirm the existence of an interaction. Continue until a promising interaction has been identified. Include your selected interaction variables when constructing a GLM in subsequent tasks.

Interactions

In CHUNK 10, we explore the possible interaction between:

log_veh_value (numeric) and veh_body (factor)

via the use of boxplots for log_veh_value split by clm (factor) faceted by veh_body (see Figure 4.3.4).

# CHUNK 10
ggplot(dataCar, aes(x = factor(clm), y = log_veh_value)) + 
  geom_boxplot() + 
  facet_wrap(~ veh_body)

Figure 4.3.4: Faceted split boxplots for log_veh_value by clm and veh_body in the dataCar data.

Findings:

For OTHER vehicles, claim occurrence is associated with a slightly higher log_veh_value, but the association is much more positive for BUS-MCARA vehicles. In other words,

log_veh_value exerts a much stronger effect on claim occurrence for BUS-MCARA vehicles than for OTHER vehicles.

Discover interaction term: log_veh_value:veh_body

TASK 6: Select a link function

Split the data into training and test sets. With the target variable being only 0 or 1, the binomial distribution is the only reasonable choice. For the glm package in R there are five link functions that can be used with the binomial distribution. They are shown below, where η is the linear predictor and p is the probability that the target variable equals 1.
- Logit (link = "logit"): $\ln{\dfrac{p}{1-p}}=\eta$ or $p=\dfrac{e^\eta}{1+e^\eta}$.
- Probit (link = "probit"): Φ^-1(p) = η or p = Φ(η), where Φ is the standard normal cumulative distribution function.
- Cauchit (link = "cauchit"): …
- Log (link = "log"): ln p = η or p = e^η
- Complementary log-log (link = "cloglog"): ln[-ln(l – p)] = η or $p=1-e^{-e^\eta}$
Evaluate two potential link functions for applying a GLM to the training set.

Note:

Explain, prior to fitting the models, why your choices are reasonable for this problem.
Fit both models using the feature developed in Task 5 and select the better link function, justifying your choice based on metrics on the training set. Use only that link function in subsequent tasks.

Stage 4: Model Construction and Evaluation

Creation of Training and Test sets

In CHUNK 11, createDataPartition() function to split the data into the training (75%) and test (25%) sets.

# CHUNK 11
library(caret)
set.seed(4769)
partition <- createDataPartition(y = dataCar$clm, p = 0.75, list = FALSE)
data.train <- dataCar[partition, ]
data.test <- dataCar[-partition, ]
mean(data.train$clm)
mean(data.test$clm)

> mean(data.train$clm)
[1] 0.06870784
> mean(data.test$clm)
[1] 0.06631268

Validating the data split process: To check that the two sets are representative, we observe that the claim occurrence rate in the training set is 6.87% and that in the test set is 6.63%. That the two rates are similar shows that the built-in stratification of the target variable works well.

Model Distribution and Link Function

Distribution: Since the target variable clm is binary, the binomial distribution is the only reasonable candidate for the distribution of the target variable.

Link Function:

Log link: can be safely ruled out as the mean of the binary target variable, given by p = η in this case, is not necessarily lying between 0 and 1 (predictions, although non-negative, may be greater than 1, which is impossible for a probability).
Logit link: is most commonly used and most interpretable due to its intimate connections with the odds of an event.
Probit, Cauchit, and Complementary Log-Log link: all produce results which are much more difficult to interpret because of the rather complex relationship between p and η.

Distribution: Binomial
Link Function: Logit link and Probit link for comparison

Fitting the Model with the Link Functions: Logit and Probit

In CHUNK 12, we fit separately the logistic regression model and probit regression model to the training set.

Because we are specifically asked not to use the exposure variable, we use the formula:

clm ~ . - exposure + log_ veh_ value: veh_body,

meaning to regress clm on all variables in the training set except for exposure (using - exposure), with the interaction term between log_veh_value and veh_body added.

# CHUNK 12
logit <- glm(clm ~ . - exposure + log_veh_value:veh_body, 
             data = data.train, 
             family = binomial) 
probit <- glm(clm ~ . - exposure + log_veh_value:veh_body, 
              data = data.train, 
              family = binomial(link = "probit"))

Which link function to use?

The most natural way to compare these two GLMs is to look at their predictive performance on the test set. As the task statement explicitly told us not to do so, we instead look at metrics computed on the training set that assess not only the goodness of fit of a GLM to the training set, but also the complexity of the model, and strike a balance between the two.

In the second part of CHUNK 12, we use the AIC() function to return the AIC of each of the two GLMs.

# CHUNK 12 (Cont.) 
AIC(logit)
AIC(probit)

> # CHUNK 12 (Cont.)
> AIC(logit)
[1] 25354.09
> AIC(probit)
[1] 25353.79

The two AIC values are very close to each other, but the logistic regression model is much more interpretable than the probit model, so we will use the logit link in subsequent tasks.

Link Function := Logit Link

TASK 7: Specify an offset

Explain why the exposure variable may serve as an offset for predicting clm.
Fit a GLM using exposure as an offset and the link function selected in Task 6. Be sure to include the offset on an appropriate scale and explain your assumption.

Setting Offset: `exposure`

exposure is strongly positively related to claim occurrence. This comes as no surprise since exposure records the effective duration of each vehicle insurance policy in the dataset (according to the data dictionary) and captures the amount of risk each policy is exposed to.

Other things equal, the larger the value of exposure, the higher the claim occurrence probability.

One good way to put exposure to good use is to take it as an offset: ln(exposure), so that the model equation reads:

ln(odds) = ln(exposure) + η,

where η is the linear predictor without the offset, or equivalently, odds = exposure x e^η.

Refitting the Model with Log Link

In CHUNK 13, we refit the logistic regression model for clm using exposure as an offset and output its AIC and summary.

# CHUNK 13 
logit.offset <- glm(clm ~ . - exposure + log_veh_value:veh_body, 
                    data = data.train, 
                    offset = log(exposure), 
                    family = binomial)

AIC(logit.offset) 
summary(logit.offset)

> AIC(logit.offset) 
[1] 24459.17
> summary(logit.offset)

Call:
glm(formula = clm ~ . - exposure + log_veh_value:veh_body, family = binomial, 
    data = data.train, offset = log(exposure))

Coefficients:
                                  Estimate Std. Error z value Pr(>|z|)    
(Intercept)                     -1.8401822  0.0553407 -33.252  < 2e-16 ***
veh_bodyBUS-MCARA                1.2064692  0.5839503   2.066 0.038824 *  
veh_age1                        -0.0237603  0.0577435  -0.411 0.680720    
veh_age2                         0.0707487  0.0489945   1.444 0.148735    
veh_age4                        -0.0258168  0.0530979  -0.486 0.626818    
genderM                         -0.0471309  0.0367956  -1.281 0.200233    
areaA                           -0.0418634  0.0489577  -0.855 0.392500    
areaB                            0.0839797  0.0503891   1.667 0.095589 .  
areaD                           -0.1630404  0.0637039  -2.559 0.010487 *  
areaE                           -0.1027563  0.0706512  -1.454 0.145831    
areaF                            0.0636621  0.0798244   0.798 0.425145    
agecat1                          0.2960493  0.0653958   4.527 5.98e-06 ***
agecat2                          0.0788806  0.0537650   1.467 0.142339    
agecat3                         -0.0003662  0.0514177  -0.007 0.994317    
agecat5                         -0.2342049  0.0604906  -3.872 0.000108 ***
agecat6                         -0.2749970  0.0746007  -3.686 0.000228 ***
log_veh_value                    0.1470760  0.0400693   3.671 0.000242 ***
veh_bodyBUS-MCARA:log_veh_value -0.5307286  0.6035585  -0.879 0.379221    
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 24567  on 50852  degrees of freedom
Residual deviance: 24423  on 50835  degrees of freedom
AIC: 24459

Number of Fisher Scoring iterations: 5

Findings:

- The AIC of the logistic regression model with an offset is 24,459.17, down from 25,354.09 (CHUNK 12) when an offset is absent.
  This indicates that the incorporation of the offset leads to a moderate amount of improvement on the quality of the logistic regression model.

TASK 8: Generate confusion matrices and AUCs

Generate confusion matrices (based on an appropriate cutoff), plot the ROC curves, and calculate AUCs for the model fitted in Task 7 on the training set and the test set.
Based on these metrics, comment on the quality of the model.

Constructing Confusion Matrices

Determine the Cutoff

From CHUNK 3, only 6.811 % of the policies in the whole dataset has a claim, so using a cutoff of 0.5 (which is the most intuitive value) will result in almost every policy predicted to have no claim.

To make the logistic regression model produce more meaningful results, we set the cutoff to 0.06811.

Construct Confusion Matrix on the Training Set

In CHUNK 14, we produce the predicted class probabilities and predicted class labels for the logistic regression model, and generate a confusion matrix on each of the training set and the test set using a cutoff of 0.06811.

It is important to note that:

- The first two arguments of the confusionMatrix() function must be factors, so we need to apply the factor() function to:
  - class.train,
  - train$clm,
  - class.test,
  - test$clm.
- By default, R will take the alpha-numerically first level (which is “0”, or no claims) as the level of interest. By adding the option positive = "1" to the confusionMatrix() function, we explicitly specify the factor level that corresponds to the “positive” result to be “1”
- For some unknown reason, very few past PA exams do so. Without specifying the positive argument, the “sensitivity” and “specificity” in the output are, in fact, the specificity and sensitivity we are interested in.

# CHUNK 14 
cutoff <- 0.06811 # you can try other values 

# Generate predicted probabilities on the training set 
pred.train <- predict(logit.offset, type = "response") 
# Turn predicted probabilities into predicted classes 
class.train <- 1*(pred.train > cutoff) # OR ifelse(pred.train > cutoff, 1, 0) 
confusionMatrix(factor(class.train), factor(data.train$clm), positive = "1")

Confusion Matrix and Statistics

          Reference
Prediction     0     1
         0 25474  1017
         1 21885  2477
                                         
               Accuracy : 0.5496         
                 95% CI : (0.5453, 0.554)
    No Information Rate : 0.9313         
    P-Value [Acc > NIR] : 1              
                                         
                  Kappa : 0.0655         
                                         
 Mcnemar's Test P-Value : <2e-16         
                                         
            Sensitivity : 0.70893        
            Specificity : 0.53789        
         Pos Pred Value : 0.10167        
         Neg Pred Value : 0.96161        
             Prevalence : 0.06871        
         Detection Rate : 0.04871        
   Detection Prevalence : 0.47907        
      Balanced Accuracy : 0.62341        
                                         
       'Positive' Class : 1

Construct Confusion Matrix on the Test Set

# CHUNK 14 (Cont.)
# Generate predicted probabilities on the test set
pred.test <- predict(logit.offset, newdata = data.test, type = "response")
class.test <- 1*(pred.test > cutoff)
confusionMatrix(factor(class.test), factor(data.test$clm), positive = "1")

Confusion Matrix and Statistics

          Reference
Prediction    0    1
         0 8483  332
         1 7343  792
                                          
               Accuracy : 0.5472          
                 95% CI : (0.5397, 0.5547)
    No Information Rate : 0.9337          
    P-Value [Acc > NIR] : 1               
                                          
                  Kappa : 0.0617          
                                          
 Mcnemar's Test P-Value : <2e-16          
                                          
            Sensitivity : 0.70463         
            Specificity : 0.53602         
         Pos Pred Value : 0.09736         
         Neg Pred Value : 0.96234         
             Prevalence : 0.06631         
         Detection Rate : 0.04673         
   Detection Prevalence : 0.47994         
      Balanced Accuracy : 0.62032         
                                          
       'Positive' Class : 1

Findings:

From the two confusion matrices, we can see that:

Sensitivity vs. specificity:

In both matrices, the accuracy is only marginally higher than 0.5, but the accuracy does not take into account the fact that the insurance company should be more interested in correctly classifying policyholders with a claim (who will file for payment) than correctly classifying policyholders without a claim. In other words,

the specificity of the logistic regression model should be more of interest to the insurance company than the sensitivity.

We can see that the model has a decent specificity (close to 0. 70) on both the training and test sets whereas its sensitivity is relatively low, meaning that

the model does a pretty good job of identifying policies with a claim, but it is not very good at identifying no-claim policies.

If time allows, it is a good move to tune the cutoff and try a wide range of values to decide on one that results in a better balance between the sensitivity and specificity.

Training vs. test:

The accuracy, sensitivity, and specificity of the model on the test set are all slightly lower than the corresponding quantities on the training set. This is an indication of a small (but not serious) extent of overfitting for the logistic regression model.

Determining AUCs and ROC Plots

The AUC of a classifier can be computed by the roc() and auc() functions from the pR0C package. In CHUNK 15, we make an ROC plot and obtain the AUC of the logistic regression model on both the training and test sets.

# CHUNK 15 
library(pROC)
roc.train <- roc(data.train$clm, pred.train)
auc(roc.train)

roc.test <- roc(data.test$elm, pred.test) 
auc(roc.test)

> auc(roc.train)
Area under the curve: 0.6647

> auc(roc.test)
Area under the curve: 0.6581

Figure 4.3.5: The ROC plots for the full logistic regression model on the training set (left) and test set (right) in the dataCar data.

Training Set

Test Set

The command (pty = "s") can be used to generate a square (which is what “s” stands for) plotting region, which improves the appearance of the ROC plot.

Findings:

The two AUCs are close to 0.66, which means that the model is making better (but perhaps not way better) predictions than by chance.
The fact that the test AUC is slightly lower than the training AUC supports our earlier conclusion that there is a small extent of overfitting for the logistic regression model.

TASK 9: Select features using stepwise selection

Some of the features may lack predictive power and result in overfitting.

Explain whether or not it is a reasonable suggestion to binarize the factor variables before doing stepwise selection in this problem.
Perform forward selection using the BIC.
Calculate the AUC of the final model on the test set.

When finished with this task, this will be your final model form.

Binarization

In CHUNK 16, we use the dummyVars() function from the caret package to binarize the categorical predictors with three or more levels, which include agecat, area, and veh_age.

# CHUNK 16 
library(caret) 
binarizer <- dummyVars(~ agecat + area + veh_age, 
                       data = dataCar, 
                       fullRank = TRUE) 
binarized.vars <- data.frame(predict(binarizer, newdata = dataCar)) 
head(binarized.vars)

  agecat.1 agecat.2 agecat.3 agecat.5 agecat.6 area.A area.B area.D area.E area.F veh_age.1 veh_age.2 veh_age.4
1        0        1        0        0        0      0      0      0      0      0         0         0         0
2        0        0        0        0        0      1      0      0      0      0         0         1         0
3        0        1        0        0        0      0      0      0      1      0         0         1         0
4        0        1        0        0        0      0      0      1      0      0         0         1         0
5        0        1        0        0        0      0      0      0      0      0         0         0         1
6        0        0        0        0        0      0      0      0      0      0         0         0         0

> binarizer$lvls
$agecat
[1] "4" "1" "2" "3" "5" "6"

$area
[1] "C" "A" "B" "D" "E" "F"

$veh_age
[1] "3" "1" "2" "4"

Then we attach the binarized variables to the dataCar dataset (with a .bin suffix), delete the original categorical variables, and reuse the partition vector generated in CHUNK 11 to partition the dataset into the training and test sets, now with binarized variables.

# CHUNK 16 (Cont.) 
# Attach the binarized variables 
dataCar.bin <- cbind(dataCar, binarized.vars) 
# Remove the original factor variab les 
dataCar.bin$agecat <- NULL 
dataCar.bin$area <- NULL 
dataCar.bin$veh_age <- NULL 
#Setup the binarized training and test sets 
data.train.bin <- dataCar.bin[partition, ] 
data.test.bin <- dataCar.bin[-partition, ]

In CHUNK 17, we fit the logistic regression model (with interaction) to the binarized training set and print a model summary.

# CHUNK 17 
# Fit the Logistic regression model to the binarized training set 
logit.full <- glm(clm ~ . - exposure + log_veh_value:veh_body, 
                  data = data.train.bin, 
                  offset = log(exposure), 
                  family = binomial)
summary (logit.full)

Call:
glm(formula = clm ~ . - exposure + log_veh_value:veh_body, family = binomial, 
    data = data.train.bin, offset = log(exposure))

Coefficients:
                                  Estimate Std. Error z value Pr(>|z|)    
(Intercept)                     -1.8401822  0.0553407 -33.252  < 2e-16 ***
veh_bodyBUS-MCARA                1.2064692  0.5839503   2.066 0.038824 *  
genderM                         -0.0471309  0.0367956  -1.281 0.200233    
log_veh_value                    0.1470760  0.0400693   3.671 0.000242 ***
agecat.1                         0.2960493  0.0653958   4.527 5.98e-06 ***
agecat.2                         0.0788806  0.0537650   1.467 0.142339    
agecat.3                        -0.0003662  0.0514177  -0.007 0.994317    
agecat.5                        -0.2342049  0.0604906  -3.872 0.000108 ***
agecat.6                        -0.2749970  0.0746007  -3.686 0.000228 ***
area.A                          -0.0418634  0.0489577  -0.855 0.392500    
area.B                           0.0839797  0.0503891   1.667 0.095589 .  
area.D                          -0.1630404  0.0637039  -2.559 0.010487 *  
area.E                          -0.1027563  0.0706512  -1.454 0.145831    
area.F                           0.0636621  0.0798244   0.798 0.425145    
veh_age.1                       -0.0237603  0.0577435  -0.411 0.680720    
veh_age.2                        0.0707487  0.0489945   1.444 0.148735    
veh_age.4                       -0.0258168  0.0530979  -0.486 0.626818    
veh_bodyBUS-MCARA:log_veh_value -0.5307286  0.6035585  -0.879 0.379221    
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 24567  on 50852  degrees of freedom
Residual deviance: 24423  on 50835  degrees of freedom
AIC: 24459

Number of Fisher Scoring iterations: 5

The summary output is identical to that in CHUNK 13, showing that the binarization is done properly and does not have any effect on the model.

Stepwise Selection

In CHUNK 18, we run the stepAIC() function to perform forward selection using the BIC as requested, which represents a more conservative approach to feature selection, allowing the insurance company to identify key factors affecting claim occurrence rates.

# CHUNK 18 
library(MASS)
logit.null <- glm(clm ~ 1,
                  data= data.train.bin,
                  offset= log(exposure),
                  family= binomial)
logit.reduced <- stepAIC(logit.null,
                         direction= "forward", 
                         k = log(nrow(data.train.bin)),
                         scope = list(upper = logit.full, lower = logit.null))
summary(logit.reduced)

Start:  AIC=24578.19
clm ~ 1

                Df Deviance   AIC
+ log_veh_value  1    24532 24553
+ agecat.1       1    24533 24554
+ agecat.5       1    24541 24563
+ agecat.6       1    24542 24564
+ veh_age.4      1    24549 24571
+ veh_age.2      1    24554 24575
<none>                24567 24578
+ area.D         1    24558 24580
+ agecat.2       1    24559 24581
+ area.F         1    24561 24583
+ veh_body       1    24562 24583
+ area.B         1    24562 24584
+ area.E         1    24566 24587
+ gender         1    24566 24588
+ veh_age.1      1    24566 24588
+ area.A         1    24566 24588
+ agecat.3       1    24567 24589

Step:  AIC=24553.41
clm ~ log_veh_value

            Df Deviance   AIC
+ agecat.1   1    24498 24531
+ agecat.5   1    24505 24538
+ agecat.6   1    24512 24544
<none>            24532 24553
+ area.D     1    24522 24554
+ area.B     1    24524 24557
+ agecat.2   1    24524 24557
+ veh_age.2  1    24527 24560
+ veh_body   1    24527 24560
+ gender     1    24528 24561
+ area.F     1    24529 24561
+ area.E     1    24529 24562
+ veh_age.1  1    24530 24563
+ veh_age.4  1    24531 24563
+ area.A     1    24531 24564
+ agecat.3   1    24532 24564

Step:  AIC=24530.58
clm ~ log_veh_value + agecat.1

            Df Deviance   AIC
+ agecat.5   1    24479 24522
+ agecat.6   1    24482 24526
+ agecat.2   1    24485 24528
<none>            24498 24531
+ area.D     1    24488 24532
+ area.B     1    24490 24534
+ veh_body   1    24494 24537
+ veh_age.2  1    24494 24537
+ gender     1    24495 24538
+ area.F     1    24496 24539
+ area.E     1    24496 24539
+ agecat.3   1    24496 24539
+ veh_age.1  1    24496 24540
+ veh_age.4  1    24497 24541
+ area.A     1    24497 24541

Step:  AIC=24522.16
clm ~ log_veh_value + agecat.1 + agecat.5

            Df Deviance   AIC
+ agecat.6   1    24457 24511
<none>            24479 24522
+ area.D     1    24469 24524
+ area.B     1    24470 24525
+ agecat.2   1    24472 24526
+ veh_body   1    24474 24528
+ veh_age.2  1    24475 24529
+ gender     1    24476 24530
+ area.E     1    24476 24530
+ area.F     1    24477 24531
+ veh_age.1  1    24477 24531
+ veh_age.4  1    24478 24532
+ area.A     1    24478 24532
+ agecat.3   1    24479 24533

Step:  AIC=24511.36
clm ~ log_veh_value + agecat.1 + agecat.5 + agecat.6

            Df Deviance   AIC
<none>            24457 24511
+ area.D     1    24448 24514
+ area.B     1    24449 24514
+ veh_body   1    24452 24517
+ veh_age.2  1    24452 24517
+ agecat.2   1    24454 24519
+ area.E     1    24455 24520
+ gender     1    24455 24520
+ veh_age.1  1    24456 24521
+ area.F     1    24456 24521
+ veh_age.4  1    24456 24521
+ agecat.3   1    24456 24522
+ area.A     1    24456 24522


> summary(logit.reduced)

Call:
glm(formula = clm ~ log_veh_value + agecat.1 + agecat.5 + agecat.6, 
    family = binomial, data = data.train.bin, offset = log(exposure))

Coefficients:
              Estimate Std. Error z value Pr(>|z|)    
(Intercept)   -1.84794    0.02533 -72.959  < 2e-16 ***
log_veh_value  0.15913    0.02917   5.455 4.90e-08 ***
agecat.1       0.27479    0.05845   4.701 2.59e-06 ***
agecat.5      -0.25995    0.05300  -4.905 9.36e-07 ***
agecat.6      -0.30815    0.06856  -4.495 6.97e-06 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 24567  on 50852  degrees of freedom
Residual deviance: 24457  on 50848  degrees of freedom
AIC: 24467

Number of Fisher Scoring iterations: 5

The final model carries only four features:

log_veh_value, agecat.1, agecat.5, and agecat.6

CHUNK 19 generates the AUC for the final model.

# CHUNK 19 
pred.reduced <- predict(logit.reduced, newdata = data.test.bin, type= "response") 
roc.reduced <- roc(data.test$clm, pred.reduced) 
auc(roc.reduced)

Area under the curve: 0.6595

The AUC = 0.6595, is slightly higher than that for the full model (0.6581). Not only does the final model have a slightly better prediction performance, it is also much simpler, with only four features. In terms of both prediction performance and interpretability, the final model is superior and we are going to use this model in the final task.

Interpretation of Model Results

TASK 10: Interpret the model

Run the selected model from Task 9 on the full dataset and provide the output.
Interpret the results of the model by examining the effects of changes in the key variables on:
- Odds of claim occurrence
- Probability of claim occurrence
Describe the results in a manner that will provide useful information to an insurance company.

In CHUN 20, we refit the reduced logistic regression model to the full dataset (note the argument data = dataCar.bin, not data = data.train.bin) and save it as a glm object named logit.final.

# CHUNK 20 
logit.final <- glm(clm ~ log_veh_value + agecat.1 + agecat.5 + agecat.6, 
           data= dataCar.bin, 
           offset= log(exposure), 
           family= binomial) 
summary(logit.final)

# Exponentiate the coefficients to get the multiplicative effects 
exp(coef(logit.final))

Call:
glm(formula = clm ~ log_veh_value + agecat.1 + agecat.5 + agecat.6, 
    family = binomial, data = dataCar.bin, offset = log(exposure))

Coefficients:
              Estimate Std. Error z value Pr(>|z|)    
(Intercept)   -1.86185    0.02205 -84.438  < 2e-16 ***
log_veh_value  0.15843    0.02532   6.257 3.92e-10 ***
agecat.1       0.25731    0.05138   5.008 5.51e-07 ***
agecat.5      -0.25433    0.04603  -5.525 3.29e-08 ***
agecat.6      -0.23858    0.05784  -4.125 3.72e-05 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 32572  on 67802  degrees of freedom
Residual deviance: 32445  on 67798  degrees of freedom
AIC: 32455

Number of Fisher Scoring iterations: 5


> # Exponentiate the coefficients to get the multiplicative effects 
> exp(coef(logit.final)) 
  (Intercept) log_veh_value      agecat.1      agecat.5      agecat.6 
    0.1553857     1.1716752     1.2934442     0.7754369     0.7877484

There are two ways to interpret the output of a fitted logistic regression model.

Odds-Based Interpretation

Table 4.4 provides the odds-based interpretation of the final model using the value of each estimated coefficient. The statements there are in terms of multiplicative changes in the odds of claim occurrence. Equivalently, we can also state the findings in terms of percentage changes.

Table 4.4: Odds-based interpretation of the reduced logistic regression model fitted to the full dataCar dataset.

Feature	Coefficient Estimate	Interpretation
`log_ veh_ value`	0.15843	A unit increase in the log of vehicle value is associated with a multiplicative increase of e^0.15843 = 1.1717 (> 1) in the odds of claim occurrence, holding all other features fixed. The fact that this increase exceeds 1 makes intuitive sense because the more valuable the vehicle, the more likely a driver will submit insurance claims in the case of accidental damage.
`agecat.1`	0.25731	The odds of claim occurrence for policyholders in age category 1 is e^0·25731 = 1.2934 (> 1) times of that in the baseline level (which is now categories 2, 3, and 4 combined). It is possible that younger drivers are more reckless and tend to have higher accident rates.
`agecat.5`	-0.25433	The odds of claim occurrence for policyholders in age category 5 is ^e-0·25433 = 0.7754 (< 1) times of that in the baseline level. This also aligns with intuition as more mature drivers are more careful and tend to have lower accident rates.
`agecat.6`	-0.23858	The odds of claim occurrence for policyholders in age category 6 is e^-0·23858 = 0.7877 (< 1) times of that in the baseline level.

Probability-Based Interpretation

Steps:

1. Look for the the mean values of numeric predictors and modal (i.e., most common) levels of categorical predictors.

# CHUNK 21 
# To look at mean or modal values 
summary(dataCar)

     clm             exposure             veh_body     veh_age   gender    area      agecat    log_veh_value     
Min.   :0.00000   Min.   :0.002738   OTHER    :67644   3:20060   F:38584   C:20530   4:16179   Min.   :-1.71480  
1st Qu.:0.00000   1st Qu.:0.219028   BUS-MCARA:  159   1:12254   M:29219   A:16302   1: 5734   1st Qu.: 0.00995  
Median :0.00000   Median :0.446270                     2:16582             B:13328   2:12868   Median : 0.40547  
Mean   :0.06811   Mean   :0.468481                     4:18907             D: 8161   3:15757   Mean   : 0.38675  
3rd Qu.:0.00000   3rd Qu.:0.709103                                         E: 5907   5:10722   3rd Qu.: 0.76547  
Max.   :1.00000   Max.   :0.999316                                         F: 3575   6: 6543   Max.   : 3.54270

2. Create a new data hosting the feature combinations we want to explore.

new.data <- data.frame(
  exposure = c(0.468481, 0.515329, 0.468481, 0.468481, 0.468481, 0.468481),
  log_veh_value = c(0.38675, 0.38675, 0.42543, 0.38675, 0.38675, 0.38675),
  agecat.1 = c(0, 0, 0, 1, 0, 0), 
  agecat.5 = c(0, 0, 0, 0, 1, 0), 
  agecat.6 = c(0, 0, 0, 0, 0, 1)
)
new.data

  exposure log_veh_value agecat.1 agecat.5 agecat.6
1 0.468481       0.38675        0        0        0
2 0.515329       0.38675        0        0        0
3 0.468481       0.42543        0        0        0
4 0.468481       0.38675        1        0        0
5 0.468481       0.38675        0        1        0
6 0.468481       0.38675        0        0        1

3. In each of the five remaining rows, we alter the value of one and only one feature (exposure, log_veh_value, agecat, in that order) to look at the effect of that predictor on the predicted claim occurrence probability.

predict(logit.final, newdata = new.data, type = "response")

         1          2          3          4          5          6 
0.07183549 0.07845544 0.07224517 0.09099701 0.05661722 0.05746447

Table 4.5 summarizes the results (the feature that is changed is shown in bold).

`exposure`	`log_veh_value`	`agecat.1`	`agecat.5`	Predicted Probability	Interpretation
0.468481	0.38675	0	0	0.07183549
0.515329	0.38675	0	0	0.07845544	A 10% increase in `exposure` increases the claim occurrence probability by 0.0785 – 0.0718 = 0.0067.
0.468481	0.42543	0	0	0.07224517	A 10% increase in `log_veh_value` increases the claim occurrence probability by 0.0722 – 0.0718 = 0.0004.
0.468481	0.38675	1	0	0.09099701	Being in age category 1 increases the claim occurrence probability by 0.0910-0.0718 = 0.0192 compared to being in the baseline age category.
0.468481	0.38675	0	1	0.05661722	Being in age category 5 decreases the claim occurrence probability by 0.0718-0.0566 = 0.0152 compared to being in the baseline age category.
0.468481	0.38675	0	0	0.05746447	Being in age category 6 decreases the claim occurrence probability by 0.0718-0.0575 = 0.0143 compared to being in the baseline age category.

4.4. Generalized Linear Models Case Study 3: GLMs for Count and Aggregate Loss Variables

Learning Objectives

Select appropriate distributions and link functions for count and severity variables.
Identify appropriate offsets and weights for count and severity variables.
Implement GLMs for count and severity variables in R.
Assess the quality of a Poisson GLM using the Pearson goodness-of-fit statistic.
Combine the GLMs for count and severity variables to make predictions for an aggregate loss variable.

Stage 1: Define the Business Problem

Objective

To identify important determinants of the two target variables and to use these determinants to make accurate predictions with GLMs.

Stage 2: Data Collection

Full Data Set

Run CHUNK 1 to load the Swedish package and the data, and print a summary of each variable. There are 2182 observations and 7 variables.

# CHUNK 1
Swedish <- read.csv("SwedishMotorInsurance.csv", stringsAsFactors=TRUE)
summary(Swedish)

   Kilometres         Zone          Bonus            Make          Insured              Claims           Payment        
 Min.   :1.000   Min.   :1.00   Min.   :1.000   Min.   :1.000   Min.   :     0.01   Min.   :   0.00   Min.   :       0  
 1st Qu.:2.000   1st Qu.:2.00   1st Qu.:2.000   1st Qu.:3.000   1st Qu.:    21.61   1st Qu.:   1.00   1st Qu.:    2989  
 Median :3.000   Median :4.00   Median :4.000   Median :5.000   Median :    81.53   Median :   5.00   Median :   27404  
 Mean   :2.986   Mean   :3.97   Mean   :4.015   Mean   :4.992   Mean   :  1092.20   Mean   :  51.87   Mean   :  257008  
 3rd Qu.:4.000   3rd Qu.:6.00   3rd Qu.:6.000   3rd Qu.:7.000   3rd Qu.:   389.78   3rd Qu.:  21.00   3rd Qu.:  111954  
 Max.   :5.000   Max.   :7.00   Max.   :7.000   Max.   :9.000   Max.   :127687.27   Max.   :3338.00   Max.   :18245026

Findings:

The summary does not suggest any missing or abnormal values.

Data Description

Variable	Description	Characteristics
`Kilometres`	Distance driven by a vehicle in km per year, grouped into five categories (it is not exactly the distance driven, but a higher number does mean a longer distance)	Integer from 1 to 5 < 1,000 1,000 – 15,000 15,000 – 20,000 20,000 – 25,000 > 25,000
`Zone`	Geographic zone of a vehicle	Integer from 1 to 7
`Bonus`	No claims bonus, equal to the number of years, plus one, since the last claim	Integer from 1 to 7
`Make`	Type of a vehicle	Integer from 1 to 8 representing eight different common car models. All other models are combined in class 9.
`Insured`	Number of policyholder years (fraction of the year that the policyholder has a contract with the issuing company)	Numeric from 0.01 to 127687.27
`Claims`	Number of claims	Integer from Oto 3338
`Payment`	Sum of payments in Swedish kroner	Integer from 0 to 18245026

TASK 1: Select offsets

Select an appropriate offset for each of the two target variables, claim counts and aggregate payments.

First propose the two candidate variables that are likely to serve their purpose based on their meaning.
Confirm their usefulness via appropriate graphical displays. Include these displays in your response.

Choosing Offsets for `Claims` and `Payment`

There are two target variables in this case study,

the number of claims (Claims) and
the aggregate payments (Payment).

The Number of Claims: `Claims`

It is reasonable to expect that it is proportional to the number of policyholder years (Insured), which provide Claims an exposure basis.

Other things equal, the longer a policyholder is covered by an automobile insurance policy, the larger the number of claims submitted.

The Aggregate Payments: `Payment`

In the same vein, we expect that aggregate payments vary in proportion to the number of claims.

Each claim generates a payment (see CHUNK 3) and it makes intuitive sense that:

With more claims submitted, the sum of payments becomes larger.

At this stage, we can conjecture that:

Insured may be a good offset for Claims and Claims in turn may serve as a good offset for Payment

In CHUNK 2, we construct two scatterplots, one for Claims against Insured and one for Payment against Claims.

# CHUNK 2
ggplot(Swedish, aes(x = Insured, y = Claims)) + geom_point()
ggplot(Swedish, aes(x = Claims, y = Payment)) + geom_point()

Findings:

It is clear that there is a strong proportional relationship between Claims and Insured and between Payment and Claims.
This supports the use of Insured and Claims as offsets (to be included on a suitable scale) for Claims and Payment, respectively.

Not strictly required by the task statement, we can make two additional observations, involving Claims, Insured, and Payment:

1. If Claims = 0, then Payment = 0.

# CHUNK 3 
summary(Swedish[Swedish$Claims == 0, ]$Payment)
Swedish.severity <- Swedish[Swedish$Claims > 0, ]

Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
   0       0       0       0       0       0

2. Create two variables:

- ClaimsPerinsured: the number of claims per policyholder year
- PaymentPerClaim: the payment per claim, claim severity

# CHUNK 4 
Swedish$ClaimsPerinsured <- Swedish$Claims / Swedish$Insured 
Swedish.severity$PaymentPerClaim <- Swedish.severity$Payment / Swedish.severity$Claims

Generate scatterplots for these two variables:

# CHUNK 4 (cont.)
ggplot(Swedish, aes(x = Insured, y = ClaimsPerinsured)) + geom_point()
ggplot(Swedish.severity, aes(x = Claims, y = PaymentPerClaim)) + geom_point()

Findings:

The size as well as the variability of ClaimsPerinsured and PaymentPerClaim shrink significantly with Insured and Claims, respectively, showing that averaging goes a long way towards stabilizing the two target variables.

TASK 2: Edit and explore the data

Edit and explore the data to prepare for further analysis and identify potentially predictive variables.

Determine if any numeric variables should be converted to factor variables. If there are variables that can function as both numeric and factor variables, leave them as numeric and explore the factor conversions in Task 3.
Investigate the distribution of the two target variables corrected by the offsets identified in Task 1.
Identify variables ( other than the offsets in Task 1) that are likely to have significant predictive power for each of the two target variables.

Data Quality Issue

Factorize Zone and Makeas they do not convey any numerical order:

# CHUNK 5 
Swedish$Zone <- as.factor(Swedish$Zone) 
Swedish$Make <- as.factor(Swedish$Make) 
Swedish.severity$Zone <- as.factor(Swedish.severity$Zone) 
Swedish.severity$Make <- as.factor(Swedish.severity$Make)

Stage 3: Exploratory Data Analysis

Distributions of the Target Variables

In CHUNK 6, we explore the distributions of the two offset-corrected target variables, ClaimsPerinsured and PaymentPerClaim.

# CHUNK 6
summary(Swedish$ClaimsPerinsured)
summary(Swedish.severity$PaymentPerClaim)

ggplot(Swedish, aes(x = ClaimsPerinsured)) + geom_histogram()

> summary(Swedish$ClaimsPerinsured)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
0.00000 0.02648 0.05123 0.06914 0.08906 1.66667 

> summary(Swedish.severity$PaymentPerClaim)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
     72    2643    4375    5206    5985   31442

Findings:

Both ClaimsPerinsured and PaymentPerClaim are non-negative continuous variables (while Claims takes only non-negative integers, the division by Insured makes ClaimsPerinsured continuous). The former ranges from 0 (corresponding to no claims) to 1.66667 while the latter ranges from 72 to 31,442.
Even after being accounted for by the offsets, the two target variables exhibit a pronounced right skew, especially for PaymentPerClaim, as shown in the histograms and reflected by their means substantially exceeding their medians. We will take the right skew into account when developing GLMs later.

Potentially Useful Predictors

In search of potentially predictive variables out of Kilometres, Bonus, Zone, and Make, CHUNK 7 provides the means and medians of the two offset-corrected target variables split by distinct values or levels of the four predictors. (Split boxplots can also be used here.)

# CHUNK 7 
library(tidyverse) 
vars <- c("Kilometres", "Bonus", "Zone", "Make") 
for (i in vars) { 
    x <- Swedish%>% 
    group_by_(i) %>% 
    summarise( 
        mean = mean(ClaimsPerinsured), 
        median = median(ClaimsPerinsured), 
        n = n() 
    ) 
print(x) 
}

# A tibble: 5 × 4
  Kilometres   mean median     n
       <int>  <dbl>  <dbl> <int>
1          1 0.0561 0.0425   439
2          2 0.0651 0.0488   441
3          3 0.0718 0.0552   441
4          4 0.0705 0.0555   434
5          5 0.0827 0.0574   427
# A tibble: 7 × 4
  Bonus   mean median     n
  <int>  <dbl>  <dbl> <int>
1     1 0.129  0.117    307
2     2 0.0792 0.0741   312
3     3 0.0676 0.0591   310
4     4 0.0659 0.0514   310
5     5 0.0550 0.0454   313
6     6 0.0524 0.0447   315
7     7 0.0364 0.0350   315
# A tibble: 7 × 4
  Zone    mean median     n
  <fct>  <dbl>  <dbl> <int>
1 1     0.104  0.0862   315
2 2     0.0795 0.0645   315
3 3     0.0722 0.0575   315
4 4     0.0575 0.0473   315
5 5     0.0626 0.0469   313
6 6     0.0569 0.0435   315
7 7     0.0504 0        294
# A tibble: 9 × 4
  Make    mean median     n
  <fct>  <dbl>  <dbl> <int>
1 1     0.0761 0.0614   245
2 2     0.0802 0.0607   245
3 3     0.0576 0.0389   242
4 4     0.0333 0.0216   238
5 5     0.0919 0.0691   244
6 6     0.0543 0.0430   244
7 7     0.0838 0.0555   242
8 8     0.0729 0.0487   237
9 9     0.0712 0.0630   245

# CHUNK 7 (cont.)


for (i in vars) { 
    x <- Swedish.severity%>% 
    group_by_(i) %>% 
    summarise( 
        mean= mean(PaymentPerClaim), 
        median= median(PaymentPerClaim), 
        n = n()
    )
print(x)
}

# A tibble: 5 × 4
  Kilometres  mean median     n
       <int> <dbl>  <dbl> <int>
1          1 4912.  4239.   381
2          2 5068.  4650.   399
3          3 5445.  4494.   378
4          4 5065.  3901.   328
5          5 5601.  4207.   311
# A tibble: 7 × 4
  Bonus  mean median     n
  <int> <dbl>  <dbl> <int>
1     1 4913.  4195.   252
2     2 5451.  4238.   246
3     3 4899.  3974.   246
4     4 5257.  4095.   240
5     5 5188.  3900    247
6     6 5251.  4464.   270
7     7 5440.  5107.   296
# A tibble: 7 × 4
  Zone   mean median     n
  <fct> <dbl>  <dbl> <int>
1 1     4959.  4399.   295
2 2     4569.  4142.   295
3 3     5167.  4500.   293
4 4     5411.  4904.   306
5 5     5625.  4145.   236
6 6     5702.  4775.   264
7 7     5016.  2605.   108
# A tibble: 9 × 4
  Make   mean median     n
  <fct> <dbl>  <dbl> <int>
1 1     5083.  4828.   227
2 2     4854.  4313.   205
3 3     5950.  4438.   178
4 4     4877.  3367    155
5 5     4986.  3797.   206
6 6     5193.  4482.   212
7 7     4759.  3429    197
8 8     6603.  4387    173
9 9     4849.  4791.   244

# Remove redundant variabLes 
rm(i, vars, x)

Findings:

ClaimsPerlnsured:

It appears that ClaimsPerlnsured has a monotonic relationship with both Kilometres and Bonus, being increasing in Kilometres and decreasing in Bonus.

These findings make intuitive sense as:

a longer distance traveled indicates a more worn-out vehicle and a smaller Bonus suggests poorer driving experience of a policyholder.

The average of ClaimsPerlnsured is also particularly low in certain geographical zones such as 4, 5, 6, and 7, and for certain car models such as 3, 4, and 6.

PaymentPerClaim:

The effects of the four predictors on PaymentPerClaim are not as pronounced as for ClaimsPerlnsured. Unlike ClaimsPerlnsured, Kilometres and Bonus do not seem to exert a monotonic effect on PaymentPerClaim.

We can see that PaymentPerClaim has relatively close means and medians across different values of Kilometres and Bonus, and relatively high means and medians for Zone 4 and Zone 6, and relatively low medians for cars of Make 4 and Make 7.

TASK 3: Construct a GLM for Claims

Split the data into the training and test sets.
Select a distribution and link function to use for constructing a GLM for Claims. Justify your choice based on the characteristics of the target variable.
Fit two GLMs to the training data using the combination of distribution and link function you propose, one treating the “controversial” variables in Task 2 as numeric variables and one treating them as factor variables. Be sure to include the offsets identified in Task 1. Provide the summary output of the two GLMs in your response.
Perform feature selection using an appropriate method if desirable.
Evaluate the Pearson goodness-of-fit statistic of the two GLMs on the test set and recommend which model to use.

Stage 4: Model Construction and Selection

Creation of Training and Test sets

In CHUNK 8, we do the stratified sampling based on the number of claims per insured, ClaimsPerinsured.

# CHUNK 8 
library(caret) 
set.seed(14)
train_ind <- createDataPartition(Swedish$ClaimsPerinsured, p = 0.7, list = FALSE)
data.train <- Swedish[train_ind, ]
data.test <- Swedish[-train_ind, ]
# Takeout the severity portions of the training and test sets 
data.train.severity <- data.train[data.train$Claims > 0, ]
data.train.severity$PaymentPerClaim <- data.train.severity$Payment / data.train.severity$Claims
data.test.severity <- data.test[data.test$Claims > 0, ]
data.test.severity$PaymentPerClaim <- data.test.severity$Payment / data.test.severity$Claims
print("TRAIN") 
mean(data.train$ClaimsPerinsured)
mean(data.train.severity$PaymentPerClaim)
print("TEST")
mean(data.test$ClaimsPerinsured)
mean(data.test.severity$PaymentPerClaim)

> print("TRAIN") 
[1] "TRAIN"
> mean(data.train$ClaimsPerinsured)
[1] 0.06896561
> mean(data.train.severity$PaymentPerClaim)
[1] 5253.53
> print("TEST")
[1] "TEST"
> mean(data.test$ClaimsPerinsured)
[1] 0.06955017
> mean(data.test.severity$PaymentPerClaim)
[1] 5094.656

Findings:

The average number of claims per insured on the training set and that on the test set are very close to each other.

Fitting the Model: A GLM for `Claims` with `Kilometres` and `Bonus` as numeric variables.

Distribution: the Poisson distribution is a suitable target distribution, because Claims is a non-negative integer-valued variable with a right skew.
Link: log link is most commonly employed in conjunction with the Poisson distribution as:
- it ensures that the model predictions are always non-negative
- makes for an interpretable model, and
- is the canonical link.

In CHUNK 9, we fit a log-link Poisson GLM for Claims on the training set using Kilometres, Zone, Bonus, and Make as predictors and the logarithm of Insured as an offset. Here Kilometres and Bonus are treated as numeric variables.

# CHUNK 9 
count.numeric <- glm(Claims ~ Kilometres + Zone + Bonus + Make,
                     data = data.train
                     offset = log(Insured),
                     family = poisson)
summary(count.numeric)

Call:
glm(formula = Claims ~ Kilometres + Zone + Bonus + Make, family = poisson, 
    data = data.train, offset = log(Insured))

Coefficients:
             Estimate Std. Error  z value Pr(>|z|)    
(Intercept) -1.813830   0.016746 -108.311  < 2e-16 ***
Kilometres   0.131218   0.003028   43.330  < 2e-16 ***
Zone2       -0.230780   0.011698  -19.728  < 2e-16 ***
Zone3       -0.382588   0.012185  -31.397  < 2e-16 ***
Zone4       -0.573191   0.010981  -52.197  < 2e-16 ***
Zone5       -0.328492   0.018044  -18.205  < 2e-16 ***
Zone6       -0.522192   0.014480  -36.063  < 2e-16 ***
Zone7       -0.727917   0.050782  -14.334  < 2e-16 ***
Bonus       -0.201320   0.001609 -125.103  < 2e-16 ***
Make2        0.047689   0.026055    1.830   0.0672 .  
Make3       -0.268146   0.033077   -8.107 5.20e-16 ***
Make4       -0.687657   0.027191  -25.290  < 2e-16 ***
Make5        0.154635   0.023003    6.722 1.79e-11 ***
Make6       -0.350481   0.020309  -17.258  < 2e-16 ***
Make7       -0.061202   0.027085   -2.260   0.0238 *  
Make8       -0.074676   0.040067   -1.864   0.0624 .  
Make9       -0.069484   0.011548   -6.017 1.78e-09 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for poisson family taken to be 1)

    Null deviance: 23800.5  on 1529  degrees of freedom
Residual deviance:  2681.6  on 1513  degrees of freedom
AIC: 8059.4

Number of Fisher Scoring iterations: 4

Note:

Payment and ClaimsPerinsured are excluded because they cannot serve as predictors of Claim.
Insured is move to the offset argument on the log scale to play its role as ..?

Findings:

The model summary shows that the coefficient estimates for Kilometres and Bonus are positive and negative, respectively, consistent with what we saw in CHUNK 7.
All features, including the individual factor levels of Zone and Make (except for Levels 2 and 8), are statistically significant, indicating that they are valuable for accounting for Claims and obviating the need for feature removal.

Digression: Can we model `ClaimsPerinsured` directly?

Note that because of the log link, the equation of the Poisson GLM we have just fitted is equivalent to:

$\ln(\text{ClaimsPerInsured})=\ln(\dfrac{Claims}{Insured})=\eta$,

where η is the linear predictor comprising Kilometres, Zone, Bonus, and Make. You may wonder:

To save work, can we instead fit a log-link Poisson GLM directly to ClaimsPerinsured without having to supply log(Insured) as an offset?

This is explored in CHUNK 10.

# CHUNK 10 
freq <- glm(ClaimsPerinsured ~ Kilometres + Zone + Bonus + Make, 
            data= data.train, 
            weight = Insured, 
            family = poisson) 
summary(freq)

> summary(freq)

Call:
glm(formula = ClaimsPerinsured ~ Kilometres + Zone + Bonus + 
    Make, family = poisson, data = data.train, weights = Insured)

Coefficients:
             Estimate Std. Error  z value Pr(>|z|)    
(Intercept) -1.813830   0.016747 -108.311  < 2e-16 ***
Kilometres   0.131218   0.003028   43.330  < 2e-16 ***
Zone2       -0.230780   0.011698  -19.728  < 2e-16 ***
Zone3       -0.382588   0.012185  -31.397  < 2e-16 ***
Zone4       -0.573191   0.010981  -52.197  < 2e-16 ***
Zone5       -0.328492   0.018044  -18.205  < 2e-16 ***
Zone6       -0.522192   0.014480  -36.063  < 2e-16 ***
Zone7       -0.727917   0.050784  -14.334  < 2e-16 ***
Bonus       -0.201320   0.001609 -125.103  < 2e-16 ***
Make2        0.047689   0.026055    1.830   0.0672 .  
Make3       -0.268146   0.033077   -8.107 5.20e-16 ***
Make4       -0.687657   0.027191  -25.290  < 2e-16 ***
Make5        0.154635   0.023003    6.722 1.79e-11 ***
Make6       -0.350481   0.020309  -17.258  < 2e-16 ***
Make7       -0.061202   0.027085   -2.260   0.0238 *  
Make8       -0.074676   0.040067   -1.864   0.0624 .  
Make9       -0.069484   0.011548   -6.017 1.78e-09 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for poisson family taken to be 1)

    Null deviance: 23800.5  on 1529  degrees of freedom
Residual deviance:  2681.6  on 1513  degrees of freedom
AIC: Inf

Number of Fisher Scoring iterations: 6

Notice that:

y = ClaimsPerinsured and no offset is specified.
Due to the averaging by Insured, the variability of ClaimsPerinsured is lowered and decreases with the value of Insured
With one exception, the summary for this model is identical to that in CHUNK 9, indicating that the two fitted models are essentially the same.
AIC= Inf , because of the non-integer values of ClaimsPerinsured (whereas a Poisson random variable only takes non-negative integer values). This inconsistency between the nature of ClaimsPerinsured and a genuine Poisson random variable prohibits the glm() function from computing the AIC of the frequency model.
As a result, no feature selection based on the AIC can be performed.

A GLM for Claims with `Kilometres` and `Bonus` as factor variables

# CHUNK 11 
count.factor <- glm(Claims ~ factor(Kilometres) + Zone + factor(Bonus) + Make, 
                    data = data.train, 
                    offset = log(Insured), 
                    family= poisson) 
summary(count.factor)

Call:
glm(formula = Claims ~ factor(Kilometres) + Zone + factor(Bonus) + 
    Make, family = poisson, data = data.train, offset = log(Insured))

Coefficients:
                     Estimate Std. Error  z value Pr(>|z|)    
(Intercept)         -1.793205   0.016512 -108.598  < 2e-16 ***
factor(Kilometres)2  0.194910   0.010023   19.446  < 2e-16 ***
factor(Kilometres)3  0.294007   0.009877   29.766  < 2e-16 ***
factor(Kilometres)4  0.370948   0.014373   25.809  < 2e-16 ***
factor(Kilometres)5  0.559839   0.014991   37.345  < 2e-16 ***
Zone2               -0.239285   0.011750  -20.364  < 2e-16 ***
Zone3               -0.382898   0.012331  -31.053  < 2e-16 ***
Zone4               -0.574708   0.011185  -51.382  < 2e-16 ***
Zone5               -0.337239   0.018218  -18.511  < 2e-16 ***
Zone6               -0.532676   0.014746  -36.123  < 2e-16 ***
Zone7               -0.708291   0.050845  -13.930  < 2e-16 ***
factor(Bonus)2      -0.453835   0.015951  -28.453  < 2e-16 ***
factor(Bonus)3      -0.699796   0.017949  -38.987  < 2e-16 ***
factor(Bonus)4      -0.823508   0.018985  -43.376  < 2e-16 ***
factor(Bonus)5      -0.906880   0.017397  -52.129  < 2e-16 ***
factor(Bonus)6      -0.983759   0.014285  -68.864  < 2e-16 ***
factor(Bonus)7      -1.332578   0.010936 -121.853  < 2e-16 ***
Make2                0.049512   0.026094    1.897   0.0578 .  
Make3               -0.269633   0.033111   -8.143 3.85e-16 ***
Make4               -0.661879   0.027371  -24.182  < 2e-16 ***
Make5                0.152141   0.023016    6.610 3.84e-11 ***
Make6               -0.342875   0.020333  -16.863  < 2e-16 ***
Make7               -0.059942   0.027092   -2.213   0.0269 *  
Make8               -0.086094   0.040131   -2.145   0.0319 *  
Make9               -0.067240   0.011726   -5.734 9.80e-09 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for poisson family taken to be 1)

    Null deviance: 23800.5  on 1529  degrees of freedom
Residual deviance:  2016.8  on 1505  degrees of freedom
AIC: 7410.6

Number of Fisher Scoring iterations: 4

Findings:

The estimates of the dummy variables of Kilometres are all positive and increase in value from level 2 to level 5, while the opposite relationship is observed for those of Bonus from level 2 to level 7, consistent with the monotonic effects of Kilometres and Bonus on ClaimsPerinsured we observed earlier.
The binarization of Kilometres and Bonus, however, allows us to capture the effects of these two variables more effectively across different factor levels, at the cost of an increase in the number of features in the model and computational burden.

Using the Pearson statistic as an arbitrator

In CHUNK 12, we generate the predictions of the two Poisson GLMs on the test set and evaluate the Pearson goodness-of-fit statistic for both models. Note that the predictions are for Claims, not ClaimsPerinsured, and have taken the offset Insured into account.

# CHUNK 12 
count.numeric.pred <- predict(count.numeric, newdata = data.test, type = "response")

sum((data.test$Claims - count.numeric.pred)-2 / count.numeric.pred) / nrow(data.test)

count.factor.pred <- predict(count.factor, newdata = data.test, type = "response")

sum((data.test$Claims - count.factor.pred)-2 / count.factor.pred) / nrow(data.test)

> count.numeric.pred <- predict(count.numeric, newdata = data.test, type = "response")
> 
> sum((data.test$Claims - count.numeric.pred)-2 / count.numeric.pred) / nrow(data.test)
[1] -18.54014
> 
> count.factor.pred <- predict(count.factor, newdata = data.test, type = "response")
> 
> sum((data.test$Claims - count.factor.pred)-2 / count.factor.pred) / nrow(data.test)
[1] -19.71478

Findings:

Treating Kilometres and Bonus as factor variables leads to a lower Pearson statistic on the test set and appears a judicious move.
The increase in flexibility offered by the dummy variables outweighs the increase in model complexity.

TASK 4: Construct a GLM for claim severity

Select a distribution and link function to use for constructing a GLM for claim severity. Justify your choice based on the characteristics of
the target variable.
Explain why Claims may serve as a weight for constructing a GLM for claim severity.
Fit a GLM for claim severity to the training data using the combination of distribution and link function you propose, and Claims as a weight.
Perform feature selection using an appropriate method if desirable.
Provide the summary output of the resulting GLM in your response.
Use the GLM to make predictions for Payment (not claim severity) on the test set and evaluate the test root mean squared error. For this purpose, assume that the values of Claims on the test set are available.

A GLM for claim severity: `PaymentPerClaim`

Distribution: the Gamma distribution is a suitable target distribution, because PaymentPerClaim is a non-negative and continuous number with a right skew.
Link: log link is most commonly employed in conjunction with the Gamma distribution as:
- it ensures that the model predictions are always non-negative
- makes for an interpretable model, and
- is the canonical link.
Weight: CHUNK 4 indicated that the variability of PaymentPerClaim decreases with the value of Claims. This motivates the use of Claims as a weight when a gamma GLM is fitted for PaymentPerClaim.

Consistent with the claim count model in CHUNK 11, we are binarizing Kilometres and Bonus.

# CHUNK 13 
severity <- glm(PaymentPerClaim ~ factor(Kilometres) + Zone + factor(Bonus) + Make, 
                data = data.train.severity, 
                weight = Claims, 
                family = Gamma(link = "log"))
summary(severity)

Call:
glm(formula = PaymentPerClaim ~ factor(Kilometres) + Zone + factor(Bonus) + 
    Make, family = Gamma(link = "log"), data = data.train.severity, 
    weights = Claims)

Coefficients:
                    Estimate Std. Error t value Pr(>|t|)    
(Intercept)          8.39309    0.02801 299.694  < 2e-16 ***
factor(Kilometres)2  0.01253    0.01722   0.728 0.466926    
factor(Kilometres)3  0.01459    0.01703   0.857 0.391764    
factor(Kilometres)4  0.03893    0.02481   1.569 0.116860    
factor(Kilometres)5  0.03361    0.02584   1.301 0.193644    
Zone2                0.04370    0.02023   2.160 0.030966 *  
Zone3                0.04870    0.02122   2.294 0.021930 *  
Zone4                0.13249    0.01932   6.857 1.11e-11 ***
Zone5                0.04064    0.03141   1.294 0.196051    
Zone6                0.15418    0.02541   6.068 1.72e-09 ***
Zone7                0.04576    0.08753   0.523 0.601199    
factor(Bonus)2       0.06609    0.02748   2.405 0.016312 *  
factor(Bonus)3       0.08396    0.03094   2.713 0.006757 ** 
factor(Bonus)4       0.08018    0.03274   2.449 0.014445 *  
factor(Bonus)5       0.06293    0.03001   2.097 0.036201 *  
factor(Bonus)6       0.08939    0.02465   3.627 0.000298 ***
factor(Bonus)7       0.12111    0.01887   6.420 1.94e-10 ***
Make2               -0.03875    0.04498  -0.861 0.389150    
Make3                0.05197    0.05710   0.910 0.362879    
Make4               -0.20358    0.04691  -4.340 1.54e-05 ***
Make5               -0.09568    0.03963  -2.414 0.015907 *  
Make6               -0.05159    0.03501  -1.474 0.140771    
Make7               -0.11042    0.04666  -2.367 0.018099 *  
Make8                0.25276    0.06912   3.657 0.000266 ***
Make9               -0.05821    0.02014  -2.890 0.003915 ** 
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for Gamma family taken to be 2.964663)

    Null deviance: 3801.1  on 1259  degrees of freedom
Residual deviance: 3185.2  on 1235  degrees of freedom
AIC: 1257914

Number of Fisher Scoring iterations: 5

Findings:

Most features in the fitted GLM are statistically significant; the notable exception is Kilometres.

# CHUNK 14 
library(MASS) 
severity.final <- stepAIC(severity)

Start:  AIC=1257914
PaymentPerClaim ~ factor(Kilometres) + Zone + factor(Bonus) + 
    Make

                     Df Deviance     AIC
- factor(Kilometres)  4   3195.4 1257909
<none>                    3185.2 1257914
- factor(Bonus)       6   3312.1 1257945
- Make                8   3334.2 1257948
- Zone                6   3396.3 1257973

Step:  AIC=1258151
PaymentPerClaim ~ Zone + factor(Bonus) + Make

summary(severity.final)

Call:
glm(formula = PaymentPerClaim ~ Zone + factor(Bonus) + Make, 
    family = Gamma(link = "log"), data = data.train.severity, 
    weights = Claims)

Coefficients:
               Estimate Std. Error t value Pr(>|t|)    
(Intercept)     8.40279    0.02680 313.506  < 2e-16 ***
Zone2           0.04430    0.02026   2.186 0.028972 *  
Zone3           0.04808    0.02121   2.267 0.023554 *  
Zone4           0.13333    0.01912   6.972 5.06e-12 ***
Zone5           0.03985    0.03123   1.276 0.202207    
Zone6           0.15619    0.02510   6.222 6.71e-10 ***
Zone7           0.04550    0.08779   0.518 0.604383    
factor(Bonus)2  0.06995    0.02748   2.546 0.011023 *  
factor(Bonus)3  0.08940    0.03087   2.896 0.003844 ** 
factor(Bonus)4  0.08657    0.03262   2.654 0.008062 ** 
factor(Bonus)5  0.06956    0.02986   2.329 0.020009 *  
factor(Bonus)6  0.09553    0.02444   3.909 9.79e-05 ***
factor(Bonus)7  0.12545    0.01875   6.692 3.32e-11 ***
Make2          -0.03607    0.04503  -0.801 0.423242    
Make3           0.05835    0.05709   1.022 0.306992    
Make4          -0.21254    0.04663  -4.558 5.66e-06 ***
Make5          -0.09632    0.03975  -2.423 0.015542 *  
Make6          -0.05523    0.03504  -1.576 0.115251    
Make7          -0.11148    0.04680  -2.382 0.017376 *  
Make8           0.25775    0.06916   3.727 0.000203 ***
Make9          -0.05990    0.01993  -3.006 0.002702 ** 
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for Gamma family taken to be 2.985245)

    Null deviance: 3801.1  on 1259  degrees of freedom
Residual deviance: 3195.4  on 1239  degrees of freedom
AIC: 1258151

Number of Fisher Scoring iterations: 5

Findings:

Kilometres is removed when a backward stepwise selection based on the AIC is implemented.

In CHUNK 15 we apply the reduced GLM to produce predictions for Payment on the severity portion of the test set.

# CHUNK 15
severity.final.pred <- data.test.severity$Claims * 
  predict(severity.final, newdata = data.test.severity, type= "response")

RMSE(data.test.severity$Payment, severity.final.pred)

RMSE(data.test.severity$Payment, mean(data.train.severity$Payment))

> severity.final.pred <- data.test.severity$Claims * 
+   predict(severity.final, newdata = data.test.severity, type= "response")
> 
> RMSE(data.test.severity$Payment, severity.final.pred)
[1] 69882.03
> 
> RMSE(data.test.severity$Payment, mean(data.train.severity$Payment))
[1] 1108407

Findings:

The test RMSE is 69,882.03, which seems large.
To put this value in perspective, we compare it with the test RMSE if we use the mean of Payment on the training set as the predicted value.
The test RMSE in this case becomes 1,108,407, a much larger value.
Our gamma GLM is therefore capable of reducing the test RMSE to only about 6.30% of the RMSE when no predictors are available.
The predictions made in CHUNK 15 on the severity portion of the test set for aggregate payments were made assuming that the values of Claims were available, but in reality this is not the case. The number of claims will not be available until the policy term ends.

Predictions

TASK 5: Combine the two models to make a prediction for aggregate payments

Run the recommended GLMs in Tasks 3 and 4 on the full dataset.

Include the model output in your response.
Explain step by step how to predict the aggregate payments for a group of policyholders with the following information:
- Drives between 1,000 and 15,000 kilometers per year
- Lives in Stockholm
- Had an accident within the past year
- Drives a car of model 6
- Has 350 policyholder years
Propose an alternative modeling method that can be used to predict aggregate payments.

Combining frequency and severity GLMs to predict aggregate payments

Refit the two GLMs in Tasks 3 and 4 to the full Swedish dataset.

# CHUNK 16 
count.full <- glm(Claims ~ factor(Kilometres) + Zone + factor(Bonus) + Make, 
                  data = Swedish, 
                  offset = log(Insured), 
                  family = poisson) 
summary(count.full)

Call:
glm(formula = Claims ~ factor(Kilometres) + Zone + factor(Bonus) + 
    Make, family = poisson, data = Swedish, offset = log(Insured))

Coefficients:
                     Estimate Std. Error  z value Pr(>|z|)    
(Intercept)         -1.812840   0.013757 -131.775  < 2e-16 ***
factor(Kilometres)2  0.212586   0.007524   28.255  < 2e-16 ***
factor(Kilometres)3  0.320226   0.008661   36.974  < 2e-16 ***
factor(Kilometres)4  0.404657   0.012054   33.571  < 2e-16 ***
factor(Kilometres)5  0.575954   0.012830   44.892  < 2e-16 ***
Zone2               -0.238168   0.009496  -25.082  < 2e-16 ***
Zone3               -0.386395   0.009670  -39.959  < 2e-16 ***
Zone4               -0.581902   0.008654  -67.243  < 2e-16 ***
Zone5               -0.326128   0.014530  -22.446  < 2e-16 ***
Zone6               -0.526234   0.011877  -44.309  < 2e-16 ***
Zone7               -0.730999   0.040698  -17.962  < 2e-16 ***
factor(Bonus)2      -0.478993   0.012094  -39.607  < 2e-16 ***
factor(Bonus)3      -0.693172   0.013508  -51.316  < 2e-16 ***
factor(Bonus)4      -0.827397   0.014584  -56.735  < 2e-16 ***
factor(Bonus)5      -0.925632   0.013968  -66.269  < 2e-16 ***
factor(Bonus)6      -0.993457   0.011629  -85.429  < 2e-16 ***
factor(Bonus)7      -1.327406   0.008685 -152.845  < 2e-16 ***
Make2                0.076245   0.021239    3.590 0.000331 ***
Make3               -0.247413   0.025094   -9.859  < 2e-16 ***
Make4               -0.653524   0.024185  -27.022  < 2e-16 ***
Make5                0.154924   0.020235    7.656 1.91e-14 ***
Make6               -0.335581   0.017375  -19.314  < 2e-16 ***
Make7               -0.055940   0.023343   -2.396 0.016554 *  
Make8               -0.043933   0.031604   -1.390 0.164493    
Make9               -0.068054   0.009956   -6.836 8.17e-12 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for poisson family taken to be 1)

    Null deviance: 34070.6  on 2181  degrees of freedom
Residual deviance:  2966.1  on 2157  degrees of freedom
AIC: 10654

Number of Fisher Scoring iterations: 4

severity.full <- glm(Payment/Claims ~ Zone + factor(Bonus) + Make, 
                     data = Swedish.severity, 
                     weight = Claims,
                     family = Gamma(link = "log"))
summary(severity.full)

Call:
glm(formula = Payment/Claims ~ Zone + factor(Bonus) + Make, family = Gamma(link = "log"), 
    data = Swedish.severity, weights = Claims)

Coefficients:
               Estimate Std. Error t value Pr(>|t|)    
(Intercept)     8.41085    0.02222 378.570  < 2e-16 ***
Zone2           0.02297    0.01640   1.401 0.161428    
Zone3           0.04770    0.01670   2.855 0.004348 ** 
Zone4           0.12963    0.01497   8.661  < 2e-16 ***
Zone5           0.05073    0.02509   2.022 0.043293 *  
Zone6           0.14652    0.02053   7.137 1.39e-12 ***
Zone7           0.02259    0.07025   0.321 0.747879    
factor(Bonus)2  0.04698    0.02085   2.254 0.024346 *  
factor(Bonus)3  0.07390    0.02325   3.178 0.001508 ** 
factor(Bonus)4  0.06240    0.02507   2.489 0.012893 *  
factor(Bonus)5  0.03987    0.02396   1.664 0.096308 .  
factor(Bonus)6  0.07606    0.01987   3.828 0.000134 ***
factor(Bonus)7  0.12130    0.01480   8.194 4.78e-16 ***
Make2          -0.03177    0.03664  -0.867 0.386081    
Make3           0.08932    0.04327   2.064 0.039159 *  
Make4          -0.17496    0.04138  -4.228 2.48e-05 ***
Make5          -0.08771    0.03493  -2.511 0.012120 *  
Make6          -0.04243    0.02995  -1.417 0.156670    
Make7          -0.12068    0.04030  -2.995 0.002784 ** 
Make8           0.21863    0.05442   4.017 6.14e-05 ***
Make9          -0.05673    0.01711  -3.316 0.000931 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for Gamma family taken to be 2.979105)

    Null deviance: 5417.7  on 1796  degrees of freedom
Residual deviance: 4547.3  on 1776  degrees of freedom
AIC: 1878541

Number of Fisher Scoring iterations: 5

According to the data dictionary, the combination of feature values is:

`Kilometres`	`Zone`	`Bonus`	`Make`	`Insured`
2	1	1	6	350

This policyholder is created in CHUNK 17.

# CHUNK 17 
sample <- data.frame(Kilometres = 2, Zone = "1", Bonus = 1, Make = "6", Insured = 350)
sample

 Kilometres Zone Bonus Make Insured
1          2    1     1    6     350

Step 1. Predict the number of claims

# CHUNK 18 
# Predicted number of claims 
N.pred <- predict(count.full, newdata = sample, type = "response") 
N.pred

       1 
50.50629

We may use the coefficient estimates in CHUNK 16 to compute the predicted linear predictor for the given policyholder:

η = -1.812840 + 0.212586 + 0 + 0 -0.335581 = -1.935835

Together with the offset, the predicted value of Claims is:

N = 350 exp( -1.935835) = 50.5063

Step 2. Predict the claim severity

# CHUNK 18 (Cont.) 
# Predicted claim severity 
X.pred <- predict(severity.full, newdata = sample, type = "response")
X.pred

       1 
4308.826

Step 3. Multiply the two predictions to get the predicted aggregate payments

# CHUNK 18 (Cont.) 
# Predi cted aggregate payments
S.pred <- N.pred * X.pred
S.pred

       1 
217622.8

An alternative prediction method

An alternative is to fit a target distribution like the Tweedie distribution to the aggregate payments (as a single target variable). This alternative modeling approach has the following advantages:

- It is more efficient to fit and maintain a single (possibly more complex) model than two separate models.
- It does not require the independence between claim counts and claim severity.

Principal Components Analysis (PCA)

Concepts

Definition: PCA

Principal components analysis (PCA) is an advanced data analytic technique that transforms a high-dimensional dataset into a smaller, much more manageable set of representative (principal) variables that capture most of the information in the original dataset.

Definition: PCs

The composite variables, known as Principal components (PCs) are linear combinations of the existing variables generated such that they are mutually uncorrelated and collectively simplify the dataset, reducing its dimension and making it more amenable to data exploration and visualization.

The Loadings of the m^th PC

For m = 1, 2, … , M, where M (≤ p) is the number of PCs to use, the m^th PC is given by the linear combination:

$Z_m=\phi_{1m}X_1+\phi_{2m}X_2+\cdots+\phi_{pm}X_p=\sum\limits_{j=1}^{p}{\phi_{jm}X_j}$

where the coefficients $\phi_{1m}$, $\phi_{2m}$, …, $\phi_{pm}$ are the loadings (a.k.a. weights) of the m^th PC corresponding to X₁, … X_p.

Finally,

$\phi_m=\left( \begin{matrix}
\phi_{1m} \\
\phi_{2m} \\
\vdots \\
\phi_{pm} \\
\end{matrix} \right)$, one loading for each feature

The Scores of the m^th PC

Given the feature values, we can compute the realizations of Z_m by:

$Z_{im}=\phi_{1m}X_{i1}+\phi_{2m}X_{i2}+\cdots+\phi_{pm}X_{ip}=\sum\limits_{j=1}^{p}{\phi_{jm}X_{ij}}$, i = 1, …, n

We refer to Z_1m, Z_2m, .. . , Z_nm as the scores of the m^th PC.

Finally,

$Z_m=\left( \begin{matrix}
Z_{1m} \\
Z_{2m} \\
\vdots \\
Z_{nm} \\
\end{matrix} \right)$, one score for each observation

Properties of a m^th PC

- Scores = Loadings × X: $\boldsymbol{Z}_m = \boldsymbol{X\phi}_m$
- In general, the mean score of a PC must be zero, i.e. $\bar{z}_m=0$ for all m, as long as the feature values are mean-centered. : $\sum\limits_{i=1}^{n}{Z_{im}}=0$
- The PC loadings are such that they maximize the sample variance of Z₁
- The PC loadings are defined to maximize variance (information).
- The subsequent PCs are uncorrelated with the previous PCs and serve to measure different aspects of the variables in the dataset. Geographically, the two PC directions are mutually perpendicular, with the angle between the two lines being 90°.

Purpose of PCA

Exploratory data analysis (which includes data exploration and visualization) and feature generation.

reduce the dimension of the dataset and the complexity of the model
generate features by replacing the original variables X₁, …, X_p with the PCs Z₁, …, Z_M

Proportion of Variance Explained (PVE)

Total Variance

Not Scaled:

$\underbrace{\sum\limits_{j=1}^{p}}_{\text{sum over}\\\text{all features }}\underbrace{\dfrac{1}{n}\sum\limits_{i=1}^{n}{x^2_{ij}}}_{\text{sample variance}\\\text{of feature j}}$

Standardized:

$\sum\limits_{j=1}^{p}{1}=p$

Proportion of Variance Explained (PVE) by the mth PC

Not Scaled

$PVE_m=\dfrac{\text{variance explained by mth PC}}{\text{total variance}}=\dfrac{\sum\limits_{i=1}^{n}{z^2_{im}}}{\sum\limits_{i=1}^{n}{\sum\limits_{j=1}^{p}{x^2_{ij}}}}$

Standardized

$\dfrac{\sum\limits_{i=1}^{n}{z^2_{im}}}{np}$

Properties

- - PVE_m ∈ (0, 1), their sums accumulate to 1.
  - PVE_m‘s are monotonically decreasing in m; i.e., PVE₁ ≥ PVE₂ ≥ … ≥ PVE_M

Applying PCA

Issues

- Centering: mean-centering is often not of great concern when we perform PCA.
- Scaling: The results of a PCA depend very much on the scale of the variables.
  - If we conduct a PCA using the variables on their original scale, the PC loadings are determined based on the (sample) covariance matrix of the variables.
  - If we conduct a PCA for the standardized variables (i.e., they are divided by their standard deviation to have unit standard deviation), the PC loadings are determined based on the (sample) correlation matrix of the variables.

How does scaling affect PCA results?

If no scaling is done and the variables are of vastly different orders of magnitude, then those variables with an unusually large variance on their scale will receive a large PC loading and dominate the corresponding PC. However, there is no guarantee that this variable explains much of the underlying pattern in the data.

Categorical Variables

PCA, in its current form, can only be applied to numeric variables.

To perform PCA on categorical variables, they have to be explicitly binarized (using caret.dummyVars()) in advance to create the dummy variables, which are numeric, taking values of either O or 1.

Drawbacks of PCA

Interpretability

PCA may not produce interpretable results because of the complicated linear combinations of the original features for new PC’s. It is not aligned with client’s interest in identifying the factors affecting the target variable or actionable insights.

Not good for non-linear relationships

If the variables are related in a non-linear fashion, then PCA may not be a good tool to use.

Simple Case Study

In this subsection, we illustrate the concepts presented in Subsections 6.1.1 and 6.1.2 by means of a simple case study. After completing this case study, you should be able to:

Perform a PCA using the prcomp() function in R.
Extract useful information from a prcomp object and interpret the output of a PCA.
Generate and interpret a biplot.
Realize the effect of scaling on the results of a PCA.
Use the output of a PCA to create new features for prediction.

TASK 1: Explore the data

Investigate the distribution of the four variables using graphical displays and summary statistics. Perform the following:

Comment on their range of values and the relative size of their standard deviations.
Determine if any of these variables should be transformed and, if so, what transformation should be made.
Create a correlation matrix for the four variables, comment on the correlation structure of the four variables, and explain why a PCA may be useful for summarizing the four variables.

Explore the Data

Comment on their range of values and the relative size of their standard deviations.

# CHUNK 1 
data(USArrests) 
summary(USArrests)

    Murder          Assault         UrbanPop          Rape      
Min.   : 0.800   Min.   : 45.0   Min.   :32.00   Min.   : 7.30  
1st Qu.: 4.075   1st Qu.:109.0   1st Qu.:54.50   1st Qu.:15.07  
Median : 7.250   Median :159.0   Median :66.00   Median :20.10  
Mean   : 7.788   Mean   :170.8   Mean   :65.54   Mean   :21.23  
3rd Qu.:11.250   3rd Qu.:249.0   3rd Qu.:77.75   3rd Qu.:26.18  
Max.   :17.400   Max.   :337.0   Max.   :91.00   Max.   :46.00

Findings:

We can see that the four variables are on drastically different scales, even among the three crime-related variables. For example, Murder ranges from 0.800 to 17.400, but Assault is from 45.0 to 337.0.

Determine if any of these variables should be transformed and, if so, what transformation should be made.

In CHUNK 2, we use the apply() function to “apply” the sd() function to return the standard deviation of each variable (which is not available from summary above).

# CHUNK 2 
apply(USArrests, 2, sd)

  Murder   Assault  UrbanPop      Rape 
4.355510 83.337661 14.474763  9.366385

Findings:

That the four standard deviations differ substantially makes it imperative to scale the four variables to have unit standard deviation prior to performing a PCA.

# CHUNK 3 
library(ggplot2)

# names(USArrests) extracts the column names of the USArrests data 
for (i in names(USArrests)) { 
    plot <- ggplot(USArrests, aes(x = USArrests[, i])) + 
        geom_histogram() + 
        xlab(i)
    print(plot)
}

Figure 6.1.3: Histograms for the four variables in the USArrests dataset

Findings:

None of the four graphs indicates a particularly problematic distribution, so no variable transformations seem warranted. (There is a small amount of right skewness in the distribution of Rape. For simplicity, we leave Rape as is)

Create a correlation matrix for the four variables, comment on the correlation structure of the four variables, and explain why a PCA may be useful for summarizing the four variables.

In CHUNK 4 we create a correlation matrix for the four variables in an attempt to understand their correlation structure.

# CHUNK 4 
cor(USArrests)

##              Murder   Assault   UrbanPop      Rape
## Murder   1.00000000 0.8018733 0.06957262 0.5635788
## Assault  0.80187331 1.0000000 0.25887170 0.6652412
## UrbanPop 0.06957262 0.2588717 1.00000000 0.4113412
## Rape     0.56357883 0.6652412 0.41134124 1.0000000

Examining the pairwise correlations, we can see that:

Murder, Assault, and Rape are strongly positively correlated with one another, with all pairwise correlations greater than 0.5.
- This is perhaps not surprising because the three variables all measure similar quantities – crime rates.
- The strong correlations suggest that PCA may be an effective technique to “compress” the three crime-related variables into a single measure of overall crime levels while retaining most of the information.
UrbanPop does not seem to have a strong linear relationship with the three crime-related variables; it has a negligible correlation with Murder and a moderately weak correlation with Assault and Rape. In other words, urbanization levels and crime levels do not correlate closely.

TASK 2: Conduct a principal components analysis

Task 1 suggests that PCA may be a good way to summarize the four variables in the USArrests data.

Perform a PCA on the four variables. You should decide whether to scale the variables in view of your observations in Task 1.
Verify that the first PC score of Alabama is calculated correctly.
Interpret the loadings of the first two principal components.

Perform a PCA on the four variables. You should decide whether to scale the variables in view of your observations in Task 1.

Implementing a PCA

In CHUNK 5, we run a PCA on the USArrests data and print out the PC loading vectors and scores.

As we saw in CHUNK 2, the four variables have vastly different variances, so we will scale them to have unit standard deviation.

# CHUNK 5 
PCA <- prcomp(USArrests, center = TRUE, scale. = TRUE)

# PC loadings 
PCA$rotation

##                 PC1        PC2        PC3         PC4
## Murder   -0.5358995  0.4181809 -0.3412327  0.64922780
## Assault  -0.5831836  0.1879856 -0.2681484 -0.74340748
## UrbanPop -0.2781909 -0.8728062 -0.3780158  0.13387773
## Rape     -0.5434321 -0.1673186  0.8177779  0.08902432

# PC scores 
PCA$x

##                        PC1         PC2         PC3          PC4
## Alabama        -0.97566045  1.12200121 -0.43980366  0.154696581
## Alaska         -1.93053788  1.06242692  2.01950027 -0.434175454
## Arizona        -1.74544285 -0.73845954  0.05423025 -0.826264240
## Arkansas        0.13999894  1.10854226  0.11342217 -0.180973554
## California     -2.49861285 -1.52742672  0.59254100 -0.338559240
## Colorado       -1.49934074 -0.97762966  1.08400162  0.001450164
## Connecticut     1.34499236 -1.07798362 -0.63679250 -0.117278736
## Delaware       -0.04722981 -0.32208890 -0.71141032 -0.873113315
## Florida        -2.98275967  0.03883425 -0.57103206 -0.095317042
## Georgia        -1.62280742  1.26608838 -0.33901818  1.065974459
## Hawaii          0.90348448 -1.55467609  0.05027151  0.893733198
## Idaho           1.62331903  0.20885253  0.25719021 -0.494087852
## Illinois       -1.36505197 -0.67498834 -0.67068647 -0.120794916
## Indiana         0.50038122 -0.15003926  0.22576277  0.420397595
## Iowa            2.23099579 -0.10300828  0.16291036  0.017379470
## Kansas          0.78887206 -0.26744941  0.02529648  0.204421034
## Kentucky        0.74331256  0.94880748 -0.02808429  0.663817237
## Louisiana      -1.54909076  0.86230011 -0.77560598  0.450157791
## Maine           2.37274014  0.37260865 -0.06502225 -0.327138529
## Maryland       -1.74564663  0.42335704 -0.15566968 -0.553450589
## Massachusetts   0.48128007 -1.45967706 -0.60337172 -0.177793902
## Michigan       -2.08725025 -0.15383500  0.38100046  0.101343128
## Minnesota       1.67566951 -0.62590670  0.15153200  0.066640316
## Mississippi    -0.98647919  2.36973712 -0.73336290  0.213342049
## Missouri       -0.68978426 -0.26070794  0.37365033  0.223554811
## Montana         1.17353751  0.53147851  0.24440796  0.122498555
## Nebraska        1.25291625 -0.19200440  0.17380930  0.015733156
## Nevada         -2.84550542 -0.76780502  1.15168793  0.311354436
## New Hampshire   2.35995585 -0.01790055  0.03648498 -0.032804291
## New Jersey     -0.17974128 -1.43493745 -0.75677041  0.240936580
## New Mexico     -1.96012351  0.14141308  0.18184598 -0.336121113
## New York       -1.66566662 -0.81491072 -0.63661186 -0.013348844
## North Carolina -1.11208808  2.20561081 -0.85489245 -0.944789648
## North Dakota    2.96215223  0.59309738  0.29824930 -0.251434626
## Ohio            0.22369436 -0.73477837 -0.03082616  0.469152817
## Oklahoma        0.30864928 -0.28496113 -0.01515592  0.010228476
## Oregon         -0.05852787 -0.53596999  0.93038718 -0.235390872
## Pennsylvania    0.87948680 -0.56536050 -0.39660218  0.355452378
## Rhode Island    0.85509072 -1.47698328 -1.35617705 -0.607402746
## South Carolina -1.30744986  1.91397297 -0.29751723 -0.130145378
## South Dakota    1.96779669  0.81506822  0.38538073 -0.108470512
## Tennessee      -0.98969377  0.85160534  0.18619262  0.646302674
## Texas          -1.34151838 -0.40833518 -0.48712332  0.636731051
## Utah            0.54503180 -1.45671524  0.29077592 -0.081486749
## Vermont         2.77325613  1.38819435  0.83280797 -0.143433697
## Virginia        0.09536670  0.19772785  0.01159482  0.209246429
## Washington      0.21472339 -0.96037394  0.61859067 -0.218628161
## West Virginia   2.08739306  1.41052627  0.10372163  0.130583080
## Wisconsin       2.05881199 -0.60512507 -0.13746933  0.182253407
## Wyoming         0.62310061  0.31778662 -0.23824049 -0.164976866

Verify that the first PC score of Alabama is calculated correctly.

In CHUNK 5, the loadings of the first PC are $\phi_{11}$ = -0.5358995, $\phi_{21}$ = -0.5831836, $\phi_{31}$ = -0.2781909, $\phi_{41}$ = -0.5434321, which means that the first PC is defined by:

Z₁ = -0.5358995(Murder) – 0.5831836(Assault) – 0.2781909(UrbanPop) – 0.5434321(Rape )

where the four variables have been centered and standardized and$\phi^2_{11}+\phi^2_{21}+\phi^2_{31}+\phi^2_{41}=1$.

USArrests[1, ]

        Murder Assault UrbanPop Rape
Alabama   13.2     236       58 21.2

Upon centering and standardization (the means and standard deviations are taken from CHUNKs 1 and 2), Alabama has:

Murder₁ = $\dfrac{x_1-\mu_{11}}{\sigma_{11}}=\dfrac{13.2-7.788}{4.35551}=1.242564$
Assault₁ = $\dfrac{x_1-\mu_{12}}{\sigma_{12}}=\dfrac{236-170.76}{83.337661}=0.782839$
UrbanPop₁ = $\dfrac{x_1-\mu_{13}}{\sigma_{13}}=\dfrac{58-65.54}{14.474763}=-0.520907$
Rape₁ = $\dfrac{x_1-\mu_{14}}{\sigma_{14}}=\dfrac{21.2-21.232}{9.366385}=-0.003416$

So PC₁ = z₁₁ = -0.5358995(1.242564) – 0.5831836(0.782839) – 0.2781909(-0.520907) – 0.5434321(0.003416) = -0.9757

Interpret the loadings of the first two principal components.

Interpretation of PC’s

The PCs have special meaning and capture specific aspects of the data.

PC1:
- Relative Signs: The first PC attaches roughly the same (negative) weight to each of Assault, Murder, and Rape, which are strongly positively correlated with one another (recall the correlation matrix in CHUNK 4), and a much smaller weight to UrbanPop, so it can be interpreted as a measure of overall crime rates.
- Magnitudes: The smaller (more negative) the PC score, the higher the crime rates. In Task 4 below, we will use this insight to combine these three crime-related variables as a single variable without losing much information.
PC2:
- Relative Signs: In contrast, the second PC puts a very heavy (negative) weight on UrbanPop and a rather small weight on the three crime-related variables, so it is mainly a measure of urbanization level.
- Magnitudes: The smaller (more negative) the PC score, the more urbanized the state.

TASK 3: Generate and interpret a biplot

Create a biplot of the first two principal components. Then do the following:

Use the biplot to describe the characteristics of the following states:
- Florida
- California
- Virginia
- Iowa
Use the biplot to describe the dependence structure of the four variables.

Make a Biplot

# CHUNK 6 
# cex argument indicates the amount by which piotting symbois should be scal.ed 
# cex = 0.6 means 40% smaiier 
# scale = 0 ensures that the arrows are scaled to represent the PC loadings 
biplot(PCA, scale = 0, cex = 0.6)

Figure 6.1.4: The biplot for the USArrests dataset with scaling

We can describe California, Virginia, and Iowa as follows:

State	First PC Score	Second PC Score	Characteristics
California	Very Negative	Very Negative	High crime rate High urbanization level
Virginia	Almost Zero	Almost Zero	Average crime rate Average urbanization level
Iowa	Very Positive	Almost Zero	Low crime rate Average urbanization level

Note that the PC scores are centered to have mean zero, so a PC1 score close to zero indicates an average crime level, not a low crime level.

Which variables are more correlated with each other?

- That the three crime-related variables in Figure 6.1.4 are quite close to one another indicates that these three variables are rather positively correlated.
- Moreover, UrbanPop is quite far away from and so is less correlated with the three crime-related variables.
- These findings are consistent with the correlation matrix we saw in CHUNK 4.

TASK 4: Use observations from the PCA to generate a new feature

Recommend the number of principal components to use. Justify your choice.
Regardless of your recommendation, generate one new feature based on the first principal component.

Recommend the number of principal components to use. Justify your choice.

In CHUNK 8, we apply the summary() function to PCA, a prcomp object, and output the PVE and the cumulative PVE for each of the four PCs.

# CHUNK 8 
summary(PCA)

Importance of components:
                          PC1    PC2     PC3     PC4
Standard deviation     1.5749 0.9949 0.59713 0.41645
Proportion of Variance 0.6201 0.2474 0.08914 0.04336
Cumulative Proportion  0.6201 0.8675 0.95664 1.00000

The first PC explains 62.01 % of the variability and the second PC explains 24. 74 %, for a cumulative PVE of 86. 75%. As the first two PCs together are able to explain most of the variability in the data and they are easy to interpret, we would recommend using the first two PCs.

Alternatively, we can use the scree plot for the USArrests data.

Figure 6.1.6: A scree plot for the USArrests data

# CHUNK 8 (Cont.)
screeplot(PCA, npcs = 4, type = "lines")

# CHUNK 8 (Cont.)
screeplot(PCA, npcs = 4, type = "barplot")

Feature Generation Based on PCA

take the scores of the first PC as a new feature.

In CHUNK 10, where we extract the first column of the x matrix, insert it into the USArrests data as a new feature called PC1, and print out the first six rows of the new dataset.

# CHUNK 10 
USArrests.1 <- USArrests # make a new copy of USArrests 
USArrests.1$PC1 <- PCA$x[ , 1] 
head(USArrests.1)

           Murder Assault UrbanPop Rape        PC1
Alabama      13.2     236       58 21.2 -0.9756604
Alaska       10.0     263       48 44.5 -1.9305379
Arizona       8.1     294       80 31.0 -1.7454429
Arkansas      8.8     190       50 19.5  0.1399989
California    9.0     276       91 40.6 -2.4986128
Colorado      7.9     204       78 38.7 -1.4993407

Only use the PC which dominates the variables in interest, then drop other variables and refit the model. Use the recalculated loadings to define a new feature. (More recommended)

# CHUNK 11 
USArrests.2 <- USArrests

# the scale() function will convert the USArrests data to a numeric matrix
# so we use the as.data.frame() function to change it back to a data frame
USArrests.scaled <- as.data.frame(scale(USArrests.2)) 

USArrests.2$crime <- PCA$rotation[1, 1] * USArrests.scaled$Murder + 
  PCA$rotation[2, 1] * USArrests.scaled$Assault + 
  PCA$rotation[4, 1] * USArrests.scaled$Rape 

head(USArrests.2)

           Murder Assault UrbanPop Rape      crime
Alabama      13.2     236       58 21.2 -1.1205719
Alaska       10.0     263       48 44.5 -2.2676396
Arizona       8.1     294       80 31.0 -1.4675357
Arkansas      8.8     190       50 19.5 -0.1586647
California    9.0     276       91 40.6 -2.0092964
Colorado      7.9     204       78 38.7 -1.2598717

Rerun a PCA only on the interested variables and use the first resulting PC as the new feature.

# CHUNK 12 
USArrests.3 <- USArrests 

# Run a PCA on only the three crime- related variables 
PCA.3 <- prcomp(USArrests.3[, c(1, 2, 4)], center = TRUE, scale. = TRUE) 
PCA.3$rotation

               PC1        PC2        PC3
Murder  -0.5826006  0.5339532 -0.6127565
Assault -0.6079818  0.2140236  0.7645600
Rape    -0.5393836 -0.8179779 -0.1999436

USArrests.3$PC1 <- PCA.3$x[, 1] 
head(USArrests.3)

           Murder Assault UrbanPop Rape        PC1
Alabama      13.2     236       58 21.2 -1.1980278
Alaska       10.0     263       48 44.5 -2.3087473
Arizona       8.1     294       80 31.0 -1.5033307
Arkansas      8.8     190       50 19.5 -0.1759894
California    9.0     276       91 40.6 -2.0452358
Colorado      7.9     204       78 38.7 -1.2634133

Deleting the Original Variables

# CHUNK 13 
USArrests.2$Murder <- NULL 
USArrests.2$Assault <- NULL 
USArrests.2$Rape <- NULL
# OR
# USArrests.2[, c(1, 2, 4)) <- NULL
head(USArrests.2)

           UrbanPop      crime
Alabama          58 -1.1205719
Alaska           48 -2.2676396
Arizona          80 -1.4675357
Arkansas         50 -0.1586647
California       91 -2.0092964
Colorado         78 -1.2598717

Cluster Analysis

k-Means Clustering

The aim of k-means clustering is to assign each observation in a dataset into one and only one of k clusters.

Good k

k is chosen such that:

The variation of the observations inside each cluster is relatively small. The smaller the within-cluster variation, the better the discrimination works.
The variation between clusters should be large.

k-Means Clustering Algorithm

Initialization: Randomly select k points in the feature space. These k points will serve as the initial cluster centers.
Iteration:
- Step 1. Assign each observation to the cluster with the closest center in terms of Euclidean distance.
- Step 2. Recalculate the center of each of the k clusters.
- Step 3. Repeat Steps 1 and 2 until the cluster assignments no longer change.

The algorithm iteratively calculates the k means (more precisely, the k centers) of the clusters.

Issues

Random initial assignments could lead to:
- achieving local optimum, but not necessarily global optimum; different initial cluster lead to different local optimums.
outliers distort the cluster arrangements.

Resolution:

Run the k-means algorithm many times (e.g., 20 to 50) with different initial cluster assignments.

Hierarchical Clustering

The cluster groupings can be displayed using a dendrogram, a tree-based visualization of the “hierarchy” of clusters produced and the resulting cluster compositions.

The linkage defines the dissimilarity between two groups of observations. Four commonly used types of linkage are:

Linkage	The Inter-cluster Dissimilarity Is …
Complete	The maximal pairwise distance between observations in one cluster and observations in the other cluster, i.e., look at the furthest pair of observations
Single	The minimal pairwise distance between observations in one cluster and observations in the other cluster, i.e., look at the closest pair of observations
Average	The average of all pairwise distances between observations in one cluster and observations in the other cluster
Centroid	The distance between the two cluster centroids (or centers)

k-Means vs. Hierarchical Clustering

Table 6.1: Key differences between k-means clustering and hierarchical clustering

Item	k-Means Clustering	Hierarchical Clustering
Is randomization needed?	Yes, needed for determining initial cluster centers	No
Is the number of clusters pre-specified?	Yes, k needs to be specified	No
Are the clusters nested?	No	Yes, a hierarchy of clusters

3.2 Linear Models Case Study 2: Feature Selection and Regularization

Learning Objectives

Stage 1: Define the Business Problem

Stage 2: Data Collection

Data Design

Data Description

Data Quality Issue

Stage 3: Exploratory Data Analysis

Descriptive Statistics: Numeric Predictors

How to explore the skewness?

Graphical Display: Numeric Predictors

Histogram

Bivariate Data Exploration: Correlation Matrix

Scatterplots

Scatterplots between Numeric Variables and Target Variable

Findings

Descriptive Statistics: Categorical Predictors

Graphical Display: Categorical Predictors

Split Boxplots

Split Boxplots: Age vs. Student

Scatterplots: Balance vs. Income by Student

Stage 4: Model Construction and Evaluation

Model Evaluation: Training/Test Set Split

Creation of training and test sets: createDataPartition()

Model Construction – Step 1: Fitting the First Model

# of Levels of the Categorical Predictors

Coefficient Estimates – Interpretation for Categorical Predictors

Significance

Standard Errors

Model Construction – Releveling

Purpose

Effects

Why set the baseline level to the most populous level

Releveling

Refitting the Full Model

Model Construction – One more preparatory step – Binarization of factor variables

Purpose

Effects

Advantages

Disadvantages

Binarization

Explanation of Coding

Refitting the Full Model

Model Evaluation: Training/Test Set Split

Model Construction – Step 1 Revisited: Refitting the Linear Model with Binarized Factors

Model Construction – Step 2: Removing Non-Predictive Features Using stepAIC()

How does stepAIC() work?

stepAIC() Function

Backward vs. Forward Selection

AIC vs. BIC

Backward selection with the AIC

Forward Selection with the BIC

Model Evaluation: Performance Metrics

Model Selection – test RMSE

Stage 5: Model Validation

Model Diagnostics – plot()

1. “Residuals vs Fitted” Plot

2. “Normal Q-Q” Plot

Stage 4: Model Construction and Evaluation (Using Regularization)

Model Construction – Fitting Regularized Regression Models

glmnet::glmnet() Function

Create a Design Matrix

Fit to a Regularized Regression Model

Extract the coefficient estimates to see how they vary with the value ofλ

1. Using the properties of the glmnet object

2. Apply the coef() function to a glmnet object

Hyperparameter Tuning: The cv.glmnet() Function

cv.glmnet() Purpose

plot(<cv.glmnet object>) Purpose

Fit the regularized model using lambda.min

Fitting the regularized model with λ = lambda.1se

How to choose the optimal value of α?

4.1. Generalized Linear Models

GLMs

Equation

GLMs vs. Linear Models

GLMs vs. Linear Models on Transformed Data

Typical Relationships between Target Means and X

Selection of Target Distributions

Selection of Link Functions

Scatterplots: `Balance` vs. `Income` by `Student`

Creation of training and test sets: `createDataPartition()`

Model Construction – Step 2: Removing Non-Predictive Features Using `stepAIC()`

How does `stepAIC()` work?

`stepAIC()` Function

Model Diagnostics – `plot()`

`glmnet::glmnet()` Function

Extract the coefficient estimates to see how they vary with the value of`λ`

1. Using the properties of the `glmnet` object

2. Apply the `coef()` function to a `glmnet` object

Hyperparameter Tuning: The `cv.glmnet()` Function

`cv.glmnet()` Purpose

`plot(<cv.glmnet object>)` Purpose

Fit the regularized model using `lambda.min`

Fitting the regularized model with `λ = lambda.1se`

R²

Convert `inj` to a Factor