[mathjax]

Case Study 2: GLMs for Binary Target Variables

Learning Objectives

Compared to GLMs for numeric target variables, GLM-based classifiers enjoy some subtly unique features, which will be revealed in the course of this case study. At the completion of this section, you should be able to:

Combine factor levels to reduce the dimension of the data.
Select appropriate link functions for binary target variables.
Implement different kinds of GLMs for binary target variables in R.
Incorporate an offset into a logistic regression model.
Interpret the results of a fitted logistic regression model.

Background

In this case study, we will examine the dataCar dataset in the insuranceData package. This dataset is based on a total of n = 67,856 one-year vehicle insurance policies taken out in 2004 or 2005. The variables in this dataset pertain to different characteristics of the policyholders and their vehicles. The target variable is clm, a binary variable equal to 1 if a claim occurred over the policy period and 0 otherwise.

Stage 1: Define the Business Problem

Objective

Our objective here is to construct appropriate GLMs to identify key factors associated with claim occurrence. Such factors will provide insurance companies offering vehicle insurance policies with useful information for understanding the claims-generating mechanism and, by extension, setting fair premium rates for its policyholders.

TASK 1: Edit the data

Perform data-processing by:

Removing questionable observations and variables that are not suitable as predictors for clm.
Changing veh_age and agecat to factor variables. Discuss the advantage(s) and disadvantage(s) of the factor conversions. Regardless, convert the two variables to factors.

Stage 2: Data Collection

Run CHUNK 1 to load the insuranceData package and the data, and print a summary of each variable.

# CHUNK 1
library (insuranceData) 
data(dataCar) 
summary(dataCar)

Or

# CHUNK 1
# Prior to R v4.0, the option stringsAsFactors was TRUE by default, i.e. string-like variables would automatically get converted to factors when being read in by
dataCar <- read.csv("vehins.csv")
# For R v4.0+, 
dataCar <- read.csv("vehins.csv", stringsAsFactors=TRUE)
summary(dataCar)

     clm             exposure          veh_value        numclaims          claimcst         veh_body            veh_age         gender              area               agecat     
Min.   :0.00000   Min.   :0.002738   Min.   : 0.000   Min.   :0.00000   Min.   :    0.0   Length:67856       Min.   :1.000   Length:67856       Length:67856       Min.   :1.000  
1st Qu.:0.00000   1st Qu.:0.219028   1st Qu.: 1.010   1st Qu.:0.00000   1st Qu.:    0.0   Class :character   1st Qu.:2.000   Class :character   Class :character   1st Qu.:2.000  
Median :0.00000   Median :0.446270   Median : 1.500   Median :0.00000   Median :    0.0   Mode  :character   Median :3.000   Mode  :character   Mode  :character   Median :3.000  
Mean   :0.06814   Mean   :0.468651   Mean   : 1.777   Mean   :0.07276   Mean   :  137.3                      Mean   :2.674                                         Mean   :3.485  
3rd Qu.:0.00000   3rd Qu.:0.709103   3rd Qu.: 2.150   3rd Qu.:0.00000   3rd Qu.:    0.0                      3rd Qu.:4.000                                         3rd Qu.:5.000  
Max.   :1.00000   Max.   :0.999316   Max.   :34.560   Max.   :4.00000   Max.   :55922.1                      Max.   :4.000                                         Max.   :6.000

Findings:

The mean of clm (which is treated in R as a numeric variable taking only two values, 0 and 1) is 0.06814, meaning that 6.814% of the 67,856 vehicle insurance policies had at least one claim over the policy period.

Variable	Description	Characteristics
`veh_value`	Vehicle value (in $10,000s)	Numeric with three decimal places, from 0 to 34.560
`exposure`	Exposure of policy	Numeric from 0 to 1
`clm`	Claim occurrence	Integer from 0 to 1 (0 = no, 1 = yes)
`numelaims`	Number of claims	Integer from 0 to 4
`claimcst`	Claim size	Positive value from 0 to 55,922.1 (= 0 if no claim)
`veh_body`	Vehicle body type	Factor with 13 levels: BUS, CONVT (= convertible), COUPE, HBACK (= hatchback), HDTOP (= hardtop), MCARA (= motorized caravan), MIBUS (= minibus), PANVN ( = panel van), RDS TR ( = roadster), SEDAN, STNWG (= station wagon), TRUCK, UTE(= utility)
`veh_age`	Vehicle age	Integer from 1 (youngest) to 4
`gender`	Gender of policyholder	A factor with two levels: M (= male) and F (= female)
`area`	Area of residence of policyholder	Factor with 6 levels: A, B, C, D, E, F
`ageeat`	Age band of policyholder	Integer from 1 (youngest) to 6
`X_OBSTAT_`	(Unknown)	A factor with one level “01101 0 0 0”

Data Quality Issue

Removing inappropriate observations and variables

Because the goal of this case study is to develop GLMs that allow us to predict claim occurrence based on policyholders’ characteristics, it is vitally important to make sure that the variables that serve as predictors of the GLMs are available before claims are observed. If you check the data dictionary (knowledge of the variables is important!), there are two variables whose values are known at the same time as or after claims are submitted, meaning that they cannot function as predictors of clm in practice. They are numclaims and claimcst, representing the number of claims and claim size, respectively. By definition, these two variables already imply clm. Indeed, we can check that if numclaims is 1 or more, or if claimcst is strictly positive, then clm is necessarily 1, indicating the presence of at least one claim, as the table in the first line of CHUNK 2 shows.

# CHUNK 2
table(dataCar$clm, dataCar$numclaims)

      0     1     2     3     4
0 63232     0     0     0     0
1     0  4333   271    18     2

If we build a model for clm using numclaims or claimcst as predictors, then we are able to make perfect predictions due to their strong relationships, but this superb prediction performance is misleading due to the issue of target leakage discussed. numclaims and claimcst “leads” the information about clm.

In the second part of CHUNK 2, we drop numclaims and claimcst to avoid target leakage, as well as the mysterious variable X_0BSTAT _ from the dataset.

# CHUNK 2 (cont.)
dataCar$claimcst <- NULL 
dataCar$numclaims <- NULL 
dataCar$X_0BSTAT_ <- NULL

The dataset also contains a small number of observations (53, to be exact) which involve a zero vehicle value. Although veh_value is measured in $10,000s, a value of zero for veh_value means that the value of the vehicle is lower than 0.001($10, 000) = $10, which does not make practical sense. For this reason, let’s also remove these questionable observations.

# CHUNK 2 (cont.)
nrow(dataCar[dataCar$veh_value == 0, ])
dataCar <- dataCar[dataCar$veh_value > 0, ]

[1] 53

Factor Conversions

Of all the predictors for clm, only veh_value, exposure, veh_age, and agecat are numeric variables (at this stage). Since veh_age and agecat are labels of age groups and they do not represent exact ages, for modeling purposes it may be desirable to treat them as factors instead of numeric variables. As with most decisions in Exam PA, this comes with advantages and disadvantages:

Advantages

Even though there is an order among the age categories (e.g., vehicles with veh_age = 1 are the newest and the higher the value of veh_age, the older the vehicles), taking the two variables as numeric variables, each with a single regression coefficient attached, imposes the undesirable restriction that every unit increase in each of the two variables is associated with the same change in the linear predictor, holding all other variables fixed (e.g., the change in the linear predictor when agecat changes from 1 to 2 equals the change when agecat changes from 2 to 3). Such a restriction is lifted if veh_age and agecat are treated as categorical predictors represented by dummy variables with different coefficients, and we will provide a GLM with more flexibility to capture the effects of these two variables on the target variable more effectively.

Disadvantages

Converting veh_age and agecat into factors will inflate the dimension of the data as each non-baseline factor level has to be represented by dummy variables and may result in overfitting. Stepwise selection with binarization or regularization can help reduce the great number of factor levels, but the model fitting procedure may take much longer to complete than when the two variables are numeric.

The conversion of veh_age and agecat from integers to factors can be achieved by the as.factor() function, as in CHUNK 3.

# CHUNK 3 
# Before factor conversion 
class(dataCar$agecat)
class(dataCar$veh_age)

dataCar$agecat <- as.factor(dataCar$agecat)
dataCar$veh_age <- as.factor(dataCar$veh_age)

# After factor conversion
class(dataCar$agecat)
class(dataCar$veh_age)

> # CHUNK 3 # Before factor conversion
> class(dataCar$agecat)
[1] "integer"

> class(dataCar$veh_age)
[1] "integer"
> 
> dataCar$agecat <- as.factor(dataCar$agecat)
> dataCar$veh_age <- as.factor(dataCar$veh_age)
> 
> # After factor conversion
> class(dataCar$agecat)
[1] "factor"

> class(dataCar$veh_age)
[1] "factor"

Relevel Categorical Variables

The rest of CHUNK 3 relevels all of the categorical predictors, including veh_age and agecat, so that their baseline level is the level with the most observations.

vars.cat <- c("veh_age", "veh_body", "gender", "area", "agecat") 

for (i in vars.cat){
  table <- as.data.frame(table(dataCar[, i]))
  max <- which.max(table[, 2]) 
  level.name <- as.character(table[max, 1]) 
  dataCar[, i] <- relevel(dataCar[, i], ref = level.name) 
}

summary(dataCar)

     clm             exposure          veh_value         claimcst          veh_body     veh_age   gender    area      agecat   
Min.   :0.00000   Min.   :0.002738   Min.   : 0.180   Min.   :    0.0   SEDAN  :22232   3:20060   F:38584   C:20530   4:16179  
1st Qu.:0.00000   1st Qu.:0.219028   1st Qu.: 1.010   1st Qu.:    0.0   HBACK  :18915   1:12254   M:29219   A:16302   1: 5734  
Median :0.00000   Median :0.446270   Median : 1.500   Median :    0.0   STNWG  :16234   2:16582             B:13328   2:12868  
Mean   :0.06811   Mean   :0.468481   Mean   : 1.778   Mean   :  137.1   UTE    : 4586   4:18907             D: 8161   3:15757  
3rd Qu.:0.00000   3rd Qu.:0.709103   3rd Qu.: 2.150   3rd Qu.:    0.0   TRUCK  : 1742                       E: 5907   5:10722  
Max.   :1.00000   Max.   :0.999316   Max.   :34.560   Max.   :55922.1   HDTOP  : 1579                       F: 3575   6: 6543  
                                                                        (Other): 2515

For all of the categorical predictors, their baseline level is listed first and contains the greatest number of observations, as desired.

The next two tasks relate to univariate and bivariate data exploration.

TASK 2: Perform univariate exploration of the non-factor variables

Use graphical displays and summary statistics to determine if any of the nonfactor variable(s) except exposure should be transformed and, if so, what transformation should be made. Do your recommended transformation(s), if any, and delete the original variable(s).

Note: For this task, there is no need to analyze the exposure variable, which will be treated later.

Stage 3: Exploratory Data Analysis

We are left with veh_value as the only numeric variable (besides exposure) to analyze. It ranges from 0.180 to 34.560 and its mean (1.778) is higher than its median (1.500), suggesting that its distribution is moderately right-skewed.

This is confirmed by the histogram (see the left panel of Figure 4.3.1) in CHUNK 4. For such a strictly positive and skewed variable, the log transformation is commonly used, so we create a log-transformed version of veh_value called log_veh_value, whose distribution is much more symmetric and bell-shaped, and delete the original veh_value in the second part of CHUNK 4.

# CHUNK 4 
library(ggplot2)
ggplot(dataCar, aes(x = veh_value)) + 
   geom_histogram() 
ggplot(dataCar, aes(x = log(veh_value))) + 
   geom_histogram() 
dataCar$log_veh_value <- log(dataCar$veh_value) 
dataCar$veh_value <- NULL

Figure 4.3.1: A histogram for veh_value (left) and the log of veh_value (right) in the dataCar dataset.

TASK 3: Explore the relationship of each predictor to clm

Use graphical displays and summary statistics to form preliminary conclusions regarding which variables are likely to have significant predictive power.

We already explored the distribution of the target variable clm in CHUNK 1 and the distribution of veh_value in CHUNK 4. Now we turn to the relationship between elm and each predictor, numeric and categorical.

Numeric predictors

First we look at the relationship between clm and the two numeric variables, log_veh_value and exposure. For a categorical-numeric pair, recall that boxplots for the numeric variable split by the levels of the categorical variable will be useful.

CHUNK 5 produces these boxplots for log_veh_value and exposure indexed by the two levels (0 and 1) of clm (see Figure 4.3.2). Do remember to apply the factor() function to clm so that it is treated as a factor when making the boxplots. It appears that claim occurrence (clm = 1) is associated with a slightly higher value of log_veh_value and a much higher value of exposure.

These positive relationships between the two predictors and clm will be quantified by a GLM in later tasks. In addition to split boxplots, we can also look at the mean or median of log_veh_value and exposure over the two levels of clm, as we will do in CHUNK 6.

# CHUNK 6 
library(tidyverse) 
dataCar %>%
  group_by_("clm") %>%
  summarise( 
    mean= mean(log_veh_value), 
    median= median(log_veh_value), 
    n = n()
)

# A tibble: 2 × 4
    clm  mean median     n
  <int> <dbl>  <dbl> <int>
1     0 0.382  0.399 63185
2     1 0.448  0.451  4618

dataCar %>% 
group_by_("clm") %>% 
summarise(
  mean = mean(exposure), 
  median = median(exposure), 
  n = n()
)

# A tibble: 2 × 4
    clm  mean median     n
  <int> <dbl>  <dbl> <int>
1     0 0.458  0.427 63185
2     1 0.611  0.637  4618

As we expect, the means and medians are higher when there is a claim, so a policyholder with a larger vehicle value or a larger exposure has a higher tendency to submit an insurance claim.

# CHUNK 5 
ggplot(dataCar, aes(x = factor(clm), y = log_veh_value)) + 
   geom_boxplot(fill = "red") 
ggplot(dataCar, aes(x = factor(clm), y = exposure)) + 
   geom_boxplot(fill = "red")

Figure 4.3.2: Split boxplots for log_veh_value and exposure by clm in the dataCar dataset.

Categorical Predictors

The other predictors, veh_body, veh_age (after conversion), gender, area, and agecat (after conversion), are all categorical variables. For categorical-categorical combinations, filled bar charts are useful. In CHUNK 7, we use a for loop to construct the bar charts for each categorical predictor color-filled by clm.

# CHUNK 7
for (i in vars.cat) {
  plot <- ggplot(dataCar, aes(x = dataCar[, i], fill = factor(clm))) +
    geom_bar(position = "fill") +
    labs(x = i, y = "percent") +
    theme(axis.text.x = element_text(angle = 90, hjust = 1))
  print(plot)
}

Figure 4.3.3: A bar chart for agecat by clm in the dataCar dataset

If a categorical variable is an important predictor for clm, then the proportions of 0 (in red) and 1 (in green) should vary significantly across different levels of the categorical variable. More noticeable differences were observed for the following variables:

veh_age: Higher rate of claim occurrence for group 2.
veh_body: Much higher rate for BUS and lower rate for CONVT, MIBUS, and UTE.
area: Higher rate for area F.
agecat: The claim occurrence rate appears to decrease with agecat.

We can also obtain a more accurate understanding of how the categorical predictors affect clm by looking at the mean of clm (which is the claim occurrence rate) by different levels of the predictors, as in CHUNK 8. The descriptive statistics are consistent with what we observe from the filled bar charts in CHUNK 7.

# CHUNK 8
for (i in vars.cat) {
  print(i)
  x <- dataCar %>%
    group_by_(i) %>%
    summarise(mean = mean(clm),
              n = n()) 
  print(x)
}

[1] "veh_age"
# A tibble: 4 × 3
  veh_age   mean     n
  <fct>    <dbl> <int>
1 3       0.0679 20060
2 1       0.0672 12254
3 2       0.0759 16582
4 4       0.0620 18907

[1] "veh_body"
# A tibble: 13 × 3
   veh_body   mean     n
   <fct>     <dbl> <int>
 1 SEDAN    0.0664 22232
 2 BUS      0.184     38
 3 CONVT    0.0370    81
 4 COUPE    0.0872   780
 5 HBACK    0.0668 18915
 6 HDTOP    0.0823  1579
 7 MCARA    0.116    121
 8 MIBUS    0.0601   716
 9 PANVN    0.0824   752
10 RDSTR    0.0741    27
11 STNWG    0.0721 16234
12 TRUCK    0.0683  1742
13 UTE      0.0567  4586

[1] "gender"
# A tibble: 2 × 3
  gender   mean     n
  <fct>   <dbl> <int>
1 F      0.0686 38584
2 M      0.0675 29219

[1] "area"
# A tibble: 6 × 3
  area    mean     n
  <fct>  <dbl> <int>
1 C     0.0688 20530
2 A     0.0664 16302
3 B     0.0723 13328
4 D     0.0607  8161
5 E     0.0652  5907
6 F     0.0783  3575

[1] "agecat"
# A tibble: 6 × 3
  agecat   mean     n
  <fct>   <dbl> <int>
1 4      0.0682 16179
2 1      0.0863  5734
3 2      0.0724 12868
4 3      0.0705 15757
5 5      0.0572 10722
6 6      0.0556  6543

TASK 4: Reduce the number of factor levels where appropriate

Several of the factor variables have a small number of observations at some of the levels. Consider using knowledge of the factor levels as well as evidence from Task 3 to combine some of them into factor levels with more observations.

Combining Categorical Predictors

The dataCar dataset features a number of categorical predictors, some of which have a large number of levels. This significantly increases the dimension of the data and may dilute the predictive power of the GLMs we construct. To reduce the dimension of the data, we can combine factor levels with either similar claim occurrence rates or very few observations to form more populous levels, based on the claim occurrence rates grouped by factor levels shown in CHUNK 8. Arguably the most problematic categorical predictor is veh_body, which has the largest number of levels (13 levels) out of the five factor variables. Some levels like “BUS” and “RDSTR” have very few observations, so our priority is to combine these sparse levels with more populous levels with a similar claim occurrence rate.

Combining `BUS` and `MCARA` into `BUS-MCARA`

BUS-MCARA: “BUS” is extremely sparse, but it has the highest claim occurrence rate. The level with the second highest rate is “MCARA“, which is quite sparse as well. In the first two lines of CHUNK 9, let’s combine these two levels into a larger level called “BUS-MCARA“.

# CHUNK 9
# Combine "BUS" and "MCARA" as one level called "BUS-MCARA"
levels(dataCar$veh_body) [levels(dataCar$veh_body) == "BUS"] <- "BUS-MCARA" 
levels(dataCar$veh_body) [levels(dataCar$veh_body) == "MCARA"] <- "BUS-MCARA"

The code, adapted from the December 8, 2020 exam, is another illustration of logical subsetting we learned in Chapter 1. To understand the code, note that levels(dataCar$veh_body) is a vector with 13 elements representing the 13 levels of veh_body. In the first line, we use a logical test to extract the index corresponding to “BUS” (which is, in fact, the second element) and change the name of the factor level to “BUS-MCARA“. The same idea applies to the second line of code.

Combining others into `OTHER`

There are other sparse levels such as “CONVT” and ”RDSTR“. We can observe that the claim occurrence rates in these two levels are quite similar to the rates in other levels (except “BUS” and “MCARA“) , ranging from about 0.05 to about 0.08, so we may combine all these levels as one big level called “OTHER” . That’s what the rest of CHUNK 9 does.

# CHUNK 9 (Cont.) 
# Combine LeveLs other than "BUS_MCARA" as "OTHER" 
levels(dataCar$veh_body) [!(levels(dataCar$veh_body) == "BUS-MCARA")] <- "OTHER"

table(dataCar$veh_body)

OTHER BUS-MCARA 
67644       159

After consolidating its factor levels, veh_body only has two levels, “OTHER” (baseline level) and “BUS-MCARA“. The baseline level has relatively low claim occurrence rate while the “BUS-MCARA” level has a much higher rate, even though this level has much fewer observations.

EXAM NOTE

- The above is only one way, but by no means the only way to reduce the factor levels of veh_body. There are many other possibilities. On the exam, be sure to fully justify your rationale behind doing the combinations.
- The June 2019 exam model solution says that when doing combinations, “[t}he best candidates found multiple cases where factor levels could be combined, considered the similarity of means/medians, and provided adequate rationale for combining levels. It was not sufficient to only combine those levels with extremely low counts.”

TASK 5: Select an interaction

Select one pair of features that may be included as an interaction variable in your GLM. Do this by first proposing two variables that are likely to interact and then using appropriate graphical displays to confirm the existence of an interaction. Continue until a promising interaction has been identified. Include your selected interaction variables when constructing a GLM in subsequent tasks.

Interactions

Interaction between `log_veh_value` (numeric) and `veh_body` (factor)

Recall that an interaction is indicated when changing the value or level of one variable alters the relationship between the target variable and another variable. In CHUNK 10, we explore the possible interaction between log_veh_value (numeric) and veh_body (factor) via the use of boxplots for log_veh_value split by clm (factor) faceted by veh_body (see Figure 4.3.4). It is possible that the effect of log_veh_value on claim occurrence differs for the two kinds of vehicles.

# CHUNK 10
ggplot(dataCar, aes(x = factor(clm), y = log_veh_value)) + 
  geom_boxplot() + 
  facet_wrap(~ veh_body)

Figure 4.3.4: Faceted split boxplots for log_veh_value by clm and veh_body in the dataCar data.

We can see that for OTHER vehicles, claim occurrence is associated with a slightly higher log_veh_value, but the association is much more positive for BUS-MCARA vehicles. In other words, log_veh_value exerts a much stronger effect on claim occurrence for BUS-MCARA vehicles than for OTHER vehicles. We will include the interaction between log_veh_value and veh_body when constructing GLMs in later tasks.

Stage 4: Model Construction and Selection

We proceed to construct GLMs for explaining the claim occurrence rate based on other risk factors in the vehicle insurance dataset.

TASK 6: Select a link function

Split the data into training and test sets. With the target variable being only 0 or 1, the binomial distribution is the only reasonable choice. For the glm package in R there are five link functions that can be used with the binomial distribution. They are shown below, where η is the linear predictor and p is the probability that the target variable equals 1.
- Logit (link = "logit"): $\ln{\dfrac{p}{1-p}}=\eta$ or $p=\dfrac{e^\eta}{1+e^\eta}$.
- Probit (link = "probit"): Φ^-1(p) = η or p = Φ(η), where Φ is the standard normal cumulative distribution function.
- Cauchit (link = "cauchit"): …
- Log (link = "log"): ln p = η or p = e^η
- Complementary log-log (link = "cloglog"): ln[-ln(l – p)] = η or $p=1-e^{-e^\eta}$
Evaluate two potential link functions for applying a GLM to the training set.

Note:

Explain, prior to fitting the models, why your choices are reasonable for this problem.
Fit both models using the feature developed in Task 5 and select the better link function, justifying your choice based on metrics on the training set. Use only that link function in subsequent tasks.

Since the target variable clm is binary, the binomial distribution is the only reasonable candidate for the distribution of the target variable. It remains to specify an appropriate link function. Among the five link functions above, the log link can be safely ruled out as the mean of the binary target variable, given by p = η in this case, is not necessarily lying between 0 and 1 (predictions, although non-negative, may be greater than 1, which is impossible for a probability). While the four remaining link functions all ensure a unit-valued mean for the binary target, the logit link is most commonly used and most interpretable due to its intimate connections with the odds of an event (discussed on page 292). The probit, cauchit, and complementary log-log link functions all produce results which are much more difficult to interpret because of the rather complex relationship between p and η. In fact, the logit link is the default choice in R and is the canonical link function for the binomial response distribution. In what follows, we will explore the logit and probit links (you can replace probit by the cauchit or complementary log-log links).

Creation of Training and Test sets

Before building any GLMs, let’s use the createDataPartition() function again to split the data into the training (75%) and test (25%) sets, as we do in CHUNK 11.

# CHUNK 11
library(caret)
set.seed(4769)
partition <- createDataPartition(y = dataCar$clm, p = 0.75, list = FALSE)
data.train <- dataCar[partition, ]
data.test <- dataCar[-partition, ]
mean(data.train$clm)
mean(data.test$clm)

> mean(data.train$clm)
[1] 0.06870784
> mean(data.test$clm)
[1] 0.06631268

Validating the data split process

To check that the two sets are representative, we observe that the claim occurrence rate in the training set is 6.87% and that in the test set is 6.63%. That the two rates are similar shows that the built-in stratification of the target variable works well.

Fitting the Model with the Link Functions: Logit and Probit

In CHUNK 12, we fit separately the logistic regression model and probit regression model to the training set. The default choice of the link function for the binomial distribution is the logit link, so it does not have to be specified, but for the pro bit model the command link = "probit" has to be added as an extra argument of the glm() function. Because we are specifically asked not to use the exposure variable, we use the formula clm ~ . - exposure + log_ veh_ value: veh_body, meaning to regress clm on all variables in the training set except for exposure, with the interaction term between log_veh_value and veh_body added (don’t miss the interaction found in Task 5!).

# CHUNK 12
logit <- glm(clm ~ . - exposure + log_veh_value:veh_body, 
             data = data.train, 
             family = binomial) 
probit <- glm(clm ~ . - exposure + log_veh_value:veh_body, 
              data = data.train, 
              family = binomial(link = "probit"))

Which link function to use?

The most natural way to compare these two GLMs is to look at their predictive performance on the test set. As the task statement explicitly told us not to do so, we instead look at metrics computed on the training set that assess not only the goodness of fit of a GLM to the training set, but also the complexity of the model, and strike a balance between the two. The deviance is generally not a good measure as it focuses solely on the goodness of fit of the two GLMs on the training set. A good metric is the AIC, which penalizes a GLM by the number of parameters it carries (you can also use the BIC). Recall that the lower the AIC, the better the model.

In the second part of CHUNK 12, we use the AIC() function to return the AIC of each of the two GLMs.

# CHUNK 12 (Cont.) 
AIC(logit)
AIC(probit)

> # CHUNK 12 (Cont.)
> AIC(logit)
[1] 25354.09
> AIC(probit)
[1] 25353.79

The two AIC values are very close to each other, but the logistic regression model is much more interpretable than the probit model, so we will use the logit link in subsequent tasks.

Link Function := Logit Link

TASK 7: Specify an offset

Explain why the exposure variable may serve as an offset for predicting clm.
Fit a GLM using exposure as an offset and the link function selected in Task 6. Be sure to include the offset on an appropriate scale and explain your assumption.

The two GLMs in Task 6 ignored the exposure variable in the model building process. Recall from Figure 4.3.2 that exposure is strongly positively related to claim occurrence. This comes as no surprise since exposure records the effective duration of each vehicle insurance policy in the dataset (according to the data dictionary) and captures the amount of risk each policy is exposed to.

Some of the policies came into force partly into the year and some were canceled by the end of the policy year. Other things equal, the larger the value of exposure, the higher the claim occurrence probability. One good way to put exposure to good use is to take it as an offset and revise the construction of the logistic regression model in Task 6. Since the model involves the logit of the claim occurrence probability, or the log of the odds of claim occurrence, we impose exposure on a log scale so that the model equation reads

ln(odds) = ln(exposure) + η,

where η is the linear predictor without the offset, or equivalently, odds = exposure x e^η.

This means that the odds of claim occurrence, which runs on a scale from 0 to +∞, is directly proportional to the value of exposure, which is a rather reasonable assumption.

In CHUNK 13, we refit the logistic regression model for clm using exposure as an offset and output its AIC and summary. Notice the extra option offset = log(exposure) in the glm() function that specifies the logarithm of exposure (not exposure itself!) as an offset.

# CHUNK 13 
logit.offset <- glm(clm ~ . - exposure + log_veh_value:veh_body, 
                    data = data.train, 
                    offset = log(exposure), 
                    family = binomial)

AIC(logit.offset) 
summary(logit.offset)

> AIC(logit.offset) 
[1] 24459.17
> summary(logit.offset)

Call:
glm(formula = clm ~ . - exposure + log_veh_value:veh_body, family = binomial, 
    data = data.train, offset = log(exposure))

Coefficients:
                                  Estimate Std. Error z value Pr(>|z|)    
(Intercept)                     -1.8401822  0.0553407 -33.252  < 2e-16 ***
veh_bodyBUS-MCARA                1.2064692  0.5839503   2.066 0.038824 *  
veh_age1                        -0.0237603  0.0577435  -0.411 0.680720    
veh_age2                         0.0707487  0.0489945   1.444 0.148735    
veh_age4                        -0.0258168  0.0530979  -0.486 0.626818    
genderM                         -0.0471309  0.0367956  -1.281 0.200233    
areaA                           -0.0418634  0.0489577  -0.855 0.392500    
areaB                            0.0839797  0.0503891   1.667 0.095589 .  
areaD                           -0.1630404  0.0637039  -2.559 0.010487 *  
areaE                           -0.1027563  0.0706512  -1.454 0.145831    
areaF                            0.0636621  0.0798244   0.798 0.425145    
agecat1                          0.2960493  0.0653958   4.527 5.98e-06 ***
agecat2                          0.0788806  0.0537650   1.467 0.142339    
agecat3                         -0.0003662  0.0514177  -0.007 0.994317    
agecat5                         -0.2342049  0.0604906  -3.872 0.000108 ***
agecat6                         -0.2749970  0.0746007  -3.686 0.000228 ***
log_veh_value                    0.1470760  0.0400693   3.671 0.000242 ***
veh_bodyBUS-MCARA:log_veh_value -0.5307286  0.6035585  -0.879 0.379221    
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 24567  on 50852  degrees of freedom
Residual deviance: 24423  on 50835  degrees of freedom
AIC: 24459

Number of Fisher Scoring iterations: 5

The AIC of the logistic regression model with an offset is 24,459.17, down from 25,354.09 (CHUNK 12) when an offset is absent. This indicates that the incorporation of the offset leads to a moderate amount of improvement on the quality of the logistic regression model.

TASK 8: Generate confusion matrices and AUCs

Generate confusion matrices (based on an appropriate cutoff), plot the ROC curves, and calculate AUCs for the model fitted in Task 7 on the training set and the test set.
Based on these metrics, comment on the quality of the model.

Constructing Confusion Matrices

To get a feel for how well the logistic regression model fitted in Task 7 is doing, we can construct a confusion matrix on the training set and another one on the test set corresponding to a certain cutoff. In R, a confusion matrix can be aptly created by the confusionMatrix() function from the caret package. This function takes a vector of predicted classes and a vector of observed classes as the first two arguments (if you are doubtful about which vector should come first on the exam, you can run ?confusionMatrix to consult the function’s documentation), and returns the confusion matrix.

To obtain the vector of predicted classes from the vector of predicted probabilities, we notice from CHUNK 3 that only 6.811 % of the policies in the whole dataset has a claim, so using a cutoff of 0.5 (which is the most intuitive value) will result in almost every policy predicted to have no claim. To make the logistic regression model produce more meaningful results, we set the cutoff to 0.06811.

In CHUNK 14, we produce the predicted class probabilities and predicted class labels for the logistic regression model, and generate a confusion matrix on each of the training set and the test set using a cutoff of 0.06811.

# CHUNK 14 
cutoff <- 0.06811 # you can try other values 

# Generate predicted probabilities on the training set 
pred.train <- predict(logit.offset, type = "response") 
# Turn predicted probabilities into predicted classes 
class.train <- 1*(pred.train > cutoff) # OR ifelse(pred.train > cutoff, 1, 0) 
confusionMatrix(factor(class.train), factor(data.train$clm), positive = "1")

Confusion Matrix and Statistics

          Reference
Prediction     0     1
         0 25474  1017
         1 21885  2477
                                         
               Accuracy : 0.5496         
                 95% CI : (0.5453, 0.554)
    No Information Rate : 0.9313         
    P-Value [Acc > NIR] : 1              
                                         
                  Kappa : 0.0655         
                                         
 Mcnemar's Test P-Value : <2e-16         
                                         
            Sensitivity : 0.70893        
            Specificity : 0.53789        
         Pos Pred Value : 0.10167        
         Neg Pred Value : 0.96161        
             Prevalence : 0.06871        
         Detection Rate : 0.04871        
   Detection Prevalence : 0.47907        
      Balanced Accuracy : 0.62341        
                                         
       'Positive' Class : 1

# CHUNK 14 (Cont.)
# Generate predicted probabilities on the test set
pred.test <- predict(logit.offset, newdata = data.test, type = "response")
class.test <- 1*(pred.test > cutoff)
confusionMatrix(factor(class.test), factor(data.test$clm), positive = "1")

Confusion Matrix and Statistics

          Reference
Prediction    0    1
         0 8483  332
         1 7343  792
                                          
               Accuracy : 0.5472          
                 95% CI : (0.5397, 0.5547)
    No Information Rate : 0.9337          
    P-Value [Acc > NIR] : 1               
                                          
                  Kappa : 0.0617          
                                          
 Mcnemar's Test P-Value : <2e-16          
                                          
            Sensitivity : 0.70463         
            Specificity : 0.53602         
         Pos Pred Value : 0.09736         
         Neg Pred Value : 0.96234         
             Prevalence : 0.06631         
         Detection Rate : 0.04673         
   Detection Prevalence : 0.47994         
      Balanced Accuracy : 0.62032         
                                          
       'Positive' Class : 1

It is important to note that:

The first two arguments of the confusionMatrix() function must be factors, so we need to apply the factor() function to class.train, train$clm, class.test, and test$clm.
By default, R will take the alpha-numerically first level (which is “0”, or no claims) as the level of interest. By adding the option positive = "1" to the confusionMatrix() function, we explicitly specify the factor level that corresponds to the “positive” result to be “1”
For some unknown reason, very few past PA exams do so. Without specifying the positive argument, the “sensitivity” and “specificity” in the output are, in fact, the specificity and sensitivity we are interested in.

Findings:

From the two confusion matrices, we can see that:

Sensitivity vs. specificity: In both matrices, the accuracy is only marginally higher than 0.5, but the accuracy does not take into account the fact that the insurance company should be more interested in correctly classifying policyholders with a claim (who will file for payment) than correctly classifying policyholders without a claim. In other words, the specificity (as shown in the R output) of the logistic regression model should be more of interest to the insurance company than the sensitivity. We can see that the model has a decent specificity (close to 0. 70) on both the training and test sets whereas its sensitivity is relatively low, meaning that the model does a pretty good job of identifying policies with a claim, but it is not very good at identifying no-claim policies. If time allows, it is a good move to tune the cutoff and try a wide range of values to decide on one that results in a better balance between the sensitivity and specificity.
Training vs. test: The accuracy, sensitivity, and specificity of the model on the test set are all slightly lower than the corresponding quantities on the training set. This is an indication of a small (but not serious) extent of overfitting for the logistic regression model.

Determining AUCs and ROC Plots

Confusion matrices are based on a pre-specified cutoff. As we learned in Subsection 4.1.4, the AUC is another performance metric of a classifier aggregated over all possible cutoffs from 0 to 1. In R, the AUC of a classifier can be computed by the roc() and auc() functions from the pR0C package (available on the exam). The roc() function takes the observed levels of the target variable and the predicted probabilities supplied in order and produces a list, which can be passed to the plot() function to plot the ROC curve and to the auc() function to return the AUC of the classifier.

In CHUNK 15, we make an ROC plot (see Figure 4.3.5) and obtain the AUC of the logistic regression model on both the training and test sets.

# CHUNK 15 
library(pROC)
roc.train <- roc(data.train$clm, pred.train)
auc(roc.train)

roc.test <- roc(data.test$elm, pred.test) 
auc(roc.test)

> auc(roc.train)
Area under the curve: 0.6647

> auc(roc.test)
Area under the curve: 0.6581

The command par(pty = "s"), used on a few PA exams (but not in the PA modules), generates a square (which is what “s” stands for) plotting region, which improves the appearance of the ROC plot.

The two AUCs are close to 0.66, which means that the model is making better (but perhaps not way better) predictions than by chance. The fact that the test AUC is slightly lower than the training AUC supports our earlier conclusion that there is a small extent of overfitting for the logistic regression model.

Figure 4.3.5: The ROC plots for the full logistic regression model on the training set (left) and test set (right) in the dataCar data.

Training Set

Test Set

TASK 9: Select features using stepwise selection

Some of the features may lack predictive power and result in overfitting.

Explain whether or not it is a reasonable suggestion to binarize the factor variables before doing stepwise selection in this problem.
Perform forward selection using the BIC.
Calculate the AUC of the final model on the test set.

When finished with this task, this will be your final model form.

Now that we have fitted a logistic regression model (equipped with a good offset) and examined its preliminary performance, it is time to simplify the model by retaining only those with strong predictive power and prevent overfitting. Recall from the model summary in CHUNK 13 that the model currently has a total of 17 features (excluding the intercept), many of which are dummy variables representing the categorical predictors. The model summary shows that only a few features are statistically significant, so feature selection will likely be meaningful.

Binarization

Before we use a function like stepAIC() (which works well for GLMs, not just for linear models) to perform stepwise selection on the fitted logistic regression model, it is a desirable activity to explicitly binarize the categorical predictors with three or more levels, as noted in Subsection 3.4.2. This allows the feature selection function to view each level of a multi-level categorical predictor separately and add or remove one level at a time, instead of having to choose between adding and removing all of the dummy variables of the categorical predictor entirely. Aswe can see from the summary output in CHUNK 13, only certain levels of area and agecat are significant with respect to the baseline, so it may be sensible to combine the insignificant levels with the baseline and use only the significant levels for prediction. Explicit binarization will make this possible.

In CHUNK 16, we use the dummyVars() function from the caret package we first saw in Sub-section 3.4.2 to binarize the categorical predictors with three or more levels, which include agecat, area, and veh_age.

# CHUNK 16 
library(caret) 
binarizer <- dummyVars(~ agecat + area + veh_age, 
                       data= dataCar, 
                       fullRank = TRUE) 
binarized.vars <- data.frame(predict(binarizer, newdata = dataCar)) 
head(binarized.vars)

  agecat.1 agecat.2 agecat.3 agecat.5 agecat.6 area.A area.B area.D area.E area.F veh_age.1 veh_age.2 veh_age.4
1        0        1        0        0        0      0      0      0      0      0         0         0         0
2        0        0        0        0        0      1      0      0      0      0         0         1         0
3        0        1        0        0        0      0      0      0      1      0         0         1         0
4        0        1        0        0        0      0      0      1      0      0         0         1         0
5        0        1        0        0        0      0      0      0      0      0         0         0         1
6        0        0        0        0        0      0      0      0      0      0         0         0         0

Then we attach the binarized variables to the dataCar dataset (with a .bin suffix), delete the original categorical variables, and reuse the partition vector generated in CHUNK 11 to partition the dataset into the training and test sets, now with binarized variables.

# CHUNK 16 (Cont.) 
# Attach the binarized variables 
dataCar.bin <- cbind(dataCar, binarized.vars) 
# Remove the original factor variab les 
dataCar.bin$agecat <- NULL 
dataCar.bin$area <- NULL 
dataCar.bin$veh_age <- NULL 
#Setup the binarized training and test sets 
data.train.bin<- dataCar.bin[partition, ] 
data.test.bin <- dataCar.bin[-partition, ]

In CHUNK 17, we fit the logistic regression model (with interaction) to the binarized training set and print a model summary.

# CHUNK 17 
# Fit the Logistic regression model to the binarized training set 
logit.full <- glm(clm ~ . - exposure + log_veh_value:veh_body, 
                  data = data.train.bin, 
                  offset = log(exposure), 
                  family = binomial)
summary (logit.full)

Call:
glm(formula = clm ~ . - exposure + log_veh_value:veh_body, family = binomial, 
    data = data.train.bin, offset = log(exposure))

Coefficients:
                                  Estimate Std. Error z value Pr(>|z|)    
(Intercept)                     -1.8401822  0.0553407 -33.252  < 2e-16 ***
veh_bodyBUS-MCARA                1.2064692  0.5839503   2.066 0.038824 *  
genderM                         -0.0471309  0.0367956  -1.281 0.200233    
log_veh_value                    0.1470760  0.0400693   3.671 0.000242 ***
agecat.1                         0.2960493  0.0653958   4.527 5.98e-06 ***
agecat.2                         0.0788806  0.0537650   1.467 0.142339    
agecat.3                        -0.0003662  0.0514177  -0.007 0.994317    
agecat.5                        -0.2342049  0.0604906  -3.872 0.000108 ***
agecat.6                        -0.2749970  0.0746007  -3.686 0.000228 ***
area.A                          -0.0418634  0.0489577  -0.855 0.392500    
area.B                           0.0839797  0.0503891   1.667 0.095589 .  
area.D                          -0.1630404  0.0637039  -2.559 0.010487 *  
area.E                          -0.1027563  0.0706512  -1.454 0.145831    
area.F                           0.0636621  0.0798244   0.798 0.425145    
veh_age.1                       -0.0237603  0.0577435  -0.411 0.680720    
veh_age.2                        0.0707487  0.0489945   1.444 0.148735    
veh_age.4                       -0.0258168  0.0530979  -0.486 0.626818    
veh_bodyBUS-MCARA:log_veh_value -0.5307286  0.6035585  -0.879 0.379221    
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 24567  on 50852  degrees of freedom
Residual deviance: 24423  on 50835  degrees of freedom
AIC: 24459

Number of Fisher Scoring iterations: 5

The summary output is identical to that in CHUNK 13, showing that the binarization is done properly and does not have any effect on the model, but the automatic stepwise selection procedure we are going to carry out will recognize each level of the categorical predictors as a separate feature.

Stepwise Selection

In CHUNK 18, we run the stepAIC() function on the full logistic regression model. We have been asked to do forward selection using the BIC. Since the BIC represents a more conservative approach to feature selection, this allows the insurance company to identify key factors affecting claim occurrence rates.

# CHUNK 18 
library(MASS)
logit.null <- glm(clm ~ 1,
                  data= data.train.bin,
                  offset= log(exposure),
                  family= binomial)
logit.reduced <- stepAIC(logit.null,
                         direction= "forward", 
                         k = log(nrow(data.train.bin)),
                         scope= list(upper = logit.full, l ower = logit.null))
summary(logit.reduced)

Start:  AIC=24578.19
clm ~ 1

                Df Deviance   AIC
+ log_veh_value  1    24532 24553
+ agecat.1       1    24533 24554
+ agecat.5       1    24541 24563
+ agecat.6       1    24542 24564
+ veh_age.4      1    24549 24571
+ veh_age.2      1    24554 24575
<none>                24567 24578
+ area.D         1    24558 24580
+ agecat.2       1    24559 24581
+ area.F         1    24561 24583
+ veh_body       1    24562 24583
+ area.B         1    24562 24584
+ area.E         1    24566 24587
+ gender         1    24566 24588
+ veh_age.1      1    24566 24588
+ area.A         1    24566 24588
+ agecat.3       1    24567 24589

Step:  AIC=24553.41
clm ~ log_veh_value

            Df Deviance   AIC
+ agecat.1   1    24498 24531
+ agecat.5   1    24505 24538
+ agecat.6   1    24512 24544
<none>            24532 24553
+ area.D     1    24522 24554
+ area.B     1    24524 24557
+ agecat.2   1    24524 24557
+ veh_age.2  1    24527 24560
+ veh_body   1    24527 24560
+ gender     1    24528 24561
+ area.F     1    24529 24561
+ area.E     1    24529 24562
+ veh_age.1  1    24530 24563
+ veh_age.4  1    24531 24563
+ area.A     1    24531 24564
+ agecat.3   1    24532 24564

Step:  AIC=24530.58
clm ~ log_veh_value + agecat.1

            Df Deviance   AIC
+ agecat.5   1    24479 24522
+ agecat.6   1    24482 24526
+ agecat.2   1    24485 24528
<none>            24498 24531
+ area.D     1    24488 24532
+ area.B     1    24490 24534
+ veh_body   1    24494 24537
+ veh_age.2  1    24494 24537
+ gender     1    24495 24538
+ area.F     1    24496 24539
+ area.E     1    24496 24539
+ agecat.3   1    24496 24539
+ veh_age.1  1    24496 24540
+ veh_age.4  1    24497 24541
+ area.A     1    24497 24541

Step:  AIC=24522.16
clm ~ log_veh_value + agecat.1 + agecat.5

            Df Deviance   AIC
+ agecat.6   1    24457 24511
<none>            24479 24522
+ area.D     1    24469 24524
+ area.B     1    24470 24525
+ agecat.2   1    24472 24526
+ veh_body   1    24474 24528
+ veh_age.2  1    24475 24529
+ gender     1    24476 24530
+ area.E     1    24476 24530
+ area.F     1    24477 24531
+ veh_age.1  1    24477 24531
+ veh_age.4  1    24478 24532
+ area.A     1    24478 24532
+ agecat.3   1    24479 24533

Step:  AIC=24511.36
clm ~ log_veh_value + agecat.1 + agecat.5 + agecat.6

            Df Deviance   AIC
<none>            24457 24511
+ area.D     1    24448 24514
+ area.B     1    24449 24514
+ veh_body   1    24452 24517
+ veh_age.2  1    24452 24517
+ agecat.2   1    24454 24519
+ area.E     1    24455 24520
+ gender     1    24455 24520
+ veh_age.1  1    24456 24521
+ area.F     1    24456 24521
+ veh_age.4  1    24456 24521
+ agecat.3   1    24456 24522
+ area.A     1    24456 24522


> summary(logit.reduced)

Call:
glm(formula = clm ~ log_veh_value + agecat.1 + agecat.5 + agecat.6, 
    family = binomial, data = data.train.bin, offset = log(exposure))

Coefficients:
              Estimate Std. Error z value Pr(>|z|)    
(Intercept)   -1.84794    0.02533 -72.959  < 2e-16 ***
log_veh_value  0.15913    0.02917   5.455 4.90e-08 ***
agecat.1       0.27479    0.05845   4.701 2.59e-06 ***
agecat.5      -0.25995    0.05300  -4.905 9.36e-07 ***
agecat.6      -0.30815    0.06856  -4.495 6.97e-06 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 24567  on 50852  degrees of freedom
Residual deviance: 24457  on 50848  degrees of freedom
AIC: 24467

Number of Fisher Scoring iterations: 5

The final model carries only four features:

log_veh_value, agecat.1, agecat.5, and agecat.6

The interaction variables are not added. The small size of the final model comes as no surprise because the BIC favors simpler models and tends to retain only extremely significant features. As we can see in the summary output of the final model, all of the four retained features are highly statistically significant.

CHUNK 19 generates the AUC for the final model.

# CHUNK 19 
pred.reduced <- predict(logit.reduced, newdata = data.test.bin, type= "response") 
roc.reduced <- roc(data.test$clm, pred.reduced) 
auc(roc.reduced)

Area under the curve: 0.6595

The AUC = 0.6595, is slightly higher than that for the full model (0.6581). Not only does the final model have a slightly better prediction performance, it is also much simpler, with only four features. In terms of both prediction performance and interpretability, the final model is superior and we are going to use this model in the final task.

Interpretation of Model Results

TASK 10: Interpret the model

Run the selected model from Task 9 on the full dataset and provide the output.
Interpret the results of the model by examining the effects of changes in the key variables on:
- Odds of claim occurrence
- Probability of claim occurrence
Describe the results in a manner that will provide useful information to an insurance company.

To wrap up our model development procedure, we refit the reduced logistic regression model to the full dataset (note the argument data = dataCar.bin, not data = data.train.bin) and save it as a glm object named logit.final. This refitting is done because in a real-world application new data will be available in the future on a periodic basis and these data can be combined with the existing data to improve the robustness of the model. For moderately sized datasets, this refitting may give rise to a meaningful improvement in the accuracy of the parameter estimates with the use of extra observations. The results are in CHUNK 20.

# CHUNK 20 
logit.final <- glm(clm ~ log_veh_value + agecat.1 + agecat.5 + agecat.6, 
           data= dataCar.bin, 
           offset= log(exposure), 
           family= binomial) 
summary(logit.final)

# Exponentiate the coefficients to get the multiplicative effects 
exp(coef(logit.final))

Call:
glm(formula = clm ~ log_veh_value + agecat.1 + agecat.5 + agecat.6, 
    family = binomial, data = dataCar.bin, offset = log(exposure))

Coefficients:
              Estimate Std. Error z value Pr(>|z|)    
(Intercept)   -1.86185    0.02205 -84.438  < 2e-16 ***
log_veh_value  0.15843    0.02532   6.257 3.92e-10 ***
agecat.1       0.25731    0.05138   5.008 5.51e-07 ***
agecat.5      -0.25433    0.04603  -5.525 3.29e-08 ***
agecat.6      -0.23858    0.05784  -4.125 3.72e-05 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 32572  on 67802  degrees of freedom
Residual deviance: 32445  on 67798  degrees of freedom
AIC: 32455

Number of Fisher Scoring iterations: 5


> # Exponentiate the coefficients to get the multiplicative effects 
> exp(coef(logit.final)) 
  (Intercept) log_veh_value      agecat.1      agecat.5      agecat.6 
    0.1553857     1.1716752     1.2934442     0.7754369     0.7877484

There are two ways to interpret the output of a fitted logistic regression model.

Odds-Based Interpretation

Table 4.4 provides the odds-based interpretation of the final model using the value of each estimated coefficient. The statements there are in terms of multiplicative changes in the odds of claim occurrence. Equivalently, we can also state the findings in terms of percentage changes. For example, the odds of claim occurrence is estimated to increase by e^0.15843 – 1 = 17.17% for every unit increase in log_veh_value, holding all other features fixed.

Table 4.4: Odds-based interpretation of the reduced logistic regression model fitted to the full dataCar dataset.

Feature	Coefficient Estimate	Interpretation
`log_ veh_ value`	0.15843	A unit increase in the log of vehicle value is associated with a multiplicative increase of e^0.15843 = 1.1717 (> 1) in the odds of claim occurrence, holding all other features fixed. The fact that this increase exceeds 1 makes intuitive sense because the more valuable the vehicle, the more likely a driver will submit insurance claims in the case of accidental damage.
`agecat.1`	0.25731	The odds of claim occurrence for policyholders in age category 1 is e^0·25731 = 1.2934 (> 1) times of that in the baseline level (which is now categories 2, 3, and 4 combined). It is possible that younger drivers are more reckless and tend to have higher accident rates.
`agecat.5`	-0.25433	The odds of claim occurrence for policyholders in age category 5 is ^e-0·25433 = 0.7754 (< 1) times of that in the baseline level. This also aligns with intuition as more mature drivers are more careful and tend to have lower accident rates.
`agecat.6`	-0.23858	The odds of claim occurrence for policyholders in age category 6 is e^-0·23858 = 0.7877 (< 1) times of that in the baseline level.

Probability-Based Interpretation

Translating the results of a logistic regression model into statements for the probability of the event of interest is harder. This is because the probability p = e^η / (1 + e^η) = 1 / (1 + e^-η) is a non-linear, rather complex function of the linear predictor, which makes it more difficult to describe how much the probability will change (in multiplicative or percentage terms) in response to a unit increase in a predictor.

One interesting way to make probability-based statements, as suggested in the December 2019 PA exam and the Hospital Readmissions sample project, is to use as the baseline case the “average policyholder,” who has the mean values of numeric predictors and modal (i.e., most common) levels of categorical predictors. Then we change the value of one and only one feature at a time and examine the isolated impact of that change on the claim occurrence probability.

In CHUNK 21, we create a data frame called new.data hosting the feature combinations we want to explore. The first row of the data frame contains the feature values of the “average policyholder,” who has exposure = 0.468481, log_veh_value = 0.38675, and agecat.1, agecat.5, and agecat.6 all equal to 0 (the policyholder is in the baseline age category). In each of the five remaining rows, we alter the value of one and only one feature (exposure, log_veh_value, agecat, in that order) to look at the effect of that predictor on the predicted claim occurrence probability.

# CHUNK 21 
# To look at mean or modal values 
summary(dataCar)

     clm             exposure             veh_body     veh_age   gender    area      agecat    log_veh_value     
Min.   :0.00000   Min.   :0.002738   OTHER    :67644   3:20060   F:38584   C:20530   4:16179   Min.   :-1.71480  
1st Qu.:0.00000   1st Qu.:0.219028   BUS-MCARA:  159   1:12254   M:29219   A:16302   1: 5734   1st Qu.: 0.00995  
Median :0.00000   Median :0.446270                     2:16582             B:13328   2:12868   Median : 0.40547  
Mean   :0.06811   Mean   :0.468481                     4:18907             D: 8161   3:15757   Mean   : 0.38675  
3rd Qu.:0.00000   3rd Qu.:0.709103                                         E: 5907   5:10722   3rd Qu.: 0.76547  
Max.   :1.00000   Max.   :0.999316                                         F: 3575   6: 6543   Max.   : 3.54270

new.data <- data.frame(
  exposure = c(0.468481, 0.515329, 0.468481, 0.468481, 0.468481, 0.468481),
  log_veh_value = c(0.38675, 0.38675, 0.42543, 0.38675, 0.38675, 0.38675),
  agecat.1 = c(0, 0, 0, 1, 0, 0), 
  agecat.5 = c(0, 0, 0, 0, 1, 0), 
  agecat.6 = c(0, 0, 0, 0, 0, 1)
)
new.data

  exposure log_veh_value agecat.1 agecat.5 agecat.6
1 0.468481       0.38675        0        0        0
2 0.515329       0.38675        0        0        0
3 0.468481       0.42543        0        0        0
4 0.468481       0.38675        1        0        0
5 0.468481       0.38675        0        1        0
6 0.468481       0.38675        0        0        1

predict(logit.final, newdata = new.data, type = "response")

         1          2          3          4          5          6 
0.07183549 0.07845544 0.07224517 0.09099701 0.05661722 0.05746447

Table 4.5 summarizes the results (the feature that is changed is shown in bold).

`exposure`	`log_veh_value`	`agecat.1`	`agecat.5`	Predicted Probability
0.468481	0.38675	0	0	0.07183549
0.515329	0.38675	0	0	0.07845544
0.468481	0.42543	0	0	0.07224517
0.468481	0.38675	1	0	0.09099701
0.468481	0.38675	0	1	0.05661722
0.468481	0.38675	0	0	0.05746447

We can see that:

A 10% increase in exposure increases the claim occurrence probability by 0.0785 – 0.0718 = 0.0067.
A 10% increase in log_veh_value increases the claim occurrence probability by 0.0722 – 0.0718 = 0.0004.
Being in age category 1 increases the claim occurrence probability by 0.0910-0.0718 = 0.0192 compared to being in the baseline age category.
Being in age category 5 decreases the claim occurrence probability by 0.0718-0.0566 = 0.0152 compared to being in the baseline age category.
Being in age category 6 decreases the claim occurrence probability by 0.0718-0.0575 = 0.0143 compared to being in the baseline age category.

The signs of these changes align well with the signs of the coefficient estimates in CHUNK 20. For example, an increase in log_veh_value and being in age category 1 contribute positively to the claim occurrence probability, whereas being in age categories 5 and 6 exerts a negative effect. Regardless of whether our statements are based on odds or probability, overall our results shed light on the directional impact of the key risk factors affecting the claim-generating mechanism and inform the insurance company as to what characteristics of its policyholders and their vehicles should enter its ratemaking plan.