[mathjax]
Case Study 2: GLMs for Binary Target Variables
Learning Objectives
Compared to GLMs for numeric target variables, GLM-based classifiers enjoy some subtly unique features, which will be revealed in the course of this case study. At the completion of this section, you should be able to:
- Combine factor levels to reduce the dimension of the data.
- Select appropriate link functions for binary target variables.
- Implement different kinds of GLMs for binary target variables in R.
- Incorporate an offset into a logistic regression model.
- Interpret the results of a fitted logistic regression model.
Background
In this case study, we will examine the dataCar
dataset in the insuranceData
package. This dataset is based on a total of n = 67,856 one-year vehicle insurance policies taken out in 2004 or 2005. The variables in this dataset pertain to different characteristics of the policyholders and their vehicles. The target variable is clm
, a binary variable equal to 1 if a claim occurred over the policy period and 0 otherwise.
Stage 1: Define the Business Problem
Objective
Our objective here is to construct appropriate GLMs to identify key factors associated with claim occurrence. Such factors will provide insurance companies offering vehicle insurance policies with useful information for understanding the claims-generating mechanism and, by extension, setting fair premium rates for its policyholders.
TASK 1: Edit the data Perform data-processing by:
|
Stage 2: Data Collection
Run CHUNK 1
to load the insuranceData
package and the data, and print a summary of each variable.
# CHUNK 1 library (insuranceData) data(dataCar) summary(dataCar) Or # CHUNK 1 # Prior to R v4.0, the option stringsAsFactors was TRUE by default, i.e. string-like variables would automatically get converted to factors when being read in by dataCar <- read.csv("vehins.csv") # For R v4.0+, dataCar <- read.csv("vehins.csv", stringsAsFactors=TRUE) summary(dataCar)
clm exposure veh_value numclaims claimcst veh_body veh_age gender area agecat Min. :0.00000 Min. :0.002738 Min. : 0.000 Min. :0.00000 Min. : 0.0 Length:67856 Min. :1.000 Length:67856 Length:67856 Min. :1.000 1st Qu.:0.00000 1st Qu.:0.219028 1st Qu.: 1.010 1st Qu.:0.00000 1st Qu.: 0.0 Class :character 1st Qu.:2.000 Class :character Class :character 1st Qu.:2.000 Median :0.00000 Median :0.446270 Median : 1.500 Median :0.00000 Median : 0.0 Mode :character Median :3.000 Mode :character Mode :character Median :3.000 Mean :0.06814 Mean :0.468651 Mean : 1.777 Mean :0.07276 Mean : 137.3 Mean :2.674 Mean :3.485 3rd Qu.:0.00000 3rd Qu.:0.709103 3rd Qu.: 2.150 3rd Qu.:0.00000 3rd Qu.: 0.0 3rd Qu.:4.000 3rd Qu.:5.000 Max. :1.00000 Max. :0.999316 Max. :34.560 Max. :4.00000 Max. :55922.1 Max. :4.000 Max. :6.000
Findings:
- The mean of
clm
(which is treated in R as a numeric variable taking only two values, 0 and 1) is 0.06814, meaning that 6.814% of the 67,856 vehicle insurance policies had at least one claim over the policy period.
Variable | Description | Characteristics |
veh_value |
Vehicle value (in $10,000s) | Numeric with three decimal places, from 0 to 34.560 |
exposure |
Exposure of policy | Numeric from 0 to 1 |
clm |
Claim occurrence | Integer from 0 to 1 (0 = no, 1 = yes) |
numelaims |
Number of claims | Integer from 0 to 4 |
claimcst |
Claim size | Positive value from 0 to 55,922.1 (= 0 if no claim) |
veh_body |
Vehicle body type | Factor with 13 levels: BUS, CONVT (= convertible), COUPE, HBACK (= hatchback), HDTOP (= hardtop), MCARA (= motorized caravan), MIBUS (= minibus), PANVN ( = panel van), RDS TR ( = roadster), SEDAN, STNWG (= station wagon), TRUCK, UTE(= utility) |
veh_age |
Vehicle age | Integer from 1 (youngest) to 4 |
gender |
Gender of policyholder | A factor with two levels: M (= male) and F (= female) |
area |
Area of residence of policyholder |
Factor with 6 levels: A, B, C, D, E, F |
ageeat |
Age band of policyholder | Integer from 1 (youngest) to 6 |
X_OBSTAT_ |
(Unknown) | A factor with one level “01101 0 0 0” |
Data Quality Issue
Removing inappropriate observations and variables
Because the goal of this case study is to develop GLMs that allow us to predict claim occurrence based on policyholders’ characteristics, it is vitally important to make sure that the variables that serve as predictors of the GLMs are available before claims are observed. If you check the data dictionary (knowledge of the variables is important!), there are two variables whose values are known at the same time as or after claims are submitted, meaning that they cannot function as predictors of clm
in practice. They are numclaims
and claimcst
, representing the number of claims and claim size, respectively. By definition, these two variables already imply clm
. Indeed, we can check that if numclaims
is 1 or more, or if claimcst
is strictly positive, then clm
is necessarily 1, indicating the presence of at least one claim, as the table in the first line of CHUNK 2
shows.
# CHUNK 2 table(dataCar$clm, dataCar$numclaims)
0 1 2 3 4 0 63232 0 0 0 0 1 0 4333 271 18 2
If we build a model for clm
using numclaims
or claimcst
as predictors, then we are able to make perfect predictions due to their strong relationships, but this superb prediction performance is misleading due to the issue of target leakage discussed. numclaims
and claimcst
“leads” the information about clm
.
In the second part of CHUNK 2
, we drop numclaims
and claimcst
to avoid target leakage, as well as the mysterious variable X_0BSTAT _
from the dataset.
# CHUNK 2 (cont.) dataCar$claimcst <- NULL dataCar$numclaims <- NULL dataCar$X_0BSTAT_ <- NULL
The dataset also contains a small number of observations (53, to be exact) which involve a zero vehicle value. Although veh_value
is measured in $10,000s, a value of zero for veh_value
means that the value of the vehicle is lower than 0.001($10, 000) = $10, which does not make practical sense. For this reason, let’s also remove these questionable observations.
# CHUNK 2 (cont.) nrow(dataCar[dataCar$veh_value == 0, ]) dataCar <- dataCar[dataCar$veh_value > 0, ]
[1] 53
Factor Conversions
Of all the predictors for clm
, only veh_value
, exposure
, veh_age
, and agecat
are numeric variables (at this stage). Since veh_age
and agecat
are labels of age groups and they do not represent exact ages, for modeling purposes it may be desirable to treat them as factors instead of numeric variables. As with most decisions in Exam PA, this comes with advantages and disadvantages:
Advantages
Even though there is an order among the age categories (e.g., vehicles with veh_age = 1
are the newest and the higher the value of veh_age
, the older the vehicles), taking the two variables as numeric variables, each with a single regression coefficient attached, imposes the undesirable restriction that every unit increase in each of the two variables is associated with the same change in the linear predictor, holding all other variables fixed (e.g., the change in the linear predictor when agecat
changes from 1 to 2 equals the change when agecat
changes from 2 to 3). Such a restriction is lifted if veh_age
and agecat
are treated as categorical predictors represented by dummy variables with different coefficients, and we will provide a GLM with more flexibility to capture the effects of these two variables on the target variable more effectively.
Disadvantages
Converting veh_age
and agecat
into factors will inflate the dimension of the data as each non-baseline factor level has to be represented by dummy variables and may result in overfitting. Stepwise selection with binarization or regularization can help reduce the great number of factor levels, but the model fitting procedure may take much longer to complete than when the two variables are numeric.
The conversion of veh_age
and agecat
from integers to factors can be achieved by the as.factor()
function, as in CHUNK 3
.
# CHUNK 3 # Before factor conversion class(dataCar$agecat) class(dataCar$veh_age) dataCar$agecat <- as.factor(dataCar$agecat) dataCar$veh_age <- as.factor(dataCar$veh_age) # After factor conversion class(dataCar$agecat) class(dataCar$veh_age)
> # CHUNK 3 # Before factor conversion > class(dataCar$agecat) [1] "integer" > class(dataCar$veh_age) [1] "integer" > > dataCar$agecat <- as.factor(dataCar$agecat) > dataCar$veh_age <- as.factor(dataCar$veh_age) > > # After factor conversion > class(dataCar$agecat) [1] "factor" > class(dataCar$veh_age) [1] "factor"
Relevel Categorical Variables
The rest of CHUNK 3
relevels all of the categorical predictors, including veh_age
and agecat
, so that their baseline level is the level with the most observations.
vars.cat <- c("veh_age", "veh_body", "gender", "area", "agecat") for (i in vars.cat){ table <- as.data.frame(table(dataCar[, i])) max <- which.max(table[, 2]) level.name <- as.character(table[max, 1]) dataCar[, i] <- relevel(dataCar[, i], ref = level.name) } summary(dataCar)
clm exposure veh_value claimcst veh_body veh_age gender area agecat Min. :0.00000 Min. :0.002738 Min. : 0.180 Min. : 0.0 SEDAN :22232 3:20060 F:38584 C:20530 4:16179 1st Qu.:0.00000 1st Qu.:0.219028 1st Qu.: 1.010 1st Qu.: 0.0 HBACK :18915 1:12254 M:29219 A:16302 1: 5734 Median :0.00000 Median :0.446270 Median : 1.500 Median : 0.0 STNWG :16234 2:16582 B:13328 2:12868 Mean :0.06811 Mean :0.468481 Mean : 1.778 Mean : 137.1 UTE : 4586 4:18907 D: 8161 3:15757 3rd Qu.:0.00000 3rd Qu.:0.709103 3rd Qu.: 2.150 3rd Qu.: 0.0 TRUCK : 1742 E: 5907 5:10722 Max. :1.00000 Max. :0.999316 Max. :34.560 Max. :55922.1 HDTOP : 1579 F: 3575 6: 6543 (Other): 2515
For all of the categorical predictors, their baseline level is listed first and contains the greatest number of observations, as desired.
The next two tasks relate to univariate and bivariate data exploration.
TASK 2: Perform univariate exploration of the non-factor variables Use graphical displays and summary statistics to determine if any of the nonfactor variable(s) except exposure should be transformed and, if so, what transformation should be made. Do your recommended transformation(s), if any, and delete the original variable(s). Note: For this task, there is no need to analyze the exposure variable, which will be treated later. |
Stage 3: Exploratory Data Analysis
We are left with veh_value
as the only numeric variable (besides exposure
) to analyze. It ranges from 0.180 to 34.560 and its mean (1.778) is higher than its median (1.500), suggesting that its distribution is moderately right-skewed.
This is confirmed by the histogram (see the left panel of Figure 4.3.1) in CHUNK 4
. For such a strictly positive and skewed variable, the log transformation is commonly used, so we create a log-transformed version of veh_value
called log_veh_value
, whose distribution is much more symmetric and bell-shaped, and delete the original veh_value
in the second part of CHUNK 4
.
# CHUNK 4 library(ggplot2) ggplot(dataCar, aes(x = veh_value)) + geom_histogram() ggplot(dataCar, aes(x = log(veh_value))) + geom_histogram() dataCar$log_veh_value <- log(dataCar$veh_value) dataCar$veh_value <- NULL
Figure 4.3.1: A histogram for veh_value
(left) and the log of veh_value
(right) in the dataCar
dataset.
TASK 3: Explore the relationship of each predictor to Use graphical displays and summary statistics to form preliminary conclusions regarding which variables are likely to have significant predictive power. |
We already explored the distribution of the target variable clm
in CHUNK 1
and the distribution of veh_value
in CHUNK 4
. Now we turn to the relationship between elm and each predictor, numeric and categorical.
Numeric predictors
First we look at the relationship between clm
and the two numeric variables, log_veh_value
and exposure
. For a categorical-numeric pair, recall that boxplots for the numeric variable split by the levels of the categorical variable will be useful.
CHUNK 5
produces these boxplots for log_veh_value
and exposure
indexed by the two levels (0 and 1) of clm
(see Figure 4.3.2). Do remember to apply the factor()
function to clm
so that it is treated as a factor when making the boxplots. It appears that claim occurrence (clm = 1
) is associated with a slightly higher value of log_veh_value
and a much higher value of exposure
.
These positive relationships between the two predictors and clm
will be quantified by a GLM in later tasks. In addition to split boxplots, we can also look at the mean or median of log_veh_value
and exposure
over the two levels of clm
, as we will do in CHUNK 6
.
# CHUNK 6 library(tidyverse) dataCar %>% group_by_("clm") %>% summarise( mean= mean(log_veh_value), median= median(log_veh_value), n = n() )
# A tibble: 2 × 4 clm mean median n <int> <dbl> <dbl> <int> 1 0 0.382 0.399 63185 2 1 0.448 0.451 4618
dataCar %>% group_by_("clm") %>% summarise( mean = mean(exposure), median = median(exposure), n = n() )
# A tibble: 2 × 4 clm mean median n <int> <dbl> <dbl> <int> 1 0 0.458 0.427 63185 2 1 0.611 0.637 4618
As we expect, the means and medians are higher when there is a claim, so a policyholder with a larger vehicle value or a larger exposure has a higher tendency to submit an insurance claim.
# CHUNK 5 ggplot(dataCar, aes(x = factor(clm), y = log_veh_value)) + geom_boxplot(fill = "red") ggplot(dataCar, aes(x = factor(clm), y = exposure)) + geom_boxplot(fill = "red")
Figure 4.3.2: Split boxplots for log_veh_value
and exposure
by clm
in the dataCar
dataset.
Categorical Predictors
The other predictors, veh_body
, veh_age
(after conversion), gender
, area
, and agecat
(after conversion), are all categorical variables. For categorical-categorical combinations, filled bar charts are useful. In CHUNK 7
, we use a for loop to construct the bar charts for each categorical predictor color-filled by clm
.
# CHUNK 7 for (i in vars.cat) { plot <- ggplot(dataCar, aes(x = dataCar[, i], fill = factor(clm))) + geom_bar(position = "fill") + labs(x = i, y = "percent") + theme(axis.text.x = element_text(angle = 90, hjust = 1)) print(plot) }
Figure 4.3.3: A bar chart for agecat
by clm
in the dataCar
dataset
If a categorical variable is an important predictor for clm
, then the proportions of 0 (in red) and 1 (in green) should vary significantly across different levels of the categorical variable. More noticeable differences were observed for the following variables:
veh_age
: Higher rate of claim occurrence for group 2.veh_body
: Much higher rate forBUS
and lower rate forCONVT
,MIBUS
, andUTE
.area
: Higher rate for areaF
.agecat
: The claim occurrence rate appears to decrease withagecat
.
We can also obtain a more accurate understanding of how the categorical predictors affect clm
by looking at the mean of clm
(which is the claim occurrence rate) by different levels of the predictors, as in CHUNK 8
. The descriptive statistics are consistent with what we observe from the filled bar charts in CHUNK 7
.
# CHUNK 8 for (i in vars.cat) { print(i) x <- dataCar %>% group_by_(i) %>% summarise(mean = mean(clm), n = n()) print(x) }
[1] "veh_age" # A tibble: 4 × 3 veh_age mean n <fct> <dbl> <int> 1 3 0.0679 20060 2 1 0.0672 12254 3 2 0.0759 16582 4 4 0.0620 18907 [1] "veh_body" # A tibble: 13 × 3 veh_body mean n <fct> <dbl> <int> 1 SEDAN 0.0664 22232 2 BUS 0.184 38 3 CONVT 0.0370 81 4 COUPE 0.0872 780 5 HBACK 0.0668 18915 6 HDTOP 0.0823 1579 7 MCARA 0.116 121 8 MIBUS 0.0601 716 9 PANVN 0.0824 752 10 RDSTR 0.0741 27 11 STNWG 0.0721 16234 12 TRUCK 0.0683 1742 13 UTE 0.0567 4586 [1] "gender" # A tibble: 2 × 3 gender mean n <fct> <dbl> <int> 1 F 0.0686 38584 2 M 0.0675 29219 [1] "area" # A tibble: 6 × 3 area mean n <fct> <dbl> <int> 1 C 0.0688 20530 2 A 0.0664 16302 3 B 0.0723 13328 4 D 0.0607 8161 5 E 0.0652 5907 6 F 0.0783 3575 [1] "agecat" # A tibble: 6 × 3 agecat mean n <fct> <dbl> <int> 1 4 0.0682 16179 2 1 0.0863 5734 3 2 0.0724 12868 4 3 0.0705 15757 5 5 0.0572 10722 6 6 0.0556 6543
TASK 4: Reduce the number of factor levels where appropriate Several of the factor variables have a small number of observations at some of the levels. Consider using knowledge of the factor levels as well as evidence from Task 3 to combine some of them into factor levels with more observations. |
Combining Categorical Predictors
The dataCar
dataset features a number of categorical predictors, some of which have a large number of levels. This significantly increases the dimension of the data and may dilute the predictive power of the GLMs we construct. To reduce the dimension of the data, we can combine factor levels with either similar claim occurrence rates or very few observations to form more populous levels, based on the claim occurrence rates grouped by factor levels shown in CHUNK 8
. Arguably the most problematic categorical predictor is veh_body
, which has the largest number of levels (13 levels) out of the five factor variables. Some levels like “BUS
” and “RDSTR
” have very few observations, so our priority is to combine these sparse levels with more populous levels with a similar claim occurrence rate.
Combining BUS
and MCARA
into BUS-MCARA
BUS-MCARA
: “BUS
” is extremely sparse, but it has the highest claim occurrence rate. The level with the second highest rate is “MCARA
“, which is quite sparse as well. In the first two lines of CHUNK 9
, let’s combine these two levels into a larger level called “BUS-MCARA
“.
# CHUNK 9 # Combine "BUS" and "MCARA" as one level called "BUS-MCARA" levels(dataCar$veh_body) [levels(dataCar$veh_body) == "BUS"] <- "BUS-MCARA" levels(dataCar$veh_body) [levels(dataCar$veh_body) == "MCARA"] <- "BUS-MCARA"
The code, adapted from the December 8, 2020 exam, is another illustration of logical subsetting we learned in Chapter 1. To understand the code, note that levels(dataCar$veh_body)
is a vector with 13 elements representing the 13 levels of veh_body
. In the first line, we use a logical test to extract the index corresponding to “BUS
” (which is, in fact, the second element) and change the name of the factor level to “BUS-MCARA
“. The same idea applies to the second line of code.
Combining others into OTHER
There are other sparse levels such as “CONVT
” and ”RDSTR
“. We can observe that the claim occurrence rates in these two levels are quite similar to the rates in other levels (except “BUS
” and “MCARA
“) , ranging from about 0.05 to about 0.08, so we may combine all these levels as one big level called “OTHER
” . That’s what the rest of CHUNK 9
does.
# CHUNK 9 (Cont.) # Combine LeveLs other than "BUS_MCARA" as "OTHER" levels(dataCar$veh_body) [!(levels(dataCar$veh_body) == "BUS-MCARA")] <- "OTHER" table(dataCar$veh_body)
OTHER BUS-MCARA 67644 159
After consolidating its factor levels, veh_body
only has two levels, “OTHER
” (baseline level) and “BUS-MCARA
“. The baseline level has relatively low claim occurrence rate while the “BUS-MCARA
” level has a much higher rate, even though this level has much fewer observations.
EXAM NOTE
-
- The above is only one way, but by no means the only way to reduce the factor levels of
veh_body
. There are many other possibilities. On the exam, be sure to fully justify your rationale behind doing the combinations. - The June 2019 exam model solution says that when doing combinations, “[t}he best candidates found multiple cases where factor levels could be combined, considered the similarity of means/medians, and provided adequate rationale for combining levels. It was not sufficient to only combine those levels with extremely low counts.”
- The above is only one way, but by no means the only way to reduce the factor levels of
TASK 5: Select an interaction Select one pair of features that may be included as an interaction variable in your GLM. Do this by first proposing two variables that are likely to interact and then using appropriate graphical displays to confirm the existence of an interaction. Continue until a promising interaction has been identified. Include your selected interaction variables when constructing a GLM in subsequent tasks. |
Interactions
Interaction between log_veh_value
(numeric) and veh_body
(factor)
Recall that an interaction is indicated when changing the value or level of one variable alters the relationship between the target variable and another variable. In CHUNK 10
, we explore the possible interaction between log_veh_value
(numeric) and veh_body
(factor) via the use of boxplots for log_veh_value
split by clm
(factor) faceted by veh_body
(see Figure 4.3.4). It is possible that the effect of log_veh_value
on claim occurrence differs for the two kinds of vehicles.
# CHUNK 10 ggplot(dataCar, aes(x = factor(clm), y = log_veh_value)) + geom_boxplot() + facet_wrap(~ veh_body)
Figure 4.3.4: Faceted split boxplots for log_veh_value
by clm
and veh_body
in the dataCar
data.
We can see that for OTHER
vehicles, claim occurrence is associated with a slightly higher log_veh_value
, but the association is much more positive for BUS-MCARA
vehicles. In other words, log_veh_value
exerts a much stronger effect on claim occurrence for BUS-MCARA
vehicles than for OTHER
vehicles. We will include the interaction between log_veh_value
and veh_body
when constructing GLMs in later tasks.
Stage 4: Model Construction and Selection
We proceed to construct GLMs for explaining the claim occurrence rate based on other risk factors in the vehicle insurance dataset.
TASK 6: Select a link function
Note:
|
Since the target variable clm
is binary, the binomial distribution is the only reasonable candidate for the distribution of the target variable. It remains to specify an appropriate link function. Among the five link functions above, the log link can be safely ruled out as the mean of the binary target variable, given by p = η in this case, is not necessarily lying between 0 and 1 (predictions, although non-negative, may be greater than 1, which is impossible for a probability). While the four remaining link functions all ensure a unit-valued mean for the binary target, the logit link is most commonly used and most interpretable due to its intimate connections with the odds of an event (discussed on page 292). The probit, cauchit, and complementary log-log link functions all produce results which are much more difficult to interpret because of the rather complex relationship between p and η. In fact, the logit link is the default choice in R and is the canonical link function for the binomial response distribution. In what follows, we will explore the logit and probit links (you can replace probit by the cauchit or complementary log-log links).
Creation of Training and Test sets
Before building any GLMs, let’s use the createDataPartition()
function again to split the data into the training (75%) and test (25%) sets, as we do in CHUNK 11
.
# CHUNK 11 library(caret) set.seed(4769) partition <- createDataPartition(y = dataCar$clm, p = 0.75, list = FALSE) data.train <- dataCar[partition, ] data.test <- dataCar[-partition, ] mean(data.train$clm) mean(data.test$clm)
> mean(data.train$clm) [1] 0.06870784 > mean(data.test$clm) [1] 0.06631268
Validating the data split process
To check that the two sets are representative, we observe that the claim occurrence rate in the training set is 6.87% and that in the test set is 6.63%. That the two rates are similar shows that the built-in stratification of the target variable works well.
Fitting the Model with the Link Functions: Logit and Probit
In CHUNK 12
, we fit separately the logistic regression model and probit regression model to the training set. The default choice of the link function for the binomial distribution is the logit link, so it does not have to be specified, but for the pro bit model the command link = "probit"
has to be added as an extra argument of the glm()
function. Because we are specifically asked not to use the exposure
variable, we use the formula clm ~ . - exposure + log_ veh_ value: veh_body
, meaning to regress clm
on all variables in the training set except for exposure
, with the interaction term between log_veh_value
and veh_body
added (don’t miss the interaction found in Task 5!).
# CHUNK 12 logit <- glm(clm ~ . - exposure + log_veh_value:veh_body, data = data.train, family = binomial) probit <- glm(clm ~ . - exposure + log_veh_value:veh_body, data = data.train, family = binomial(link = "probit"))
Which link function to use?
The most natural way to compare these two GLMs is to look at their predictive performance on the test set. As the task statement explicitly told us not to do so, we instead look at metrics computed on the training set that assess not only the goodness of fit of a GLM to the training set, but also the complexity of the model, and strike a balance between the two. The deviance is generally not a good measure as it focuses solely on the goodness of fit of the two GLMs on the training set. A good metric is the AIC, which penalizes a GLM by the number of parameters it carries (you can also use the BIC). Recall that the lower the AIC, the better the model.
In the second part of CHUNK 12
, we use the AIC()
function to return the AIC of each of the two GLMs.
# CHUNK 12 (Cont.) AIC(logit) AIC(probit)
> # CHUNK 12 (Cont.) > AIC(logit) [1] 25354.09 > AIC(probit) [1] 25353.79
The two AIC values are very close to each other, but the logistic regression model is much more interpretable than the probit model, so we will use the logit link in subsequent tasks.
Link Function := Logit Link
TASK 7: Specify an offset
|
The two GLMs in Task 6 ignored the exposure
variable in the model building process. Recall from Figure 4.3.2 that exposure
is strongly positively related to claim occurrence. This comes as no surprise since exposure
records the effective duration of each vehicle insurance policy in the dataset (according to the data dictionary) and captures the amount of risk each policy is exposed to.
Some of the policies came into force partly into the year and some were canceled by the end of the policy year. Other things equal, the larger the value of exposure
, the higher the claim occurrence probability. One good way to put exposure
to good use is to take it as an offset and revise the construction of the logistic regression model in Task 6. Since the model involves the logit of the claim occurrence probability, or the log of the odds of claim occurrence, we impose exposure on a log scale so that the model equation reads
ln(odds) = ln(exposure) + η,
where η is the linear predictor without the offset, or equivalently, odds = exposure x eη.
This means that the odds of claim occurrence, which runs on a scale from 0 to +∞, is directly proportional to the value of exposure, which is a rather reasonable assumption.
In CHUNK 13
, we refit the logistic regression model for clm
using exposure as an offset and output its AIC and summary. Notice the extra option offset = log(exposure)
in the glm()
function that specifies the logarithm of exposure (not exposure itself!) as an offset.
# CHUNK 13 logit.offset <- glm(clm ~ . - exposure + log_veh_value:veh_body, data = data.train, offset = log(exposure), family = binomial) AIC(logit.offset) summary(logit.offset)
> AIC(logit.offset) [1] 24459.17 > summary(logit.offset) Call: glm(formula = clm ~ . - exposure + log_veh_value:veh_body, family = binomial, data = data.train, offset = log(exposure)) Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) -1.8401822 0.0553407 -33.252 < 2e-16 *** veh_bodyBUS-MCARA 1.2064692 0.5839503 2.066 0.038824 * veh_age1 -0.0237603 0.0577435 -0.411 0.680720 veh_age2 0.0707487 0.0489945 1.444 0.148735 veh_age4 -0.0258168 0.0530979 -0.486 0.626818 genderM -0.0471309 0.0367956 -1.281 0.200233 areaA -0.0418634 0.0489577 -0.855 0.392500 areaB 0.0839797 0.0503891 1.667 0.095589 . areaD -0.1630404 0.0637039 -2.559 0.010487 * areaE -0.1027563 0.0706512 -1.454 0.145831 areaF 0.0636621 0.0798244 0.798 0.425145 agecat1 0.2960493 0.0653958 4.527 5.98e-06 *** agecat2 0.0788806 0.0537650 1.467 0.142339 agecat3 -0.0003662 0.0514177 -0.007 0.994317 agecat5 -0.2342049 0.0604906 -3.872 0.000108 *** agecat6 -0.2749970 0.0746007 -3.686 0.000228 *** log_veh_value 0.1470760 0.0400693 3.671 0.000242 *** veh_bodyBUS-MCARA:log_veh_value -0.5307286 0.6035585 -0.879 0.379221 --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 (Dispersion parameter for binomial family taken to be 1) Null deviance: 24567 on 50852 degrees of freedom Residual deviance: 24423 on 50835 degrees of freedom AIC: 24459 Number of Fisher Scoring iterations: 5
The AIC of the logistic regression model with an offset is 24,459.17, down from 25,354.09 (CHUNK 12
) when an offset is absent. This indicates that the incorporation of the offset leads to a moderate amount of improvement on the quality of the logistic regression model.
TASK 8: Generate confusion matrices and AUCs
|
Constructing Confusion Matrices
To get a feel for how well the logistic regression model fitted in Task 7 is doing, we can construct a confusion matrix on the training set and another one on the test set corresponding to a certain cutoff. In R, a confusion matrix can be aptly created by the confusionMatrix()
function from the caret
package. This function takes a vector of predicted classes and a vector of observed classes as the first two arguments (if you are doubtful about which vector should come first on the exam, you can run ?confusionMatrix
to consult the function’s documentation), and returns the confusion matrix.
To obtain the vector of predicted classes from the vector of predicted probabilities, we notice from CHUNK 3
that only 6.811 % of the policies in the whole dataset has a claim, so using a cutoff of 0.5 (which is the most intuitive value) will result in almost every policy predicted to have no claim. To make the logistic regression model produce more meaningful results, we set the cutoff to 0.06811.
In CHUNK 14
, we produce the predicted class probabilities and predicted class labels for the logistic regression model, and generate a confusion matrix on each of the training set and the test set using a cutoff of 0.06811.
# CHUNK 14 cutoff <- 0.06811 # you can try other values # Generate predicted probabilities on the training set pred.train <- predict(logit.offset, type = "response") # Turn predicted probabilities into predicted classes class.train <- 1*(pred.train > cutoff) # OR ifelse(pred.train > cutoff, 1, 0) confusionMatrix(factor(class.train), factor(data.train$clm), positive = "1")
Confusion Matrix and Statistics Reference Prediction 0 1 0 25474 1017 1 21885 2477 Accuracy : 0.5496 95% CI : (0.5453, 0.554) No Information Rate : 0.9313 P-Value [Acc > NIR] : 1 Kappa : 0.0655 Mcnemar's Test P-Value : <2e-16 Sensitivity : 0.70893 Specificity : 0.53789 Pos Pred Value : 0.10167 Neg Pred Value : 0.96161 Prevalence : 0.06871 Detection Rate : 0.04871 Detection Prevalence : 0.47907 Balanced Accuracy : 0.62341 'Positive' Class : 1
# CHUNK 14 (Cont.) # Generate predicted probabilities on the test set pred.test <- predict(logit.offset, newdata = data.test, type = "response") class.test <- 1*(pred.test > cutoff) confusionMatrix(factor(class.test), factor(data.test$clm), positive = "1")
Confusion Matrix and Statistics Reference Prediction 0 1 0 8483 332 1 7343 792 Accuracy : 0.5472 95% CI : (0.5397, 0.5547) No Information Rate : 0.9337 P-Value [Acc > NIR] : 1 Kappa : 0.0617 Mcnemar's Test P-Value : <2e-16 Sensitivity : 0.70463 Specificity : 0.53602 Pos Pred Value : 0.09736 Neg Pred Value : 0.96234 Prevalence : 0.06631 Detection Rate : 0.04673 Detection Prevalence : 0.47994 Balanced Accuracy : 0.62032 'Positive' Class : 1
It is important to note that:
- The first two arguments of the
confusionMatrix()
function must be factors, so we need to apply thefactor()
function toclass.train
,train$clm
,class.test
, andtest$clm
. - By default, R will take the alpha-numerically first level (which is “0”, or no claims) as the level of interest. By adding the option
positive = "1"
to theconfusionMatrix()
function, we explicitly specify the factor level that corresponds to the “positive” result to be “1” - For some unknown reason, very few past PA exams do so. Without specifying the
positive
argument, the “sensitivity” and “specificity” in the output are, in fact, the specificity and sensitivity we are interested in.
Findings:
From the two confusion matrices, we can see that:
- Sensitivity vs. specificity: In both matrices, the accuracy is only marginally higher than 0.5, but the accuracy does not take into account the fact that the insurance company should be more interested in correctly classifying policyholders with a claim (who will file for payment) than correctly classifying policyholders without a claim. In other words, the specificity (as shown in the R output) of the logistic regression model should be more of interest to the insurance company than the sensitivity. We can see that the model has a decent specificity (close to 0. 70) on both the training and test sets whereas its sensitivity is relatively low, meaning that the model does a pretty good job of identifying policies with a claim, but it is not very good at identifying no-claim policies. If time allows, it is a good move to tune the cutoff and try a wide range of values to decide on one that results in a better balance between the sensitivity and specificity.
- Training vs. test: The accuracy, sensitivity, and specificity of the model on the test set are all slightly lower than the corresponding quantities on the training set. This is an indication of a small (but not serious) extent of overfitting for the logistic regression model.
Determining AUCs and ROC Plots
Confusion matrices are based on a pre-specified cutoff. As we learned in Subsection 4.1.4, the AUC is another performance metric of a classifier aggregated over all possible cutoffs from 0 to 1. In R, the AUC of a classifier can be computed by the roc()
and auc()
functions from the pR0C
package (available on the exam). The roc()
function takes the observed levels of the target variable and the predicted probabilities supplied in order and produces a list, which can be passed to the plot()
function to plot the ROC curve and to the auc()
function to return the AUC of the classifier.
In CHUNK 15
, we make an ROC plot (see Figure 4.3.5) and obtain the AUC of the logistic regression model on both the training and test sets.
# CHUNK 15 library(pROC) roc.train <- roc(data.train$clm, pred.train) auc(roc.train) roc.test <- roc(data.test$elm, pred.test) auc(roc.test)
> auc(roc.train) Area under the curve: 0.6647 > auc(roc.test) Area under the curve: 0.6581
The command par(pty = "s")
, used on a few PA exams (but not in the PA modules), generates a square (which is what “s” stands for) plotting region, which improves the appearance of the ROC plot.
The two AUCs are close to 0.66, which means that the model is making better (but perhaps not way better) predictions than by chance. The fact that the test AUC is slightly lower than the training AUC supports our earlier conclusion that there is a small extent of overfitting for the logistic regression model.
Figure 4.3.5: The ROC plots for the full logistic regression model on the training set (left) and test set (right) in the dataCar
data.
Training Set |
Test Set |
TASK 9: Select features using stepwise selection Some of the features may lack predictive power and result in overfitting.
When finished with this task, this will be your final model form. |
Now that we have fitted a logistic regression model (equipped with a good offset) and examined its preliminary performance, it is time to simplify the model by retaining only those with strong predictive power and prevent overfitting. Recall from the model summary in CHUNK 13
that the model currently has a total of 17 features (excluding the intercept), many of which are dummy variables representing the categorical predictors. The model summary shows that only a few features are statistically significant, so feature selection will likely be meaningful.
Binarization
Before we use a function like stepAIC()
(which works well for GLMs, not just for linear models) to perform stepwise selection on the fitted logistic regression model, it is a desirable activity to explicitly binarize the categorical predictors with three or more levels, as noted in Subsection 3.4.2. This allows the feature selection function to view each level of a multi-level categorical predictor separately and add or remove one level at a time, instead of having to choose between adding and removing all of the dummy variables of the categorical predictor entirely. Aswe can see from the summary output in CHUNK 13
, only certain levels of area and agecat are significant with respect to the baseline, so it may be sensible to combine the insignificant levels with the baseline and use only the significant levels for prediction. Explicit binarization will make this possible.
In CHUNK 16, we use the dummyVars()
function from the caret
package we first saw in Sub-section 3.4.2 to binarize the categorical predictors with three or more levels, which include agecat
, area
, and veh_age
.
# CHUNK 16 library(caret) binarizer <- dummyVars(~ agecat + area + veh_age, data= dataCar, fullRank = TRUE) binarized.vars <- data.frame(predict(binarizer, newdata = dataCar)) head(binarized.vars)
agecat.1 agecat.2 agecat.3 agecat.5 agecat.6 area.A area.B area.D area.E area.F veh_age.1 veh_age.2 veh_age.4 1 0 1 0 0 0 0 0 0 0 0 0 0 0 2 0 0 0 0 0 1 0 0 0 0 0 1 0 3 0 1 0 0 0 0 0 0 1 0 0 1 0 4 0 1 0 0 0 0 0 1 0 0 0 1 0 5 0 1 0 0 0 0 0 0 0 0 0 0 1 6 0 0 0 0 0 0 0 0 0 0 0 0 0
Then we attach the binarized variables to the dataCar
dataset (with a .bin
suffix), delete the original categorical variables, and reuse the partition vector generated in CHUNK 11
to partition
the dataset into the training and test sets, now with binarized variables.
# CHUNK 16 (Cont.) # Attach the binarized variables dataCar.bin <- cbind(dataCar, binarized.vars) # Remove the original factor variab les dataCar.bin$agecat <- NULL dataCar.bin$area <- NULL dataCar.bin$veh_age <- NULL #Setup the binarized training and test sets data.train.bin<- dataCar.bin[partition, ] data.test.bin <- dataCar.bin[-partition, ]
In CHUNK 17
, we fit the logistic regression model (with interaction) to the binarized training set and print a model summary.
# CHUNK 17 # Fit the Logistic regression model to the binarized training set logit.full <- glm(clm ~ . - exposure + log_veh_value:veh_body, data = data.train.bin, offset = log(exposure), family = binomial) summary (logit.full)
Call: glm(formula = clm ~ . - exposure + log_veh_value:veh_body, family = binomial, data = data.train.bin, offset = log(exposure)) Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) -1.8401822 0.0553407 -33.252 < 2e-16 *** veh_bodyBUS-MCARA 1.2064692 0.5839503 2.066 0.038824 * genderM -0.0471309 0.0367956 -1.281 0.200233 log_veh_value 0.1470760 0.0400693 3.671 0.000242 *** agecat.1 0.2960493 0.0653958 4.527 5.98e-06 *** agecat.2 0.0788806 0.0537650 1.467 0.142339 agecat.3 -0.0003662 0.0514177 -0.007 0.994317 agecat.5 -0.2342049 0.0604906 -3.872 0.000108 *** agecat.6 -0.2749970 0.0746007 -3.686 0.000228 *** area.A -0.0418634 0.0489577 -0.855 0.392500 area.B 0.0839797 0.0503891 1.667 0.095589 . area.D -0.1630404 0.0637039 -2.559 0.010487 * area.E -0.1027563 0.0706512 -1.454 0.145831 area.F 0.0636621 0.0798244 0.798 0.425145 veh_age.1 -0.0237603 0.0577435 -0.411 0.680720 veh_age.2 0.0707487 0.0489945 1.444 0.148735 veh_age.4 -0.0258168 0.0530979 -0.486 0.626818 veh_bodyBUS-MCARA:log_veh_value -0.5307286 0.6035585 -0.879 0.379221 --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 (Dispersion parameter for binomial family taken to be 1) Null deviance: 24567 on 50852 degrees of freedom Residual deviance: 24423 on 50835 degrees of freedom AIC: 24459 Number of Fisher Scoring iterations: 5
The summary output is identical to that in CHUNK 13
, showing that the binarization is done properly and does not have any effect on the model, but the automatic stepwise selection procedure we are going to carry out will recognize each level of the categorical predictors as a separate feature.
Stepwise Selection
In CHUNK 18
, we run the stepAIC()
function on the full logistic regression model. We have been asked to do forward selection using the BIC. Since the BIC represents a more conservative approach to feature selection, this allows the insurance company to identify key factors affecting claim occurrence rates.
# CHUNK 18 library(MASS) logit.null <- glm(clm ~ 1, data= data.train.bin, offset= log(exposure), family= binomial) logit.reduced <- stepAIC(logit.null, direction= "forward", k = log(nrow(data.train.bin)), scope= list(upper = logit.full, l ower = logit.null)) summary(logit.reduced)
Start: AIC=24578.19 clm ~ 1 Df Deviance AIC + log_veh_value 1 24532 24553 + agecat.1 1 24533 24554 + agecat.5 1 24541 24563 + agecat.6 1 24542 24564 + veh_age.4 1 24549 24571 + veh_age.2 1 24554 24575 <none> 24567 24578 + area.D 1 24558 24580 + agecat.2 1 24559 24581 + area.F 1 24561 24583 + veh_body 1 24562 24583 + area.B 1 24562 24584 + area.E 1 24566 24587 + gender 1 24566 24588 + veh_age.1 1 24566 24588 + area.A 1 24566 24588 + agecat.3 1 24567 24589 Step: AIC=24553.41 clm ~ log_veh_value Df Deviance AIC + agecat.1 1 24498 24531 + agecat.5 1 24505 24538 + agecat.6 1 24512 24544 <none> 24532 24553 + area.D 1 24522 24554 + area.B 1 24524 24557 + agecat.2 1 24524 24557 + veh_age.2 1 24527 24560 + veh_body 1 24527 24560 + gender 1 24528 24561 + area.F 1 24529 24561 + area.E 1 24529 24562 + veh_age.1 1 24530 24563 + veh_age.4 1 24531 24563 + area.A 1 24531 24564 + agecat.3 1 24532 24564 Step: AIC=24530.58 clm ~ log_veh_value + agecat.1 Df Deviance AIC + agecat.5 1 24479 24522 + agecat.6 1 24482 24526 + agecat.2 1 24485 24528 <none> 24498 24531 + area.D 1 24488 24532 + area.B 1 24490 24534 + veh_body 1 24494 24537 + veh_age.2 1 24494 24537 + gender 1 24495 24538 + area.F 1 24496 24539 + area.E 1 24496 24539 + agecat.3 1 24496 24539 + veh_age.1 1 24496 24540 + veh_age.4 1 24497 24541 + area.A 1 24497 24541 Step: AIC=24522.16 clm ~ log_veh_value + agecat.1 + agecat.5 Df Deviance AIC + agecat.6 1 24457 24511 <none> 24479 24522 + area.D 1 24469 24524 + area.B 1 24470 24525 + agecat.2 1 24472 24526 + veh_body 1 24474 24528 + veh_age.2 1 24475 24529 + gender 1 24476 24530 + area.E 1 24476 24530 + area.F 1 24477 24531 + veh_age.1 1 24477 24531 + veh_age.4 1 24478 24532 + area.A 1 24478 24532 + agecat.3 1 24479 24533 Step: AIC=24511.36 clm ~ log_veh_value + agecat.1 + agecat.5 + agecat.6 Df Deviance AIC <none> 24457 24511 + area.D 1 24448 24514 + area.B 1 24449 24514 + veh_body 1 24452 24517 + veh_age.2 1 24452 24517 + agecat.2 1 24454 24519 + area.E 1 24455 24520 + gender 1 24455 24520 + veh_age.1 1 24456 24521 + area.F 1 24456 24521 + veh_age.4 1 24456 24521 + agecat.3 1 24456 24522 + area.A 1 24456 24522 > summary(logit.reduced) Call: glm(formula = clm ~ log_veh_value + agecat.1 + agecat.5 + agecat.6, family = binomial, data = data.train.bin, offset = log(exposure)) Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) -1.84794 0.02533 -72.959 < 2e-16 *** log_veh_value 0.15913 0.02917 5.455 4.90e-08 *** agecat.1 0.27479 0.05845 4.701 2.59e-06 *** agecat.5 -0.25995 0.05300 -4.905 9.36e-07 *** agecat.6 -0.30815 0.06856 -4.495 6.97e-06 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 (Dispersion parameter for binomial family taken to be 1) Null deviance: 24567 on 50852 degrees of freedom Residual deviance: 24457 on 50848 degrees of freedom AIC: 24467 Number of Fisher Scoring iterations: 5
The final model carries only four features:
log_veh_value
, agecat.1
, agecat.5
, and agecat.6
The interaction variables are not added. The small size of the final model comes as no surprise because the BIC favors simpler models and tends to retain only extremely significant features. As we can see in the summary output of the final model, all of the four retained features are highly statistically significant.
CHUNK 19
generates the AUC for the final model.
# CHUNK 19 pred.reduced <- predict(logit.reduced, newdata = data.test.bin, type= "response") roc.reduced <- roc(data.test$clm, pred.reduced) auc(roc.reduced)
Area under the curve: 0.6595
The AUC = 0.6595
, is slightly higher than that for the full model (0.6581). Not only does the final model have a slightly better prediction performance, it is also much simpler, with only four features. In terms of both prediction performance and interpretability, the final model is superior and we are going to use this model in the final task.
Interpretation of Model Results
TASK 10: Interpret the model
|
To wrap up our model development procedure, we refit the reduced logistic regression model to the full dataset (note the argument data = dataCar.bin
, not data = data.train.bin
) and save it as a glm
object named logit.final
. This refitting is done because in a real-world application new data will be available in the future on a periodic basis and these data can be combined with the existing data to improve the robustness of the model. For moderately sized datasets, this refitting may give rise to a meaningful improvement in the accuracy of the parameter estimates with the use of extra observations. The results are in CHUNK 20
.
# CHUNK 20 logit.final <- glm(clm ~ log_veh_value + agecat.1 + agecat.5 + agecat.6, data= dataCar.bin, offset= log(exposure), family= binomial) summary(logit.final) # Exponentiate the coefficients to get the multiplicative effects exp(coef(logit.final))
Call: glm(formula = clm ~ log_veh_value + agecat.1 + agecat.5 + agecat.6, family = binomial, data = dataCar.bin, offset = log(exposure)) Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) -1.86185 0.02205 -84.438 < 2e-16 *** log_veh_value 0.15843 0.02532 6.257 3.92e-10 *** agecat.1 0.25731 0.05138 5.008 5.51e-07 *** agecat.5 -0.25433 0.04603 -5.525 3.29e-08 *** agecat.6 -0.23858 0.05784 -4.125 3.72e-05 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 (Dispersion parameter for binomial family taken to be 1) Null deviance: 32572 on 67802 degrees of freedom Residual deviance: 32445 on 67798 degrees of freedom AIC: 32455 Number of Fisher Scoring iterations: 5 > # Exponentiate the coefficients to get the multiplicative effects > exp(coef(logit.final)) (Intercept) log_veh_value agecat.1 agecat.5 agecat.6 0.1553857 1.1716752 1.2934442 0.7754369 0.7877484
There are two ways to interpret the output of a fitted logistic regression model.
Odds-Based Interpretation
Table 4.4 provides the odds-based interpretation of the final model using the value of each estimated coefficient. The statements there are in terms of multiplicative changes in the odds of claim occurrence. Equivalently, we can also state the findings in terms of percentage changes. For example, the odds of claim occurrence is estimated to increase by e0.15843 – 1 = 17.17% for every unit increase in log_veh_value
, holding all other features fixed.
Table 4.4: Odds-based interpretation of the reduced logistic regression model fitted to the full dataCar
dataset.
Feature | Coefficient Estimate | Interpretation |
log_ veh_ value |
0.15843 |
A unit increase in the log of vehicle value is associated with a multiplicative increase of e0.15843 = 1.1717 (> 1) in the odds of claim occurrence, holding all other features fixed. The fact that this increase exceeds 1 makes intuitive sense because the more valuable the vehicle, the more likely a driver will submit insurance claims in the case of accidental damage. |
agecat.1 |
0.25731 | The odds of claim occurrence for policyholders in age category 1 is e0·25731 = 1.2934 (> 1) times of that in the baseline level (which is now categories 2, 3, and 4 combined). It is possible that younger drivers are more reckless and tend to have higher accident rates. |
agecat.5 |
-0.25433 | The odds of claim occurrence for policyholders in age category 5 is e-0·25433 = 0.7754 (< 1) times of that in the baseline level. This also aligns with intuition as more mature drivers are more careful and tend to have lower accident rates. |
agecat.6 |
-0.23858 | The odds of claim occurrence for policyholders in age category 6 is e-0·23858 = 0.7877 (< 1) times of that in the baseline level. |
Probability-Based Interpretation
Translating the results of a logistic regression model into statements for the probability of the event of interest is harder. This is because the probability p = eη / (1 + eη) = 1 / (1 + e-η) is a non-linear, rather complex function of the linear predictor, which makes it more difficult to describe how much the probability will change (in multiplicative or percentage terms) in response to a unit increase in a predictor.
One interesting way to make probability-based statements, as suggested in the December 2019 PA exam and the Hospital Readmissions sample project, is to use as the baseline case the “average policyholder,” who has the mean values of numeric predictors and modal (i.e., most common) levels of categorical predictors. Then we change the value of one and only one feature at a time and examine the isolated impact of that change on the claim occurrence probability.
In CHUNK 21
, we create a data frame called new.data
hosting the feature combinations we want to explore. The first row of the data frame contains the feature values of the “average policyholder,” who has exposure = 0.468481
, log_veh_value = 0.38675
, and agecat.1
, agecat.5
, and agecat.6
all equal to 0 (the policyholder is in the baseline age category). In each of the five remaining rows, we alter the value of one and only one feature (exposure
, log_veh_value
, agecat
, in that order) to look at the effect of that predictor on the predicted claim occurrence probability.
# CHUNK 21 # To look at mean or modal values summary(dataCar)
clm exposure veh_body veh_age gender area agecat log_veh_value Min. :0.00000 Min. :0.002738 OTHER :67644 3:20060 F:38584 C:20530 4:16179 Min. :-1.71480 1st Qu.:0.00000 1st Qu.:0.219028 BUS-MCARA: 159 1:12254 M:29219 A:16302 1: 5734 1st Qu.: 0.00995 Median :0.00000 Median :0.446270 2:16582 B:13328 2:12868 Median : 0.40547 Mean :0.06811 Mean :0.468481 4:18907 D: 8161 3:15757 Mean : 0.38675 3rd Qu.:0.00000 3rd Qu.:0.709103 E: 5907 5:10722 3rd Qu.: 0.76547 Max. :1.00000 Max. :0.999316 F: 3575 6: 6543 Max. : 3.54270
new.data <- data.frame( exposure = c(0.468481, 0.515329, 0.468481, 0.468481, 0.468481, 0.468481), log_veh_value = c(0.38675, 0.38675, 0.42543, 0.38675, 0.38675, 0.38675), agecat.1 = c(0, 0, 0, 1, 0, 0), agecat.5 = c(0, 0, 0, 0, 1, 0), agecat.6 = c(0, 0, 0, 0, 0, 1) ) new.data
exposure log_veh_value agecat.1 agecat.5 agecat.6 1 0.468481 0.38675 0 0 0 2 0.515329 0.38675 0 0 0 3 0.468481 0.42543 0 0 0 4 0.468481 0.38675 1 0 0 5 0.468481 0.38675 0 1 0 6 0.468481 0.38675 0 0 1
predict(logit.final, newdata = new.data, type = "response")
1 2 3 4 5 6 0.07183549 0.07845544 0.07224517 0.09099701 0.05661722 0.05746447
Table 4.5 summarizes the results (the feature that is changed is shown in bold).
exposure |
log_veh_value |
agecat.1 |
agecat.5 |
Predicted Probability |
0.468481 | 0.38675 | 0 | 0 | 0.07183549 |
0.515329 | 0.38675 | 0 | 0 | 0.07845544 |
0.468481 | 0.42543 | 0 | 0 | 0.07224517 |
0.468481 | 0.38675 | 1 | 0 | 0.09099701 |
0.468481 | 0.38675 | 0 | 1 | 0.05661722 |
0.468481 | 0.38675 | 0 | 0 | 0.05746447 |
We can see that:
- A 10% increase in
exposure
increases the claim occurrence probability by 0.0785 – 0.0718 = 0.0067. - A 10% increase in
log_veh_value
increases the claim occurrence probability by 0.0722 – 0.0718 = 0.0004. - Being in age category 1 increases the claim occurrence probability by 0.0910-0.0718 = 0.0192 compared to being in the baseline age category.
- Being in age category 5 decreases the claim occurrence probability by 0.0718-0.0566 = 0.0152 compared to being in the baseline age category.
- Being in age category 6 decreases the claim occurrence probability by 0.0718-0.0575 = 0.0143 compared to being in the baseline age category.
The signs of these changes align well with the signs of the coefficient estimates in CHUNK 20
. For example, an increase in log_veh_value
and being in age category 1 contribute positively to the claim occurrence probability, whereas being in age categories 5 and 6 exerts a negative effect. Regardless of whether our statements are based on odds or probability, overall our results shed light on the directional impact of the key risk factors affecting the claim-generating mechanism and inform the insurance company as to what characteristics of its policyholders and their vehicles should enter its ratemaking plan.