[mathjax]
Case Study 3: GLMs for Count and Aggregate Loss Variables
Learning Objectives
- Select appropriate distributions and link functions for count and severity variables.
- Identify appropriate offsets and weights for count and severity variables.
- Implement GLMs for count and severity variables in R.
- Assess the quality of a Poisson GLM using the Pearson goodness-of-fit statistic.
- Combine the GLMs for count and severity variables to make predictions for an aggregate loss variable.
Background
Compiled by the Swedish Committee on the Analysis of Risk Premium in Motor Insurance, the dataset in this case study describes third-party automobile insurance claims for the year 1977 (third-party claims entail payments to someone other than the policyholder and the insurer).
This dataset involves grouped observations, with each row of the dataset corresponding to a collection of (homogeneous) policyholders sharing the same set of predictor values rather than a single policyholder. The variables include various automobile and policyholder characteristics such as the distance driven by a vehicle, geographic area, and recent driver claims experience (see Table 4.6 for the data dictionary). Unlike Section 4.3, where our interest is only the probability of making a claim, the outcomes of interest in this case study are twofold:
- The number of claims (the
Claims
variable) - The sum of claim payments (the
Payment
variable)
Note that although our ultimate interest is the aggregate payments arising from an automobile policy, modeling the number of claims is an important stepping stone not only because doing so can shed light on the claims-generating mechanism, but also because the number of claims may serve as a valuable input for modeling aggregate payments, as we will see later.
Stage 1: Define the Business Problem
Objective
Our objectives in this case study are to identify important determinants of the two target variables and to use these determinants to make accurate predictions with GLMs. Even though the dataset we use dates back to 1977, our findings may have implications for automobile ratemaking in this day and age by helping automobile insurers set fair premiums based on useful rating variables.
Stage 2: Data Collection
Run CHUNK 1
to load the Swedish
package and the data, and print a summary of each variable. There are 2182 observations and 7 variables. The summary does not suggest any missing or abnormal values.
# CHUNK 1 Swedish <- read.csv("SwedishMotorInsurance.csv", stringsAsFactors=TRUE) summary(Swedish)
Kilometres Zone Bonus Make Insured Claims Payment Min. :1.000 Min. :1.00 Min. :1.000 Min. :1.000 Min. : 0.01 Min. : 0.00 Min. : 0 1st Qu.:2.000 1st Qu.:2.00 1st Qu.:2.000 1st Qu.:3.000 1st Qu.: 21.61 1st Qu.: 1.00 1st Qu.: 2989 Median :3.000 Median :4.00 Median :4.000 Median :5.000 Median : 81.53 Median : 5.00 Median : 27404 Mean :2.986 Mean :3.97 Mean :4.015 Mean :4.992 Mean : 1092.20 Mean : 51.87 Mean : 257008 3rd Qu.:4.000 3rd Qu.:6.00 3rd Qu.:6.000 3rd Qu.:7.000 3rd Qu.: 389.78 3rd Qu.: 21.00 3rd Qu.: 111954 Max. :5.000 Max. :7.00 Max. :7.000 Max. :9.000 Max. :127687.27 Max. :3338.00 Max. :18245026
Variable | Description | Characteristics |
Kilometres |
Distance driven by a vehicle in km per year, grouped into five categories (it is not exactly the distance driven, but a higher number does mean a longer distance) | Integer from 1 to 5
|
Zone |
Geographic zone of a vehicle | Integer from 1 to 7 |
Bonus |
No claims bonus, equal to the number of years, plus one, since the last claim | Integer from 1 to 7 |
Make |
Type of a vehicle | Integer from 1 to 8 representing eight different common car models. All other models are combined in class 9. |
Insured |
Number of policyholder years (fraction of the year that the policyholder has a contract with the issuing company) | Numeric from 0.01 to 127687.27 |
Claims |
Number of claims | Integer from Oto 3338 |
Payment |
Sum of payments in Swedish kroner | Integer from 0 to 18245026 |
TASK 1: Select offsets Select an appropriate offset for each of the two target variables, claim counts and aggregate payments.
|
There are two target variables in this case study,
- the number of claims (
Claims
) and - the aggregate payments (
Payment
).
For the number of claims, it is reasonable to expect that it is proportional to the number of policyholder years (Insured
), which provides an exposure basis.
Other things equal, the longer a policyholder is covered by an automobile insurance policy, the larger the number of claims submitted.
In the same vein, we expect that aggregate payments vary in proportion to the number of claims. Each claim generates a payment (see CHUNK 3
) and it makes intuitive sense that with more claims submitted, the sum of payments becomes larger.
At this stage, we can conjecture that:
Insured
may be a good offset forClaims
andClaims
in turn may serve as a good offset forPayment
In CHUNK 2
, we construct two scatterplots, one for Claims
against Insured
and one for Payment
against Claims
.
# CHUNK 2 ggplot(Swedish, aes(x = Insured, y = Claims)) + geom_point() ggplot(Swedish, aes(x = Claims, y = Payment)) + geom_point()
Findings:
- It is clear that there is a strong proportional relationship between
Claims
andInsured
and betweenPayment
andClaims
. This supports the use ofInsured
andClaims
as offsets (to be included on a suitable scale) forClaims
andPayment
, respectively.
Offset for Claims
: Insured
Offset for
Payment
: Claims
Not strictly required by the task statement, we can make two additional observations, involving Claims
, Insured
, and Payment
:
1. Regarding the relationship between Payment
and Claims
, notice that if Claims
is zero, then Payment
must be zero as well. This is intuitively clear as when there are no claims, there are no payments, and confirmed in CHUNK 3
, which shows that Payment
is degenerate at zero in the subset of Swedish
with no claims. Since claim payments are observed only when there are one or more claims, we have created (via logical subsetting) a subset of Swedish
where Claims
is strictly positive, called Swedish.severity
, which will be useful for modeling claim payments later.
# CHUNK 3 summary(Swedish[Swedish$Claims == 0, ]$Payment) Swedish.severity <- Swedish[Swedish$Claims > 0, ]
Min. 1st Qu. Median Mean 3rd Qu. Max. 0 0 0 0 0 0
2. To facilitate later use, in CHUNK 4
we define the two offset-corrected target variables, called ClaimsPerinsured
and PaymentPerClaim
, which are the number of claims per policyholder year and the payment per claim, respectively.
Traditionally, in the actuarial literature PaymentPerClaim
is referred to as claim severity, which is the size of one payment and not to be confused with the aggregate payments (what you learned in Exam STAM should help!).
Note that PaymentPerClaim
is only defined on Swedish.severity
(or else the division by zero caused by a zero Claims
would create computational trouble).
# CHUNK 4 Swedish$ClaimsPerinsured <- Swedish$Claims / Swedish$Insured Swedish.severity$PaymentPerClaim <- Swedish.severity$Payment / Swedish.severity$Claims
The following two scatterplots show that the size as well as the variability of ClaimsPerinsured
and PaymentPerClaim
shrink significantly with Insured
and Claims
, respectively, showing that averaging goes a long way towards stabilizing the two target variables. We will take advantage of these findings when we construct GLMs in the later part of this case study.
# CHUNK 4 (cont.) ggplot(Swedish, aes(x = Insured, y = ClaimsPerinsured)) + geom_point() ggplot(Swedish.severity, aes(x = Claims, y = PaymentPerClaim)) + geom_point()
TASK 2: Edit and explore the data Edit and explore the data to prepare for further analysis and identify potentially predictive variables.
|
Data Quality Issue
Factor Conversions
In the Swedish
dataset, all variables are represented by numeric values. However, Zone
and Make
are merely numerical labels of the geographical zone and type of a vehicle and do not convey any numerical order (e.g., there is no indication that Zone 2 is larger than Zone 1 in any sense). These two variables should better be converted to factors, as we do in CHUNK
5.
# CHUNK 5 Swedish$Zone <- as.factor(Swedish$Zone) Swedish$Make <- as.factor(Swedish$Make) Swedish.severity$Zone <- as.factor(Swedish.severity$Zone) Swedish.severity$Make <- as.factor(Swedish.severity$Make)
As for Kilometres
and Bonus
, these two variables do have a numerical order (e.g., Kilometres = 2
indicates a longer distance traveled than Kilometres = 1
) but can be converted to ordered factors to be represented by dummy variables to capture their effects on the target variables more effectively. We will leave them as numeric variables for now and decide whether changing them to factors is worthwhile when it comes to the modeling stage.
Stage 3: Exploratory Data Analysis
Univariate Exploration
Distributions of the Target Variables
In CHUNK 6
, we explore the distributions of the two offset-corrected target variables, ClaimsPerinsured
and PaymentPerClaim
, using summary statistics (via the summary()
function) in combination with graphical displays (histograms here).
# CHUNK 6 summary(Swedish$ClaimsPerinsured) summary(Swedish.severity$PaymentPerClaim) ggplot(Swedish, aes(x = ClaimsPerinsured)) + geom_histogram()
> summary(Swedish$ClaimsPerinsured) Min. 1st Qu. Median Mean 3rd Qu. Max. 0.00000 0.02648 0.05123 0.06914 0.08906 1.66667 > summary(Swedish.severity$PaymentPerClaim) Min. 1st Qu. Median Mean 3rd Qu. Max. 72 2643 4375 5206 5985 31442
Both ClaimsPerinsured
and PaymentPerClaim
are non-negative continuous variables (while Claims
takes only non-negative integers, the division by Insured
makes ClaimsPerinsured
continuous). The former ranges from 0 (corresponding to no claims) to 1.66667 while the latter ranges from 72 to 31,442.
Even after being accounted for by the offsets, the two target variables exhibit a pronounced right skew, especially for PaymentPerClaim
, as shown in the histograms and reflected by their means substantially exceeding their medians. We will take the right skew into account when developing GLMs later.
Bivariate Exploration
Potentially Useful Predictors
Having investigated the two target variables, we go on to perform bivariate data exploration in search of potentially predictive variables out of Kilometres
, Bonus
, Zone
, and Make
. CHUNK 7
provides the means and medians of the two offset-corrected target variables split by distinct values or levels of the four predictors. (Split boxplots can also be used here.)
# CHUNK 7 library(tidyverse) vars <- c("Kilometres", "Bonus", "Zone", "Make") for (i in vars) { x <- Swedish%>% group_by_(i) %>% summarise( mean = mean(ClaimsPerinsured), median = median(ClaimsPerinsured), n = n() ) print(x) } # A tibble: 5 × 4 Kilometres mean median n <int> <dbl> <dbl> <int> 1 1 0.0561 0.0425 439 2 2 0.0651 0.0488 441 3 3 0.0718 0.0552 441 4 4 0.0705 0.0555 434 5 5 0.0827 0.0574 427 # A tibble: 7 × 4 Bonus mean median n <int> <dbl> <dbl> <int> 1 1 0.129 0.117 307 2 2 0.0792 0.0741 312 3 3 0.0676 0.0591 310 4 4 0.0659 0.0514 310 5 5 0.0550 0.0454 313 6 6 0.0524 0.0447 315 7 7 0.0364 0.0350 315 # A tibble: 7 × 4 Zone mean median n <fct> <dbl> <dbl> <int> 1 1 0.104 0.0862 315 2 2 0.0795 0.0645 315 3 3 0.0722 0.0575 315 4 4 0.0575 0.0473 315 5 5 0.0626 0.0469 313 6 6 0.0569 0.0435 315 7 7 0.0504 0 294 # A tibble: 9 × 4 Make mean median n <fct> <dbl> <dbl> <int> 1 1 0.0761 0.0614 245 2 2 0.0802 0.0607 245 3 3 0.0576 0.0389 242 4 4 0.0333 0.0216 238 5 5 0.0919 0.0691 244 6 6 0.0543 0.0430 244 7 7 0.0838 0.0555 242 8 8 0.0729 0.0487 237 9 9 0.0712 0.0630 245 |
# CHUNK 7 (cont.) for (i in vars) { x <- Swedish.severity%>% group_by_(i) %>% summarise( mean= mean(PaymentPerClaim), median= median(PaymentPerClaim), n = n() ) print(x) } # A tibble: 5 × 4 Kilometres mean median n <int> <dbl> <dbl> <int> 1 1 4912. 4239. 381 2 2 5068. 4650. 399 3 3 5445. 4494. 378 4 4 5065. 3901. 328 5 5 5601. 4207. 311 # A tibble: 7 × 4 Bonus mean median n <int> <dbl> <dbl> <int> 1 1 4913. 4195. 252 2 2 5451. 4238. 246 3 3 4899. 3974. 246 4 4 5257. 4095. 240 5 5 5188. 3900 247 6 6 5251. 4464. 270 7 7 5440. 5107. 296 # A tibble: 7 × 4 Zone mean median n <fct> <dbl> <dbl> <int> 1 1 4959. 4399. 295 2 2 4569. 4142. 295 3 3 5167. 4500. 293 4 4 5411. 4904. 306 5 5 5625. 4145. 236 6 6 5702. 4775. 264 7 7 5016. 2605. 108 # A tibble: 9 × 4 Make mean median n <fct> <dbl> <dbl> <int> 1 1 5083. 4828. 227 2 2 4854. 4313. 205 3 3 5950. 4438. 178 4 4 4877. 3367 155 5 5 4986. 3797. 206 6 6 5193. 4482. 212 7 7 4759. 3429 197 8 8 6603. 4387 173 9 9 4849. 4791. 244
|
# Remove redundant variabLes rm(i, vars, x)
ClaimsPerlnsured
: It appears thatClaimsPerlnsured
has a monotonic relationship with bothKilometres
andBonus
, being increasing inKilometres
and decreasing inBonus
.
These findings make intuitive sense as a longer distance traveled indicates a more worn-out vehicle and a smallerBonus
suggests poorer driving experience of a policyholder.
The average ofClaimsPerlnsured
is also particularly low in certain geographical zones such as 4, 5, 6, and 7, and for certain car models such as 3, 4, and 6.PaymentPerClaim
: The effects of the four predictors onPaymentPerClaim
are not as pronounced as forClaimsPerlnsured
. UnlikeClaimsPerlnsured
,Kilometres
andBonus
do not seem to exert a monotonic effect onPaymentPerClaim
.
We can see thatPaymentPerClaim
has relatively close means and medians across different values ofKilometres
andBonus
, and relatively high means and medians for Zone 4 and Zone 6, and relatively low medians for cars of Make 4 and Make 7.
TASK 3: Construct a GLM for
|
Stage 4: Model Construction and Selection
Creation of Training and Test sets
Before constructing GLMs for Claims
and Payment
, we perform the usual step of splitting the data into the training and test sets. The training set (70% of the data) will be used for fitting the GLMs and the test set (30% of the data) for evaluating their prediction accuracy. There is a subtlety in the current context: We have two target variables here and which variable should we base the stratified sampling on when doing the training/test set split?
In CHUNK 8
, we do the stratified sampling based on the number of claims per insured, ClaimsPerinsured
, and you will see why very shortly.
# CHUNK 8 library(caret) set.seed(14) train_ind <- createDataPartition(Swedish$ClaimsPerinsured, p = 0.7, list = FALSE) data.train <- Swedish[train_ind, ] data.test <- Swedish[-train_ind, ] # Takeout the severity portions of the training and test sets data.train.severity <- data.train[data.train$Claims > 0, ] data.train.severity$PaymentPerClaim <- data.train.severity$Payment / data.train.severity$Claims data.test.severity <- data.test[data.test$Claims > 0, ] data.test.severity$PaymentPerClaim <- data.test.severity$Payment / data.test.severity$Claims print("TRAIN") mean(data.train$ClaimsPerinsured) mean(data.train.severity$PaymentPerClaim) print("TEST") mean(data.test$ClaimsPerinsured) mean(data.test.severity$PaymentPerClaim)
> print("TRAIN") [1] "TRAIN" > mean(data.train$ClaimsPerinsured) [1] 0.06896561 > mean(data.train.severity$PaymentPerClaim) [1] 5253.53 > print("TEST") [1] "TEST" > mean(data.test$ClaimsPerinsured) [1] 0.06955017 > mean(data.test.severity$PaymentPerClaim) [1] 5094.656
The average number of claims per insured on the training set and that on the test set are very close to each other, which is a result of the built-in stratification. To our delight, the two sets also produce a similar average claim severity, which can be attributed to the strong proportional relationship between Payment
and Claims
we saw in CHUNK 2
and the comparable mean values of ClaimsPerinsured
on the training and test sets.
Fitting the Model
A GLM for Claims
with Kilometres
and Bonus
as numeric variables.
Because Claims
is a non-negative integer-valued variable with a right skew, the Poisson distribution is a suitable target distribution. We will adopt the log link, which is most commonly employed in conjunction with the Poisson distribution as it ensures that the model predictions are always non-negative, makes for an interpretable model, and is the canonical link.
In CHUNK 9
, we fit a log-link Poisson GLM for Claims
on the training set using Kilometres
, Zone
, Bonus
, and Make
as predictors and the logarithm of Insured
as an offset. Here Kilometres
and Bonus
are treated as numeric variables. The GLM is saved as an object named count.numeric
to distinguish it from other GLMs we will fit later.
# CHUNK 9 count.numeric <- glm(Claims ~ Kilometres + Zone + Bonus + Make, data = data.train offset = log(Insured), family = poisson) summary(count.numeric)
Call: glm(formula = Claims ~ Kilometres + Zone + Bonus + Make, family = poisson, data = data.train, offset = log(Insured)) Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) -1.813830 0.016746 -108.311 < 2e-16 *** Kilometres 0.131218 0.003028 43.330 < 2e-16 *** Zone2 -0.230780 0.011698 -19.728 < 2e-16 *** Zone3 -0.382588 0.012185 -31.397 < 2e-16 *** Zone4 -0.573191 0.010981 -52.197 < 2e-16 *** Zone5 -0.328492 0.018044 -18.205 < 2e-16 *** Zone6 -0.522192 0.014480 -36.063 < 2e-16 *** Zone7 -0.727917 0.050782 -14.334 < 2e-16 *** Bonus -0.201320 0.001609 -125.103 < 2e-16 *** Make2 0.047689 0.026055 1.830 0.0672 . Make3 -0.268146 0.033077 -8.107 5.20e-16 *** Make4 -0.687657 0.027191 -25.290 < 2e-16 *** Make5 0.154635 0.023003 6.722 1.79e-11 *** Make6 -0.350481 0.020309 -17.258 < 2e-16 *** Make7 -0.061202 0.027085 -2.260 0.0238 * Make8 -0.074676 0.040067 -1.864 0.0624 . Make9 -0.069484 0.011548 -6.017 1.78e-09 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 (Dispersion parameter for poisson family taken to be 1) Null deviance: 23800.5 on 1529 degrees of freedom Residual deviance: 2681.6 on 1513 degrees of freedom AIC: 8059.4 Number of Fisher Scoring iterations: 4
Note that:
- In the formula argument of the
glm()
function, we deliberately excludePayment
andClaimsPerinsured
, which cannot serve as predictors ofClaims
, as well asInsured
, which will play its role in the offset argument. - The family argument of the
glm()
function is set to poisson(not Poisson!) to specify a Poisson GLM. The log link is the canonical (and default) link for the Poisson distribution and so does not have to be specified explicitly. (If you prefer, you can replacefamily = poisson
byfamily = poisson()
orfamily = poisson(link = "log")
.) - Do remember to put
Insured
on the log scale, hence theoffset = log(Insured)
command (notoffset = Insured
!).
The model summary shows that the coefficient estimates for Kilometres
and Bonus
are positive and negative, respectively, consistent with what we saw in CHUNK 7
. All features, including the individual factor levels of Zone
and Make
(except for Levels 2 and 8), are statistically significant, indicating that they are valuable for accounting for Claims
and obviating the need for feature removal.
Digression: Can we model ClaimsPerinsured
directly?
Note that because of the log link, the equation of the Poisson GLM we have just fitted is equivalent to:
\(\ln(\text{ClaimsPerInsured})=\ln(\dfrac{Claims}{Insured})=\eta\),
where η is the linear predictor comprising Kilometres
, Zone
, Bonus
, and Make
. You may wonder:
To save work, can we instead fit a log-link Poisson GLM directly to
ClaimsPerinsured
without having to supplylog(Insured)
as an offset?
This is explored in CHUNK 10
.
# CHUNK 10 freq <- glm(ClaimsPerinsured ~ Kilometres + Zone + Bonus + Make, data= data.train, weight = Insured, family = poisson) summary(freq)
> summary(freq) Call: glm(formula = ClaimsPerinsured ~ Kilometres + Zone + Bonus + Make, family = poisson, data = data.train, weights = Insured) Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) -1.813830 0.016747 -108.311 < 2e-16 *** Kilometres 0.131218 0.003028 43.330 < 2e-16 *** Zone2 -0.230780 0.011698 -19.728 < 2e-16 *** Zone3 -0.382588 0.012185 -31.397 < 2e-16 *** Zone4 -0.573191 0.010981 -52.197 < 2e-16 *** Zone5 -0.328492 0.018044 -18.205 < 2e-16 *** Zone6 -0.522192 0.014480 -36.063 < 2e-16 *** Zone7 -0.727917 0.050784 -14.334 < 2e-16 *** Bonus -0.201320 0.001609 -125.103 < 2e-16 *** Make2 0.047689 0.026055 1.830 0.0672 . Make3 -0.268146 0.033077 -8.107 5.20e-16 *** Make4 -0.687657 0.027191 -25.290 < 2e-16 *** Make5 0.154635 0.023003 6.722 1.79e-11 *** Make6 -0.350481 0.020309 -17.258 < 2e-16 *** Make7 -0.061202 0.027085 -2.260 0.0238 * Make8 -0.074676 0.040067 -1.864 0.0624 . Make9 -0.069484 0.011548 -6.017 1.78e-09 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 (Dispersion parameter for poisson family taken to be 1) Null deviance: 23800.5 on 1529 degrees of freedom Residual deviance: 2681.6 on 1513 degrees of freedom AIC: Inf Number of Fisher Scoring iterations: 6
Notice that:
- As opposed to
CHUNK 9
, the target variable in the formula argument of theglm()
function becomesClaimsPerinsured
and no offset is specified. - Due to the averaging by
Insured
, the variability ofClaimsPerinsured
is lowered and decreases with the value ofInsured
, as we saw inCHUNK 4
. To reflect the different precision of the observations caused by different values ofInsured
, the commandweight = Insured
is in place to insertInsured
as a weight in the model fitting procedure. Observations with a higher value ofInsured
will receive a higher weight.
With one exception, the summary for this model is identical to that in CHUNK 9
, indicating that the two fitted models are essentially the same (the equivalence was first noted on page 296).
That notable exception is that the AIC of the model for ClaimsPerinsured
(the frequency model) is Inf
while that of the model for Claims
(the count model) is a normal value. When running CHUNK 10
in RStudio, you should also see various warning messages that stem from the non-integer values of ClaimsPerinsured
(whereas a Poisson random variable only takes non-negative integer values). This inconsistency between the nature of ClaimsPerinsured
and a genuine Poisson random variable prohibits the glm()
function from computing the AIC of the frequency model. As a result, no feature selection based on the AIC can be performed.
Since the count model and the frequency model produce identical fitted results and the former is much easier to work with, there is no reason to dwell on the latter and we will no longer consider the frequency model in the rest of this case study.
A GLM for Claims with Kilometres
and Bonus
as factor variables. Now we refit the Poisson count model with Kilometres
and Bonus
treated as factor variables and name the fitted model as count.factor
. Notice that Kilometres
and Bonus
are replaced by factor(Kilometres)
and factor(Bonus)
in the model formula.
# CHUNK 11 count.factor <- glm(Claims ~ factor(Kilometres) + Zone + factor(Bonus) + Make, data = data.train, offset = log(Insured), family= poisson) summary(count.factor)
Call: glm(formula = Claims ~ factor(Kilometres) + Zone + factor(Bonus) + Make, family = poisson, data = data.train, offset = log(Insured)) Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) -1.793205 0.016512 -108.598 < 2e-16 *** factor(Kilometres)2 0.194910 0.010023 19.446 < 2e-16 *** factor(Kilometres)3 0.294007 0.009877 29.766 < 2e-16 *** factor(Kilometres)4 0.370948 0.014373 25.809 < 2e-16 *** factor(Kilometres)5 0.559839 0.014991 37.345 < 2e-16 *** Zone2 -0.239285 0.011750 -20.364 < 2e-16 *** Zone3 -0.382898 0.012331 -31.053 < 2e-16 *** Zone4 -0.574708 0.011185 -51.382 < 2e-16 *** Zone5 -0.337239 0.018218 -18.511 < 2e-16 *** Zone6 -0.532676 0.014746 -36.123 < 2e-16 *** Zone7 -0.708291 0.050845 -13.930 < 2e-16 *** factor(Bonus)2 -0.453835 0.015951 -28.453 < 2e-16 *** factor(Bonus)3 -0.699796 0.017949 -38.987 < 2e-16 *** factor(Bonus)4 -0.823508 0.018985 -43.376 < 2e-16 *** factor(Bonus)5 -0.906880 0.017397 -52.129 < 2e-16 *** factor(Bonus)6 -0.983759 0.014285 -68.864 < 2e-16 *** factor(Bonus)7 -1.332578 0.010936 -121.853 < 2e-16 *** Make2 0.049512 0.026094 1.897 0.0578 . Make3 -0.269633 0.033111 -8.143 3.85e-16 *** Make4 -0.661879 0.027371 -24.182 < 2e-16 *** Make5 0.152141 0.023016 6.610 3.84e-11 *** Make6 -0.342875 0.020333 -16.863 < 2e-16 *** Make7 -0.059942 0.027092 -2.213 0.0269 * Make8 -0.086094 0.040131 -2.145 0.0319 * Make9 -0.067240 0.011726 -5.734 9.80e-09 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 (Dispersion parameter for poisson family taken to be 1) Null deviance: 23800.5 on 1529 degrees of freedom Residual deviance: 2016.8 on 1505 degrees of freedom AIC: 7410.6 Number of Fisher Scoring iterations: 4
In the model summary, Kilometres
and Bonus
are now replaced by a collection of dummy variables representing their distinct levels other than the baseline levels. The signs and magnitudes of the coefficient estimates of these dummy variables are in agreement with the coefficient estimates of Kilometres
and Bonus
in CHUNK 9
. The estimates of the dummy variables of Kilometres
are all positive and increase in value from level 2 to level 5, while the opposite relationship is observed for those of Bonus from level 2 to level 7, consistent with the monotonic effects of Kilometres
and Bonus
on ClaimsPerinsured
we observed earlier. The binarization of Kilometres
and Bonus
, however, allows us to capture the effects of these two variables more effectively across different factor levels, at the cost of an increase in the number of features in the model and computational burden.
Using the Pearson statistic as an arbitrator
Comparing CHUNKs 9
and 11
, we can see that the GLM with binarized Kilometres
and Bonus
has a lower AIC and therefore appears to be the better model. The superiority of binarizing Kilometres
and Bonus
is also reflected by the Pearson goodness-of-fit statistics defined by:
\(\dfrac{1}{n}\sum{\dfrac{(y_i-\hat{y}_i)^2}{\hat{y}_i}}\)
where yi is the ith observation of the training or test set, \(\hat{y}_i\) is the corresponding fitted or predicted value, and n is the number of observations in the set.
Featured in the June 16, 2020 PA exam and covered in Exam SRM, the Pearson statistic is commonly used to assess the goodness of fit of a count model. It takes the square of the discrepancy between the observed value and the predicted value but unlike the MSE, it scales the squared discrepancy by the predicted value to avoid the statistic being disproportionately dominated by large predicted values, which are not uncommon with right-skewed data. Like the AIC, the smaller the Pearson statistic, the better the model.
In CHUNK 12
, we generate the predictions of the two Poisson GLMs on the test set and evaluate the Pearson goodness-of-fit statistic for both models. Note that the predictions are for Claims
, not ClaimsPerinsured
, and have taken the offset Insured
into account.
# CHUNK 12 count.numeric.pred <- predict(count.numeric, newdata = data.test, type = "response") sum((data.test$Claims - count.numeric.pred)^2 / count.numeric.pred) / nrow(data.test) count.factor.pred <- predict(count.factor, newdata = data.test, type = "response") sum((data.test$Claims - count.factor.pred)^2 / count.factor.pred) / nrow(data.test)
> count.numeric.pred <- predict(count.numeric, newdata = data.test, type = "response") > > sum((data.test$Claims - count.numeric.pred)^2 / count.numeric.pred) / nrow(data.test) [1] 2.311969 > > count.factor.pred <- predict(count.factor, newdata = data.test, type = "response") > > sum((data.test$Claims - count.factor.pred)^2 / count.factor.pred) / nrow(data.test) [1] 1.602462
In congruence with the AIC-based rankings, treating Kilometres
and Bonus
as factor variables leads to a lower Pearson statistic on the test set and appears a judicious move. The increase in flexibility offered by the dummy variables outweighs the increase in model complexity.
TASK 4: Construct a GLM for claim severity
|
A GLM for claim severity
We now turn to building GLMs for claim severity, or PaymentPerClaim
. As CHUNK 6
showed, PaymentPerClaim
is a positive, continuous variable with a right skew, so a gamma distribution is a good choice for the target distribution (inverse Gaussian is another choice).
Furthermore, CHUNK 4
indicated that the variability of PaymentPerClaim
decreases with the value of Claims
. This motivates the use of Claims
as a weight when a gamma GLM is fitted for PaymentPerClaim
.
In CHUNK 13
, we fit a log-link gamma GLM for PaymentPerClaim
on the severity portion of the training set with Claims
serving as a weight. Consistent with the claim count model in CHUNK 11
, we are binarizing Kilometres
and Bonus
. (If you are interested, feel free to compare the performance of the GLM with numeric Kilometres
and Bonus
and with binarized Kilometres
and Bonus
. Such a comparison is omitted here to reduce repetition.)
# CHUNK 13 severity <- glm(PaymentPerClaim ~ factor(Kilometres) + Zone + factor(Bonus) + Make, data = data.train.severity, weight = Claims, family = Gamma(link = "log")) summary(severity)
Call: glm(formula = PaymentPerClaim ~ factor(Kilometres) + Zone + factor(Bonus) + Make, family = Gamma(link = "log"), data = data.train.severity, weights = Claims) Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 8.39309 0.02801 299.694 < 2e-16 *** factor(Kilometres)2 0.01253 0.01722 0.728 0.466926 factor(Kilometres)3 0.01459 0.01703 0.857 0.391764 factor(Kilometres)4 0.03893 0.02481 1.569 0.116860 factor(Kilometres)5 0.03361 0.02584 1.301 0.193644 Zone2 0.04370 0.02023 2.160 0.030966 * Zone3 0.04870 0.02122 2.294 0.021930 * Zone4 0.13249 0.01932 6.857 1.11e-11 *** Zone5 0.04064 0.03141 1.294 0.196051 Zone6 0.15418 0.02541 6.068 1.72e-09 *** Zone7 0.04576 0.08753 0.523 0.601199 factor(Bonus)2 0.06609 0.02748 2.405 0.016312 * factor(Bonus)3 0.08396 0.03094 2.713 0.006757 ** factor(Bonus)4 0.08018 0.03274 2.449 0.014445 * factor(Bonus)5 0.06293 0.03001 2.097 0.036201 * factor(Bonus)6 0.08939 0.02465 3.627 0.000298 *** factor(Bonus)7 0.12111 0.01887 6.420 1.94e-10 *** Make2 -0.03875 0.04498 -0.861 0.389150 Make3 0.05197 0.05710 0.910 0.362879 Make4 -0.20358 0.04691 -4.340 1.54e-05 *** Make5 -0.09568 0.03963 -2.414 0.015907 * Make6 -0.05159 0.03501 -1.474 0.140771 Make7 -0.11042 0.04666 -2.367 0.018099 * Make8 0.25276 0.06912 3.657 0.000266 *** Make9 -0.05821 0.02014 -2.890 0.003915 ** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 (Dispersion parameter for Gamma family taken to be 2.964663) Null deviance: 3801.1 on 1259 degrees of freedom Residual deviance: 3185.2 on 1235 degrees of freedom AIC: 1257914 Number of Fisher Scoring iterations: 5
Most features in the fitted GLM are statistically significant; the notable exception is Kilometres
, which is removed when a backward stepwise selection based on the AIC is implemented, as shown in CHUNK 14
. (For simplicity, we do not perform explicit binarization.)
# CHUNK 14 library(MASS) severity.final <- stepAIC(severity)
Start: AIC=1257914 PaymentPerClaim ~ factor(Kilometres) + Zone + factor(Bonus) + Make Df Deviance AIC - factor(Kilometres) 4 3195.4 1257909 <none> 3185.2 1257914 - factor(Bonus) 6 3312.1 1257945 - Make 8 3334.2 1257948 - Zone 6 3396.3 1257973 Step: AIC=1258151 PaymentPerClaim ~ Zone + factor(Bonus) + Make
summary(severity.final)
Call: glm(formula = PaymentPerClaim ~ Zone + factor(Bonus) + Make, family = Gamma(link = "log"), data = data.train.severity, weights = Claims) Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 8.40279 0.02680 313.506 < 2e-16 *** Zone2 0.04430 0.02026 2.186 0.028972 * Zone3 0.04808 0.02121 2.267 0.023554 * Zone4 0.13333 0.01912 6.972 5.06e-12 *** Zone5 0.03985 0.03123 1.276 0.202207 Zone6 0.15619 0.02510 6.222 6.71e-10 *** Zone7 0.04550 0.08779 0.518 0.604383 factor(Bonus)2 0.06995 0.02748 2.546 0.011023 * factor(Bonus)3 0.08940 0.03087 2.896 0.003844 ** factor(Bonus)4 0.08657 0.03262 2.654 0.008062 ** factor(Bonus)5 0.06956 0.02986 2.329 0.020009 * factor(Bonus)6 0.09553 0.02444 3.909 9.79e-05 *** factor(Bonus)7 0.12545 0.01875 6.692 3.32e-11 *** Make2 -0.03607 0.04503 -0.801 0.423242 Make3 0.05835 0.05709 1.022 0.306992 Make4 -0.21254 0.04663 -4.558 5.66e-06 *** Make5 -0.09632 0.03975 -2.423 0.015542 * Make6 -0.05523 0.03504 -1.576 0.115251 Make7 -0.11148 0.04680 -2.382 0.017376 * Make8 0.25775 0.06916 3.727 0.000203 *** Make9 -0.05990 0.01993 -3.006 0.002702 ** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 (Dispersion parameter for Gamma family taken to be 2.985245) Null deviance: 3801.1 on 1259 degrees of freedom Residual deviance: 3195.4 on 1239 degrees of freedom AIC: 1258151 Number of Fisher Scoring iterations: 5
Lastly, in CHUNK 15
we apply the reduced GLM to produce predictions for Payment
on the severity portion of the test set. Because the target variable of the model is PaymentPerClaim
, we have to multiply the model predictions by Claims
on the test set to get the predictions for Payment
.
# CHUNK 15 severity.final.pred <- data.test.severity$Claims * predict(severity.final, newdata = data.test.severity, type= "response") RMSE(data.test.severity$Payment, severity.final.pred) RMSE(data.test.severity$Payment, mean(data.train.severity$Payment))
> severity.final.pred <- data.test.severity$Claims * + predict(severity.final, newdata = data.test.severity, type= "response") > > RMSE(data.test.severity$Payment, severity.final.pred) [1] 69882.03 > > RMSE(data.test.severity$Payment, mean(data.train.severity$Payment)) [1] 1108407
The test RMSE is 69,882.03, which seems large. To put this value in perspective, we compare it with the test RMSE if we use the mean of Payment
on the training set as the predicted value. The test RMSE in this case becomes 1,108,407, a much larger value. Our gamma GLM is therefore capable of reducing the test RMSE to only about 6.30% of the RMSE when no predictors are available.
Predictions
TASK 5: Combine the two models to make a prediction for aggregate payments Run the recommended GLMs in Tasks 3 and 4 on the full dataset.
|
Combining frequency and severity GLMs to predict aggregate payments
Let’s refit the two GLMs in Tasks 3 and 4 to the full Swedish
dataset. Here is the summary output.
# CHUNK 16 count.full <- glm(Claims ~ factor(Kilometres) + Zone + factor(Bonus) + Make, data = Swedish, offset = log(Insured), family = poisson) summary(count.full)
Call: glm(formula = Claims ~ factor(Kilometres) + Zone + factor(Bonus) + Make, family = poisson, data = Swedish, offset = log(Insured)) Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) -1.812840 0.013757 -131.775 < 2e-16 *** factor(Kilometres)2 0.212586 0.007524 28.255 < 2e-16 *** factor(Kilometres)3 0.320226 0.008661 36.974 < 2e-16 *** factor(Kilometres)4 0.404657 0.012054 33.571 < 2e-16 *** factor(Kilometres)5 0.575954 0.012830 44.892 < 2e-16 *** Zone2 -0.238168 0.009496 -25.082 < 2e-16 *** Zone3 -0.386395 0.009670 -39.959 < 2e-16 *** Zone4 -0.581902 0.008654 -67.243 < 2e-16 *** Zone5 -0.326128 0.014530 -22.446 < 2e-16 *** Zone6 -0.526234 0.011877 -44.309 < 2e-16 *** Zone7 -0.730999 0.040698 -17.962 < 2e-16 *** factor(Bonus)2 -0.478993 0.012094 -39.607 < 2e-16 *** factor(Bonus)3 -0.693172 0.013508 -51.316 < 2e-16 *** factor(Bonus)4 -0.827397 0.014584 -56.735 < 2e-16 *** factor(Bonus)5 -0.925632 0.013968 -66.269 < 2e-16 *** factor(Bonus)6 -0.993457 0.011629 -85.429 < 2e-16 *** factor(Bonus)7 -1.327406 0.008685 -152.845 < 2e-16 *** Make2 0.076245 0.021239 3.590 0.000331 *** Make3 -0.247413 0.025094 -9.859 < 2e-16 *** Make4 -0.653524 0.024185 -27.022 < 2e-16 *** Make5 0.154924 0.020235 7.656 1.91e-14 *** Make6 -0.335581 0.017375 -19.314 < 2e-16 *** Make7 -0.055940 0.023343 -2.396 0.016554 * Make8 -0.043933 0.031604 -1.390 0.164493 Make9 -0.068054 0.009956 -6.836 8.17e-12 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 (Dispersion parameter for poisson family taken to be 1) Null deviance: 34070.6 on 2181 degrees of freedom Residual deviance: 2966.1 on 2157 degrees of freedom AIC: 10654 Number of Fisher Scoring iterations: 4
severity.full <- glm(Payment/Claims ~ Zone + factor(Bonus) + Make, data = Swedish.severity, weight = Claims, family = Gamma(link = "log")) summary(severity.full)
Call: glm(formula = Payment/Claims ~ Zone + factor(Bonus) + Make, family = Gamma(link = "log"), data = Swedish.severity, weights = Claims) Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 8.41085 0.02222 378.570 < 2e-16 *** Zone2 0.02297 0.01640 1.401 0.161428 Zone3 0.04770 0.01670 2.855 0.004348 ** Zone4 0.12963 0.01497 8.661 < 2e-16 *** Zone5 0.05073 0.02509 2.022 0.043293 * Zone6 0.14652 0.02053 7.137 1.39e-12 *** Zone7 0.02259 0.07025 0.321 0.747879 factor(Bonus)2 0.04698 0.02085 2.254 0.024346 * factor(Bonus)3 0.07390 0.02325 3.178 0.001508 ** factor(Bonus)4 0.06240 0.02507 2.489 0.012893 * factor(Bonus)5 0.03987 0.02396 1.664 0.096308 . factor(Bonus)6 0.07606 0.01987 3.828 0.000134 *** factor(Bonus)7 0.12130 0.01480 8.194 4.78e-16 *** Make2 -0.03177 0.03664 -0.867 0.386081 Make3 0.08932 0.04327 2.064 0.039159 * Make4 -0.17496 0.04138 -4.228 2.48e-05 *** Make5 -0.08771 0.03493 -2.511 0.012120 * Make6 -0.04243 0.02995 -1.417 0.156670 Make7 -0.12068 0.04030 -2.995 0.002784 ** Make8 0.21863 0.05442 4.017 6.14e-05 *** Make9 -0.05673 0.01711 -3.316 0.000931 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 (Dispersion parameter for Gamma family taken to be 2.979105) Null deviance: 5417.7 on 1796 degrees of freedom Residual deviance: 4547.3 on 1776 degrees of freedom AIC: 1878541 Number of Fisher Scoring iterations: 5
Note that the predictions made in CHUNK 15
on the severity portion of the test set for aggregate payments were made assuming that the values of Claims
were available. In practice, one would predict the aggregate payments of a prospective policyholder based only on his/her rating variables, without knowing the number of claims he/she will submit; the number of claims will not be available until the policy term ends.
To illustrate how to predict the aggregate payments for a future policyholder, whose Claims
is yet to be observed, we consider the sample policyholder described in the task statement. According to the data dictionary, the combination of feature values is:
|
Zone |
Bonus |
Make |
Insured |
2 |
1 | 1 | 6 | 350 |
This policyholder is created in CHUNK 17
. Note that Zone
and Make
are created as character variables with levels surrounded by quotes as we have converted them to factors in CHUNK 5
, before the GLMs were fitted. Kilometres
and Bonus
, however, are turned into factors in the formulas of the GLMs.
# CHUNK 17 sample <- data.frame(Kilometres = 2, Zone = "1", Bonus = 1, Make = "6", Insured = 350) sample
Kilometres Zone Bonus Make Insured 1 2 1 1 6 350
To predict the aggregate payments for this sample policyholder, we will apply the two GLMs to make predictions for the number of claims and claim severity separately, then combine the two predictions to form an overall prediction.
Step 1. Let’s start with predicting the number of claims.
# CHUNK 18 # Predicted number of claims N.pred <- predict(count.full, newdata = sample, type = "response") N.pred
1 50.50629
To check the prediction manually, we use the coefficient estimates in CHUNK 16
to compute the predicted linear predictor for the given policyholder:
η = -1.812840 + 0.212586 + 0 + 0 -0.335581 = -1.935835
Together with the offset, the predicted value of Claims
is N = 350 exp( -1.935835) = 50.5063, which agrees with the value produced by R.
Step 2. Next, we turn to the predicted claim severity.
# CHUNK 18 (Cont.) # Predicted claim severity X.pred <- predict(severity.full, newdata = sample, type = "response") X.pred
1 4308.826
The claim severity GLM says that 4,308.826 is the predicted size of one claim for this sample policyholder.
Step 3. Finally, we multiply the two predictions to get the predicted aggregate payments.
# CHUNK 18 (Cont.) # Predi cted aggregate payments S.pred <- N.pred * X.pred S.pred
1 217622.8
An alternative prediction method
In the procedure above, we are basically decomposing the aggregate payments into two components, the claim counts and claim severity, and predict these two variables separately using different G LMs. Then we multiply the two predictions to form the overall prediction for the aggregate payments, using the following formula you may have learned from Exam STAM:
E[aggregate payments] = E[# of claims] • E[claim severity]
which relies on the important assumption that claim counts and claim severity are independent random variables.
An alternative is to fit a target distribution like the Tweedie distribution discussed briefly in Sub-section 4.1.1 to the aggregate payments (as a single target variable) directly and make a prediction using this distribution. This alternative modeling approach has the following advantages:
- It is more efficient to fit and maintain a single (possibly more complex) model than two separate models.
- It does not require the independence between claim counts and claim severity.
EXAM NOTE
As we mentioned in Subsection 4.1.1, it is unlikely that you will actually fit a Tweedie distribution in Exam PA because the R package for Tweedie will not be available on the exam. A conceptual exam item, however, may ask you to describe the Tweedie distribution and how it can serve as an alternative for modeling aggregate payments.