[mathjax]

Case Study 3: GLMs for Count and Aggregate Loss Variables

Learning Objectives

Select appropriate distributions and link functions for count and severity variables.
Identify appropriate offsets and weights for count and severity variables.
Implement GLMs for count and severity variables in R.
Assess the quality of a Poisson GLM using the Pearson goodness-of-fit statistic.
Combine the GLMs for count and severity variables to make predictions for an aggregate loss variable.

Background

Compiled by the Swedish Committee on the Analysis of Risk Premium in Motor Insurance, the dataset in this case study describes third-party automobile insurance claims for the year 1977 (third-party claims entail payments to someone other than the policyholder and the insurer).

This dataset involves grouped observations, with each row of the dataset corresponding to a collection of (homogeneous) policyholders sharing the same set of predictor values rather than a single policyholder. The variables include various automobile and policyholder characteristics such as the distance driven by a vehicle, geographic area, and recent driver claims experience (see Table 4.6 for the data dictionary). Unlike Section 4.3, where our interest is only the probability of making a claim, the outcomes of interest in this case study are twofold:

The number of claims (the Claims variable)
The sum of claim payments (the Payment variable)

Note that although our ultimate interest is the aggregate payments arising from an automobile policy, modeling the number of claims is an important stepping stone not only because doing so can shed light on the claims-generating mechanism, but also because the number of claims may serve as a valuable input for modeling aggregate payments, as we will see later.

Stage 1: Define the Business Problem

Objective

Our objectives in this case study are to identify important determinants of the two target variables and to use these determinants to make accurate predictions with GLMs. Even though the dataset we use dates back to 1977, our findings may have implications for automobile ratemaking in this day and age by helping automobile insurers set fair premiums based on useful rating variables.

Stage 2: Data Collection

Run CHUNK 1 to load the Swedish package and the data, and print a summary of each variable. There are 2182 observations and 7 variables. The summary does not suggest any missing or abnormal values.

# CHUNK 1
Swedish <- read.csv("SwedishMotorInsurance.csv", stringsAsFactors=TRUE)
summary(Swedish)

   Kilometres         Zone          Bonus            Make          Insured              Claims           Payment        
 Min.   :1.000   Min.   :1.00   Min.   :1.000   Min.   :1.000   Min.   :     0.01   Min.   :   0.00   Min.   :       0  
 1st Qu.:2.000   1st Qu.:2.00   1st Qu.:2.000   1st Qu.:3.000   1st Qu.:    21.61   1st Qu.:   1.00   1st Qu.:    2989  
 Median :3.000   Median :4.00   Median :4.000   Median :5.000   Median :    81.53   Median :   5.00   Median :   27404  
 Mean   :2.986   Mean   :3.97   Mean   :4.015   Mean   :4.992   Mean   :  1092.20   Mean   :  51.87   Mean   :  257008  
 3rd Qu.:4.000   3rd Qu.:6.00   3rd Qu.:6.000   3rd Qu.:7.000   3rd Qu.:   389.78   3rd Qu.:  21.00   3rd Qu.:  111954  
 Max.   :5.000   Max.   :7.00   Max.   :7.000   Max.   :9.000   Max.   :127687.27   Max.   :3338.00   Max.   :18245026

Variable	Description	Characteristics
`Kilometres`	Distance driven by a vehicle in km per year, grouped into five categories (it is not exactly the distance driven, but a higher number does mean a longer distance)	Integer from 1 to 5 < 1,000 1,000 – 15,000 15,000 – 20,000 20,000 – 25,000 > 25,000
`Zone`	Geographic zone of a vehicle	Integer from 1 to 7
`Bonus`	No claims bonus, equal to the number of years, plus one, since the last claim	Integer from 1 to 7
`Make`	Type of a vehicle	Integer from 1 to 8 representing eight different common car models. All other models are combined in class 9.
`Insured`	Number of policyholder years (fraction of the year that the policyholder has a contract with the issuing company)	Numeric from 0.01 to 127687.27
`Claims`	Number of claims	Integer from Oto 3338
`Payment`	Sum of payments in Swedish kroner	Integer from 0 to 18245026

TASK 1: Select offsets

Select an appropriate offset for each of the two target variables, claim counts and aggregate payments.

First propose the two candidate variables that are likely to serve their purpose based on their meaning.
Confirm their usefulness via appropriate graphical displays. Include these displays in your response.

There are two target variables in this case study,

the number of claims (Claims) and
the aggregate payments (Payment).

For the number of claims, it is reasonable to expect that it is proportional to the number of policyholder years (Insured), which provides an exposure basis.

Other things equal, the longer a policyholder is covered by an automobile insurance policy, the larger the number of claims submitted.

In the same vein, we expect that aggregate payments vary in proportion to the number of claims. Each claim generates a payment (see CHUNK 3) and it makes intuitive sense that with more claims submitted, the sum of payments becomes larger.

At this stage, we can conjecture that:

Insured may be a good offset for Claims and Claims in turn may serve as a good offset for Payment

In CHUNK 2, we construct two scatterplots, one for Claims against Insured and one for Payment against Claims.

# CHUNK 2
ggplot(Swedish, aes(x = Insured, y = Claims)) + geom_point()
ggplot(Swedish, aes(x = Claims, y = Payment)) + geom_point()

Findings:

It is clear that there is a strong proportional relationship between Claims and Insured and between Payment and Claims. This supports the use of Insured and Claims as offsets (to be included on a suitable scale) for Claims and Payment, respectively.

Offset for Claims: InsuredOffset for Payment: Claims

Not strictly required by the task statement, we can make two additional observations, involving Claims, Insured, and Payment:

1. Regarding the relationship between Payment and Claims, notice that if Claims is zero, then Payment must be zero as well. This is intuitively clear as when there are no claims, there are no payments, and confirmed in CHUNK 3, which shows that Payment is degenerate at zero in the subset of Swedish with no claims. Since claim payments are observed only when there are one or more claims, we have created (via logical subsetting) a subset of Swedish where Claims is strictly positive, called Swedish.severity, which will be useful for modeling claim payments later.

# CHUNK 3 
summary(Swedish[Swedish$Claims == 0, ]$Payment)
Swedish.severity <- Swedish[Swedish$Claims > 0, ]

Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
   0       0       0       0       0       0

2. To facilitate later use, in CHUNK 4 we define the two offset-corrected target variables, called ClaimsPerinsured and PaymentPerClaim, which are the number of claims per policyholder year and the payment per claim, respectively.

Traditionally, in the actuarial literature PaymentPerClaim is referred to as claim severity, which is the size of one payment and not to be confused with the aggregate payments (what you learned in Exam STAM should help!).

Note that PaymentPerClaim is only defined on Swedish.severity (or else the division by zero caused by a zero Claims would create computational trouble).

# CHUNK 4 
Swedish$ClaimsPerinsured <- Swedish$Claims / Swedish$Insured 
Swedish.severity$PaymentPerClaim <- Swedish.severity$Payment / Swedish.severity$Claims

The following two scatterplots show that the size as well as the variability of ClaimsPerinsured and PaymentPerClaim shrink significantly with Insured and Claims, respectively, showing that averaging goes a long way towards stabilizing the two target variables. We will take advantage of these findings when we construct GLMs in the later part of this case study.

# CHUNK 4 (cont.)
ggplot(Swedish, aes(x = Insured, y = ClaimsPerinsured)) + geom_point()
ggplot(Swedish.severity, aes(x = Claims, y = PaymentPerClaim)) + geom_point()

TASK 2: Edit and explore the data

Edit and explore the data to prepare for further analysis and identify potentially predictive variables.

Determine if any numeric variables should be converted to factor variables. If there are variables that can function as both numeric and factor variables, leave them as numeric and explore the factor conversions in Task 3.
Investigate the distribution of the two target variables corrected by the offsets identified in Task 1.
Identify variables ( other than the offsets in Task 1) that are likely to have significant predictive power for each of the two target variables.

Data Quality Issue

Factor Conversions

In the Swedish dataset, all variables are represented by numeric values. However, Zone and Make are merely numerical labels of the geographical zone and type of a vehicle and do not convey any numerical order (e.g., there is no indication that Zone 2 is larger than Zone 1 in any sense). These two variables should better be converted to factors, as we do in CHUNK 5.

# CHUNK 5 
Swedish$Zone <- as.factor(Swedish$Zone) 
Swedish$Make <- as.factor(Swedish$Make) 
Swedish.severity$Zone <- as.factor(Swedish.severity$Zone) 
Swedish.severity$Make <- as.factor(Swedish.severity$Make)

As for Kilometres and Bonus, these two variables do have a numerical order (e.g., Kilometres = 2 indicates a longer distance traveled than Kilometres = 1) but can be converted to ordered factors to be represented by dummy variables to capture their effects on the target variables more effectively. We will leave them as numeric variables for now and decide whether changing them to factors is worthwhile when it comes to the modeling stage.

Stage 3: Exploratory Data Analysis

Univariate Exploration

Distributions of the Target Variables

In CHUNK 6, we explore the distributions of the two offset-corrected target variables, ClaimsPerinsured and PaymentPerClaim, using summary statistics (via the summary() function) in combination with graphical displays (histograms here).

# CHUNK 6 
summary(Swedish$ClaimsPerinsured)
summary(Swedish.severity$PaymentPerClaim)

ggplot(Swedish, aes(x = ClaimsPerinsured)) + geom_histogram()

> summary(Swedish$ClaimsPerinsured)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
0.00000 0.02648 0.05123 0.06914 0.08906 1.66667 

> summary(Swedish.severity$PaymentPerClaim)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
     72    2643    4375    5206    5985   31442

Both ClaimsPerinsured and PaymentPerClaim are non-negative continuous variables (while Claims takes only non-negative integers, the division by Insured makes ClaimsPerinsured continuous). The former ranges from 0 (corresponding to no claims) to 1.66667 while the latter ranges from 72 to 31,442.

Even after being accounted for by the offsets, the two target variables exhibit a pronounced right skew, especially for PaymentPerClaim, as shown in the histograms and reflected by their means substantially exceeding their medians. We will take the right skew into account when developing GLMs later.

Bivariate Exploration

Potentially Useful Predictors

Having investigated the two target variables, we go on to perform bivariate data exploration in search of potentially predictive variables out of Kilometres, Bonus, Zone, and Make. CHUNK 7 provides the means and medians of the two offset-corrected target variables split by distinct values or levels of the four predictors. (Split boxplots can also be used here.)

# CHUNK 7 
library(tidyverse) 
vars <- c("Kilometres", "Bonus", "Zone", "Make") 
for (i in vars) { 
    x <- Swedish%>% 
    group_by_(i) %>% 
    summarise( 
        mean = mean(ClaimsPerinsured), 
        median = median(ClaimsPerinsured), 
        n = n() 
    ) 
print(x) 
}

# A tibble: 5 × 4
  Kilometres   mean median     n
       <int>  <dbl>  <dbl> <int>
1          1 0.0561 0.0425   439
2          2 0.0651 0.0488   441
3          3 0.0718 0.0552   441
4          4 0.0705 0.0555   434
5          5 0.0827 0.0574   427
# A tibble: 7 × 4
  Bonus   mean median     n
  <int>  <dbl>  <dbl> <int>
1     1 0.129  0.117    307
2     2 0.0792 0.0741   312
3     3 0.0676 0.0591   310
4     4 0.0659 0.0514   310
5     5 0.0550 0.0454   313
6     6 0.0524 0.0447   315
7     7 0.0364 0.0350   315
# A tibble: 7 × 4
  Zone    mean median     n
  <fct>  <dbl>  <dbl> <int>
1 1     0.104  0.0862   315
2 2     0.0795 0.0645   315
3 3     0.0722 0.0575   315
4 4     0.0575 0.0473   315
5 5     0.0626 0.0469   313
6 6     0.0569 0.0435   315
7 7     0.0504 0        294
# A tibble: 9 × 4
  Make    mean median     n
  <fct>  <dbl>  <dbl> <int>
1 1     0.0761 0.0614   245
2 2     0.0802 0.0607   245
3 3     0.0576 0.0389   242
4 4     0.0333 0.0216   238
5 5     0.0919 0.0691   244
6 6     0.0543 0.0430   244
7 7     0.0838 0.0555   242
8 8     0.0729 0.0487   237
9 9     0.0712 0.0630   245

# CHUNK 7 (cont.)


for (i in vars) { 
    x <- Swedish.severity%>% 
    group_by_(i) %>% 
    summarise( 
        mean= mean(PaymentPerClaim), 
        median= median(PaymentPerClaim), 
        n = n()
    )
print(x)
}

# A tibble: 5 × 4
  Kilometres  mean median     n
       <int> <dbl>  <dbl> <int>
1          1 4912.  4239.   381
2          2 5068.  4650.   399
3          3 5445.  4494.   378
4          4 5065.  3901.   328
5          5 5601.  4207.   311
# A tibble: 7 × 4
  Bonus  mean median     n
  <int> <dbl>  <dbl> <int>
1     1 4913.  4195.   252
2     2 5451.  4238.   246
3     3 4899.  3974.   246
4     4 5257.  4095.   240
5     5 5188.  3900    247
6     6 5251.  4464.   270
7     7 5440.  5107.   296
# A tibble: 7 × 4
  Zone   mean median     n
  <fct> <dbl>  <dbl> <int>
1 1     4959.  4399.   295
2 2     4569.  4142.   295
3 3     5167.  4500.   293
4 4     5411.  4904.   306
5 5     5625.  4145.   236
6 6     5702.  4775.   264
7 7     5016.  2605.   108
# A tibble: 9 × 4
  Make   mean median     n
  <fct> <dbl>  <dbl> <int>
1 1     5083.  4828.   227
2 2     4854.  4313.   205
3 3     5950.  4438.   178
4 4     4877.  3367    155
5 5     4986.  3797.   206
6 6     5193.  4482.   212
7 7     4759.  3429    197
8 8     6603.  4387    173
9 9     4849.  4791.   244

# Remove redundant variabLes 
rm(i, vars, x)

ClaimsPerlnsured: It appears that ClaimsPerlnsured has a monotonic relationship with both Kilometres and Bonus, being increasing in Kilometres and decreasing in Bonus.
These findings make intuitive sense as a longer distance traveled indicates a more worn-out vehicle and a smaller Bonus suggests poorer driving experience of a policyholder.
The average of ClaimsPerlnsured is also particularly low in certain geographical zones such as 4, 5, 6, and 7, and for certain car models such as 3, 4, and 6.
PaymentPerClaim: The effects of the four predictors on PaymentPerClaim are not as pronounced as for ClaimsPerlnsured. Unlike ClaimsPerlnsured, Kilometres and Bonus do not seem to exert a monotonic effect on PaymentPerClaim.
We can see that PaymentPerClaim has relatively close means and medians across different values of Kilometres and Bonus, and relatively high means and medians for Zone 4 and Zone 6, and relatively low medians for cars of Make 4 and Make 7.

TASK 3: Construct a GLM for Claims

Split the data into the training and test sets.
Select a distribution and link function to use for constructing a GLM for Claims. Justify your choice based on the characteristics of the target variable.
Fit two GLMs to the training data using the combination of distribution and link function you propose, one treating the “controversial” variables in Task 2 as numeric variables and one treating them as factor variables. Be sure to include the offsets identified in Task 1. Provide the summary output of the two GLMs in your response.
Perform feature selection using an appropriate method if desirable.
Evaluate the Pearson goodness-of-fit statistic of the two GLMs on the test set and recommend which model to use.

Stage 4: Model Construction and Selection

Creation of Training and Test sets

Before constructing GLMs for Claims and Payment, we perform the usual step of splitting the data into the training and test sets. The training set (70% of the data) will be used for fitting the GLMs and the test set (30% of the data) for evaluating their prediction accuracy. There is a subtlety in the current context: We have two target variables here and which variable should we base the stratified sampling on when doing the training/test set split?

In CHUNK 8, we do the stratified sampling based on the number of claims per insured, ClaimsPerinsured, and you will see why very shortly.

# CHUNK 8 
library(caret) 
set.seed(14)
train_ind <- createDataPartition(Swedish$ClaimsPerinsured, p = 0.7, list = FALSE)
data.train <- Swedish[train_ind, ]
data.test <- Swedish[-train_ind, ]
# Takeout the severity portions of the training and test sets 
data.train.severity <- data.train[data.train$Claims > 0, ]
data.train.severity$PaymentPerClaim <- data.train.severity$Payment / data.train.severity$Claims
data.test.severity <- data.test[data.test$Claims > 0, ]
data.test.severity$PaymentPerClaim <- data.test.severity$Payment / data.test.severity$Claims
print("TRAIN") 
mean(data.train$ClaimsPerinsured)
mean(data.train.severity$PaymentPerClaim)
print("TEST")
mean(data.test$ClaimsPerinsured)
mean(data.test.severity$PaymentPerClaim)

> print("TRAIN") 
[1] "TRAIN"
> mean(data.train$ClaimsPerinsured)
[1] 0.06896561
> mean(data.train.severity$PaymentPerClaim)
[1] 5253.53
> print("TEST")
[1] "TEST"
> mean(data.test$ClaimsPerinsured)
[1] 0.06955017
> mean(data.test.severity$PaymentPerClaim)
[1] 5094.656

The average number of claims per insured on the training set and that on the test set are very close to each other, which is a result of the built-in stratification. To our delight, the two sets also produce a similar average claim severity, which can be attributed to the strong proportional relationship between Payment and Claims we saw in CHUNK 2 and the comparable mean values of ClaimsPerinsured on the training and test sets.

Fitting the Model

A GLM for `Claims` with `Kilometres` and `Bonus` as numeric variables.

Because Claims is a non-negative integer-valued variable with a right skew, the Poisson distribution is a suitable target distribution. We will adopt the log link, which is most commonly employed in conjunction with the Poisson distribution as it ensures that the model predictions are always non-negative, makes for an interpretable model, and is the canonical link.

In CHUNK 9, we fit a log-link Poisson GLM for Claims on the training set using Kilometres, Zone, Bonus, and Make as predictors and the logarithm of Insured as an offset. Here Kilometres and Bonus are treated as numeric variables. The GLM is saved as an object named count.numeric to distinguish it from other GLMs we will fit later.

# CHUNK 9 
count.numeric <- glm(Claims ~ Kilometres + Zone + Bonus + Make,
                     data = data.train
                     offset = log(Insured),
                     family = poisson)
summary(count.numeric)

Call:
glm(formula = Claims ~ Kilometres + Zone + Bonus + Make, family = poisson, 
    data = data.train, offset = log(Insured))

Coefficients:
             Estimate Std. Error  z value Pr(>|z|)    
(Intercept) -1.813830   0.016746 -108.311  < 2e-16 ***
Kilometres   0.131218   0.003028   43.330  < 2e-16 ***
Zone2       -0.230780   0.011698  -19.728  < 2e-16 ***
Zone3       -0.382588   0.012185  -31.397  < 2e-16 ***
Zone4       -0.573191   0.010981  -52.197  < 2e-16 ***
Zone5       -0.328492   0.018044  -18.205  < 2e-16 ***
Zone6       -0.522192   0.014480  -36.063  < 2e-16 ***
Zone7       -0.727917   0.050782  -14.334  < 2e-16 ***
Bonus       -0.201320   0.001609 -125.103  < 2e-16 ***
Make2        0.047689   0.026055    1.830   0.0672 .  
Make3       -0.268146   0.033077   -8.107 5.20e-16 ***
Make4       -0.687657   0.027191  -25.290  < 2e-16 ***
Make5        0.154635   0.023003    6.722 1.79e-11 ***
Make6       -0.350481   0.020309  -17.258  < 2e-16 ***
Make7       -0.061202   0.027085   -2.260   0.0238 *  
Make8       -0.074676   0.040067   -1.864   0.0624 .  
Make9       -0.069484   0.011548   -6.017 1.78e-09 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for poisson family taken to be 1)

    Null deviance: 23800.5  on 1529  degrees of freedom
Residual deviance:  2681.6  on 1513  degrees of freedom
AIC: 8059.4

Number of Fisher Scoring iterations: 4

Note that:

In the formula argument of the glm() function, we deliberately exclude Payment and ClaimsPerinsured, which cannot serve as predictors of Claims, as well as Insured, which will play its role in the offset argument.
The family argument of the glm() function is set to poisson(not Poisson!) to specify a Poisson GLM. The log link is the canonical (and default) link for the Poisson distribution and so does not have to be specified explicitly. (If you prefer, you can replace family = poisson by family = poisson() or family = poisson(link = "log").)
Do remember to put Insured on the log scale, hence the offset = log(Insured) command (not offset = Insured!).

The model summary shows that the coefficient estimates for Kilometres and Bonus are positive and negative, respectively, consistent with what we saw in CHUNK 7. All features, including the individual factor levels of Zone and Make (except for Levels 2 and 8), are statistically significant, indicating that they are valuable for accounting for Claims and obviating the need for feature removal.

Digression: Can we model `ClaimsPerinsured` directly?

Note that because of the log link, the equation of the Poisson GLM we have just fitted is equivalent to:

\(\ln(\text{ClaimsPerInsured})=\ln(\dfrac{Claims}{Insured})=\eta\),

where η is the linear predictor comprising Kilometres, Zone, Bonus, and Make. You may wonder:

To save work, can we instead fit a log-link Poisson GLM directly to ClaimsPerinsured without having to supply log(Insured) as an offset?

This is explored in CHUNK 10.

# CHUNK 10 
freq <- glm(ClaimsPerinsured ~ Kilometres + Zone + Bonus + Make, 
            data= data.train, 
            weight = Insured, 
            family = poisson) 
summary(freq)

> summary(freq)

Call:
glm(formula = ClaimsPerinsured ~ Kilometres + Zone + Bonus + 
    Make, family = poisson, data = data.train, weights = Insured)

Coefficients:
             Estimate Std. Error  z value Pr(>|z|)    
(Intercept) -1.813830   0.016747 -108.311  < 2e-16 ***
Kilometres   0.131218   0.003028   43.330  < 2e-16 ***
Zone2       -0.230780   0.011698  -19.728  < 2e-16 ***
Zone3       -0.382588   0.012185  -31.397  < 2e-16 ***
Zone4       -0.573191   0.010981  -52.197  < 2e-16 ***
Zone5       -0.328492   0.018044  -18.205  < 2e-16 ***
Zone6       -0.522192   0.014480  -36.063  < 2e-16 ***
Zone7       -0.727917   0.050784  -14.334  < 2e-16 ***
Bonus       -0.201320   0.001609 -125.103  < 2e-16 ***
Make2        0.047689   0.026055    1.830   0.0672 .  
Make3       -0.268146   0.033077   -8.107 5.20e-16 ***
Make4       -0.687657   0.027191  -25.290  < 2e-16 ***
Make5        0.154635   0.023003    6.722 1.79e-11 ***
Make6       -0.350481   0.020309  -17.258  < 2e-16 ***
Make7       -0.061202   0.027085   -2.260   0.0238 *  
Make8       -0.074676   0.040067   -1.864   0.0624 .  
Make9       -0.069484   0.011548   -6.017 1.78e-09 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for poisson family taken to be 1)

    Null deviance: 23800.5  on 1529  degrees of freedom
Residual deviance:  2681.6  on 1513  degrees of freedom
AIC: Inf

Number of Fisher Scoring iterations: 6

Notice that:

As opposed to CHUNK 9, the target variable in the formula argument of the glm() function becomes ClaimsPerinsured and no offset is specified.
Due to the averaging by Insured, the variability of ClaimsPerinsured is lowered and decreases with the value of Insured, as we saw in CHUNK 4. To reflect the different precision of the observations caused by different values of Insured, the command weight = Insured is in place to insert Insured as a weight in the model fitting procedure. Observations with a higher value of Insured will receive a higher weight.

With one exception, the summary for this model is identical to that in CHUNK 9, indicating that the two fitted models are essentially the same (the equivalence was first noted on page 296).

That notable exception is that the AIC of the model for ClaimsPerinsured (the frequency model) is Inf while that of the model for Claims (the count model) is a normal value. When running CHUNK 10 in RStudio, you should also see various warning messages that stem from the non-integer values of ClaimsPerinsured (whereas a Poisson random variable only takes non-negative integer values). This inconsistency between the nature of ClaimsPerinsured and a genuine Poisson random variable prohibits the glm() function from computing the AIC of the frequency model. As a result, no feature selection based on the AIC can be performed.

Since the count model and the frequency model produce identical fitted results and the former is much easier to work with, there is no reason to dwell on the latter and we will no longer consider the frequency model in the rest of this case study.

A GLM for Claims with Kilometres and Bonus as factor variables. Now we refit the Poisson count model with Kilometres and Bonus treated as factor variables and name the fitted model as count.factor. Notice that Kilometres and Bonus are replaced by factor(Kilometres) and factor(Bonus) in the model formula.

# CHUNK 11 
count.factor <- glm(Claims ~ factor(Kilometres) + Zone + factor(Bonus) + Make, 
                    data = data.train, 
                    offset = log(Insured), 
                    family= poisson) 
summary(count.factor)

Call:
glm(formula = Claims ~ factor(Kilometres) + Zone + factor(Bonus) + 
    Make, family = poisson, data = data.train, offset = log(Insured))

Coefficients:
                     Estimate Std. Error  z value Pr(>|z|)    
(Intercept)         -1.793205   0.016512 -108.598  < 2e-16 ***
factor(Kilometres)2  0.194910   0.010023   19.446  < 2e-16 ***
factor(Kilometres)3  0.294007   0.009877   29.766  < 2e-16 ***
factor(Kilometres)4  0.370948   0.014373   25.809  < 2e-16 ***
factor(Kilometres)5  0.559839   0.014991   37.345  < 2e-16 ***
Zone2               -0.239285   0.011750  -20.364  < 2e-16 ***
Zone3               -0.382898   0.012331  -31.053  < 2e-16 ***
Zone4               -0.574708   0.011185  -51.382  < 2e-16 ***
Zone5               -0.337239   0.018218  -18.511  < 2e-16 ***
Zone6               -0.532676   0.014746  -36.123  < 2e-16 ***
Zone7               -0.708291   0.050845  -13.930  < 2e-16 ***
factor(Bonus)2      -0.453835   0.015951  -28.453  < 2e-16 ***
factor(Bonus)3      -0.699796   0.017949  -38.987  < 2e-16 ***
factor(Bonus)4      -0.823508   0.018985  -43.376  < 2e-16 ***
factor(Bonus)5      -0.906880   0.017397  -52.129  < 2e-16 ***
factor(Bonus)6      -0.983759   0.014285  -68.864  < 2e-16 ***
factor(Bonus)7      -1.332578   0.010936 -121.853  < 2e-16 ***
Make2                0.049512   0.026094    1.897   0.0578 .  
Make3               -0.269633   0.033111   -8.143 3.85e-16 ***
Make4               -0.661879   0.027371  -24.182  < 2e-16 ***
Make5                0.152141   0.023016    6.610 3.84e-11 ***
Make6               -0.342875   0.020333  -16.863  < 2e-16 ***
Make7               -0.059942   0.027092   -2.213   0.0269 *  
Make8               -0.086094   0.040131   -2.145   0.0319 *  
Make9               -0.067240   0.011726   -5.734 9.80e-09 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for poisson family taken to be 1)

    Null deviance: 23800.5  on 1529  degrees of freedom
Residual deviance:  2016.8  on 1505  degrees of freedom
AIC: 7410.6

Number of Fisher Scoring iterations: 4

In the model summary, Kilometres and Bonus are now replaced by a collection of dummy variables representing their distinct levels other than the baseline levels. The signs and magnitudes of the coefficient estimates of these dummy variables are in agreement with the coefficient estimates of Kilometres and Bonus in CHUNK 9. The estimates of the dummy variables of Kilometres are all positive and increase in value from level 2 to level 5, while the opposite relationship is observed for those of Bonus from level 2 to level 7, consistent with the monotonic effects of Kilometres and Bonus on ClaimsPerinsured we observed earlier. The binarization of Kilometres and Bonus, however, allows us to capture the effects of these two variables more effectively across different factor levels, at the cost of an increase in the number of features in the model and computational burden.

Using the Pearson statistic as an arbitrator

Comparing CHUNKs 9 and 11, we can see that the GLM with binarized Kilometres and Bonus has a lower AIC and therefore appears to be the better model. The superiority of binarizing Kilometres and Bonus is also reflected by the Pearson goodness-of-fit statistics defined by:

\(\dfrac{1}{n}\sum{\dfrac{(y_i-\hat{y}_i)^2}{\hat{y}_i}}\)

where y_i is the ith observation of the training or test set, \(\hat{y}_i\) is the corresponding fitted or predicted value, and n is the number of observations in the set.

Featured in the June 16, 2020 PA exam and covered in Exam SRM, the Pearson statistic is commonly used to assess the goodness of fit of a count model. It takes the square of the discrepancy between the observed value and the predicted value but unlike the MSE, it scales the squared discrepancy by the predicted value to avoid the statistic being disproportionately dominated by large predicted values, which are not uncommon with right-skewed data. Like the AIC, the smaller the Pearson statistic, the better the model.

In CHUNK 12, we generate the predictions of the two Poisson GLMs on the test set and evaluate the Pearson goodness-of-fit statistic for both models. Note that the predictions are for Claims, not ClaimsPerinsured, and have taken the offset Insured into account.

# CHUNK 12 
count.numeric.pred <- predict(count.numeric, newdata = data.test, type = "response")

sum((data.test$Claims - count.numeric.pred)^2 / count.numeric.pred) / nrow(data.test)

count.factor.pred <- predict(count.factor, newdata = data.test, type = "response")

sum((data.test$Claims - count.factor.pred)^2 / count.factor.pred) / nrow(data.test)

> count.numeric.pred <- predict(count.numeric, newdata = data.test, type = "response")
> 
> sum((data.test$Claims - count.numeric.pred)^2 / count.numeric.pred) / nrow(data.test)
[1] 2.311969
> 
> count.factor.pred <- predict(count.factor, newdata = data.test, type = "response")
> 
> sum((data.test$Claims - count.factor.pred)^2 / count.factor.pred) / nrow(data.test)
[1] 1.602462

In congruence with the AIC-based rankings, treating Kilometres and Bonus as factor variables leads to a lower Pearson statistic on the test set and appears a judicious move. The increase in flexibility offered by the dummy variables outweighs the increase in model complexity.

TASK 4: Construct a GLM for claim severity

Select a distribution and link function to use for constructing a GLM for claim severity. Justify your choice based on the characteristics of
the target variable.
Explain why Claims may serve as a weight for constructing a GLM for claim severity.
Fit a GLM for claim severity to the training data using the combination of distribution and link function you propose, and Claims as a weight.
Perform feature selection using an appropriate method if desirable.
Provide the summary output of the resulting GLM in your response.
Use the GLM to make predictions for Payment (not claim severity) on the test set and evaluate the test root mean squared error. For this purpose, assume that the values of Claims on the test set are available.

A GLM for claim severity

We now turn to building GLMs for claim severity, or PaymentPerClaim. As CHUNK 6 showed, PaymentPerClaim is a positive, continuous variable with a right skew, so a gamma distribution is a good choice for the target distribution (inverse Gaussian is another choice).

Furthermore, CHUNK 4 indicated that the variability of PaymentPerClaim decreases with the value of Claims. This motivates the use of Claims as a weight when a gamma GLM is fitted for PaymentPerClaim.

In CHUNK 13, we fit a log-link gamma GLM for PaymentPerClaim on the severity portion of the training set with Claims serving as a weight. Consistent with the claim count model in CHUNK 11, we are binarizing Kilometres and Bonus. (If you are interested, feel free to compare the performance of the GLM with numeric Kilometres and Bonus and with binarized Kilometres and Bonus. Such a comparison is omitted here to reduce repetition.)

# CHUNK 13 
severity <- glm(PaymentPerClaim ~ factor(Kilometres) + Zone + factor(Bonus) + Make, 
                data = data.train.severity, 
                weight = Claims, 
                family = Gamma(link = "log"))
summary(severity)

Call:
glm(formula = PaymentPerClaim ~ factor(Kilometres) + Zone + factor(Bonus) + 
    Make, family = Gamma(link = "log"), data = data.train.severity, 
    weights = Claims)

Coefficients:
                    Estimate Std. Error t value Pr(>|t|)    
(Intercept)          8.39309    0.02801 299.694  < 2e-16 ***
factor(Kilometres)2  0.01253    0.01722   0.728 0.466926    
factor(Kilometres)3  0.01459    0.01703   0.857 0.391764    
factor(Kilometres)4  0.03893    0.02481   1.569 0.116860    
factor(Kilometres)5  0.03361    0.02584   1.301 0.193644    
Zone2                0.04370    0.02023   2.160 0.030966 *  
Zone3                0.04870    0.02122   2.294 0.021930 *  
Zone4                0.13249    0.01932   6.857 1.11e-11 ***
Zone5                0.04064    0.03141   1.294 0.196051    
Zone6                0.15418    0.02541   6.068 1.72e-09 ***
Zone7                0.04576    0.08753   0.523 0.601199    
factor(Bonus)2       0.06609    0.02748   2.405 0.016312 *  
factor(Bonus)3       0.08396    0.03094   2.713 0.006757 ** 
factor(Bonus)4       0.08018    0.03274   2.449 0.014445 *  
factor(Bonus)5       0.06293    0.03001   2.097 0.036201 *  
factor(Bonus)6       0.08939    0.02465   3.627 0.000298 ***
factor(Bonus)7       0.12111    0.01887   6.420 1.94e-10 ***
Make2               -0.03875    0.04498  -0.861 0.389150    
Make3                0.05197    0.05710   0.910 0.362879    
Make4               -0.20358    0.04691  -4.340 1.54e-05 ***
Make5               -0.09568    0.03963  -2.414 0.015907 *  
Make6               -0.05159    0.03501  -1.474 0.140771    
Make7               -0.11042    0.04666  -2.367 0.018099 *  
Make8                0.25276    0.06912   3.657 0.000266 ***
Make9               -0.05821    0.02014  -2.890 0.003915 ** 
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for Gamma family taken to be 2.964663)

    Null deviance: 3801.1  on 1259  degrees of freedom
Residual deviance: 3185.2  on 1235  degrees of freedom
AIC: 1257914

Number of Fisher Scoring iterations: 5

Most features in the fitted GLM are statistically significant; the notable exception is Kilometres, which is removed when a backward stepwise selection based on the AIC is implemented, as shown in CHUNK 14. (For simplicity, we do not perform explicit binarization.)

# CHUNK 14 
library(MASS) 
severity.final <- stepAIC(severity)

Start:  AIC=1257914
PaymentPerClaim ~ factor(Kilometres) + Zone + factor(Bonus) + 
    Make

                     Df Deviance     AIC
- factor(Kilometres)  4   3195.4 1257909
<none>                    3185.2 1257914
- factor(Bonus)       6   3312.1 1257945
- Make                8   3334.2 1257948
- Zone                6   3396.3 1257973

Step:  AIC=1258151
PaymentPerClaim ~ Zone + factor(Bonus) + Make

summary(severity.final)

Call:
glm(formula = PaymentPerClaim ~ Zone + factor(Bonus) + Make, 
    family = Gamma(link = "log"), data = data.train.severity, 
    weights = Claims)

Coefficients:
               Estimate Std. Error t value Pr(>|t|)    
(Intercept)     8.40279    0.02680 313.506  < 2e-16 ***
Zone2           0.04430    0.02026   2.186 0.028972 *  
Zone3           0.04808    0.02121   2.267 0.023554 *  
Zone4           0.13333    0.01912   6.972 5.06e-12 ***
Zone5           0.03985    0.03123   1.276 0.202207    
Zone6           0.15619    0.02510   6.222 6.71e-10 ***
Zone7           0.04550    0.08779   0.518 0.604383    
factor(Bonus)2  0.06995    0.02748   2.546 0.011023 *  
factor(Bonus)3  0.08940    0.03087   2.896 0.003844 ** 
factor(Bonus)4  0.08657    0.03262   2.654 0.008062 ** 
factor(Bonus)5  0.06956    0.02986   2.329 0.020009 *  
factor(Bonus)6  0.09553    0.02444   3.909 9.79e-05 ***
factor(Bonus)7  0.12545    0.01875   6.692 3.32e-11 ***
Make2          -0.03607    0.04503  -0.801 0.423242    
Make3           0.05835    0.05709   1.022 0.306992    
Make4          -0.21254    0.04663  -4.558 5.66e-06 ***
Make5          -0.09632    0.03975  -2.423 0.015542 *  
Make6          -0.05523    0.03504  -1.576 0.115251    
Make7          -0.11148    0.04680  -2.382 0.017376 *  
Make8           0.25775    0.06916   3.727 0.000203 ***
Make9          -0.05990    0.01993  -3.006 0.002702 ** 
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for Gamma family taken to be 2.985245)

    Null deviance: 3801.1  on 1259  degrees of freedom
Residual deviance: 3195.4  on 1239  degrees of freedom
AIC: 1258151

Number of Fisher Scoring iterations: 5

Lastly, in CHUNK 15 we apply the reduced GLM to produce predictions for Payment on the severity portion of the test set. Because the target variable of the model is PaymentPerClaim, we have to multiply the model predictions by Claims on the test set to get the predictions for Payment.

# CHUNK 15
severity.final.pred <- data.test.severity$Claims * 
  predict(severity.final, newdata = data.test.severity, type= "response")

RMSE(data.test.severity$Payment, severity.final.pred)

RMSE(data.test.severity$Payment, mean(data.train.severity$Payment))

> severity.final.pred <- data.test.severity$Claims * 
+   predict(severity.final, newdata = data.test.severity, type= "response")
> 
> RMSE(data.test.severity$Payment, severity.final.pred)
[1] 69882.03
> 
> RMSE(data.test.severity$Payment, mean(data.train.severity$Payment))
[1] 1108407

The test RMSE is 69,882.03, which seems large. To put this value in perspective, we compare it with the test RMSE if we use the mean of Payment on the training set as the predicted value. The test RMSE in this case becomes 1,108,407, a much larger value. Our gamma GLM is therefore capable of reducing the test RMSE to only about 6.30% of the RMSE when no predictors are available.

Predictions

TASK 5: Combine the two models to make a prediction for aggregate payments

Run the recommended GLMs in Tasks 3 and 4 on the full dataset.

Include the model output in your response.
Explain step by step how to predict the aggregate payments for a group of policyholders with the following information:
- Drives between 1,000 and 15,000 kilometers per year
- Lives in Stockholm
- Had an accident within the past year
- Drives a car of model 6
- Has 350 policyholder years
Propose an alternative modeling method that can be used to predict aggregate payments.

Combining frequency and severity GLMs to predict aggregate payments

Let’s refit the two GLMs in Tasks 3 and 4 to the full Swedish dataset. Here is the summary output.

# CHUNK 16 
count.full <- glm(Claims ~ factor(Kilometres) + Zone + factor(Bonus) + Make, 
                  data = Swedish, 
                  offset = log(Insured), 
                  family = poisson) 
summary(count.full)

Call:
glm(formula = Claims ~ factor(Kilometres) + Zone + factor(Bonus) + 
    Make, family = poisson, data = Swedish, offset = log(Insured))

Coefficients:
                     Estimate Std. Error  z value Pr(>|z|)    
(Intercept)         -1.812840   0.013757 -131.775  < 2e-16 ***
factor(Kilometres)2  0.212586   0.007524   28.255  < 2e-16 ***
factor(Kilometres)3  0.320226   0.008661   36.974  < 2e-16 ***
factor(Kilometres)4  0.404657   0.012054   33.571  < 2e-16 ***
factor(Kilometres)5  0.575954   0.012830   44.892  < 2e-16 ***
Zone2               -0.238168   0.009496  -25.082  < 2e-16 ***
Zone3               -0.386395   0.009670  -39.959  < 2e-16 ***
Zone4               -0.581902   0.008654  -67.243  < 2e-16 ***
Zone5               -0.326128   0.014530  -22.446  < 2e-16 ***
Zone6               -0.526234   0.011877  -44.309  < 2e-16 ***
Zone7               -0.730999   0.040698  -17.962  < 2e-16 ***
factor(Bonus)2      -0.478993   0.012094  -39.607  < 2e-16 ***
factor(Bonus)3      -0.693172   0.013508  -51.316  < 2e-16 ***
factor(Bonus)4      -0.827397   0.014584  -56.735  < 2e-16 ***
factor(Bonus)5      -0.925632   0.013968  -66.269  < 2e-16 ***
factor(Bonus)6      -0.993457   0.011629  -85.429  < 2e-16 ***
factor(Bonus)7      -1.327406   0.008685 -152.845  < 2e-16 ***
Make2                0.076245   0.021239    3.590 0.000331 ***
Make3               -0.247413   0.025094   -9.859  < 2e-16 ***
Make4               -0.653524   0.024185  -27.022  < 2e-16 ***
Make5                0.154924   0.020235    7.656 1.91e-14 ***
Make6               -0.335581   0.017375  -19.314  < 2e-16 ***
Make7               -0.055940   0.023343   -2.396 0.016554 *  
Make8               -0.043933   0.031604   -1.390 0.164493    
Make9               -0.068054   0.009956   -6.836 8.17e-12 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for poisson family taken to be 1)

    Null deviance: 34070.6  on 2181  degrees of freedom
Residual deviance:  2966.1  on 2157  degrees of freedom
AIC: 10654

Number of Fisher Scoring iterations: 4

severity.full <- glm(Payment/Claims ~ Zone + factor(Bonus) + Make, 
                     data = Swedish.severity, 
                     weight = Claims,
                     family = Gamma(link = "log"))
summary(severity.full)

Call:
glm(formula = Payment/Claims ~ Zone + factor(Bonus) + Make, family = Gamma(link = "log"), 
    data = Swedish.severity, weights = Claims)

Coefficients:
               Estimate Std. Error t value Pr(>|t|)    
(Intercept)     8.41085    0.02222 378.570  < 2e-16 ***
Zone2           0.02297    0.01640   1.401 0.161428    
Zone3           0.04770    0.01670   2.855 0.004348 ** 
Zone4           0.12963    0.01497   8.661  < 2e-16 ***
Zone5           0.05073    0.02509   2.022 0.043293 *  
Zone6           0.14652    0.02053   7.137 1.39e-12 ***
Zone7           0.02259    0.07025   0.321 0.747879    
factor(Bonus)2  0.04698    0.02085   2.254 0.024346 *  
factor(Bonus)3  0.07390    0.02325   3.178 0.001508 ** 
factor(Bonus)4  0.06240    0.02507   2.489 0.012893 *  
factor(Bonus)5  0.03987    0.02396   1.664 0.096308 .  
factor(Bonus)6  0.07606    0.01987   3.828 0.000134 ***
factor(Bonus)7  0.12130    0.01480   8.194 4.78e-16 ***
Make2          -0.03177    0.03664  -0.867 0.386081    
Make3           0.08932    0.04327   2.064 0.039159 *  
Make4          -0.17496    0.04138  -4.228 2.48e-05 ***
Make5          -0.08771    0.03493  -2.511 0.012120 *  
Make6          -0.04243    0.02995  -1.417 0.156670    
Make7          -0.12068    0.04030  -2.995 0.002784 ** 
Make8           0.21863    0.05442   4.017 6.14e-05 ***
Make9          -0.05673    0.01711  -3.316 0.000931 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for Gamma family taken to be 2.979105)

    Null deviance: 5417.7  on 1796  degrees of freedom
Residual deviance: 4547.3  on 1776  degrees of freedom
AIC: 1878541

Number of Fisher Scoring iterations: 5

Note that the predictions made in CHUNK 15 on the severity portion of the test set for aggregate payments were made assuming that the values of Claims were available. In practice, one would predict the aggregate payments of a prospective policyholder based only on his/her rating variables, without knowing the number of claims he/she will submit; the number of claims will not be available until the policy term ends.

To illustrate how to predict the aggregate payments for a future policyholder, whose Claims is yet to be observed, we consider the sample policyholder described in the task statement. According to the data dictionary, the combination of feature values is:

`Kilometres`	`Zone`	`Bonus`	`Make`	`Insured`
2	1	1	6	350

This policyholder is created in CHUNK 17. Note that Zone and Make are created as character variables with levels surrounded by quotes as we have converted them to factors in CHUNK 5, before the GLMs were fitted. Kilometres and Bonus, however, are turned into factors in the formulas of the GLMs.

# CHUNK 17 
sample <- data.frame(Kilometres = 2, Zone = "1", Bonus = 1, Make = "6", Insured = 350)
sample

 Kilometres Zone Bonus Make Insured
1          2    1     1    6     350

To predict the aggregate payments for this sample policyholder, we will apply the two GLMs to make predictions for the number of claims and claim severity separately, then combine the two predictions to form an overall prediction.

Step 1. Let’s start with predicting the number of claims.

# CHUNK 18 
# Predicted number of claims 
N.pred <- predict(count.full, newdata = sample, type = "response") 
N.pred

       1 
50.50629

To check the prediction manually, we use the coefficient estimates in CHUNK 16 to compute the predicted linear predictor for the given policyholder:

η = -1.812840 + 0.212586 + 0 + 0 -0.335581 = -1.935835

Together with the offset, the predicted value of Claims is N = 350 exp( -1.935835) = 50.5063, which agrees with the value produced by R.

Step 2. Next, we turn to the predicted claim severity.

# CHUNK 18 (Cont.) 
# Predicted claim severity 
X.pred <- predict(severity.full, newdata = sample, type = "response")
X.pred

       1 
4308.826

The claim severity GLM says that 4,308.826 is the predicted size of one claim for this sample policyholder.

Step 3. Finally, we multiply the two predictions to get the predicted aggregate payments.

# CHUNK 18 (Cont.) 
# Predi cted aggregate payments
S.pred <- N.pred * X.pred
S.pred

       1 
217622.8

An alternative prediction method

In the procedure above, we are basically decomposing the aggregate payments into two components, the claim counts and claim severity, and predict these two variables separately using different G LMs. Then we multiply the two predictions to form the overall prediction for the aggregate payments, using the following formula you may have learned from Exam STAM:

E[aggregate payments] = E[# of claims] • E[claim severity]

which relies on the important assumption that claim counts and claim severity are independent random variables.

An alternative is to fit a target distribution like the Tweedie distribution discussed briefly in Sub-section 4.1.1 to the aggregate payments (as a single target variable) directly and make a prediction using this distribution. This alternative modeling approach has the following advantages:

It is more efficient to fit and maintain a single (possibly more complex) model than two separate models.
It does not require the independence between claim counts and claim severity.

EXAM NOTE

As we mentioned in Subsection 4.1.1, it is unlikely that you will actually fit a Tweedie distribution in Exam PA because the R package for Tweedie will not be available on the exam. A conceptual exam item, however, may ask you to describe the Tweedie distribution and how it can serve as an alternative for modeling aggregate payments.