[mathjax]
Case Study 1: GLMs for Continuous Target Variables
Learning Objectives
- Select appropriate distributions and link functions for a positive, continuous target variable with a right skew.
- Fit a GLM using the
glm()
function in R and specify the options of this function appropriately. - Make predictions for GLMs using the
predict()
function and compare the predictive performance of different GLMs. - Generate and interpret diagnostic plots for a GLM.
Preparatory Steps
Background
persinj
contains the information of n = 22,036
settled personal injury insurance claims which were reported during the period from July 1989 to the end of 1999. Claims settled with zero payment were not included.
Objective
Our objective here is to build GLMs to predict the size of settled claims using related risk factors in the dataset, select the most promising GLM, and quantify its predictive accuracy. For claim size variables, which are continuous, positive-valued and often highly skewed, common modeling options include:
- Apply a log transformation to claim size and fit a normal linear model to the log-transformed claim size.
- Build a GLM with the normal distribution and a link function such as the log link to ensure that the target mean is non-negative.
- Build a GLM with a continuous target distribution that is positive-valued and can capture the skewness of claim size. The gamma and inverse Gaussian distributions are reasonable choices of the target distribution in this regard.
We will explore all of these modeling options in this section and show that a gamma GLM outperforms traditional models based on the normal distribution.
TASK 1: Data pre-processing Determine which numeric variables in the data, if any, should be treated as factors. Do the factor conversion. |
# CHUNK 1 persinj <- read.csv("persinj.csv") summary(persinj)
amt inj legrep op_time Min. : 10 Min. :1.00 Min. :0.0000 Min. : 0.10 1st Qu.: 6297 1st Qu.:1.00 1st Qu.:0.0000 1st Qu.:23.00 Median : 13854 Median :1.00 Median :1.0000 Median :45.90 Mean : 38367 Mean :1.83 Mean :0.6366 Mean :46.33 3rd Qu.: 35123 3rd Qu.:2.00 3rd Qu.:1.0000 3rd Qu.:69.30 Max. :4485797 Max. :9.00 Max. :1.0000 Max. :99.10
Convert inj
from numeric variable to categorical variable
The inj
variable is merely a label of group membership indicating the injury group to which each policy belongs. Treating inj
as a numeric variable, with a single regression coefficient attached, implicitly implies that the change in the linear predictor is the same for every consecutive change in inj
(i.e., from 1 to 2, from 2 to 3, etc.), which can be a severe restriction.
It is therefore important to treat inj
as a factor (equivalently, a categorical predictor) to be represented by dummy variables in a GLM.
# CHUNK 2 persinj$inj <- as.factor(persinj$inj) summary(persinj)
amt inj legrep op_time Min. : 10 1:15638 Min. :0.0000 Min. : 0.10 1st Qu.: 6297 2: 3376 1st Qu.:0.0000 1st Qu.:23.00 Median : 13854 3: 1133 Median :1.0000 Median :45.90 Mean : 38367 4: 189 Mean :0.6366 Mean :46.33 3rd Qu.: 35123 5: 188 3rd Qu.:1.0000 3rd Qu.:69.30 Max. :4485797 6: 256 Max. :1.0000 Max. :99.10 9: 1256
Since the baseline level (“1”) has the most observations, there is no need to relevel inj
.
Model Construction and Evaluation
TASK 2: Fit a linear model as a benchmark
|
Setting up the training and test sets
Before we fit and evaluate any GLMs, we should establish the training (75%) and test sets (25%) with the aid of the createDataPartition()
function, as in CHUNK 3
.
library(caret) set.seed(2019) partition <- createDataPartition(y = persinj$amt, p = .75, list = FALSE) data.train <- persinj[partition, ] data.test <- persinj[-partition, ] summary(data.train$amt) summary(data.test$amt)
> summary(data.train$amt) Min. 1st Qu. Median Mean 3rd Qu. Max. 20 6297 13852 38670 35118 4485797 > summary(data.test$amt) Min. 1st Qu. Median Mean 3rd Qu. Max. 10 6299 13858 37458 35140 1450818
Findings:
- The maximum claim sizes of the two sets differ considerably even with the use of stratified sampling
- The two means are not as close to each other as we expect.
- The two medians, 13,852 and 13,858, are very close to each other.
Benchmark Model: Normal Linear model on Log of Claim Size
As a benchmark model for all of the GLMs we are going to construct, our first candidate model is therefore a linear model on the log of the claim size fitted using ordinary least squares (OLS).
ggplot(persinj, aes(x = op_time, y = log(amt), color = legrep)) + geom_point(alpha = 0.25) + geom_smooth(method = "lm", se = FALSE)
Findings:
- It seems plausible that legal representation and operational time interact with each other to affect
amt
, so we will incorporate an interaction term for these two predictors.
In CHUNK 4
, we use the glm()
function to fit the OLS linear model to the training set. As you can guess, the glm()
function is for fitting GLMs (while the lm()
function is for linear models). Its syntax closely resembles that of lm()
, but in addition to the model formula and the data frame hosting your variables, we also have to specify the family of distributions to which the target variable belongs as well as the link function to be used (if it is not the default link) via the command of the form family = <family>(link = "<link>")
.
# CHUNK 4 glm.ols <- glm(log(amt) ~ inj + legrep * op_time, family = gaussian(link = "identity"), data = data.train) summary(glm.ols)
Call: glm(formula = log(amt) ~ inj + legrep * op_time, family = gaussian(link = "identity"), data = data.train) Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 7.5212622 0.0269021 279.579 < 2e-16 *** inj2 0.5912759 0.0237858 24.858 < 2e-16 *** inj3 0.8161053 0.0385945 21.146 < 2e-16 *** inj4 0.7958558 0.0901834 8.825 < 2e-16 *** inj5 0.6462279 0.0894089 7.228 5.12e-13 *** inj6 0.3823662 0.0777895 4.915 8.95e-07 *** inj9 -0.8151999 0.0367039 -22.210 < 2e-16 *** legrep 0.9120834 0.0339725 26.848 < 2e-16 *** op_time 0.0358738 0.0005093 70.431 < 2e-16 *** legrep:op_time -0.0101808 0.0006384 -15.946 < 2e-16 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 (Dispersion parameter for gaussian family taken to be 1.146774) Null deviance: 35056 on 16527 degrees of freedom Residual deviance: 18942 on 16518 degrees of freedom AIC: 49180 Number of Fisher Scoring iterations: 2
Findings:
- Instead of showing the R2 and F-statistic, which are specific to linear models, the model summary above shows the deviance (as shown in Residual deviance) and AIC of the OLS linear model, besides the coefficient estimates, standard errors, and p-values.
We then use the fitted OLS model to make predictions on the test set and calculate the test RMSE.
# CHUNK 5 pred.ols <- exp(predict(glm.ols, newdata = data.test)) head(pred.ols) RMSE(data.test$amt, pred.ols) RMSE(data.test$amt, mean(data.train$amt))
> head(pred.ols) 3 10 13 21 24 27 1853.5345 1853.5345 4609.6856 820.2833 820.2833 820.2833 > RMSE(data.test$amt, pred.ols) [1] 72888.33 > RMSE(data.test$amt, mean(data.train$amt)) [1] 79370.9
Assessing the Performance of OLS Model:
To get a feel for how the OLS model performs, we can compare its test RMSE, which is 72,888.33, with the test RMSE of the intercept-only model fitted to the training set, which uses the mean of amt on the training set (not test set!) as prediction. The test RMSE of the latter model is 79,370.9, so the OLS model represents a decent improvement over the intercept-only model.
TASK 3: Select a distribution and link function
|
GLM 1: Log-Link Normal GLM on Claim Size
This GLM ensures positive predictions, but a drawback is that it allows for the possibility that the observations of the target variable are negative.
# CHUNK 6 glm.log <- glm(amt ~ inj + legrep * op_time, family = gaussian(link = "log"), data = data.train) summary(glm.log)
Call: glm(formula = amt ~ inj + legrep * op_time, family = gaussian(link = "log"), data = data.train) Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 8.559238 0.120163 71.230 < 2e-16 *** inj2 0.572982 0.031018 18.473 < 2e-16 *** inj3 0.825502 0.034683 23.801 < 2e-16 *** inj4 0.883796 0.057430 15.389 < 2e-16 *** inj5 1.531415 0.034374 44.552 < 2e-16 *** inj6 1.506266 0.042987 35.040 < 2e-16 *** inj9 -0.552178 0.146506 -3.769 0.000164 *** legrep 0.068766 0.138410 0.497 0.619318 op_time 0.027185 0.001508 18.029 < 2e-16 *** legrep:op_time 0.003195 0.001721 1.857 0.063383 . --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 (Dispersion parameter for gaussian family taken to be 6897153068) Null deviance: 1.4770e+14 on 16527 degrees of freedom Residual deviance: 1.1393e+14 on 16518 degrees of freedom AIC: 421348 Number of Fisher Scoring iterations: 8
pred.log <- predict(glm.log, newdata = data.test, type = "response") head(pred.log) RMSE(data.test$amt, pred.log)
> head(pred.log) 3 10 13 21 24 27 5228.901 5228.901 5602.912 3010.251 3010.251 3010.251 > RMSE(data.test$amt, pred.log) [1] 69623.89
The use of type = "response"
:
- By default, the
predict()
function generates predictions for the GLM on the scale of the linear predictor, i.e., the log scale, due to the use of the log link. To transform the predictions to the original scale of the target variable, we include the optiontype = "response"
. This obviates the need for explicitly exponentiating the predictions.
Findings:
- To our delight, the test RMSE decreases to 69,623.89 and suggests that the normal GLM with log link outperforms the OLS linear model in terms of prediction accuracy.
GLM 2: Log-Link Gamma GLM
We now completely do away with the normal distribution and adopt a distribution that aligns with the characteristics of the target variable, amt
, which is a positive, continuous variable with a right skew.
In this regard, the gamma distribution is an appropriate choice. Although the canonical link for the gamma distribution is the inverse link \(g(\mu)=\dfrac{1}{\mu}\), the log link is much more commonly used as it not only ensures that all predictions are positive, a characteristic of the target variable, but also makes the model coefficients easy to interpret – the coefficients are closely related to multiplicative changes to the linear predictor. Thus, we will use a gamma GLM with a log link.
Exam Note:
The June 2019 PA exam model solution says that when asked to select a link function for a target variable that is positive, continuous, and right-skewed,
“[m}any candidates went with the canonical link function just because it is canonical. That is a weak justification.”
If the link function you select is the canonical link, you can use this fact as an additional reason to support your choice, but just because a link is the canonical link does not necessarily make it a good link. There are more important factors to consider, such as interpretability and whether or not the link function guarantees reasonable predictions.
Run CHUNK 7
to train a gamma GLM with the log link (notice the option family = Gamma(link = "log")
with the letter G capitalized), produce predictions on the test set, and calculate the test RMSE.
# CHUNK 7 glm.gamma <- glm(amt ~ inj + legrep * op_time, family = Gamma(link = "log"), data = data.train) summary(glm.gamma)
Call: glm(formula = amt ~ inj + legrep * op_time, family = Gamma(link = "log"), data = data.train) Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 8.2056640 0.0331403 247.604 < 2e-16 *** inj2 0.5975174 0.0293014 20.392 < 2e-16 *** inj3 0.8568408 0.0475440 18.022 < 2e-16 *** inj4 1.1029945 0.1110956 9.928 < 2e-16 *** inj5 1.5036755 0.1101415 13.652 < 2e-16 *** inj6 0.9004443 0.0958278 9.396 < 2e-16 *** inj9 -0.6321655 0.0452150 -13.981 < 2e-16 *** legrep 0.4393948 0.0418502 10.499 < 2e-16 *** op_time 0.0344292 0.0006275 54.871 < 2e-16 *** legrep:op_time -0.0049104 0.0007865 -6.243 4.38e-10 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 (Dispersion parameter for Gamma family taken to be 1.740277) Null deviance: 33141 on 16527 degrees of freedom Residual deviance: 16705 on 16518 degrees of freedom AIC: 365833 Number of Fisher Scoring iterations: 8
RMSE(data.test$amt, predict(glm.gamma, newdata = data.test, type = "response"))
> RMSE(data.test$amt, predict(glm.gamma, newdata = data.test, type = "response")) [1] 68831.79
Findings:
- It is gratifying to see that the test RMSE drops further to 68,831.79 compared with the normal GLM with the log link. This shows that the gamma GLM is the most predictive among the models we have fitted with respect to the test RMSE.
GLM 3: Inverse Gaussian GLM
An alternative to a gamma GLM is an inverse Gaussian GLM. The inverse Gaussian distribution often has a fatter tail than a gamma distribution. Its canonical link function is the inverse square link \(g(\mu)=\dfrac{1}{\mu^2}\), which ensures positive predictions but is not easy to interpret, so the log link is also commonly used.
In CHUNK 8
, we try to fit an inverse Gaussian GLM with a log link, but to no avail.
# CHUNK 8 glm.ig <- glm(amt ~ inj + legrep * op_time, data = data.train, family = inverse.gaussian(link = "log"))
Error: inner loop 1; cannot correct step size In addition: Warning messages: 1: step size truncated due to divergence 2: step size truncated due to divergence
It turns out that the fitting cannot be completed due to convergence issues.
This is not a mistake on our part – there was also no convergence when an inverse Gaussian GLM was fitted in the June 2019 PA exam! To make the fitting work, we would have to tweak the optimization procedure (remember that GLMs parameters are estimated by maximum likelihood, which involves optimization), which is likely beyond the scope of Exam PA. As the inverse Gaussian GLM cannot be evaluated, we will stick to the gamma GLM developed earlier and use it for subsequent tasks.
TASK 4: Validate the model
|
Model Validation
Gamma GLM vs. Benchmark OLS model
From CHUNK 7
, the test RMSE of the gamma GLM with the log link is 68,831.79, which not only is lower than that of the OLS model (72,888.33), but also is the lowest among all of the models we have fitted. The summary analysis of the gamma GLM shows that all coefficient estimates, with an extremely small p-value, are statistically significant, so it is wise to retain all existing features in the model.
Model Diagnostics
As a final check, in CHUNK 9
we make some diagnostic plots for the gamma GLM to see if the model assumptions are largely satisfied. The results are not perfect though.
# CHUNK 9 plot(glm.gamma, which = 1) qqnorm(residuals(glm.gamma)) qqline(residuals(glm.gamma))
Findings:
- “Residuals vs Fitted” Plot
- The plot shows that the residuals mostly scatter around 0 in a structureless manner, but with a sizable amount of fluctuation.
- The spread of the positive residuals tends to be much larger than that of the negative residuals, with a few outliers having an unusually positive residual (observations 5422, 10224, and 18870).
- Q-Q Plot
- The Q-Q plot allows us to assess the normality of the standardized deviance residuals, with a straight line passing through the 25th and 75th theoretical percentiles shown for reference (in contrast to the 45° straight line for linear models).
- The plot looks fine in the middle portion and on the left end, but seems problematic on the right end, with the points deviating quite markedly from the reference line.
- The fact that the rightmost points deviate significantly upward means that there are a lot more large, positive standardized deviance residuals than under a normal distribution, and by extension, the distributions of the deviance residuals and the data are more skewed than the gamma distribution.
- It appears that a fatter-tailed model may perform better.
(Note: There is also a Q-Q plot produced by running plot (glm.gamma
), but that plot is for standardized Pearson residuals, which, unlike deviance residuals, are not approximately normal even if a correct GLM is specified. This is why we use separate code to produce a Q-Q plot for deviance residuals.)
TASK 5: Interpret the model Run the selected model from Task 3 on the full dataset.
|
In CHUNK 10
, we rerun the gamma GLM on the full dataset(note the argument data = persinj
, not data = data.train
) to provide a more robust prediction for new, future data (the larger the data set used to train a model, the more reliable the fitting) and print the model summary.
# CHUNK 10 glm.final <- glm(amt ~ inj + legrep * op_time, family = Gamma(link = "log"), data = persinj) summary(glm.final)
Call: glm(formula = amt ~ inj + legrep * op_time, family = Gamma(link = "log"), data = persinj) Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 8.2033907 0.0279443 293.562 < 2e-16 *** inj2 0.6281073 0.0248995 25.226 < 2e-16 *** inj3 0.8882054 0.0402472 22.069 < 2e-16 *** inj4 1.1199670 0.0951566 11.770 < 2e-16 *** inj5 1.3963478 0.0955128 14.619 < 2e-16 *** inj6 0.8867918 0.0816012 10.867 < 2e-16 *** inj9 -0.6205268 0.0381881 -16.249 < 2e-16 *** legrep 0.4437842 0.0354423 12.521 < 2e-16 *** op_time 0.0343052 0.0005303 64.685 < 2e-16 *** legrep:op_time -0.0050443 0.0006663 -7.571 3.86e-14 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 (Dispersion parameter for Gamma family taken to be 1.675808) Null deviance: 44010 on 22035 degrees of freedom Residual deviance: 22242 on 22026 degrees of freedom AIC: 487535 Number of Fisher Scoring iterations: 8
# Exponentiate the coefficients to get the multiplicative changes exp(coef(glm.final))
(Intercept) inj2 inj3 inj4 inj5 inj6 inj9 legrep op_time legrep:op_time 3653.3167103 1.8740602 2.4307635 3.0647532 4.0404165 2.4273297 0.5376611 1.5585941 1.0349004 0.9949684
Findings:
- The model summary shows that injury code, legal representation, and operational time are all statistically significant factors affecting the expected claim size of a personal injury insurance.
- There is significant interaction between legal representation and operational time, meaning that the effect of operational time on the expected claim size varies for injuries with and without legal representation. Since the log link is used, the expected claim size is multiplied by \(e^{\beta_j}\) for every unit increase in a numeric predictor with coefficient βj or when a qualitative predictor moves from its baseline level to a new level represented by a dummy variable with coefficient βj, holding everything else constant (equivalently, the percentage change is \(e^{\beta_j}-1\)).
In this current context, we can interpret the coefficient estimates more precisely as follows:
- With the exception of injuries with injury code 9 (which represents unclassified injuries), the expected claim size for injuries with codes 2, 3, 4, 5, 6 is higher than that for injuries with code 1 (baseline).
- For instance, the expected claim size for injuries with code 5 is estimated to be e1.3963478 = 4.0404 times of that of injuries with code 1. That the estimated coefficients for codes 2, 3, 4, 5, 6 are all positive and mostly increasing across the codes makes intuitive sense as we would expect the more serious the injury, the larger the expected claim size. However, we would expect the estimated coefficient for code 6 to be the largest as code 6 corresponds to fatalities. This is not the case for the gamma GLM.
- Operational time is positively associated with the expected claim size, though the effect is not as positive for injuries with legal representation. A unit increase in operational time is associated with a multiplicative increase of e0·0343052 = 1.0349004 in the expected claim size for injuries without legal representation. The multiplicative increase reduces to e0.0343o52-o.oo5o443 = 1.029693 for injuries with legal representation. The positive effect is reasonable because larger claims are often more difficult to quantify and require more time to settle, leading to longer delays.
- For most injuries, those with legal representation have a higher expected claim size than those without legal representation, unless the injuries take an extraordinarily long time to settle (operational time higher than 88 to be precise). Again, this is intuitive as clients with the assistance of the legal advice are in a better position to fight for a larger settled claim size.
The findings above generally align with our intuition and shed light on the directional impact of different risk factors on the settled claim amount. Using these findings, policyholders may, for example, find it advantageous to employ legal representation in the hope of having a larger settled claim.
EXAM NOTE
When you are asked to interpret the results of a GLM, the following three-part structure can be quite useful:
- First interpret the precise values of the estimated coefficients, e.g., every unit increase in a continuous predictor is associated with a multiplicative change of \(e^{\beta_j}\) in the expected value of the target variable, holding everything else fixed.
- Comment on whether the sign of the estimated coefficients conforms to intuition. Do the results make sense? You can use your common knowledge in this part.
- Relate the findings to the “big picture” and explain how they can help clients make a better decision in life.