Taking too long? Close loading screen.

SOA ASA Exam: Predictive Analysis (PA) – 4.1. Generalized Linear Models

[mathjax]

EXAM PA LEARNING OBJECTIVES

Learning Objectives

The Candidate will be able to describe and select a Generalized Linear Model (GLM) for a given data set and regression or classification problem.

Learning Outcomes

The Candidate will be able to:

  • Understand the specifications of the GLM and the model assumptions.
  • Create new features appropriate for GLMs.
  • Interpret model coefficients, interaction terms, offsets, and weights.
  • Select and validate a GLM appropriately.
  • Explain the concepts of bias, variance, model complexity, and the bias-variance trade-off.

In Exam PA, there are often tasks that require you to describe, in high-level terms, what a GLM is and the pros and cons of a GLM relative to other predictive models, so the conceptual aspects of GLMs will be useful not only for understanding the practical implementations of GLMs in the next three sections, but also for tackling exam items.

Because all of the feature generation (e.g., binarization of categorical predictors, introduction of polynomial and interaction terms) and feature selection techniques (e.g., stepwise selection algorithms and regularization) for linear models generalize to GLMs in essentially the same way and everything we learned about the bias-variance trade-off for linear models also applies here, our focus in this section is on issues that are specific to GLMs but absent for linear models. These issues include:

  • Selection of target distributions and link functions
  • Use of offsets and weights
  • Model quantities and diagnostic tools for a GLM
  • Evaluation of GLMs for binary target variables

 

GLMs

Dating back to the 1970s, GLMs are an important predictive analytic tool whose importance is continually rising, especially in general insurance settings, where most variables are non-normal in nature, but are amenable to generalized linear modeling. If one word is to summarize the virtues of a GLM relative to a linear model, that word is probably “flexible“. Compared to linear models, GLMs provide considerable flexibility and substantially widen the scope of applications in two respects:

  • Distribution of the target variable:
    The target variable in a GLM is no longer confined to the class of normal random variables; it needs only be a member of the so-called exponential family of distributions. The mathematical details of the exponential family are not important for Exam PA; it is good enough to know that this is a rich class of distributions that include a number of discrete and continuous distributions commonly encountered in practice, such as the normal, Poisson, binomial, gamma, and inverse Gaussian distributions. For many of these distributions, the mean and variance are intricately related (e.g., for Poisson, the mean and variance coincide; for gamma, they have a square relation). GLMs therefore provide a unifying approach to modeling binary, discrete, and continuous target variables with different mean-variance relationships, and doing both regression and classification problems, all within the same statistical framework.
  • Relationship between the target mean and linear predictors:
    Instead of equating the mean of the target variable directly with the linear combination of predictors, a GLM sets a function of the target mean to be linearly related to the predictors. This allows us to impose linearity on a different scale of our choosing and analyze various situations in which the effects of the predictors on the target mean are more complex than merely additive in nature.

Mathematically, the equation of a GLM is of the form:

\(g(\mu)=\eta:=\beta_0+\beta_1X_1+…+\beta_pX_p\), where:

  • \(g(\cdot)\): The link function “linking” the target mean µ to the linear combination of the predictors, sometimes referred to as the linear predictor η for convenience. The link function can be any monotonic function (e.g., identity, log, inverse); its monotonicity allows us to invert the link function and make the target mean µ the subject:

\(\mu=g^{-1}(\eta)=g^{-1}(\beta_0+\beta_1X_1+…+\beta_pX_p)\),

Together, the distribution of the target variable and the link function g fully define a GLM:

GLM:         exponential family ∋ target variable + link function

In the special case when the target variable is normally distributed and the link function is the identity function g(µ) = µ, we are back to the linear model setting in Chapter 3 with equation \(\mu=\eta=\beta_0+\beta_1X_1+…+\beta_pX_p\)

 

GLMs vs. Linear Models on Transformed Data

One of the most common misconceptions about GLMs is to confuse them with linear models fitted to transformed data. While both modeling approaches involve the use of transformations, the transformations are applied differently (externally vs. internally), leading to aesthetically similar but fundamentally different predictive models. The rationale behind transforming the data (e.g., by the log transformation) is to bring the target observations closer to normal so that they can be reasonably described by a normal linear model.

For GLMs, however, the data is modeled directly, using the appropriate target distribution. The target variable is not transformed and the transformation plays its role only within the GLM itself. To see the differences between the two approaches more clearly, consider the following two models, both of which entail the log transformation:

Model 1 (Linear model fitted to \(\ln Y\), i.e., GLM with log-transformed target and identity link):

Suppose that we have fitted a simple linear regression model with equation:

\(\ln Y=\beta_0+\beta_1 X+\varepsilon\), where \(\varepsilon \sim N(0,\sigma^2)\)

  

\(Y=e^{(\beta_0+\beta_1 X+\varepsilon)}\)

As \(e^\varepsilon\) is lognormally distributed, so is Y. In particular, the target variable must take non-negative values only.

Model 2 (GLM with normal target and log link):

Now suppose that we have fitted a GLM with a normal target variable Y, a single predictor X, and the log link. The model equation is:

\(\ln \mu=\beta_0+\beta_1 X\)

  

\(\mu=e^{(\beta_0+\beta_1 X)}\)

Thus the target variable Y follows a normal distribution with mean \(\mu=e^{(\beta_0+\beta_1 X)}\). Although its mean is constrained to be non-negative, the target variable itself can take positive or negative values as a normal random variable.

Note that in this model, the log transformation is applied to the target mean only; the target variable is left untransformed.

As you can see, the two modeling methods imply entirely different behavior of the target variable. Which model should we recommend? That will depend on which one is easier to interpret or has better predictive performance. None of the two models will yield universally superior answers and the dataset we work with will decide the winner.

 

Exercise (Log link with a logged predictor)

Your supervisor has asked you to construct a log-link GLM using the natural logarithm of X, a positive numeric variable, as the single predictor.

  1. Investigate the relationship between the target mean and X in this GLM.
  2. Describe the advantage and disadvantage of this specification compared to using the original, non-transformed X as the predictor.

(Note: Unless otherwise requested on the exam, just use the non-transformed variable as the predictor in a GLM.)

 

Solution

The relationship between the target mean and X

The model equation is \(\ln \mu=\beta_0+\beta_1 \ln X\). Exponentiating both sides gives:

\(\mu=e^{\beta_0}(e^{\ln X^{\beta_1}})=e^{\beta_0}X^{\beta_1}\)

which shows that the target mean has a power relationship with X, with the coefficient β1 being the exponent.

 

Advantage

  Original Form of X Logged Version of X
Model Equation \(\mu=e^{\beta_0+\beta_1 X}\) \(\ln \mu=\beta_0+\beta_1 \ln X\) 
Relationship with µ

The target mean has a convex, exponential relationship with X.

  • Positive β1  => positive relationship
  • Negative β1  => negative relationship

The target mean has a power relationship with X, with the coefficient β1 being the exponent.

Characteristics The target mean has to be increasing at a faster rate or decreasing at a slower rate as X increases.
  • If β1 > 1, then µ is increasing in X at an increasing rate, similar to \(\mu=e^{\beta_0+\beta_1 X}\) with a positive β1.
  • If (β1 = 1, then \(\mu=e^{\beta_0} X\) is directly linear in X, which is not possible for \(\mu=e^{\beta_0+\beta_1 X}\).
  • If 0 ≤ β1 < 1, then µ is increasing in X, but at a decreasing rate. This increasing concave relationship cannot be captured by \(\mu=e^{\beta_0+\beta_1 X}\).
  • If β1 < 0, then µ is decreasing in X at a slower rate, just like \(\mu=e^{\beta_0+\beta_1 X}\).

In contrast, the use of the logged version of X allows for a wider variety of shapes for the target mean as a function of X. The target mean can be increasing in X, at an increasing or decreasing rate. To be precise.

Disadvantage

The log transformation is applicable only to strictly positive variables, so the specification above may not work for all kinds of numeric variables.

Also, it is a priori uncertain which function, \(\mu=e^{\beta_0+\beta_1 X}\) or \(\mu=e^{\beta_0}(X^\beta_1)\), better captures the true relationship between µ and X.

 

Selection of Target Distributions and Link Functions

A very common item in Exam PA is the selection of an appropriate target distribution (i.e., distribution of the target variable) and a link function with justification for a given situation. The importance of this exam item is evidenced by almost all past PA exams having a task about selecting a target distribution and link function.

 

Component 1: The target distribution

With respect to the choice of the target distribution, a powerful feature of a GLM is that it can accommodate a variety of distributions and we are at liberty to choose one that best aligns with the characteristics of a given target variable. Here are some common examples of variables that can be easily analyzed in the GLM framework:

  • Positive, continuous, and right-skewed data:
    In insurance applications and financial studies, variables that are positive, continuous, and right-skewed, such as claim amounts, income, and amount of insurance coverage, play a prominent role. Although a linear model applied to the log-transformed version of these variables may work fine in some cases, GLMs supply a variety of target distributions, like gamma and inverse Gaussian, that capture the skewness of the target variable directly without the use of transformations.
    For gamma (and for most distributions in the GLM framework), the mean and variance are positively related, which is a desirable characteristic when it comes to modeling claim severity and makes gamma arguably the most widely used distribution for this purpose. The inverse Gaussian distribution has a similar behavior, but is more highly skewed than the gamma distribution.
  • Binary data:
    When the target variable is binary (usually coded as 9 or 1 for convenience), which happens in a classification problem with the target variable indicating the occurrence (1) or non-occurrence (0) of an event of interest, the binomial (Bernoulli, to be precise) distribution is probably the only reasonable choice for the target distribution. Because the mean of such a variable is the probability that the event of interest occurs, a GLM for the mean is in effect modeling the event probability, a property unique to binary target variables. Examples of binary target variables include whether or not a policyholder lapses or submits a claim, whether or not a submitted claim is fraudulent, whether or not an actuarial student passes Exam PA, and whether or not an individual has contracted COVID-19.
  • Count data: Now consider target variables that represent the number of times a certain event of interest happens over a reference time period. These variables assume only non-negative integer values, possibly with a right skew, so the Poisson distribution is one of the most natural candidate distributions. One of its drawbacks is that it requires that its mean and variance be equal, as we have learned from Exam P. When the variance of the target variable exceeds its mean,
    a situation known as overdispersion, the Poisson distribution can be tweaked to account for this. Alternatively, other count distributions such as negative binomial (not discussed in the PA modules) can be explored.

All of the three types of variables above have been tested in past PA exams. The variable type below is less frequently tested (it was eventually tested in the June 21, 2021 exam!) and more subtle.

  • Aggregate data:
    Besides discrete (e.g. , binomial, Poisson) and continuous distributions (e.g., normal, gamma, inverse Gaussian), the exponential family of distributions also has a member which has a mixture of discrete and continuous components: the Tweedie distribution. We say that a random variable S follows the Tweedie distribution if it admits the compound Poisson-gamma representation: \(S=\sum^{N}_{i=1}{X_i}=X_1+…+X_N\), where N is a Poisson random variable and the Xi ‘s are i.i.d. gamma random variables independent of N. The Tweedie distribution enjoys the following distinguishing characteristics:

    • As a Poisson sum of gamma random variables, Tweedie is an “in-between” distribution of Poisson and gamma, which are combined into a single distribution (mathematical details not required for Exam PA).
    • It has a discrete probability mass at zero and a probability density function on the positive real line. Such a mixed nature makes the Tweedie distribution particularly suitable for modeling aggregate claim losses, which tend to have a large mass at zero as most policies have no claims (the size of the mass can be controlled by specifying the Poisson parameter λ appropriately) and a continuous distribution skewed to the right reflecting total claim sizes when there are claims (as gamma is right-skewed).

There are no examples in the PA modules demonstrating how a Tweedie distribution is fitted in R and the cplm package designed specifically for Tweedie will not be available on the exam. In fact, the Tweedie distribution is mentioned in one and only one slide in the PA modules (Slide 28 of Module 6), so it is probably not something to be heavily tested on. If there are exam questions set on the Tweedie distribution, then they should be conceptual ones testing what Tweedie is and how it can serve as an alternative model for loss modeling.

 

Exercise (Selection of Target Distributions)

For each of the target variables below, suggest a target distribution that is suitable for modeling the variable. Briefly explain why that target distribution is appropriate.

  1. House prices
  2. Policy renewal retention
  3. Number of claims submitted by a policyholder per policy year
  4. The daily amount of revenue of an amusement park (if there are no visitors, the revenue will be zero)

 

Solution

  1. The gamma or inverse Gaussian distributions both work well as they produce only positive outcomes and are right-skewed in nature, consistent with the character of house prices.
  2. The binomial distribution is arguably the only appropriate distribution for policy renewal retention, which is a yes/no outcome.
  3. The Poisson distribution may be suitable for modeling the number of claims, which is a count variable, although it is vulnerable to the problem of overdispersion.
  4. The daily amount of revenue of an amusement park is a compound variable combining the number of visitors and the amount of money spent by each visitor. A reasonable distribution may be the Tweedie distribution, which places a probability mass at zero, corresponding to no visitors, and a probability density function on the positive real line representing the positive revenue received.
    Remark: The Tweedie distribution is by no means limited to actuarial contexts. It is widely applicable to aggregate variables with a probability mass at zero and a right skew on the positive line.

 

Component 2: The Link Function

Given the target distribution, we can proceed to specify the link function. There are several important considerations informing the choice of the link function:

Appropriateness of Predictions

Arguably the most important criterion for a good link function is that the range of values of the target mean implied by the GLM, which is µ = g-1(η) with η ∈ R, should be consistent with the support of the target distribution. Let’s illustrate what we mean with two examples:

Example 1: Positive mean

For the Poisson, gamma, and inverse Gaussian distributions, the target mean is a priori known to be positive and unbounded from above, in which case a link function such as the log link g(µ) = ln µ is a good candidate link function as it ensures that the mean of the GLM given by µ = eη is always positive and any positive value can be captured. Remember from your calculus class that the exponential function ex is always positive, no matter whether the argument x is positive or negative.

The log link also has the advantage of being easy to interpret because it generates a multiplicative model, as we will see in the “Interpretation of GLM coefficients” paragraph below.

 

Example 2: Unit-valued mean

For binary target variables, the target mean (not the target variable itself) is the probability of the event of interest, which is always between 0 and 1. Taking this into account, the link function should ensure that the mean implied by the GLM is unit-valued. A common choice is the logit link given by:

\(g(\pi)=\ln \dfrac{\pi}{1-\pi}=\ln(\text{odds})\),

where π is the mean of the binary target variable and π /(l – π) , known as the odds of the event of interest, is the ratio of the probability of occurrence to the probability of non-occurrence and provides a measure of likelihood on a scale from 0 to +∞ (not from 0 to 1) . Inverting the model equation g (π) = η, we can express the target mean, in terms of the linear predictor, as:

\(\pi=\dfrac{e^{\eta}}{1+e^{\eta}}=\dfrac{1}{1+e^{-\eta}}\)

which is an S-shaped function of η always valued between 0 and 1 (note that e ≥ 0 always). As we will see below, the logit link is also easy to interpret due to its connections with the log link. Terminology-wise, a GLM for a binary target variable with the logit link is called a logistic regression model (which is somewhat a misnomer; it is actually a classification model).

 

Exercise: Selection of target distribution and link function

Determine which of the following pairs of distribution and link function is the most appropriate to model if a person is hospitalized or not.

    1. Normal distribution, identity link function.
    2. Normal distribution, logit link function.
    3. Binomial distribution, identity link function.
    4. Binomial distribution, logit link function.
    5. It cannot be determined from the information given.

 

Solution

Whether a person is hospitalized or not is a binary variable, which is best modeled by a binomial (more precisely, Bernoulli) distribution, leaving only Answers (C) and (D). The link function should be one that restricts the Bernoulli target mean to the range zero to one. Among the identity and logit links, only the logit link has this property. (Answer: (D))

 

Interpretability

Ease of interpretation is another property that defines a good link function. We say that a GLM is easy to interpret if it is easy to form statements that describe the effects of the predictors of the model on the target mean in terms of the model coefficients, e.g., if a continuous predictor X increases by 1 unit, by how much will the target mean change? An interpretable model will allow its users to appreciate the model results much more easily.

Some link functions are more interpretable than others and this makes them more commonly used in applied work. We will address the interpretation of a GLM more fully in the “Interpretation of GLM coefficients” paragraph below.

 

Canonical link (Less important)

One more way to help with the specification of the link function is to look at the canonical link function that is associated with each target distribution. The canonical link functions for a number of common target distributions are tabulated in the table.

Target Distribution Canonical Link Mathematical Form Other Link(s)
Normal Identity \(g(\mu)=\mu\)
Binomial Logit \(g(\pi)=\ln (\dfrac{\pi}{1-pi})\) Probit, cloglog
Poisson Natural Log \(g(\mu)=\ln \mu\)
Gamma Inverse \(g(\mu)=\mu^{-1}\) Log
Inverse Gaussian Squared Inverse \(g(\mu)=\mu^{-2}\) Log

Canonical links have the advantages of simplifying the mathematics of the estimation procedure and making it more likely to converge, but these merits alone do not mean that canonical links should always be used. More important factors to consider are the two considerations described above:

  1. Whether the predictions provided by the link align with the characteristics of the target variable
  2. Whether the resulting GLM is easy to interpret

In the case of the gamma GLM (which was tested in the June 2019 PA exam) , for example, the canonical link, which is the inverse link, does not guarantee positive predictions, neither is it easy to interpret. As a result, the log link is much more commonly used.

 

Exercise

Determine whether each of them is true or not and briefly explain why.

    1. The link function is to transform the target variable of a GLM so that the resulting distribution more closely resembles a normal distribution.
    2. The main reason for using the log link in a GLM is to reduce the skewness of a non-negative, right-skewed target variable.
    3. If some of the observations of the target variable are zero, then the log link cannot be used because ln 0 is not well-defined.

 

Solution

All of these statements are false and reflect some common but serious misconceptions!

    1. This is one of the most common misconceptions about link functions and GLMs. Remember that in a GLM, the link function is applied to the mean of the target variable; the target variable itself is left untransformed. In fact, there is no need to make the target variable resemble a normal random variable. Just choose a distribution from the exponential family to match the characteristics of the target variable.
    2. The log link is chosen mainly because it ensures appropriate predictions and eases model interpretation. The skewness of the target variable can be accommodated by a suitable target distribution (e.g., gamma and inverse Gaussian).
    3. This is related to (a). It is fine for some of the observations of the target variable to be zero because the log link is not applied to the target observations. 

 

Interpretation of GLM Coefficients

How do we interpret the coefficients of a GLM? That depends crucially on the choice of the link function (but the target distribution plays no role), which determines the functional relationship between the target mean and the features. For some link functions, the coefficients may be more difficult to explain and interpret. Apart from the identity link, here are two commonly used link functions that are particularly appealing in terms of interpretability:

Log Link

The log link is one of the most popular link functions in the GLM arena not only because it always ensures positive predictions, but also because it is easy to interpret. When the log link is used, the target mean and linear predictor are related via ln µ = η, or, upon exponentiation,

\(\mu=e^\eta =e^{\beta_0+\beta_1 X_1+…+\beta_j X_j+…+\beta_p X_p}=e^{\beta_0}\times e^{\beta_1 X_1}\times e^{\beta_j X_j}\times e^{\beta_p X_p}\),

which breaks down the target mean into a series of separate multiplicative terms involving the features.

Case 1

Suppose that Xj is a numeric predictor with coefficient βj.

1. Multiplicative changes: When all other variables are held fixed, a unit increase in Xj is associated with a multiplicative increase in the target mean by a factor of \(e^{\beta_j}\) i..e.,

\(\text{new }\mu = \boxed{e^{\beta_j}}\times\text{old }\mu\)

If βj is positive (resp. negative), then \(e^{\beta_j}>1\) (resp. \(e^{\beta_j}<1\)) and so the target mean gets amplified (resp. shrunken).

2. Percentage changes: Equivalently, the algebraic change in the target mean associated with a unit increase in Xj is \(e^{\beta_j}-1)\mu\), where μ is the target mean before the change, so the percentage (or proportional) change in the target mean is:

% change in target mean = \(\dfrac{(e^{\beta_j}-1)\mu}{\mu}=e^{\beta_j}-1\)

If βj is positive (resp. negative), then \(e^{\beta_j}-1>0\) (resp. \(e^{\beta_j}-1<0\)) and so the target mean increases (resp. decreases).

The two ways above to interpret the regression coefficients are equivalent. Feel free to choose either way on the exam. (In GLM settings, we prefer to make interpretations based on multiplicative and percentage changes in the target mean because they do not depend on the value of the initial target mean.)

Case 2

Suppose that Xj is the dummy variable of a certain non-baseline level with coefficient βj. In this case, βj admits a similar interpretation.

        • At the baseline level, Xj = 0, so:
          \(\mu=e^{\beta_0}\times e^{\beta_1 X_1}\times…\times e^{\beta_j (0)}\times…\times e^{\beta_p X_p}\)
        • At the non-baseline level represented by Xj, we have Xj = 1, so:
          \(\mu=e^{\beta_0}\times e^{\beta_1 X_1}\times…\times e^{\beta_j (1)}\times…\times e^{\beta_p X_p}\)

Comparing the two means, we see that the target mean when the categorical predictor lies in the non-baseline level is \(e^{\beta_j}\) times of that when the categorical predictor is in the baseline level, holding all other predictors fixed. Equivalently, the target mean at the non-baseline level is \(100(e^{\beta_j}-1)\)% higher than that at the baseline level.

 

Logit Link

When the logit link is used, almost always for binary data, the model equation is:

\(\ln(\text{odds})=\eta\)

  

\(\text{odds}=e^\eta\)

which is just another form of the log link, applied to the odds (rather than the target mean), so the interpretations above phrased in terms of multiplicative or percentage changes in the odds of the event of interest apply equally.

For example, a unit increase in a numeric predictor with coefficient βj is associated with a multiplicative change of \(e^{\beta_j}\) in the odds and a percentage change of \(100(e^{\beta_j}-1)\)%.

 

Weights and Offsets

Many feature generation and selection techniques that apply to linear models carry over easily to GLMs to set up the right-hand side of the model equation \(g(\mu)=\beta_0+\beta_1X_1+…+\beta_pX_p\). Some of the X’s can be polynomial or interaction terms to capture more complex, non-linear relationships, and the same dummy variable approach explained in Subsection 3.2.3 is used to binarize categorical predictors.

Weights and offsets, however, are modeling tools that are more commonly used with GLMs. Both of these tools are designed to incorporate a measure of exposure into a GLM to improve the fitting, but in different ways.

 

 

Weights

So far, we have implicitly assumed that every observation of the target variable in the available data shares the same amount of precision and therefore the same degree of importance. As you can expect, this equal importance assumption may not be always realistic. In many situations, the datasets provided by an insurance company store not individual data, but grouped data. In each row of the data, the value of the target variable may represent the claim counts or claim size averaged over a set of homogeneous policies sharing the same characteristics. It is reasonable that rows that are “exposed” to more policies should be subject to a smaller amount of variation due to the averaging. (If you have taken Exam STAM/C, this is exactly the idea behind the Buhlmann-Straub model.)

To take advantage of the fact that different observations in the data may have different exposures and thus different degrees of precision, we can attach a higher weight to observations with a larger exposure so that these more credible observations will carry more weight in the estimation of the model coefficients. The goal is to improve the reliability of the fitting procedure. (In Exam PA, we need not worry about the mathematics of weights. R will do the fitting taking the weights into account behind the scenes. The rationale underlying the use of weights is far more important.)

 

Offsets

Another common way to incorporate a measure of exposure into a GLM is via offsets, which are usually used with (but not limited to) count data. Suppose that the value of the target variable in each row of the data represents the number of claims aggregated over a group of homogeneous policies. Different rows are based on different numbers of policies. It makes sense that other things equal, the larger the number of policies (larger exposure), the larger the total number of claims in a group. This motivates us to use the number of policies as an “offset” to better account for the number of claims across different rows in the data. In many cases, the inclusion of a suitable offset term can lead to a substantial improvement in model fit. If we make the reasonable assumption that the target mean is directly proportional to the exposure, then we can capture this phenomenon easily using a log link, with which an offset is most commonly used. If we let Ei be the exposure for the ith observation, then the model equation is:

\(\ln\mu_i=\boxed{\ln E_i}+\eta_i=\ln E_i+(\beta_0+\beta_1X_{i1}+…+\beta_pX_{ip})\)

where the logged exposure \(\ln E_i\) is called an offset. You can see that an offset is technically a special predictor whose regression coefficient is known a priori to be one; no estimation is needed.

The above equation can be written in two ways to show further insights:

  • Upon exponentiation, it is equivalent to \(\mu_i=E_i e^{\eta_j}\), which shows that µi and Ei are in direct proportion.
  • When including an offset in the equation of a GLM, the convention is to make it on the same scale as the linear predictor, hence the use of ln Ei rather than Ei in the above equation. If we move ln Ei to the left and combine the two logarithms, then we get:

\(\ln(\dfrac{\mu_i}{E_i})=\eta_i=\beta_0+\beta_1X_{i1}+…+\beta_pX_{ip}\)

In this form, we are modeling the occurrence rate of the event of interest, or the expected number of events per exposure.

Exposure units are not confined to physical entities like insurance policies. Time is another common exposure unit, as the following exercise shows.

 

Offsets Versus Weights

At first it might appear that the two serve the same purpose while they are actually very different. In both cases the goal is to predict the rate of something per something. For example, consider a model to predict deaths per covered life. The key difference is in how the observations are recorded. As an example, suppose we have observed 2,000 lives and observed 5 deaths.

  • Offset: The value of the target variable is the total number or amount observed. In this case it is 5. But when the features are run through the model, it is predicting the number of deaths per covered life. Suppose the model produces a prediction of 0.0035. Then the actual prediction is 2,000(0.0035) = 7 as compared to the observed value of 5. Here 2,000 is an offset and used to multiply the prediction.
  • Weight: The value of the target variable is the ratio. In this case it is 5/2,000 = 0.0025. There is no need for modification because the prediction is already on this scale. However, because the observation is being divided by something, the variance of the observation is related to the size of that number (which here is the sample size for this one observation). Observations with lower variance should count for more when fitting the model and using 2,000 as a weight (not an offset) will do that.

 

Exercise (Specifying an appropriate GLM)

You are given an insurance dataset with the following variables:

  • NumClaims: Number of claims of a policyholder over the term of the policy
  • Time: The term length of the policy (in years)
  • Age: Age of a policyholder
  • Gender: Gender of a policyholder
  • RiskCat: Risk category to which a policyholder belongs

Use these variables to set up a GLM for examining the effects of various factors on NumClaims. Specify the distribution of the target variable, the predictors, the link function, and the offset (if any).

 

Solution

Here the appropriate exposure variable is Time. It is reasonable to expect that NumClaims will vary in direct proportion to Time (the longer the term, the more the claims), which can serve as an offset term to adjust the different exposure to which different policyholders are exposed. The GLM can be specified as follows:

  • Target distribution: Poisson
  • Predictors: Age (continuous), Gender (binary) , RiskCat (categorical)
  • Link: Log
  • Offset: ln(Time)

 

Exercise (Offset vs. an ordinary predictor)

Let E be an exposure variable to be used in a log-link GLM. Describe the advantage and disadvantage of using ln E as an offset relative to using E as an ordinary predictor in the GLM.

 

Solution

  • If we use ln E as an offset, we are assuming that the target mean varies in direct proportion to E (not ln E). The coefficient of ln E is known a priori to be 1, so no estimation is needed.
  • If we use E as an ordinary predictor, then due to the log link, we are assuming that the target mean has an exponential relationship with E. The coefficient of E is an additional parameter that needs to be estimated.

Whether the direct proportional relationship or the exponential relationship is a better assumption depends very much on the data at hand. There is no universal answer.

 

Offsets vs. Weights

We know that offsets and weights have the same purpose of taking exposure into account to improve fitting. From a certain perspective, both involve a kind of average and, at first sight, they seem so similar that they are synonyms of each other. They, in fact, impose vastly different structures on the mean and variance of the target variable. The key differences lie in the form of the target variable in the dataset (average vs. aggregate) and how the exposure affects the target variable (mean vs. variance). The two modeling approaches are generally not consistent with each other and should not be confused.

  • Weights: To use weights properly, the observations of the target variable should be averaged by exposure. Due to the averaging, the variance of each observation is inversely related to the size of the exposure, which serves as the weight for that observation. However, the weights do not affect the mean of the target variable.
  • Offsets: For offsets, the observations are values aggregated over the exposure units. The exposure, when serving as an offset, is in direct proportion to the mean of the target variable, but otherwise leaves its variance unaffected.
    (Remember that for a GLM, the mean and variance of the target variable are generally related, so an offset also affects the variance, but only as a by-product)

 

EXAM NOTE

In Task 8 of the June 21, 2021 PA exam, candidates were asked to:

“Explain the difference between weights and offsets when applied to a GLM, and then recommend which is more appropriate for implementing the assumption that the cost [the target variable] is directly proportional to the number of nights [a predictor]. Justify your recommendation.”

The model solution comments that:

“Many candidates incorrectly recommend using weights,” which is rather disappointing given that the task statement explicitly mentions the key phrase “directly proportional to.” If the assumption is changed to the variance of cost being inversely related to the number of nights, then it makes sense to use the number of nights as a weight.

This may be the case for a future exam.

 

While offsets and weights are generally different modeling vehicles, it deserves mention that in the special but important case of Poisson regression with the log link, they produce identical results when used appropriately. Specifically, the following two generic models are equivalent in the sense that they share the same coefficient estimates, standard errors, p-values, and most importantly, predictions:

  • Model 1: A Poisson regression model fitted to Y (count) using ln E as the offset
  • Model 2: A Poisson regression model fitted to Y/ E (frequency) using E as the weight

The equivalence between these two models can be justified mathematically using our knowledge from Exam SRM, but it is unthinkable that you will be asked to derive formulas or write proofs in Exam PA, so here we will just take the equivalence for granted.

An interesting feature of Model 2, the frequency model, is that the observations of the target variable may take decimal numbers whereas a Poisson random variable must be integer-valued. Because the two GLMs above are identical, on the exam it would be preferable to use Model 1, the count model, which is a genuine Poisson model and seems easier to understand.

 

Model Quantities and Diagnostic Tools

Coefficient Estimates

As soon as the target distribution and the link function have been specified, a GLM can be constructed and fitted to the training data. Instead of the ordinary least squares method that is used to fit a linear model, the method of maximum likelihood estimation (MLE) is commonly employed to estimate the unknown coefficients \(\beta_0,…,\beta_p\) of a GLM (in fact, maximum likelihood estimates for liner models coincide with the least squares estimates). If we denote the coefficient estimates by \(\hat{\beta}_0,…,\hat{\beta}_p\), then the estimated linear predictor and predicted target mean for the ith observation are, respectively:

\(\hat{\eta}_i=\hat{\beta}_0+\hat{\beta}_{i1}+…+\hat{\beta}_{ip}\) and

\(\hat{\mu}_i=g^{-1}(\hat{\eta}_i)=g^{-1}(\hat{\beta}_0+\hat{\beta}_{i1}+…+\hat{\beta}_{ip})\)

Statistical theory asserts that MLE produces estimates with desirable statistical properties, such as asymptotic unbiasedness, efficiency, and normality, which make MLE such a popular estimation method. A downside to MLE is that occasionally the optimization algorithm may be plagued by convergence issues, which may happen when a non-canonical link is used. No estimates may be produced and the GLM cannot be fitted or applied as a result.

 

Global Goodness-of-Fit Measure: Deviance

For GLMs, traditional goodness-of-fit measures which are based in part on the normal distribution such as R2 do not apply. The liberation from the normal distribution framework calls for more general (recall that “G” stands for “generalized!”), likelihood-based goodness-of-fit measures that serve to evaluate the model fit on the training set appropriately.

One of the most important goodness-of-fit measures for GLMs is the deviance, which is part of the regular output when the summary of a GLM is returned in R. Unlike the R2, which measures how a linear model compares with the most primitive linear model (the intercept-only linear model), the deviance of a GLM measures the extent to which the GLM departs from the most elaborate GLM, which is known as the saturated model. The saturated model has the same target distribution as the fitted GLM, but with as many model parameters as the number of training observations, it perfectly fits each training observation and is a very flexible GLM (perhaps too flexible and overfitted!). Mathematically, if we denote the maximized loglikelihood of the saturated model by lSAT and that of the fitted GLM by l, then the deviance of the GLM is defined as:

\(D=2(l_{SAT}-l)\)

The lower the deviance, the closer the GLM is to the model with a perfect fit, and the better its goodness of fit on the training set.

Although there will be hardly any instances in Exam PA where you will manipulate the specific formula (therefore omitted here) for the deviance of a GLM, the following facts about this model statistics are important:

  • For a normal target variable with variance σ2, the deviance will simplify to a multiple of the residual sum of squares on the training set:

\(D=\dfrac{\sum_{i=1}^{n_{tr}}(y_i-\hat{y}_i)^2}{\sigma^2}\)

In this sense, the deviance can be viewed as a generalization of the residual sum of squares that works even for non-normal target variables.

  • While the deviance is a simple and useful goodness-of-fit measure for a GLM, it can only be used to compare GLMs having the same target distribution (so that they share the same maximized loglikelihood of the saturated model, lSAT ).
    It is invalid to assess two GLMs with different target distributions on the basis of their deviance. Even if we have a Poisson GLM with a deviance of 100 and a gamma GLM with a deviance of 200, we cannot say that the first model has a better fit to the training data just because of the lower deviance.
  • Deviance provides the foundation of an important model diagnostic tool for GLMs: The deviance residual.

 

Local Goodness-of-Fit Measure: Deviance Residuals

In a GLM, the raw residuals \(e_i=y_i-\hat{\mu}_i\), where \(\hat{\mu}_i\) is the fitted target mean for the ith observation in the training set, are not as useful as they are in a linear model because they are no longer normally distributed (not even approximately, because the target distribution may not be normal), nor do they possess constant variance (because their variance varies with the target mean, which varies across different observations).

This makes it difficult to find a universal benchmark against which the raw residuals of all GLMs can be reliably compared. To perform diagnostic tests for GLM requires more broadly defined residuals to capture the discrepancy between the observed target values and the fitted values. This is where deviance residuals play their role.

The deviance residual for the ith observation, denoted by di, is defined as the signed (positive if \(y_i>\hat{\mu}_i\) and negative if \(y_i<\hat{\mu}_i\)) square root of the contribution of the ith observation to the deviance. Mathematically, we have \(\sum_{i=1}^{n}d^2_i=D\).

Provided that the fitted GLM is adequate, the deviance residuals can be shown (mathematical details not required for Exam PA) to satisfy the following properties which parallel those of the raw residuals in a linear model:

  • They are approximately normally distributed for most target distributions in the linear exponential family (the binomial distribution is a notable exception).
  • They have no systematic patterns when considered on their own and with respect to the predictors.
  • They have approximately constant variance upon standardization (using the standard error implied by the GLM).

The first property is particularly important as it provides the basis for comparing the distribution of the deviance residuals with the normal distribution (e.g., by examining Q-Q plots of deviance residuals) and evaluating whether the fitted GLM is an accurate description of the given data. This evaluation is valid even if the target distribution is not normal.

It is true that the formula and calculations for deviance residuals are rarely tested in Exam PA (in fact, never, so far!). While you need not know the technical details behind the output you see in R, you have to understand, at a conceptual level, what the output is, and what we can learn by looking at deviance residuals. This explains why this introductory section on GLMs is useful!

 

Penalized Likelihood Measures: AIC and BIC

The deviance of a GLM parallels the residual sum of squares for a linear model in the sense that it is merely a goodness-of-fit measure on the training set and always decreases when new predictors are added, even if they are not predictive of the target variable. More meaningful measures of model quality balance the goodness of fit on the training set with model complexity.

Two common examples are the AIC and BIC, which are defined in the same way as for a linear model:

\(AIC=-2Z+2p\) and \(BIC=-2Z +\ln (n_{tr})p\)

Using these information criteria, we can perform stepwise feature selection in the same way as a linear model and drop non-predictive features to prevent overfitting.

Speaking of stepwise selection, an alternative method for reducing model complexity is regularization. For GLMs, a regularized model results from minimizing the penalized objective function given by:

deviance + regularization penalty
(Goodness-of-Fitness) + (Complexity)

 

Performance Metrics for Classifiers

Recall that RMSE is the most commonly used performance metric for numeric target variables. Since GLMs accommodate both numeric and categorical target variables, we need performance metrics for categorical variables as well. The classification error rate as a measure of the performance of a classifier. This is just one of a few commonly used measures in a classification setting.

In this subsection, we use a confusion matrix to perform a more comprehensive analysis of the performance of a binary classifier (i.e., a classifier for a binary target variable) and introduce some special considerations specific to classifiers.

 

Confusion Matrix

Recall that a binary classifier merely produces a prediction of the probability that the event of interest occurs for a given set of feature values. It does not directly say whether the event is predicted to happen or not, but in most situations it is the class predictions (“positive” or “negative”) that are our ultimate interest. If, for example, you go to a clinic to test for COVID-19, you definitely want to know whether you test positive or negative. To translate the predicted probabilities into the predicted classes, we need a pre-specified cutoff (or threshold).

  • If the predicted probability of the event of interest for an observation is higher than the cutoff, then the event is predicted to occur.
  • If the predicted probability of the event of interest for an observation is lower than the cutoff, then the event is predicted not to occur.

This simple classification rule results in four possibilities (very much like the Type I and Type II errors in hypothesis testing) that can be exhibited in a confusion matrix, which is a 2 x 2 tabular display of how the predictions of a binary classifier line up with the two observed classes:

Prediction Reference (= Actual)
+
TN FN
+ FP TP
  • TP = true positive: The classifier predicts that the event occurs (“positive”) and indeed it does (“true”).
  • TN = true negative: The classifier predicts that the event does not occur (“negative”) and indeed it does not (“true”).
  • FP = false positive: The classifier predicts that the event occurs (“positive”), but it does not (“false”).
  • FN = false negative: The classifier predicts that the event does not occur (“negative”), but it does (“false”).

Given a confusion table corresponding to a certain cutoff, a lot of useful performance metrics of a classifier can be computed. The classification error rate we learned earlier is the proportion of misclassifications, which are the off-diagonal entries of the confusion matrix:

Classificati0n Error Rate = (FN + FP) / n

where n is the total number of observations in the data (the training or test set where the confusion matrix is constructed).

The complement of the classification error rate is the accuracy, which measures the proportion of correctly classified observations:

Accuracy = (TN + TP) / n

The following two measures that originate from medical diagnosis are also frequently used to capture more specific aspects of the predictive performance of a classifier:

  • Sensitivity (also called recall or true positive rate): This is the relative frequency of correctly predicting an event of interest when the event does take place, or equivalently, the ratio of TP to the total positive events. In symbols,

Sensitivity = TP / (TP +FN)

It is a measure of how “sensitive” a classifier is in identifying positive cases.

  • Specificity: This is the relative frequency of correctly predicting a non-event when there is indeed no event, or the ratio of TN to the total negative events:

Specificity = TN / (TN +FP)

The larger the specificity, the better the classifier in confirming negative cases.

Other things equal, the higher the accuracy, sensitivity, and specificity, the more attractive a classifier. Because our main interest is in the positive event, sensitivity is usually more of interest than specificity.

Note that:

  • Accuracy is a weighted average of specificity and sensitivity, where the weights are the proportions of observations belonging to the two classes.
  • A confusion matrix can be computed on both the training and test sets, but it is the confusion matrix on the test set that we care more about. If desirable, we can compare the training confusion matrix and the test confusion matrix to detect overfitting.
    If the performance of the classifier on the test set deteriorates substantially when compared with the training set, the classifier has likely overfitted.

 

Effects of Changing the Cutoff

Ideally, the classifier and the cutoff should be such that both the sensitivity and specificity are close to 1. Except for artificial situations, having such a perfect classifier is almost always impossible.

In general, sensitivity and specificity, as a function of the cutoff, are in conflict with each other, and the selection of the cutoff involves a trade-off between having high sensitivity and having high specificity. Here is how the conflict arises.

Extreme Case 1

If the cutoff is set to 0, then all predicted probabilities, which must be non-negative, will exceed the cutoff, meaning that everyone is predicted to be positive. The confusion matrix will look like this:

Prediction Reference (= Actual)
+
0 0
+ n_ n+

where:

    • n_ = TN + FP is the total number of negative observations.
    • n+ = FN + TP is the total number of positive observations.
    • n = n_ + n+ = TN + FP + FN + TP is the total number of observations.

As a result, the sensitivity and specificity of the classifier are 1 and 0, respectively. Although the classifier succeeds in catching all the positive observations, it fails miserably in catching the negative observations (in fact, none of them!).

 

Increasing the Cutoff

As we raise the cutoff, more and more observations will be classified as negative and the entries in the confusion matrix will be moving to the first row: (The directional movements of the entries are shown in arrows.)

Prediction Reference (= Actual)
+
TN FN
+ FP TP

Because we are having more true negatives and fewer false positives, at the cost of having more false negatives and fewer true positives, the sensitivity of the classifier decreases but its specificity increases. Note that the two metrics are moving in opposite directions.

 

Extreme Case 2

If the cutoff is set to 1, then all predicted probabilities will be less than the cutoff, which means that everyone is predicted to be negative. Here is the confusion matrix:

Prediction Reference (= Actual)
+
n_ n+
+ 0 0

The sensitivity and specificity of the classifier are 0 and 1, respectively.

 

The determination of the precise cutoff requires weighing the benefits of correct classifications (TN and TP) and the costs of misclassifications (FN and FP). This cost-benefit analysis is often a business decision that requires subject matter expertise and practical considerations. If, for instance, you are using a classifier to identify fraudulent claims, which will be subject to further investigation, then the benefits and costs associated with the four entries in the confusion matrix can be interpreted as follows:

  • TP: The classifier successfully identifies fraudulent claims, thereby preventing unwarranted claim payments and saving money. (Benefit)
  • TN: The classifier correctly identifies genuine claims and saves the company from unnecessary investigative work. (Benefit)
  • FP: The company expends unnecessary time and effort on genuine claims, which are misclassified as fraudulent. (Cost)
  • FN: The company allows truly fraudulent claims to go unnoticed and makes unnecessary claim payments. (Cost)

(If you like, you can offset the cost of FN against the benefit of TP to form a netted benefit because they share the same source: Unnecessary claim payments arising from fraudulent claims. We can do the same for TN and FP to get a netted cost due to unnecessary investigative work.)

In most cases, the cost of undetected fraud far outweighs the cost of unnecessary investigative work, so we may choose a relatively low cutoff so that the classifier has a higher tendency to root out frauds (though more genuine claims will be misclassified as frauds!). The same idea applies to a classifier for detecting COVID-19 patients: Given that the coronavirus is extremely contagious ( especially its many variants), letting an infected patient slip undetected can have grave repercussions: The patient could spread the virus to many more people and may lose his/her life because of the delayed diagnosis and treatment.

Subjecting a healthy person to unneeded treatment or compulsory quarantine, while undesirable, should be less serious. In some exam tasks, you are provided with the amounts of the profits and costs associated with different entries of a confusion matrix. An optimal cutoff can then be chosen by maximizing the overall profit or minimizing the overall cost as a function of the cutoff, as you will see in Task 10 of the December 2019 PA exam and Task 9 of the Hospital Readmissions sample project, which will be discussed in Chapter 7 of this study manual.

 

Exercise: Give a confusion matrix

The following confusion matrix shows the result from a claim fraud model with a discrimination threshold of 25%:

Prediction Actual
No Yes
No 1203 162
Yes 63 72
  1. Identify a link function that can be used for a generalized linear model that has a binary target variable and briefly explain why this link function is appropriate.
  2. Calculate the sensitivity and specificity from the above data.
  3. Briefly describe how the severity of claims will impact the selection of the model threshold.

 

Solution

  1. The logit link \(g(\pi)=\ln \dfrac{\pi}{1-\pi}\) is the most appropriate for linking the mean of a binary target variable to the linear predictor as it ensures that the model predictions are always valued between 0 and 1. It also makes the model easily interpretable.
  2. The sensitivity is the proportion of fraudulent claims correctly predicted to be fraudulent:
    Sensitivity = 72 / (72 + 162) = 0.3077
    The specificity is the proportion of non-fraudulent claims correctly predicted to be non-fraudulent:
    Specificity = 1203 / (1203 + 63) = 0.9502
  3. With high severity claims, it would be advisable to lower the threshold so that more claims are predicted to be fraudulent and fewer fraudulent claims go undetected. When the claims are more severe, the benefit of having true positives (i.e. , correctly identifying fraudulent claims and saving unnecessary claim payments) tends to exceed the cost of having false positives (i.e., incorrectly classifying non-fraudulent claims as fraudulent and wasting unnecessary investigation).

 

Exercise

ABC Credit Card Company (ABC) wants to speed up its credit limit review by introducing a predictive model. Current practice is to manually review each customer’s credit, which takes one business day.

ABC selected a portion of historical data to train a binary predictive model to categorize customers as either “Premier” or “Standard”. If the model is accepted, customers will receive results instantaneously.

Below is the resulting confusion matrix (based on a certain cutoff) when the remaining historical data are run through the predictive model as well as an estimated relative profit table based on historical data.

Confusion Matrix

Prediction Reference
Standard Premier
Standard 1203 162
Premier 63 72

 

Profit Matrix per Customer

Prediction Reference
Standard Premier
Standard 1203 162
Premier 63 72

 

a. Calculate the following quantities for the binary predictive model:

    1. Accuracy
    2. Sensitivity
    3. Specificity
    4. Expected profit

Take Premier as the positive class if necessary.

b. ABC’s CEO reviews the results and states:

“The model results look great. Our main concern should be the accuracy of the Premier category because it is the most profitable and we have about four times as many customers in this category as in the Standard category.”

Critique the CEO’s statement.

 

Solution

a. (i) (120 + 20) / (120 + 16 + 5 + 20) = 140/161 = 86.96%.

(ii) 120 / (120 + 5) = 96%.

(iii) 20 / (20 + 16) = 55.56%.

(iv) 120(100) + 16(-750) + 5(-100) + 20(50) = 500.

b. The accuracy of the Standard category is also (perhaps as) important. The cost of misclassifying a Standard customer as Premier is enormous. Based on the current cutoff, the cost equals the profit of correctly classifying a Premier customer as Premier.

 

ROC Curves

Instead of relying on a manually chosen cutoff, the Receiver Operating Characteristic (ROC) curve (this mysterious term comes from signal detection theory) is a graphical tool plotting the sensitivity against the specificity of a given classifier for each cutoff ranging from 0 to 1.

Conventionally, the sensitivity is plotted on the vertical axis and specificity is plotted on the horizontal axis in a reverse order (i.e., 1 to 0, from left to right). Each point on the ROC curve corresponds to a certain cutoff.

All ROC curves begin at (1, 0) and end at (0, 1), which, from the discussions above, correspond to a cutoff of 1 and 0, respectively. A good classifier should be such that its sensitivity and specificity are both close to 1 for a wide range of cutoffs, which means that its ROC curve should bend to the upper left and approach the top-left corner point (1, 1).

The closer the curve is to the top-left corner, the better the predictive ability. With this line of reasoning, the predictive performance of a classifier can be summarized by computing the area under the curve (AUC). The exact value of the AUC may not mean much for the quantitative assessment of a classifier and in real applications, it is often the relative value of the AUC that matters; the higher, the better. For this purpose, it is desirable to identify two reference values:

  • AUC = 1: The highest possible value of the AUC is 1, which is attained by a classifier with perfect discriminatory power. Such a classifier separates all of the observations into the two classes correctly whenever the cutoff is not 0 or 1, with the sensitivity and specificity both equal to 1. As soon as it leaves the bottom left point (1, 0), the ROC curve will rise instantaneously and “hug” the top left corner point (1, 1), then head right and hit the top right point (0, 1).
  • AUC = 0. 5: Another useful baseline is the naive classifier which classifies the observations purely randomly without using the information contained in the predictors, analogous to the intercept-only GLM, which ignores the predictors. In other words, the classifier assigns each observation independently to “positive” with probability p and to “negative” with probability 1 – p, where p is an arbitrary number between 0 and 1 (it need not be 0.5). On average (it is only “on average” because the classifications involve randomness), its confusion matrix will be like this:
Prediction Reference (= Actual)
+
(1 – p)n_ (1 – p)n+
+ pn_ pn+

Then:

Sensitivity = \(\dfrac{pn_{+}}{pn_{+}+(1-p)n_{+}}\) and Specificity = \(\dfrac{(1-p)n_{-}}{(1-p)n_{-}+pn_{-}}\)

If we allow p to vary from 0 to 1, the paired values of sensitivity and specificity will sweep out the 45° diagonal line, which is the ROC curve of this purely random classifier, and its AUC is the area of the triangle below the diagonal line, or 0.5.

The significance of the naive classifier is: If we ignore the predictors and make classifications purely randomly, we are able to achieve an AUC of 0.5, so any reasonable classifier should have its AUC well above 0.5 to merit consideration.

These two extreme values provide a benchmark for judging the absolute (is the AUC way higher than 0.5 and close to 1?) and relative (which model has the higher AUC?) performance of different classifiers.

 

Sidebar: Unbalanced data

Let’s end this subsection with a brief discussion of an issue that applies to many insurance claims data: The presence of unbalanced (a.k.a. imbalanced) data, for which one class of the binary target variable is much more dominant than the other class in terms of proportion of observations. The issue of unbalanced data was heavily tested in the June 22, 2021 PA exam (although the coverage of this issue is quite light in the PA modules!).

The problem with unbalanced data is that a classifier implicitly places more weight on the majority class and tries to match the training observations in that class, without paying enough/attention to the minority class. This can be especially problematic if the minority class is the positive class, that is, the class of interest. If, for example, only 1 % of the policyholders in an insurance dataset have a claim (the positive class) and 99% have no claims (the negative class), then a simple classifier that predicts each observation to have no claims will have an impressive 99% accuracy, a 100% specificity, but a zero sensitivity. (Remember that accuracy is a weighted average of sensitivity and specificity. With 99% of the observations being claims-free, it comes as no surprise that the accuracy is leaning so much towards the specificity.) The excellent accuracy is deceptive because the classifier is useless when it comes to identifying policyholders who will submit claims, which is what a typical insurer truly cares about. To put it another way, accuracy does not align well with the interests of the insurer.

There are two solutions to unbalanced training data discussed in the PA e-learning modules, both of which serve to tweak the original unbalanced data to produce a more balanced sample for training our classifier. This way, the classifier will be able to pick up the signal from the minority class more effectively. The results produced by this classifier will be adjusted and translated into results for the original unbalanced data (mathematical details not required for Exam PA).

  • Undersampling: Undersampling produces roughly balanced data by drawing fewer observations from the negative class (“undersampled”) and retaining all of the positive class observations. With relatively more data from the positive class (in proportion), this will improve the classifier’s ability to pick up the signal leading to the positive class. The drawback is that the classifier, now based on less data and therefore less information about the negative class, could be less robust and more prone to overfitting.
  • Oversampling: An alternative to undersampling, oversampling keeps all of the original data, but oversamples (with replacement) the positive class to reduce the imbalance between the two classes. Some of the positive class observations will appear more than once in the balanced data.

It is important to note that oversampling should be performed on the training data, following the training/test set split. If we do oversampling to the full data before splitting it into the training and test sets, then some of the positive class observations may appear in both sets and the test set will not be truly unseen to the trained classifier, defeating the purpose of the training/test set split in the first place.

No matter which sampling method is used, the positive class becomes more prevalent in the balanced data as a result. With the positive class observations making up a larger proportion of the data and playing a more important role in the model fitting process, their predicted probabilities will increase. Given the same cutoff, they are predicted to be positive more frequently, which leads to a rise in the sensitivity. In the same vein, the specificity, which is usually of less interest, will drop.