[mathjax]
Basic Terminology
Classification of Variables
There are two ways to classify variables in a predictive analytic context: By their role in the study (intended use) or by their nature (characteristics).
By role
The variable that we are interested in predicting is called the target variable (or response variable, dependent variable, output variable). The variables that are used to predict the target variable go by different names, such as predictors, explanatory variables, input variables, or sometimes simply variables if no confusion arises. In an actuarial context, predictors are also known as risk factors or risk drivers.
By nature
Variables can also be classified as numeric variables or categorical variables. Such a classification has important implications for developing an effective predictive model that aligns with the character of the target variable and predictors to produce realistic output.
- Numeric (a.k.a. quantitative) variables: Numeric variables take the form of numbers with an associated range. They can be further classified as discrete / continuous variables.
- Categorical (a.k.a. qualitative, factor) variables: As their name implies, categorical variables take predefined values, called levels or classes, out of a countable collection of “categories”. When a categorical variable takes only two possible levels, it is called a binary variable and the two levels indicate whether an event has taken place or whether a characteristic is present. In R, categorical variables are treated as factor variables.
Supervised vs. Unsupervised Problems
Predictive analytic problems can also be classified into supervised and unsupervised learning problems.
Supervised learning problems
They refer to those for which there is a target variable “supervising” or guiding our analysis, and our goal is to understand the relationship between target variable and the predictors, and/or make accurate predictions for the target based on the predictors. There are only two supervised learning methods covered in Exam PA. They are GLMs and decision trees.
Unsupervised learning problems
For unsupervised learning methods, a target variable is absent, and we are interested in extracting relationships and structures between different variables in the data. Again, only two unsupervised learning methods are in the PA exam syllabus, namely, principal components analysis and cluster analysis.
Note that supervised and unsupervised learning methods are sometimes used in conjunction with one another. Unsupervised learning methods can be used for the purposes of data exploration and producing potentially useful features for predicting the target variable more accurately.
Regression vs. Classification Problems
In predictive analytics, it is customary to refer to supervised learning problems with a numeric target variable as regression problems (an exception is logistic regression, for which the target variable is binary). In contrast, when the target variable is categorical in nature, we are dealing with classification problems. A predictive model for predicting a categorical target variable involves classifying its observations to a certain level is aptly called a classifier.
The Model Building Process
Stage 1: Define the Business Problem
Objectives
Most predictive analytic projects and almost all PA exams can be classified into one of the following categories:
- Prediction-focused: The primary objective is to make an accurate prediction of the target variable on the basis of other predictors available.
- Interpretation-focused: Use the model to understand the true relationship between the target variable and the predictors.
For example, how is the number of exams passed associated with the salary of actuaries? We know the association is positive, but a predictive model can quantify this association precisely and hopefully provide some insights.
If interpretation is the primary concern, then it is natural to select a relatively simple, interpretable model that clearly shows the relationship between the target variable and the predictors.
Prediction performance is of secondary importance.
Constraints
It is also important to evaluate the feasibility of solving the business problem and implementing the predictive analytic solution. The considerations and constraints we should keep in mind when evaluating and prioritizing business problems include:
- The availability of easily accessible and high-quality data (more on this in Stage 2)
- Implementation issues such as the presence of necessary IT infrastructure and technology to fit complex models efficiently, the timeline for completing the project, and the cost and effort required to maintain the selected model.
If the model is operationally prohibitive to execute, then it makes sense to trade some prediction performance for ease of implementation. After all, if we cannot actually implement and apply the model in practice, then it is basically useless.
Effective Problem Definitions:
- Clarity of the business issue
- Testable hypotheses to guide the project
- An ability to assess the outcome, gained through clear, measurable key performance indicators (KPIs)
Stage 2: Data Collection
Data Design
We should take note of a few things to maximize the usefulness of our data source, which is the entire set of observations that may enter the dataset you collect.
Relevance
Intuitively, the more data we have, the more information available and the better, provided that the data is unbiased in the sense that it is representative of the environment that our predictive model will operate. To ensure that the data is relevant to our business problem, it is necessary to source the data from the right population and time frame.
-
- Population: As obvious as this may seem, it is important that the data source aligns with the true population we are interested in. If the data source is not representative, then taking a larger dataset does not help; we are repeating the bias over again and again.
If, for example, you are interested in the mortality of Americans, but your data consists mostly of Asians, then there is a risk that the results generated by the predictive models you construct may not generalize well to Americans, the population of interest.
In practice, the alignment between the true population and the data source will not be exact, but we should strive to minimize their mismatch. - Time frame: When selecting an appropriate time frame for our data, it is advisable to choose the time period which best reflects the business environment in which we will be implementing our models.
In general, recent history is more predictive of the near future than distant history, so data one year ago should be more useful than data 10 years ago, other things equal.
- Population: As obvious as this may seem, it is important that the data source aligns with the true population we are interested in. If the data source is not representative, then taking a larger dataset does not help; we are repeating the bias over again and again.
Sampling
Sampling is the process of taking a subset of observations from the data source to generate our dataset. The observations sampled should closely resemble the business environment in which our predictive models will be applied in the future.
-
- Random sampling: randomly draw observations from the underlying population without replacement until we have the required number of observations. Each record is equally.
However, voluntary surveys are known to be vulnerable to respondent bias. - Stratified sampling: Stratified sampling involves dividing the underlying population into a number of non-overlapping “strata” or groups in a non-random fashion; and randomly sampling (with or without replacement) a set number of observations from each stratum. This sampling method has the notable advantage of ensuring that every stratum is represented in the collected data.
- Oversampling and undersampling: These are sampling methods designed specifically for unbalanced data.
- Systematic sampling: By systematic sampling, we draw observations according to a set pattern and there is no random mechanism controlling which observations are sampled.
- Random sampling: randomly draw observations from the underlying population without replacement until we have the required number of observations. Each record is equally.
Exercise 1: Sampling
Consider the following sampling schemes for a dataset of insurance policyholders in the US.
-
-
- Drawing a sample of size 100 at random from the full dataset
This is random sampling, for which all policyholders have the same probability of being selected, and they are sampled (without replacement) from the dataset until we have the required number of observations, which is 100 in this case. - Dividing the dataset into two separate datasets by geographical location (e.g., west and east), then drawing a random sample of size 50 from each dataset
This is stratified sampling, because the full dataset is divided into two “strata,” west and east, from each of which 50 policyholders are randomly and independently drawn. - Drawing a sample of all policyholders who are in Iowa.
This is systematic sampling, because observations are drawn according to a set pattern (i.e., whether the state is Iowa or not) and there is no random mechanism controlling which observations are sampled. It can also be viewed as an extreme form of stratified sampling with two strata: Iowa and non-Iowa states, and the former is fully sampled (oversampled) and the latter is not sampled at all (undersampled). - Separating the dataset into policyholders for whom the state is Iowa and those for whom the state is not, then drawing a random sample of size 50 from each of them.
This is stratified sampling for the same reason as the second scheme.
- Drawing a sample of size 100 at random from the full dataset
-
Granularity
Granularity refers to how precisely a variable in a dataset is measured, or equivalently, how detailed the information contained by the variable is.
As an example, location, in ascending order of granularity, can be recorded:
-
- by country (very crude, least granular),
- by state (more granular),
- by ZIP code (even more granular), and
- by address (most granular, each address is largely unique).
From country to address, we have more and more specific information about the location of an individual.
Granularity | # of Levels | # of Obs at Each Level | Susceptibility to Noise |
High | High | Low | High |
Low | Low | High | Low |
Data Quality Issues
Data validation, the process of ensuring the quality and appropriateness of the data available. If the collected data falls short of our standards, we may have to go back to the data collection step and search for an alternative dataset.
- Reasonableness: For a dataset to be useful, the data values should, at a minimum, be reasonable, which can be checked by exploratory data analysis (see Stage 3).
- Consistency:
We need to ensure that the records in the data are inputted consistently, meaning that the same basis and rules have been applied to all values, so that they can be directly compared with one another.- Numeric variables: For numeric variables, the same unit should be used across all rows of the data for consistency.
- Categorical variables: We should make sure that the factor levels of categorical variables are defined and recorded consistently over time with no coding changes.
- Advantages of factorizing numeric variables.
- Taking numeric variables, if there is an order among them and each with a single regression coefficient attached, would impose the undesirable restriction that every unit increase in each of the variables is associated with the same change in the linear predictor, holding all other variables fixed.
- Such a restriction is lifted if they are treated as categorical predictors represented by dummy variables with different coefficients, and we will provide a GLM with more flexibility to capture the effects of these two variables on the target variable more effectively.
- Taking numeric variables, if there is an order among them and each with a single regression coefficient attached, would impose the undesirable restriction that every unit increase in each of the variables is associated with the same change in the linear predictor, holding all other variables fixed.
- Disadvantages of factorizing numeric variables.
- Will inflate the dimension of the data as each non-baseline factor level has to be represented by dummy variables and may result in overfitting.
- Resolution: Stepwise selection with binarization or regularization can help reduce the great number of factor levels.
- Downside: longer time to complete the model fitting procedure.
- Resolution: Stepwise selection with binarization or regularization can help reduce the great number of factor levels.
- Will inflate the dimension of the data as each non-baseline factor level has to be represented by dummy variables and may result in overfitting.
- Advantages of factorizing numeric variables.
If we want to combine multiple datasets to form new variables (columns) or new observations (rows), then the consistency checks above extend to all constituent datasets to make sure that the concatenation works properly.
If we need to join datasets by a certain variable (column), it is especially important to ensure that the format of that variable is consistent across both datasets to ease matching, e.g., name should be recorded either as [first name] [last name] or [last name], [first name] so that each person can be identified uniquely. The possible existence of a middle name adds further complication.
- Sufficient documentation
A good dataset should also be sufficiently documented so that other users can easily gain an accurate understanding of different aspects of the data.
The PA modules suggest that effective documentation should at least include the following information:
-
- A description of the dataset overall, including the data source (how and when the dataset was collected)
- A description of each variable in the data, including its name, definition, and format (range of values for numeric variables and the possible levels for categorical variables)
- Notes about any past updates or other irregularities of the dataset
- A statement of accountability for the correctness of the dataset
- A description of the governance processes used to manage the dataset
Other Data Issues
Even if a dataset is of sufficient quality, there are additional regulatory and professional issues related to the collection and use of data that we need to be aware of. Most of these issues are at the variable level, i.e., they pertain to specific variables in the data rather than the whole dataset.
- PII: It is not uncommon that the datasets we use in real life contain personally identifiable information (PII), or information that can be used to trace an individual’s identity.
Examples of such information include name, Social Security number, address, photographs, and biometric records. In most places, there are laws and regulations governing the use of personal data and it is important for us to comply with these regulations as well as relevant standards of practice (e.g., Actuarial Standard of Practice No. 23, or ASOP 23 for short, in the US) and consult the legal team if necessary. For example, we may have to:- Anonymization: Anonymize or de-identify the data to remove the PII. (Note that anonymization itself constitutes processing of personal data and will need to comply with the pertinent laws.)
- Data security: Ensure that personal data receives sufficient protection, such as encryption and access/transfer restrictions. Any third parties that make use of the data must comply with our security policy. In short, we should preserve the integrity of the data.
- Terms of use: Be well aware of the terms and conditions and the privacy policy related to the collection and use of data, including limitations to what we can do with the information that customers provided, and third parties that can use the data.
- Variables causing unfair discrimination: The aim of predictive modeling is, in effect, to differentiate the observations of the target variable on the basis of the predictors in the data. If some of these variables contain rather sensitive information, such as race, ethnicity, gender, or other prohibited classes, then differential treatment based on these variables in a predictive model may lead to unfair discrimination and some jurisdictions may have regulations prohibiting their use. Even if these variables may prove useful from a predictive point of view, using them as predictors may be deemed unethical and we should weigh the legal risks involved.
We also have to be careful with their proxies. An example is occupation, which can be strongly correlated with gender in some cases, e.g., nurses are typically female and mine workers are typically male. - Target leakage: In predictive modeling, target leakage refers to the phenomenon that the predictors in a model include (“leak”) information that will not be available when the model is applied in practice. These predictors typically have a one-to-one correspondence with the target variable and would lead to artificially good model performance if mistakenly included.
- Target leakage would mislead superb prediction performance, making perfect predictions due to their strong relationships.
Stage 3: Exploratory Data Analysis
The goals of exploratory data analysis are the same as those presented in before:
- With the use of descriptive statistics and graphical displays, clean the data for incorrect, unreasonable, and inconsistent entries, and understand the characteristics of and the key relationships among variables in the data.
The observations we make may suggest an appropriate type of predictive model to use and the best form for the predictors to enter the model.
Stage 4: Model Construction and Evaluation
Should we use all available data to train our models?
After spending so much time and effort on data collection and validation, you may be tempted to fit your predictive models directly to the whole set of collected data. But hold on! Remember that one of the main goals when doing predictive modeling is to construct models that make good predictions. The fundamental question is:
- How do we go about designing a predictive model that performs well when it is applied to new, previously unseen data? In other words, how do we make sure that the model calibrated on past data excels in predicting future target values on the basis of future predictor values?
Note that prediction, by definition, entails future data, not with past data. If we use all the collected data for model fitting, that will leave no independent data for assessing the prediction performance of our models. It does not help much to evaluate the models on the same set of data on which they were built. After all, the models have seen those data and are not making genuine predictions. To ensure that our predictive models do a good job of describing the past data they have seen and, more importantly, predicting new, future observations, we need an unconventional way of leveraging our available data.
Training/Test Set Split
To construct a model that thrives on future data, it is common practice in predictive analytics to partition our data into a few parts, each of which plays a different role in the model development and evaluation process. This is in stark contrast to classical statistics where you will use all of the available observations to fit your model and make inference about the parameters. The simplest way is to split the full data into two parts, the training set and test set.
- Training set (typically 70-80% of the full data): The larger part of the full data, the training set is where you “train” or develop your predictive model to estimate the signal function \(f\) and, if needed, the model parameters. The training is typically done by optimizing a certain objective function. The fitted model can be denoted by \(\hat{f}\).
- Test set (typically 20-30% of the full data): The test set is where you assess the prediction performance of your trained model according to certain performance metrics. Observations in the test set did not participate in the model training process and will provide a more objective ground for evaluating prediction performance when the model is applied to new, future data. In contrast, evaluating the model on the training set will give an overly optimistic and somewhat misleading picture of its true predictive performance-the model has already seen the training observations and is fitting, not predicting those observations.
As simple as the training/test set split is, there are some subtleties we have to pay attention to:
1. How to do the split?
There are many different ways to do the training/test set split. It can be done either randomly according to pre-specified proportions or with special statistical techniques. ensuring that the distributions of the target variable in the two sets are comparable. The bottom line is that the observations in the test set are truly unseen by the model developed on the training set, or else we are cheating!
If one of the variables in the data is a time variable, such as calendar year or month, and the behavior of the target variable over time is of interest, then one more way to make the split is on the basis of time, allocating the older observations to the training set and the more recent observations to the test set. Such “out-of-time” validation is conducted frequently for time series data, also known as longitudinal data, and is useful for evaluating how well a model extrapolates time trends observed in the past to future, unseen years.
2. How many observations should we assign to the two sets?
The relative size of the training and test sets involves a trade-off.
-
- Having more training data will make for a more robust predictive model more capable of learning the patterns in the data and less susceptible to noise.
- If too little data is set aside for the test set, however, the assessment of the prediction performance of the trained model on new, unseen observations will be less reliable.
The exam will almost always specify the split percentages (e.g., 70%/30%) for you, but anything in the 70-80% range for the training set and 20-30% range for the test set is acceptable.
3. How to use the training/test set split to rank competing models?
If we have multiple candidate models (almost always the case in Exam PA), we can use the test set to differentiate the quality of these models and choose the one with the best test set performance according to a certain model selection criterion. To ensure a fair comparison, the same training and test sets should be used across all candidate models.
Common Performance Metrics
After making the training/test set split and fitting the predictive model to the training set, it is time for us to evaluate the predictive performance of the model on the test set with respect to an appropriate performance metric. The choice of the performance metric depends closely on the nature of the target variable (numeric or categorical), or equivalently, the nature of the prediction problem (regression or classification).
Regression problems
When the target variable is numeric, the predictive performance of a model can be assessed by looking at the size of the discrepancy between the observed value of the target variable on the test set and the corresponding predicted value. This discrepancy is referred to as the prediction error for that observation in the test set.
A commonly used metric that aggregates all of the prediction errors on the test set and provides an overall measure of prediction performance is the test root mean squared error (RMSE), defined as:
test RMSE = \(\sqrt{\dfrac{1}{n_\text{test}}\sum\limits_{i=1}^{n_\text{test}}{(y_i-\hat{y}_i)^2}}\)
where \(n_\text{test}\) is the size of the test set, and \(y_i\) and \(\hat{y}_i\) are respectively the observed and predicted values of the target variable for the \(i^{th}\) observation in the test set.
The individual prediction errors are squared (so that positive and negative errors will not cancel out), summed, and averaged, followed by taking a square root, to yield the test RMSE. As you can expect, the smaller the RMSE, the more predictive the model.
Classification problems
For categorical target variables, the predictions \(y_i\) ‘s are simply labels (certain factor levels of the target) which cannot be handled algebraically, so the difference \(y_i-\hat{y_i}\) may not be well-defined or may not make sense even if it is defined. We may instead look at the test classification error rate (a.k.a misclassification error rate) defined by
test misclassification error rate = \(\dfrac{1}{n_\text{test}}\sum\limits_{i=1}^{n_\text{test}}1_{y_i\ne\hat{y_i}}\)
where \(1_{y_i\ne\hat{y_i}}\) is the indicator function given by:
-
- \(1_{y_i\ne\hat{y_i}}=0\), if \(y_i=\hat{y_i}\), i.e., the \(i^{th}\) test observation is correctly classified
- \(1_{y_i\ne\hat{y_i}}=1\), if \(y_i\ne\hat{y_i}\), i.e., the \(i^{th}\) test observation is misclassified
The sum therefore computes the number of test observations incorrectly classified and the division by \(n_\text{test}\) returns the proportion of misclassified observations on the test set. The smaller the classification error rate, the more predictive the classifier.
Both the RMSE and classification error rate are general performance metrics that apply to general regression and classification problems, respectively. They can be computed on the training set as well as the test set, but as discussed above, our main interest is in these performance metrics on the test set. They are also distribution-free in the sense that their computations do not rely on a particular distribution assumed for the target variable. If the target variable is known a priori to follow a certain distribution (e.g., normal, Poisson, gamma), then additional likelihood-based performance metrics are available.
Cross-Validation
In predictive analytics, there is a remarkably useful and widely applicable technique, known as cross-validation (CV), that is a variation of the training/test set split described above. This technique provides a powerful means to assess the prediction performance of a model without using additional test data. In Exam PA, it also serves an important purpose:
- To select the values of hyperparameters, which are parameters that control some aspect of the fitting process itself.
Model fitting typically involves optimizing a certain objective function, and hyperparameters often play a role either in the mathematical expression of the objective function, or in the set of constraints defining the optimization problem. In fact, most predictive models contain one or more hyperparameters, speaking to their importance in predictive analytics.
The tricky thing about hyperparameters is that we have to supply their values in advance; they are not optimized as part of the model fitting algorithm, so it is impossible to select their values given the training set alone.
In theory, we can further divide the training data into two parts, one part for training the model for a given set of hyperparameters under consideration, and one part for evaluating the predictive performance of the resulting models. Then we select the combination of hyperparameters that gives rise to the best performance. In practice, this will reduce the size of the training set and undesirably undermine the reliability of the model development process.
Cross-validation is an ingenious method for tuning hyperparameters without having to further divide the training set. Given the training set, cross-validation (more precisely, k-fold cross-validation) works by performing a series of splits repeatedly across (hence the qualifier “cross”) the dataset. The procedure is as follows:
1. For a given positive integer \(k\), randomly split the training data into \(k\) folds of approximately equal size. The value of \(k\) can range between 1 (smallest) and \(n_{tr}\) (largest). A common choice for \(k\) is 10, which is the default value in many model fitting functions in R.
2.1. One fold is left out and the predictive model is fitted to the remaining \(k-1\) folds. Then the fitted model is used to make a prediction for each observation in the left-out fold and a performance metric is computed on that fold. This is a valid estimate of the prediction performance of the model because observations in the excluded fold are not part of the model training process. In other words, they are something unseen by the model.
2.2. Then repeat this process with each fold left out in turn to get \(k\) performance values, \(V_1,…,V_k\) (e.g., RMSE for a numeric target and classification error rate for a categorical target).
2.3. The overall prediction performance of the model can be estimated as the average of the \(k\) performance values, known as the cross-validation error:
\(\text{CV Error}=\dfrac{V_1,…,V_k}{k}\)
A schematic diagram of how cross-validation works when k = 3 is shown below:
The fold marked “Validation” is used for evaluating prediction performance in each round.
If we have a set of candidate hyperparameter values, then we can perform Step 2 above for each set of values under consideration and select the combination that produces the best model performance. Then the model is fitted to the entire training set (i.e., the k folds together) with this optimal combination of hyperparameter values and evaluated on the test set. In fact, cross-validation is such a useful technique that it is automatically built into some model fitting algorithms such as regularization and decision trees for tuning model parameters for best performance.
Exercise
Question: A statistician has a dataset with \(n=50\) observations and \(p=22\) predictors. He is using 10-fold cross-validation to select from a variety of available models.
Calculate the approximate number of observations used for training models in each round of the cross-validation procedure.
Answer: 10-fold cross-validation involves (randomly) dividing the data into 10 folds of approximately the same size, or 50/10 = 5 observations in this case. In each round, 1 fold is left out and other 9 folds, with approximately \(9(5) = 45\) observations, are used to train a model.
Stage 5: Model Validation
There are different kinds of model validation techniques. They may be performed on either the training set or the test set, and they may be model-dependent in the sense that they are specific to particular types of predictive models.
- Training set: For GLMs, there is a set of model diagnostic tools designed to check the model assumptions based on the training set.
- Test set: A general model validation technique applicable to both GLMs and decision trees is to compare the predicted values and the observed values of the target variable on the test set. The comparison can be quantitative (summarized by a number) or graphical. If we use the graphical means and if the model works well, then the plot of observed against predicted values not only should fall on the 45° straight line passing through the origin quite closely, the deviation from the line should also have no systematic patterns.
- For this validation technique to be effective, it is vitally important that we compare the predicted values and the observed values on the test set, not the training set. If we perform the comparison on the training set, we are only assessing the goodness of fit of the model to training data, not its prediction performance on unseen data.
- Another validation technique used in some past PA exams (e.g., December 7, 2020, June 16, 2020, and June 2019 exams) is to compare the selected model to an existing, baseline model, again on the test set. This model is usually a primitive model, such as an ordinary least squares linear model (if the selected model is a GLM) or a model that has no predictors and simply uses the average of the target variable on the training set for prediction. The model will provide a benchmark which any selected model should beat as a minimum requirement.
Stage 6: Model Maintenance
If the client gladly accepts the selected model, then we can put it to actual use and maintain it over time so that it remains useful in the future. Here are some of the steps we can take:
- As new data becomes available, the training data should be enlarged with these new observations and the model periodically retrained to improve its robustness and prediction performance.
- If new variables are collected, they should be fed into the training set and, if useful, enter the predictive model.
- Solicit external subject matter expertise if there are modeling issues that cannot be easily resolved purely on predictive analytic grounds, e.g., high-dimensional categorical predictors with no obvious ways to combine factor levels.
Bias-Variance Trade-off
The preceding subsection provided an overview of the entire model building process, from problem definition all the way to model maintenance. This subsection and the next will delve into the model construction stage (Stage 4) more carefully and introduce, in a statistical framework, the considerations that go into designing a good predictive model.
Key idea: A complex model (not ⇒) A predictive model:
Complexity does not guarantee prediction accuracy.
Goodness of fit vs. prediction accuracy
In general, the more complex the model, the lower the training error (measured by RMSE or misclassification error) because a more flexible model is capable of accommodating a wider variety of patterns in the data.
However, our prime interest when doing predictive analytics is in developing a model that makes accurate predictions on future data. Unfortunately, prediction accuracy is not quite the same as goodness of fit.
Goodness-of-fit measures, which quantify how well a model describes past data, are not necessarily good measurements of how well the model will perform on future data.
As we will see below, when it comes to making predictions for new observations, using an overly complex model may backfire. In fact, a central theme of predictive analytics is selecting the right level of complexity for constructing an effective prediction model.
Decomposition of Expected Test Error
To better understand how model complexity affects prediction performance on the test set, it is instructive to break down the test error into several components. Although such a decomposition is mainly of theoretical nature and you will rarely calculate the individual components of the formula in Exam PA, keeping the decomposition in mind will help you understand what it takes to produce good predictions. In particular, it reinforces the practically important idea that more complex models are not necessarily superior models.
We will consider accuracy from the perspective of measuring (and hopefully, minimizing) the expected loss associated with a model choice and breaking it down into its three underlying components. They incorporate a key trade-off between the following two components:
- The expected loss arising from the model not being complex/flexible enough to capture the underlying signal (bias); and
- The expected loss arising from the model being too complex and overfitting to a specific instance of the data (variance).
High bias means that our model won’t be accurate because it doesn’t have the capacity to capture the signal in the data, whereas high variance means that our model won’t be accurate because it overfit to the data it was trained on and, thus, won’t generalize well to new, unseen data.
For simplicity, we will consider a numeric target variable so that the (R)MSE is an appropriate performance metric (categorical target variables admit a similar decomposition). For a given test observation, the decomposition reads:
where \(\hat{f}\) is the estimated signal function based on the training set denoted by \(T_r\) and the expectation on the left is taken over the new (random) target \(Y_0\) as well as the training data (remember, the target variable values on the training set are only realizations from the distribution of the target variable). This identity breaks down the expected test error (expected test MSE, to be precise) into three salient components and sheds light on how model complexity affects the quality of predictions:
Bias
The bias of \(\hat{f}(X_0)\) as a predictor of \(Y_0\) is the difference between the expected value of \(\hat{f}(X_0)\) and the true value of the signal function \(f\) at \(X_0\), i.e.,
\(Bias_{T_r}(\hat{f}(X_0))=E_{Tr}[\hat{f}(X_0)]-f(X_0)\)
Remember that f is constructed from the training data and hence is random in nature in the sense that different training data give rise to a different \(\hat{f}\). Thus it makes sense to talk about the bias and variance of \(\hat{f}(X_0)\), which is a random variable.
In general, the more complex a model, the lower the bias (in magnitude) due to its higher ability to capture the signal in the data. Stated differently, bias is the part of the test error caused by the model not being flexible enough to capture the underlying signal.
Variance
The variance of \(\hat{f}(X_0)\) quantifies the amount by which \(\hat{f}(X_0)\) would change if we estimated it using a different training set. Ideally, \(\hat{f}(X_0)\) as a predictor of \(Y_0\) should be relatively stable across different training sets.
In general, a more flexible model has a higher variance because it is more sensitive to the training data. A small perturbation in the training observations can cause massive changes in \(\hat{f}\) Consider, for example, the extreme case when \(\hat{f}\) is so sophisticated that it can perfectly fit all of the training observations. Then any change in the target observations will lead to the same change in the fitted values. We can say that variance is the part of the test error caused by the model being overly complex.
Irreducible Error
This is the variance of the noise, which is independent of the choice of the predictive model, but inherent in the target variable. As this component cannot be reduced no matter how good your predictive model is, it is called the irreducible error in the prediction error. In contrast, the bias and variance of \(\hat{f}(X_0)\) constitute the reducible error as they can potentially be reduced by using a more appropriate predictive model.
Bias vs. Variance
Bias and variance quantify the accuracy and precision of prediction, respectively. Although accuracy and precision are synonyms in English, in predictive analytics they are quite different concepts measuring different aspects of prediction performance.
Ideally, we want to have the top-left model, one that produces low-bias (accurate) and low-variance (precise) predictions. In practice, there is usually an intrinsic conflict between having predictions with low bias and having predictions with low variance. As we saw above, a more flexible model has a lower bias but a higher variance than a less flexible model (in addition to being less interpretable and more prone to computational issues), all else equal. Such an inverse relationship between the model variance and the model bias is commonly referred to as the bias-variance trade-off, which is one of the most fundamental concepts in predictive analytics and a recurring theme in this manual.
Typically, the relative rate of change of these two competing quantities contributes to a U-shape behavior of the test error, which can be explained as follows:
Case 1: Underfitting
In the left tail, the model is too simple to capture the signal in the data. Without learning enough from the training data, the underfitted model has a relatively large training error, large bias, but small variance. The large bias also makes up much of the test error, which is relatively large as well. As the flexibility of the model increases, the bias initially tends to drop faster than the variance increases, so the test error decreases.
Case 2: Overfitting
When the model becomes overwhelmingly complex, the drop in the bias is outweighed by the rise in the variance, so the test error eventually rises, displaying a U-shape behavior. In the right tail, the model, which is unnecessarily complex in this range, is overfitting the data. It is trying too hard to capture the signal (which is systematic, applicable to both the training and test data) as well as the noise specific to the training data (but inapplicable to the test data) and mistreats the noise as if it were a signal. Although the training error is low, the model is marked by a high test error, which can be attributed to its high variance.
Regardless of the model type, the bias-variance trade-off has very practical implications for model construction:
The construction of a model that fares well on future, unseen data boils down to locating the right level of model complexity that corresponds, at least approximately, to the minimum point of the U-shape curve followed by the test error. Instead of using a model that is too simple or too complex, we should adopt a moderately flexible one that strikes a good balance between having a high bias and having a high variance.
Exercise: The U-Shape Curve
An actuary has split a dataset into training and test sets for a model. The chart below shows the relationship between model performance and model complexity. Model performance is represented by model error ( on the training or test sets) and model complexity is represented by degrees of freedom.
(a) Briefly explain why it is desirable to split the data into training and test sets.
(b) Briefly describe whether each of the following models has an optimal balance of complexity and performance.
(i) Model 1: 10 degrees of freedom
(ii) Model 2: 60 degrees of freedom
(iii) Model 3: 100 degrees of freedom
Solution
(a) Recall that the more flexible the model, the better the fit to the training data, but not necessarily the fit to independent test data. With the training/test set split, we use the training set to fit the model but then test its predictive power on the test set (which is data “unseen” by the model) to avoid overfitting to the noise of the data and to ensure that the chosen model is predictive on future, unseen data.
(b)
- (i) This model does not have an optimal balance as the model error on both the training and test data is relatively high and the test error can be decreased by making the model more complex.
- (ii) This model has an optimal balance as the test error is approximately at its lowest point, meaning that adding or removing parameters will raise the test error.
- (iii) This model does not have an optimal balance. While adding more parameters has lowered the training error, the test error has increased, suggesting overfitting.
Exercise: Visualizing bias-variance trade-off
This simulation example uses polynomial regression to illustrate the bias- variance trade-off.
As a concrete pictorial illustration of the bias-variance trade-off, we consider fitting five linear models of different degrees of model complexity and investigate the predictive performance of these five models on independent test data by simulation.
Specifically, we suppose that the true relationship between the target variable \(Y\) and the single predictor \(X\) is \(Y=f(X)+\varepsilon\), where the signal function is the square function \(f(X)=X^2\) and the noise \(\varepsilon\sim N(0,0.5)\)
Without knowing the true \(f\) in advance, we fit five linear models with polynomial regression functions of varying degrees (0, 1, 2, 4, and 8) to the training data, of size 50, and make a prediction for the target variable at \(X_0=0.8\). In this context, we are using the degree of the polynomial regression function as a natural index for model complexity.
The higher the degree, the more flexible the linear model.
Since \(f(X)=X^2\) , we know that the true value of the signal function at \(X_0\) is 0.82 = 0.64, which allows us to judge the predictive performance of the five linear models, knowing the true model beforehand.
This process is repeated 3,000 times to appreciate the distribution of the predictions for each model. In particular, we are interested in estimating the test MSE for each model from the results of the 3,000 rounds of simulation.
In CHUNK 1
, we use a for
loop to perform the 3,000 rounds of simulation and calculate the average test MSE for each linear model.
CHUNK 1
n <- 50 # sample size of each training set n_sim <- 3000 # number of rounds of simulation (an important parameter!) x0 <- 0.8 # predictor value of interest df.test <- data.frame(x = x0) # test set of one observation pred <- matrix(NA, nrow = n_sim, ncol = 5) bias <- Var <- test.error <- rep(NA, 5) set.seed(99) for (k in 1:n_sim) { # simulate the training set of n observations for each round x <- runif(n) e <- rnorm(n, sd = 0.5) y <- x^2 + e # true model is quadratic df.train <- data.frame(x, y) # fit the five linear models to the training set model.0 <- lm(y ~ 1, data = df.train) # intercept- only model model.1 <- lm(y ~ x, data = df.train) # simple linear regression model.2 <- lm(y ~ poly(x, 2), data = df.train) # quadratic regression model.4 <- lm(y ~ poly(x, 4), data = df.train) # degree 4 model.8 <- lm(y ~ poly(x, 8), data = df.train) # degree 8 # caLcuLate the predicted vaLue for each Linear model pred[k, 1] <- predict(model.0, newdata = df.test) pred[k, 2] <- predict(model.1, newdata = df.test) pred[k, 3] <- predict(model.2, newdata = df.test) pred[k, 4] <- predict(model.4, newdata = df.test) pred[k, 5] <- predict(model.8, newdata = df.test) } y0 <- x0^2 + rnorm(n_sim, sd = 0.5) # random target of prediction
In CHUNK 2
, we use separate boxplots to visualize the distribution of the 3,000 predictions produced by each model. Based on the boxplots, we can graphically evaluate the performance of the five models in terms of accuracy (measured by bias) and precision (measured by variance):
- Accuracy: The boxplots reveal that the model of degree 0 (i.e. , no predictors are used) performs poorly in terms of bias – its predictions are far away from the true signal value \(f(X_0)=0.64\), which is indicated by a dashed horizontal line. This model appears to be underfitting the data and fails to capture the signal sufficiently. The models of degrees 1, 2, 4, and 8 all make accurate predictions in that their predictions, on average, are close to the true signal value.
- Precision: The boxplots also indicate that the higher the degree of the polynomial regression function, the greater the amount of fluctuation around the average prediction. It follows that the models of degrees 4 and 8 have a higher variance and so a higher test MSE than the models of degrees 1 and 2. The two former models seem to be overfitting the data and following the noise in the training data too closely.
Overall, the findings conform to the bias-variance trade-off and provide a good illustration of the interplay between model complexity and prediction performance.
CHUNK 2
# Combine the predicted vaLues of the five models as a singLe vector for pLotting purposes pred.all <- c(pred[, 1], pred[, 2], pred[, 3], pred[, 4], pred[, 5]) degree <- as.factor(rep(c(O, 1, 2, 4, 8), each= n_sim)) dat <- data.frame(degree, pred.all) library(ggplot2) ggplot(dat, aes(x = degree, y = pred.all)) + geom_boxplot() + geom_hline(yintercept = xo-2, linetype = "dashed") + labs(x = "Degree of Polynomial Regression Function", y = "Predictions" )
In CHUNK 3
, we take a closer look at the squared bias, variance, and test error of the five models. As the degree of the polynomial regression function increases, the squared bias drops whereas the variance increases consistently, as anticipated. The test MSE follows the U-shape behavior discussed above, decreasing from degree 0 to degree 2, then increasing as we further raise the polynomial degree from 2, 4, to 8. As we expect, the lowest test MSE is achieved by the linear model of degree 2, which corresponds to the true level of model complexity. Beyond degree 2, the prediction performance deteriorates, so don’t think that using a flexible model is always a good idea. There is a price we have to pay for using an overly complex model!
CHUNK 3
# Simulation estimate of the squared bias print("Squared Bias") for (j in 1:5) { bias[j] <- (mean(pred[, j]) - x0^2)^2 print(bias[j]) } # Simulation estimate of the variance print ("Variance") for (j in 1:5) { Var[j] <- mean((pred[, j] - mean(pred[, j]))-2) print(Var[j]) } # Simulation estimate of the test error print("Test MSE") for (j in 1:5) { test.error[j] <- mean((y0 - pred[, j])-2) print (test.error[j]) } [1] "Squared Bias" [1] 0.09395985 [1] 8.479837e-05 [1] 1.228075e-06 [1] 7.331803e-06 [1] 1.052667e-07 [1] "Variance" [1] 0.006978376 [1] 0.01091701 [1] 0.01085023 [1] 0.02395384 [1] 0.04438122 [1] "Test MSE" [1] 0.3410798 [1] 0.2543996 [1] 0.2531114 [1] 0.2665839 [1] 0.2937785
Feature Generation and Selection
A practical implication of the bias-variance trade-off is that it is crucially important to strike a balance between having an underfitted model that produces stable but inaccurate predictions, and having an overfitted model that makes accurate but highly variable predictions. This in turn boils down to an effective control of model complexity, which can be achieved by feature generation and selection.
Variables vs. Features
To many, the terms “variables” and “features” are synonyms meaning the predictors used in a predictive model. In a predictive analytic context (and in Exam PA, in particular), however, there is a subtle distinction between variables and features that we should keep in mind.
- Variables: In statistics literature, a variable is synonymous with a predictor in a model. In the predictive analytics arena, a variable more precisely means a raw measurement that is recorded and constitutes the original dataset before any transformations are applied.
- Features: In the predictive modeling community, features refer to derivations from the original variables and provide an alternative, more useful view of the information contained in the dataset. In the language of Exam IFM, features are “derivatives” of raw variables.
The Need for Feature Generation and Feature Selection
The need for feature generation and selection arises from certain limitations of modeling algorithms. Fundamentally, models have a functional or algorithmic form that adapts to the data (through optimization of parameters or other aspects of the model). The general trade-off is that the more parameters (or input variables) you allow your model to have, or the more flexibility you give its form, the more accurately it will be able to represent complex signals, but the less interpretable it will be, and the more likely it is that the model will suffer from other issues such as fitting to noise (overfitting) and computational complexity.
Feature generation and selection aims to strike a balance. Instead of letting the model do all the work and so become overly complex, we try to transform the data by extracting the important features, allowing us to build a much simpler model. Crucially, this requires human input. Using our intelligence (and statistical/algorithmic rules we have developed) to help the predictive analytics process, actuaries create more interpretable models and improve accuracy without the cost of increasing the complexity of the model.
Limitations of Modeling Algorithms
|
Feature Generation
Feature generation is the process of “generating” new features based on existing variables in the data. Although these new derived variables do not add new information, they transform the information contained in the original variables into a more useful form or scale so that a predictive model can “absorb” the information more effectively and capture the relationship between the target variable and the predictors more powerfully.
Ultimately, the model is in a better position to pick up the signal in the data. This can be especially important for certain types of predictive models (GLMs, in particular). Within the context of the bias-variance trade-off, feature generation seeks to enhance the flexibility of the model and lower the squared bias of the predictions at the expense of an increase in variance.
In some cases, the new features will also make a model easier to interpret. There are many methods for generating new features, many of which are motivated from practical considerations.
Feature Selection and Dimension Reduction
Feature selection (or feature removal), the opposite of feature generation, is the procedure of dropping features with limited predictive power and therefore reducing the dimension of the data. In the framework of the bias-variance trade-off, it is an attempt to control model complexity and prevent overfitting by reducing variance at the cost of a slight rise in bias. It is an indispensable exercise in predictive modeling.
Feature selection is particularly relevant to high-dimensional categorical predictors, each level of which is technically a separate feature by most predictive models. By the dimensionality of a categorical variable, we refer to the number of possible levels that the variable has. If a categorical predictor has multiple categories (e.g., US states, with 50 levels), this can substantially inflate the dimension of the predictor, undermine prediction precision, and lead to low exposure in some levels.
Here are two commonly used strategies for reducing the dimensionality of a categorical predictor:
- Combining sparse categories with others: Categories with very few observations should be the first candidates to be combined with other categories. Due to the small number of observations, it is difficult to estimate the effects of these categories on the target variable reliably. To put it differently, our predictive model may have difficulty making robust predictions for future observations belonging to these sparse categories. To improve the robustness of the fitted model and reduce the likelihood of it overfitting to noise in the training data, it is advisable to fold, as far as practicable, sparse categories into more populous categories in which the target variable exhibits a rather similar behavior. The goal is to strike a balance between the following conflicting goals:
- To ensure that each level has a sufficient number of observations.
- To preserve the differences in the behavior of the target variable among different factor levels. Such differences may be useful for prediction.
- Combining similar categories: If the target variable behaves similarly (with respect to the mean, median, or other distributional measures) in two categories of a categorical predictor, then we can further reduce the dimension of the predictor by consolidating these two categories, even if they are not sparse, without losing much information.
Granularity Revisited
A concept closely related to the dimensionality of a categorical variable is granularity, which was first introduced in the context of data collection.
As we make a categorical variable more granular, the information it stores becomes finer. Its dimension increases and there are fewer observations at each distinct level, so one way to reduce the susceptibility of a predictive model to noise is to reduce the granularity of a categorical predictor, recording the information contained by the predictor at a less detailed level and making the number of factor levels more manageable. The optimal level of granularity is the one that optimizes the bias-variance trade-off.
Although granularity and dimensionality are closely related, they are not the same concept. There are two main differences:
- Applicability: Dimensionality is a concept specific to categorical variables, while granularity applies to both numeric and categorical variables.
- Comparability: Although we can always order two categorical variables by dimension-all we have to do is count their number of factor levels and see which variable has more levels-it is not always possible to order them by granularity. For variable 1 to be more granular (and more informative) than variable 2, each distinct level of variable 1 should be a subset of the levels of variable 2, so if we know the level of variable 1, then we can always deduce the level of variable 2 (but not the other way round). This is a pretty strong requirement.
Exercise: Age variables with different degrees of dimensionality and granularity
Consider the following three variables measuring age:
Age1
: The exact value of ageAge2
: A binned version of age defined as follows:
Age Band | Range of Values |
1 | 0-30 |
2 | 30-50 |
3 | 50+ |
Age3
: A binned version of age defined as follows:
Age Band | Range of Values |
1 | 0-30 |
2 | 30-40 |
3 | 40-65 |
4 | 65+ |
Rank, as far as possible, the three age variables in order of:
- Dimensionality
- Granularity
Solution
- Only
Age2
andAge3
are categorical variables and can be compared in terms of dimensionality. AsAge2
has three distinct levels butAge3
has four,Age2
has a lower dimension thanAge3
. - Because
Age1
stores the exact value of age whileAge2
andAge3
only capture an approximate range of age,Age 1
is the most granular. To put it another way, knowing the value ofAge1
, we automatically know the levels ofAge2
andAge3
.
As forAge2
versusAge3
, it seems thatAge3
divides the age range more finely thanAge2
(four bands versus three). However, the third age band, 40-65, spans two age bands forAge2
, 30-50 and 50+. In other words, we cannot always deduce the level ofAge2
from the level ofAge3
and there is no global order betweenAge2
andAge3
in terms of granularity.
Remark
- Part (b) is harder and may not be immediately obvious.
- Another way to think of granularity is that we can band
Age1
(more granular) to getAge2
andAge3
(less granular), but we cannot getAge1
fromAge2
orAge3
. We can only go fromAge2
andAge3
toAge1
using an approximation, e.g., taking the midpoint of each age band. - If
Age3
is defined as:
Age Band | Range of Values |
1 | 0-30 |
2 | 30-40 |
3 | 40-50 |
4 | 50+ |
then Age2
is less granular than Age3
. The four age bands above are subsets of the three age bands of Age2
, so the level of Age3
automatically implies the level of Age2
.
Linear Models: Theoretical Foundations
Model Formulation
Model Equation
Linear models postulate that the target variable \(Y\), which is assumed to be continuous, is related to \(p\) predictors \(X_1,X_2,…,X_p\) via the approximately linear relationship.
\(Y=\beta_0+\beta_1X_1+\beta_2X_2+…+\beta_pX_p+\varepsilon\)
where:
- \(p\) is the number of predictors.
- \(\beta_0,\beta_1,…,\beta_p\) are unknown regression coefficients (or parameters).
- \(\varepsilon\) is the unobservable zero-mean random error term that accounts for the fluctuation of \(Y\) about its mean.
The expected value of \(Y\), captured by the signal function:
\(E[Y]=f(X)=\beta_0+\beta_1X_1+\beta_2X_2+…+\beta_pX_p\)
is linear in the regression coefficients \(\beta_0,\beta_1,…,\beta_p\). It is, however, not necessarily linear in the predictors.
In general,
Training Observation |
Target | Predictors | |||
\(Y\) | \(X_1\) | \(X_2\) | … | \(X_p\) | |
1 | \(Y_1\) | \(X_11\) | \(X_12\) | … | \(X_1p\) |
2 | \(Y_2\) | \(X_21\) | \(X_22\) | … | \(X_2p\) |
… | … | … | … | … | … |
\(n_{tr}\) | \(Y_{n_{tr}}\) | \(X_{n_{tr},1}\) | \(X_{n_{tr},2}\) | … | \(X_{n_{tr},p}\) |
- \(\beta_0\) as the intercept, which is the expected value of \(Y\) when all \(X_j\)’s are zero.
- \(\beta_j\) as the regression coefficient (or slope) attached to the jth predictor for j = 1, … , p. These coefficients quantify the expected effect of the predictors on \(Y\), holding all other predictors fixed.
Model Fitting by Ordinary Least Squares
To estimate the unknown coefficients \(\beta_0,\beta_1,…,\beta_p\), we assume that we are given a training set of \(n_{tr}\) independent pairs of observations \({(Y_i,n_i)}^{n_tr}_{i=1}\) governed by:
\(Y_i=\beta_0+\beta_1X_{i1}+\beta_2X_{i2}+…+\beta_pX_{ip}+\varepsilon_i,i=1,…,n_{tr},\varepsilon_i\text{ i.i.d.}\sim N(0,\sigma^2) \)
Ordinary Least Squares Approach
\(min{\sum_{i=1}^{n_{tr}}{[Y_i-(\beta_0+\beta_1X_{i1}+\beta_2X_{i2}+…+\beta_pX_{ip})]^2}}\)
The solutions, denoted by \(\hat{\beta_0},\hat{\beta_1},…,\hat{\beta_p}\) and called the ordinary least squares estimators.
When the random errors are normally distributed, the least squares estimators \(\hat{\beta_0},\hat{\beta_1},…,\hat{\beta_p}\) coincide with the maximum likelihood estimators.
Matrix Representation of Linear Models
\(\boldsymbol{Y=X\beta+\varepsilon}\)
or \(\begin{pmatrix}
Y_1\\
Y_2\\
…\\
Y_{n_{tr}}
\end{pmatrix}=\begin{pmatrix}
1 & X_{11} & X_{12} & … & X_{1p}\\
1 & X_{21} & X_{22} & … & X_{2p}\\
…\\
1 & X_{n_{tr},1} & X_{n_{tr},2} & … & X_{n_{tr},p}
\end{pmatrix}\begin{pmatrix}
\beta_0\\
\beta_1\\
…\\
\beta_p
\end{pmatrix}+\begin{pmatrix}
\varepsilon_1\\
\varepsilon_2\\
…\\
\varepsilon_{n_{tr}}
\end{pmatrix}\)
where:
- \(\boldsymbol{Y}\) is the \(n_{tr}\times 1\) vector containing the \(n_{tr}\) values of the target variable in the training set.
- \(\boldsymbol{X}\) is the design matrix.
- \(\boldsymbol{\beta}\) is the vector of \(p+1\) regression coefficients (including \(\beta_0\)).
- \(\boldsymbol{\varepsilon}\) is the \(n_{tr}\times 1\) vector of random errors.
\(\boldsymbol{\hat{\beta}=(X^TX)^{-1}(X^TY)}\)
Model Quantities
Predicted/Fitted Value
\(\hat{Y}=\hat{\beta}_0+\hat{\beta}_1X_1+…+\hat{\beta}_pX_p\)
Residual
\(e_i=Y_i-\hat{Y}_i\)
The residual (also called the raw residual) for the ith observation in the training set represents the discrepancy between the observed target value \(Y_i\) and the corresponding fitted value \(\hat{Y}_i\), and measures how well the fitted model matches the ith training observation.
The (training) Residual Sum of Squares (RSS):
\(RSS=\sum_{i=1}^{n_{tr}}{e_i^2}=\sum_{i=1}^{n_{tr}}{(Y_i-\hat{Y}_i)^2}\)
The smaller the RSS, the better the fit of the linear model to the training set.
Coefficient of Determination: \(R^2\)
Unlike the RSS (which has no upper bound), the coefficient of determination \(R^2\) is a scaled measure of the goodness of fit of a linear model. It is defined as the proportion of the variation of the target variable that can be explained by the fitted linear model:
\(R^2=1-\dfrac{RSS}{TSS}\)
where:
\(TSS=\sum_{i=1}^{n_{tr}}{(Y_i-\bar{Y})^2}\), \(\bar{Y}=\dfrac{\sum_{i=1}^{n_{tr}}{Y_i}}{n_{tr}}\)
On the training set, \(R^2\) always runs on a scale from 0 to 1, independent of the magnitude of the target variable, which makes it easy to interpret.
The higher the value of \(R^2\), the better the fit of the model to the training set.
However, as a goodness-of-fit measure, its value always increases (more precisely, it never drops) when a predictor, however good or bad, is added to a linear model. This reinforces the idea that a more complex model generally matches the training observations better and has a lower training error. A large \(R^2\) is not necessarily an indication that the model makes good predictions.
Extreme Value of \(R^2\)
-
-
- \(R^2=0\): implies that \(RSS=TSS \Rightarrow \hat{Y}_i=\bar{Y}\), the same value for all \(i=1,…,n_{tr}\). In this case, the fitted linear model is essentially the intercept-only model. The predictors collectively bring no useful information for understanding the target variable.
- \(R^2=1\): then \(RSS=\sum_{i=1}^{n_{tr}}{(Y_i-\bar{Y})^2}=\Rightarrow \hat{Y}_i=Y_i\) for all \(i=1,…,n_{tr}\). The model perfectly fits each training observation. Although the goodness of fit to the training set is perfect, this model may have overfitted the data and may not do well on future, unseen data.
-
t-statistic
\(t(\hat{\beta}_j)=\hat{\beta}_j\) / (standard error of \(\hat{\beta}_j\))
The t-statistic for a particular regression coefficient is defined as the ratio of its least squares estimate (\(\hat{\beta}_j\)) to the corresponding estimated standard deviation, often called the standard error.
It is a measure of the partial effect of the corresponding predictor on the target variable, i.e., the effect of adding that predictor to the model after accounting for the effects of other variables in the model.
Mathematically, it can be used to test the hypothesis \(H_0:/beta_j=o\), which means that the jth predictor can be dropped out of the linear model in the presence of all other predictors.
p-value:
To quantify the amount of evidence contained in the t-statistic against the null hypothesis, we often appeal to the notion of the p-value, which is the probability that the test statistic takes a value which is as extreme as or more extreme than its observed value, given that the null hypothesis is true.
The smaller the p-value, the more evidence we have against the null hypothesis in favor of the alternative hypothesis and the more important the predictor that is being tested.
F-statistic
Unlike the t-statistic, which is for testing a single regression coefficient, the F-statistic is for testing \(H_0: \beta_1=\beta_2=…=\beta_p=0\), which assesses the joint significance of the entire set of predictors (excluding the intercept), against the alternative \(H_a: \text{at least one of }\beta_1,…,\beta_p \text{is non-zero}\)
If \(H_0\) is rejected, then we have strong evidence that at least one of the p variables is an important predictor for the target variable. However, the test itself does not say which variables are really predictive – further analysis is required.
In Exam PA, traditional statistical inference tools such as t-statistics and F-statistics are not heavily used.
Model Evaluation and Validation
Performance Metrics
RMSE
The target variable of a linear model is numeric (in fact, continuous), so a commonly used measure of the predictive performance of the model on unseen data is the test RMSE.
Other performance measures include the loglikelihood and the \(R^2\) computed on the test set.
Among the three metrics, the RMSE is most often used because it is the most interpretable (due to it having the same unit as the target variable) and gives us a sense of the typical magnitude of the prediction error.
For linear models and, more broadly, GLMs, the target variable is assumed to follow a particular distribution (e.g., normal, Poisson, gamma, binomial), so we can also employ performance metrics based on penalized likelihood. Here are two such metrics commonly used to rank competing models:
-
- Akaike Information Criterion (AIC): \(AIC=-2l+2p\), where:
\(l\) is the maximized loglikelihood of the model on the training set, and \(p\) is the number of parameters or the degree of freedom.
The smaller the AIC, the better the model, so when the AIC is used as the model selection criterion, the best model is the one with the smallest AIC. - Bayesian Information Criterion (BIC): \(BIC=-2l+p\ln n_{tr}\)
- Akaike Information Criterion (AIC): \(AIC=-2l+2p\), where:
This penalty amount, which accounts for overfitting, equals 2 for the AIC and the logarithm of the number of training observations, \(ln(n_{tr})\), for the BIC.
For almost all reasonable values of \(n_{tr}\), the penalty per parameter is higher for the BIC than for the AIC, so the BIC is more stringent than the AIC for complex models and tends to result in a simpler final model. BIC represents a more conservative approach to feature selection.
Remark:
-
- In Exam STAM, AIC and BIC are defined as: \(AIC^{For STAM}=l-p\) and \(BIC^{For STAM}=l-\dfrac{p}{2}ln{n}\), which are -2 times of the AIC and BIC we are using in Exam PA.
- degree of freedom = # of variables + 1 for the intercept, or = df_null – df_fitted + 1
Model Diagnostics
In general, model diagnostics refers to a set of quantitative and graphical tools that are used to identify evidence against the model assumptions. If any apparent deficiencies are present, we may refine the specification of the model taking these deficiencies into account. For this purpose, the residuals of the linear model serve as proxies for the unobservable random errors and will play a pivotal role.
Commonly used quantitative diagnostic tools for linear models include:
- Residuals (taken individually or as a group, to detect abnormality with respect to the target variable values), and their standardized counterparts
- Leverages (to detect abnormality with respect to predictor variable values)
- Cook’s distance (to detect influential observations, which are abnormal with respect to both the target variable values and predictor variable values)
In Exam PA, these quantitative tools are de-emphasized. In fact, there is hardly any mention of leverages and Cook’s distance in the PA modules, and only graphical or visual tools are discussed. Two of the most important plots are:
“Residuals vs Fitted” plot
This plot can be used to check the model specification as well as the homogeneity of the error variance. If the relationship between the target variable and the predictors is correctly described and the random errors possess the same variance, then the residuals, as a function of the fitted values, should display no discernible patterns and spread symmetrically in either direction (positive and negative).
Any systematic patterns (e.g., a U-shape) and significantly non-uniform spread in the residuals (e.g., a funnel shape) are symptomatic of an inadequate model equation and heteroscedasticity (i.e., the random errors have non-constant variance), respectively.
“Normal Q-Q” plot
This plot graphs the empirical quantiles of the standardized residuals (which are the residuals divided by their standard error to facilitate comparison) against the theoretical standard normal quantiles and can be used for checking the normality of the random errors.
If the random errors are indeed normally distributed, then the residuals, as proxies for the random errors, should also be normally distributed. In this case, the points in the Q-Q plot are expected to lie closely on the 45° straight line passing through the origin. Systematic departures from the 45° straight line, often in the two tails, suggest that the normality assumption is not entirely fulfilled and a distribution with a heavier tail for the target variable is warranted.
The points lie on the straight line closely, supports the normality assumption. |
The points deviate substantially from the straight line, especially in the two tails. It appears that the normal distribution is not enough to capture the skewness of the data. We may need to specify a more heavy-tailed distribution. |
Feature Generation
This important subsection addresses the practical issue of how to specify a useful linear model in the first place, or equivalently, how to fill in the right-hand side of the model equation:
\(Y=\beta_0+ \boxed{\beta_1 X_1+…+\beta_p X_p}+\varepsilon\)
This issue in turn boils down to two inter-related questions:
- How do we represent different types of predictors in a linear model?
To represent predictors properly in a linear model, we have to clearly distinguish between numeric and categorical predictors. These two types of predictors come with different considerations and we will treat them separately. - Having solved (1), how do we configure the model equation to accommodate different kinds of relationships between the target variable and predictors?
Even after we identify a list of potentially useful predictors (usually directly provided in the exam), we can fine-tune the form we want these predictors to enter the model equation so that the model can “absorb” the information contained in the data more effectively and take care of more complex relationships (e.g., non-linear, non-additive) between the target variable and the predictors. To this end, we generate new features that can be added to the model to enhance its flexibility and hopefully its prediction accuracy.
Numeric Predictors
Basic Form
For the model equation denoted as \(Y=\beta_0+\beta_1 X_1+…\beta_p X_p+\varepsilon\), the regression coefficient \(\beta_1\) can be interpreted as follows:
A unit increase in \(X\) is associated with an increase of \(\beta_1\) (the coefficient of \(X\)) in \(Y\) on average, holding all other predictors fixed (we don’t want other predictors to interfere).
There are a number of ways to tweak a linear model to handle more complex, non-linear relationships that arise commonly in practice. In what follows, we will learn three relatively simple methods, each with its merits and drawbacks.
Method 1: Polynomial Regression
In situations where the relationship between Y and X does not appear to be linear, it may be desirable to relax the linear form of the regression function and expand it to higher powers of X using polynomial regression:
\(Y=\beta_0+ \boxed{\beta_1 X+\beta_2 X^2+…+\beta_m X^m}+…+\varepsilon\)
-
- Pros
By means of polynomial regression, we are able to take care of substantially more complex relationships between the target variable and predictors than linear ones. The more polynomial terms are included (i.e., the larger the value of m), the more flexible the fit that can be achieved. - Cons
- The regression coefficients in a polynomial regression model are much more difficult to interpret. It is not easy to give a precise meaning to, for example, \(\beta_1\). We cannot say \(\beta_1\) is the expected change in \(Y\) due to a unit change in \(X\), holding all other variables fixed-the other polynomial terms \(X^2,…,x^m\) cannot be kept fixed when X varies.
- There is no simple rule as to what value of \(m\) should be chosen. Exploratory data analysis may help, but if \(m=2\) works well, can \(m=4\) do an even better job? It is often an iterative exercise to decide on the optimal value of \(m\).
In practice, it is unusual to use \(m\) larger than 3 or 4. For large values of m, the polynomial function can become overly flexible and assume an erratic shape, especially near the boundary of the numeric variable. The estimated coefficients may become widely varying in magnitude (partly due to the polynomial terms \(X,…,x^m\) being highly correlated) and the polynomial function will twist and turn in an attempt to hit all the training observations, leading to overfitting.
- Pros
Exercise: Trade-offs coming from the choice of \(m\)
Two actuaries are analyzing the same set of training data with a quantitative target variable Y and a single quantitative predictor X.
Actuary A fits a quadratic regression model \(Y=\beta_0+\beta_1 X+\beta_2 X^2+\varepsilon\)
Actuary B fits a cubic regression model \(Y=\beta_0+\beta_1 X+\beta_2 X^2+…+\beta_3 X^3+…+\varepsilon\)
-
-
- Explain why the degree of a polynomial regression function is a hyperparameter.
- Compare the two polynomial regression models in terms of bias and variance.
- Suggest two ways to choose between the two polynomial regression models.
-
Solution
-
-
- The degree of a polynomial regression function is a hyperparameter because it is a parameter that we have to supply in advance and is not optimized by the model fitting algorithm.
Remark. Most hyperparameters can serve as indices for model complexity. For instance, the degree of a polynomial regression function is positively related to the complexity of a linear model. If we base our choice of such hyperparameters on goodness-of-fit measures, we will always end up choosing the parameter values that lead to the most complex model. - The cubic regression model is more complex than the quadratic regression model and includes the latter as a special case when \(\beta_3=0\). As a result, the cubic model has a lower (squared) bias but a higher variance. The additional degree of freedom coming from the cubic term provides the model with greater flexibility to capture the relationship between \(Y\) and \(X\), but makes the model more vulnerable to overfitting.
- Here are a few ways to rank the two polynomial regression models:
- Traditional: A traditional method is to test the hypothesis \(H_0:\beta_3=0\) using at-test. If the p-value of the test is sufficiently small, then we will reject the null hypothesis (quadratic) in favor of the alternative hypothesis (cubic).
- Modern 1: A more modern method is to use cross-validation and choose the model with the lower cross-validation error.
- Modern 2: We can also compare the AIC of the two models and choose the one with the lower AIC.
- The degree of a polynomial regression function is a hyperparameter because it is a parameter that we have to supply in advance and is not optimized by the model fitting algorithm.
-
Remark. The hypothesis test relies critically on the relationship between the two models (the quadratic model is a special case of the cubic model), while the two modern methods, especially cross-validation, are more general and widely applicable.
Method 2: Binning-Using Piecewise Constant Functions
An alternative to polynomial regression for incorporating non-linearity into a linear model is to not treat the numeric variable as numeric. Rather, we “bin” (or “band”) the numeric variable and convert it into an ordered categorical variable whose levels are defined as non-overlapping intervals (the “bins”) over the range of the original variable. This modeling technique is known as binning, which was featured in quite a few recent PA exams.
Each level is represented by a dummy variable, which in turn receives a separate regression coefficient in the model equation. Effectively, the regression function becomes piecewise constant-it is “constant” over different “pieces” in the range of the numeric variable.
-
- Pros
Binning liberates the regression function from assuming any particular shape. There is no definite order among the coefficients of the dummy variables corresponding to different bins, which allows the target mean to vary irregularly over the bins. The larger the number of bins used, the wider the variety of relationships between the target variable and the original numeric predictor we can fit, and the more flexible the model. As a result, its bias drops (but its variance increases). - Cons
- Like polynomial regression, there is no simple rule as to how many bins and, more importantly, how the associated boundaries should be selected. In most cases, their choices are rather ad-hoc and depend on the business problem at hand. To model the relationship between \(X\) and the target mean effectively, we often need a large number of bins. However, using an excessive number of bins may lead to sparse categories and unstable coefficient estimates.
- Binning also results in a loss of information. With binning, we are turning the exact values of \(X\) into a set of bands containing those values and ignoring the variation of the target variable values within each of those bands. Unless we divide the range of X into a large number of bins, the loss of information can be substantial.
- (Relatively minor) If \(X\) is replaced by its binned version, then the regression function will become discontinuous in \(X\). A small change in the value of \(X\) may lead to an abrupt change in the target mean, which may not be desirable.
- Pros
Method 3: Using Piecewise Linear Functions
An extension of binning is to use piecewise linear (rather than piecewise constant) functions. The regression function is “linear” over different intervals (“pieces”) in the range of the numeric variable, with the straight lines typically connected continuously. To make this happen, we can add “call payoff” functions of the form:
\((X-c)_+:=max(0,X-c)\)
These functions are precisely the payoff function of call options, if you think of \(X\) as the stock price at the time of exercise and \(c\) as the strike price. Consider, for example, the linear model defined by:
\(E[Y]=\beta_0+\beta_1X+\boxed{\beta_2(X-c)_+}\)
The regression function is linear over each of the two intervals, \((-\infty,c)\) and \((c,\infty)\), but the slope of the regression function changes abruptly from \(\beta_1\) to \(\beta_1+\beta_2\) beyond \(X=c\)
- Pros
- Simple but powerful: The use of piecewise linear regression functions is a simple but powerful way to handle non-linearity. As we add more and more call payoff functions with different strike prices, we are able to fit a wider range of non-linear relationships. Unlike binning, this method recognizes the variation of the target mean within each piece and retains the original values of \(X\).
- Interpretation: Unlike the slope coefficients of a polynomial regression model, the slope coefficients of a piecewise linear regression model can be easily interpreted as the change in slope of the regression function at the break points.
- Cons
- To construct a piecewise linear regression function, the break points (the strike prices) must be user-specified in advance, based on exploratory data analysis and/ or the modeler’s prior knowledge, prior to fitting the model. This is the same problem suffered by polynomial regression and binning.
Categorical Predictors
Binarization
To properly incorporate a categorical predictor into a linear model, we have to perform an extra operation known as binarization, which is an important feature generation method.
As its name suggests, binarization turns a given categorical predictor into a collection of artificial “binary” variables (also known as dummy or indicator variables), each of which serves as an indicator of one and only one level of the categorical predictor.
More precisely, for each level of the categorical predictor, the process creates a binary variable that equals 1 in every row of the dataset where the categorical predictor assumes that level, and 0 otherwise. This way, the binary variable highlights whether each observation in the dataset does or does not possess a certain characteristic.
Baseline Level and Interpretation of Coefficients
Now that we have seen how binarization works, let’s turn to a generic categorical predictor. Here’s the general rule:
To incorporate a categorical predictor with \(r(\ge2)\) levels into a linear model (and, more generally, GLMs), we need to introduce \(r-1\) (i.e., the total number of levels less one) dummy variables, each of which serves as a separate predictor in the model.
We call the left-out level the baseline level (a.k.a. the base or reference level). The other \(r-1\) levels are represented by \(r-1\) dummy variables, say \(D_1,…,D_{r-1}\), each of which indicates one and only one of the non-baseline levels. These dummy variables then enter the model equation, each with a single regression coefficient assigned, so the model equation takes the form (Boxed parts are the r-level categorical predictor).
\(E[Y]=\beta_0+\boxed{\beta_1D_1+…+\beta_jD_j+…+\beta_{r-1}D_{r-1}}+\text{…{Other Predictors}…}\)
-
- Interpretation of \(\boldsymbol{\beta_0}\): The intercept \(\beta_0\) represents the value of \(E[Y]\) when the categorical predictor is at its baseline level (so \(D_1=….=D_{r-1}=0\)) and all other predictors are equal to zero or at their baseline levels.
- Interpretation of slopes: The coefficients of the dummy variables can be interpreted as:
\(\beta_j=E[Y^{\text{jth Level}}]-E[Y^{\text{Baselin Level}}],\text{ where }j=1,…,r-1\)
or, in words, the difference between the expected value of the target variable when the categorical predictor is in the jth level (a non-baseline level) and that when the categorical predictor is in the baseline level, holding all other predictors fixed.
You can see that the baseline level indeed serves as the “baseline” for comparing the target means in different levels. If \(\beta_j\) is positive (resp. negative), then the target variable in the jth level tends to be higher (resp. lower) than the target variable in the baseline level, so the jth level has a positive (resp. negative) effect on the target relative to the baseline level.
It is important to note that it is inaccurate, if not incorrect, to say that \(\beta_j\) represents the change in the target mean per unit change in \(D_j\), everything else equal (this interpretation is only for numeric predictors). The reasons are:
-
-
- \(D_j\), as a dummy variable, can only take the values of 0 and 1. It cannot undergo an arbitrary unit change, for example, from 3 to 4.
- When the value of \(D_j\) changes, the values of other dummy variables will also change, so it is impossible to keep all other variables fixed.
- Interpretation of the p-values: The p-value for a given dummy variable quantifies our confidence that the corresponding regression coefficient is non-zero, or equivalently, our confidence that the target mean in that level is significantly different from the target mean in the baseline level. The lower the p-value, the more confidence we have about the significant difference.
- Danger of a high-dimensional categorical predictor: Each non-baseline level receives a separate regression coefficient. The larger the value of \(r\), the more dummy variables are needed. A high dimensional categorical predictor (i.e., one with a large number of factor levels) will therefore consume a lot of degrees of freedom and lead to a bulky linear model. The high dimension may also mean that some categories are sparse and make the linear model prone to overfitting. If desirable, it is worth combining some of the sparse categories to ensure that each category contains a reasonable number of observations.
-
Exercise: Interpreting the coefficients of a linear model with dummy variables
Consider a multiple linear regression model for Balance on Region, which is a categorical predictor with three levels: “East” (baseline level), “South”, and “West”. Here is part of the R output for the fitted model:
Coefficients | Estimate Std. | Error | t-value | \(Pr(>|t|)\) |
(Intercept) | 531. 00 | 46.32 | 11.464 | <2e-16 *** |
RegionSouth | -12.50 | 56.68 | -0.221 | 0.826 |
RegionWest | -18.69 | 65.02 | -0.287 | 0.774 |
Interpret the coefficient estimates of the fitted model.
Solution
Coefficient | Coefficient Estimate | Interpretation |
Intercept | 531. 00 | The expected value of Balance is 531.00 when Region is in the “East” level (baseline). |
RegionSouth | -12.50 | The expected value of Balance is 12.50 lower (note the negative sign) when Region is in the “South” level than the “East” level. |
RegionWest | -18.69 | The expected value of Balance is 18.69 lower when Region is in the “West” level than the “East” level. |
Exercise: Comparing two non-baseline levels
For the r-level categorical predictor above, express the difference between the target mean in the jth level and the target mean in the kth level, both of which are non-baseline levels, in terms of appropriate regression coefficients.
Solution
The target mean when the categorical predictor is in the jth level is:
\(E[Y^{\text{@jth Level}}]=\beta_0+\beta_1(1)+…\)
Similarly, the target mean when the categorical predictor is in the kth level is:
\(E[Y^{\text{@kth Level}}]=\beta_0+\beta_1(1)+…\)
Taking the difference between the two target means and assuming that all other predictors are held fixed, we get:
\(\beta_j-\beta_k=E[Y^{\text{@jth Level}}]-E[Y^{\text{@kth Level}}]\)
Remark. This exercise shows that we are not limited to making comparisons relative to the baseline level; comparing two non-baseline levels is possible.
Choice of the Baseline Level
How do we go about choosing the baseline level? In R, the convention is to select the level that comes first alpha-numerically as the baseline level, although the default can be overridden to set the baseline level to any desired level, with respect to which comparisons of interest are made. Note that the choice of the baseline level has no effect on the predicted values generated by the linear model or the overall quality of the model (because the baseline level does not change the information contained by the underlying categorical predictor), but it does affect the interpretation of the coefficient estimates (because you are changing the reference level) and possibly the results of hypothesis tests for assessing the significance of individual factor levels.
If there is no obvious choice for the baseline level, then:
a common practice is to set the baseline level to the most common level, i.e., the level that carries the largest number of observations.
This will tend to make the model quantities more reliable, especially those related to variable significance. To understand why, recall that the coefficient estimate for a dummy variable of a factor level is the difference between the estimated target mean in that level and the estimated target mean in the baseline level.
The more observations are in the two levels, the more precise the two estimated target means. If we set the baseline to a populous level, then we can ensure that at least one of the estimated target means is relatively reliable.
Releveling a categorical predictor so that its baseline level equals the most common level is generally recommended, but doing so is not strictly necessary. That is why most, but not all past PA exams do the releveling. In some cases, there is a natural order among the levels of a categorical variable and this order aligns monotonically with the target variable (e.g., the occupation variable in the December 2019 exam; also see Exercise 3.1.10). Then it is fine to respect this order and take the first level or the last level as the baseline to make interpretations easy and meaningful, even if these levels are not the most populous.
Should we treat a numeric variable as a factor?
In real datasets, it is not uncommon that some variables can function both as a numeric variable and a categorical (factor) variable. By default, these variables will be treated by R as numeric, but it may be more desirable to convert them to a factor variable. Typically, these numeric variables have the following characteristics:
-
- They are coded by a small number of numeric values (usually integers).
- The values of these variables are merely numeric labels without a sense of numeric order.
- These variables have a complex, non-linear relationship with the target variable.
An example of such a numeric variable is:
-
weathersit=1
, if clear or partly cloudy,weathersit=2
, if there is mist,weathersit=3
, if there is rain or snow
which was featured in the December 7, 2020 exam. This variable uses the three integers, 1, 2, 3, as labels to identify the three possible weather situations. Although the variable is coded by numbers, the three integers have no natural order and the difference between these values has no numeric meaning. We cannot say, for example, “2” is less than “3” in a meaningful way.
What difference does it make if we take weathersit
as a numeric variable or as a factor variable in a linear model?
Case 1
If we treat weathersit
as a numeric variable and give it one regression coefficient, i.e.,
\(E[Y]=\beta_0+\beta_1\times \text{weathersit}+…\)
we are effectively assuming that the target mean will vary linearly with the three integers, 1, 2, 3, representing the three possible weather situations. In general, this linearity is a (potentially severe) restriction imposed on the target mean.
Case 2
If we treat weathersit
as a factor variable, it will be binarized and represented by two dummy variables. If the first level (clear or partly cloudy) serves as the baseline level, then the model equation is:
\(E[Y]=\beta_0+\beta_1D_{mist}++\beta_2D_{\text{rain or snow}}+…\)
This will provide the linear model with more flexibility in capturing the effect of weathersit
on the target mean. Depending on the sign and magnitude of the regression coefficients \(\beta_1\) and \(\beta_2\), the relationship between weathersit
and the target mean can be non-linear or non-monotonic over the three factor levels. Of course, if weathersit
has a large number of distinct levels, then there will be a large number of dummy variables, inflating the dimension of the model and worsening the extent of overfitting.
Exercise: Numeric vs. factor
Determine whether each of the following variables should be treated as factor variables in the context of a linear model. Justify your decisions. (Note: There is not a uniquely correct answer in most cases.)
-
-
- \(\text{Season}=\Biggl\{\begin{align*}
1 &, \text{ for Winter}\\
2 &, \text{ for Spring}\\
3 &, \text{ for Summer}\\
4 &, \text{ for Fall}
\end{align*}\) - Age, an integer variable ranging from 10 to 90
- \(\text{Smoking}=\Biggl\{\begin{align*}
0 &, \text{ for a non-smoker}\\
1 &, \text{ for a smoker}
\end{align*}\) - Calendar year, an integer variable ranging from 2011 to 2015
- \(\text{Season}=\Biggl\{\begin{align*}
-
Solution
-
-
- Although there is some sense of order from 1 to 4, representing the flow of time over a year, there is no guarantee that the target mean will vary linearly with the four numeric values. Similar to
weathersit
above, it may be a good idea to convert Season to a factor variable so as to capture seasonal (or cyclical) effects more effectively. - Treating Age as a factor variable will give rise to a total of 80 dummy variables. Such a large number of dummy variables means an extremely complex linear model prone to overfitting, so it is advisable to take Age as a numeric variable.
Remark
1. If Age behaves non-linearly with respect to the target variable, then one may attempt polynomial regression.
2. A similar variable is hour of the day, ranging from 0 to 23, which appears in the December 7, 2020 and December 8, 2020 exams. - Because Smoking is a binary variable, it makes no difference whether we treat it as a numeric variable or a factor variable. If we convert it to a factor variable, the corresponding dummy variable is:
\(\text{Smoking}=\Biggl\{\begin{align*}
0 &, \text{ for a non-smoker}\\
1 &, \text{ for a smoker}
\end{align*}\)
which is numerically the same as the original Smoking variable.
Remark. Whether Smoking is a factor variable may affect theggplots
you construct when it serves as thecolor
orfill
aesthetics. - This variable is tricky. It has a rather small number of distinct values and converting it to a factor variable (with five levels) allows us to capture the time trend effect more flexibly. However, a predictive model (linear models included) using calendar year as a categorical predictor will have trouble making predictions in the future. Year 2022, for example, does not belong to any of the five levels, meaning that the model is unable to make a prediction for this year. If calendar year serves as a predictor, it is necessary to retain it as a numeric variable so that we can make predictions by extrapolating the time trend observed in the training data to future years.
- Although there is some sense of order from 1 to 4, representing the flow of time over a year, there is no guarantee that the target mean will vary linearly with the four numeric values. Similar to
-
Interactions
Definition of an Interaction
In predictive analytics, we say that an interaction arises if the expected effect of one predictor on the target variable depends on the value (or level) of another predictor.
Interaction vs. correlation: It is true that correlations and interactions sound similar in English, but they are radically different concepts in predictive analytics. While correlations entail a pair of numeric variables, interactions, by definition, are always in the context of how two (or more) predictors relate to a target variable and involve the relationship among three variables.
Even if \(X_1\) and \(X_2\) are highly correlated (and “interact” in the usual sense), this has nothing to do with whether the relationship between \(X_1\) (resp. \(X_2\)) and \(Y\) varies with the value of \(X_2\) (resp. \(X_1\)).
Interactions between different types of predictors
Case 1: Interactions between two numeric predictors
To illustrate how interaction between numeric predictors manifests mathematically, consider the linear model:
\(Y=\beta_0+\beta_1X_1+\beta_2X_2+\boxed{\beta_3X_1X_2}+\varepsilon\)
where \(X_1\) and \(X_2\) are numeric predictors. Here the product term \(X_3:=X_1X_2\) is called an interaction variable or interaction term, which is treated as an additional feature with a separate regression coefficient \(\beta_3\) attached. Writing the model equation as:
\(E[Y]=\beta_0+\boxed{(\beta_1+\beta_3X_2)}X_1+\beta_2X_2=\beta_0+\beta_1X_1+\boxed{(\beta_2+\beta_3X_1)}X_2\)
we can see that the interaction stems from the fact that:
-
-
- A unit increase in \(X_1\) will now increase \(E[Y]\) by \(\beta_1+\beta_3X_2\), which depends on the value of \(X_2\).
- Equivalently, a unit increase in \(X_2\) will increase \(E[Y]\) by \(\beta_2 + \beta_3X_1\), which depends on the value of \(X_1\).
-
Therefore, the impact of each \(X_j\) on \(Y\) varies with the value assumed by the other predictor, and we say that \(X_1\) and \(X_2\) interact with each other to affect \(E[Y]\). We can interpret \(\beta_3\), the coefficient of \(X_1X_2\), as follows:
For every unit increase in \(X_2\) (resp. \(X_1\)), the expected effect of \(X_1\) (resp. \(X_2\)) on \(E[Y]\) increases by \(\beta_3\).
Terminology-wise, we say that the terms \(\beta_1X_1\) and \(\beta_2X_2\) in the model equation capture the main effects due to the two numeric predictors, and \(\beta_3X_1X_2\) produces the interaction effect.
Case 2: Interactions between numeric and categorical predictors
The interaction between a numeric predictor and a categorical predictor has a nice geometric representation that can be portrayed more prominently than the interaction between two numeric predictors. To see this, we consider a linear model with one numeric predictor \(X_1\) and a binary variable \(X_2\) (representing a certain binary predictor). Apart from the regression coefficients \(\beta_1\) and \(\beta_2\) assigned respectively to \(X_1\) and \(X_2\), an additional coefficient \(\beta_3\) is attached to the interaction term \(X_1X_2\). The model equation is:
\(E[Y]=\beta_0+\beta_1X_1+\beta_2X_2+\beta_3X_3+\boxed{\beta_4X_1X_2}+\boxed{\beta_5X_1X_3}\)
[icon name=”arrow-down” style=”solid” class=”fa-2x” unprefixed_class=””]
\(E[Y]= \Biggl\{\begin{align*}
& \beta_0+\beta_1X_1, \text{ if }X_2=0\\
& (\beta_0+\beta_2)+(\beta_1+\beta_3)X_1,\text{ if }X_2=1
\end{align*}\)
which carries two separate straight lines with respect to \(X_1\), one for \(X_2=0\) and one for \(X_2=1\). Note that the two lines possess not only different intercepts, but also different slopes with respect to X1 due to the interaction term \(X_1X_2\).
To assess the extent of interaction graphically in this set-up, we can use the ggplot2
package to make a scatterplot of the target variable against \(X_1\), distinguish the data points according to the levels of \(X_2\) by color using the color
aesthetic, and fit a separate regression line or curve to each group using the geom_smooth()
function. If the two fitted lines or curves differ considerably in shape, then the interaction effect is remarkable and should be taken care of in the linear model.
In general, to capture the interaction between a numeric predictor \(X_1\) and an r-level categorical predictor, we should multiply \(X-1\) by the dummy variable of each of the \(r-1\) non-baseline levels and use all of these \(r-1\) product terms as interaction variables.
Case 3: Interactions between two categorical predictors
Algebraically, the interaction between two categorical predictors can be dealt with in the same way as that in Cases 1 and 2. The interaction term is still a product of appropriate features. If the two categorical predictors are binary and represented by the binary variables \(X_1\) and \(X_2\), then the model equation takes the same form:
\(E[Y]=\beta_0+\beta_1X_1+\beta_2X_2+\boxed{\beta_3{X_1}X_2}\)
[icon name=”arrow-down” style=”solid” class=”fa-2x” unprefixed_class=””]
\(E[Y]= \Biggl\{\begin{align*}
& \beta_0, \text{ if }X_1=X_2=0\\
& \beta_0+\beta_1, \text{ if }X_1=1,X_2=0\\
& \beta_0+\beta_2, \text{ if }X_1=0,X_2=1\\
& \beta_0+\beta_1+\beta_2+\beta_3, \text{ if }X_1=X_2=1
\end{align*}\)
Graphically, instead of using scatterplots, we may use boxplots of the target variable split by the levels of one categorical predictor faceted by the levels of another categorical predictor to assess the extent of interactions between the two predictors.
Sidebar: Collinearity
Many of the feature generation techniques introduced in this subsection serve to increase the flexibility of a linear model to better capture the signal in the data. As we learned in, a potential side effect of increasing the flexibility of a model is overfitting, characterized by predictions with low bias but high variance. For linear models, high variance can also be induced by a phenomenon known as collinearity (a.k.a. multicollinearity). Although there is hardly any mention of collinearity in the PA e-learning modules, this concept is one of the most common challenges when exploring and preprocessing data in practice and is the theme of some past exam tasks, so it deserves some discussion.
Definition
Collinearity exists in a linear model when two or more features are closely, if not exactly, linearly related.
An example of a linear model with perfect collinearity is one that includes the dummy variables of all levels of a categorical predictor; the dummy variables always sum to 1. Consider, for instance, the linear model:
\(Y=\beta_0+\beta_1\times\text{SmokingSmoker}+\beta_2\times\text{SmokingNon-Smoker}+\beta_3\times\text{SmokingUnknown}+\varepsilon\)
This model has perfect collinearity because SmokingSmoker + SmokingNon-smoker + SmokingUnknown = 1
The concept of multi-collinearity also sheds further light on why it is a good idea to choose a common factor level as the baseline level of a categorical predictor. If the baseline level is very sparse, then most observations in the data do not belong to this level, so its associated dummy variable Dr is mostly equal to 0. This in turn implies that:
\(D_1+…+D_{r-1}\approx D_1+…+D_{r-1}+D_r \text{ always }=1\)
raising the issue of collinearity. This will be much less a problem if the baseline level has a sufficient number of observations.
Problems with Collinearity
Intuitively, the presence of collinear variables means that some of the variables do not bring much additional information because their values can be largely deduced from the values of other variables, leading to redundancy. Technically, the result of an exact linear relationship among some of the features is a rank-deficient linear model, which means that the coefficient estimates of the model cannot be determined uniquely (although they still exist).
Remark: Rank is a concept in linear algebra. In regression analysis, “rank” refers to the rank of the design matrix \(\boldsymbol{X}\), which has \(p+1\) columns. For a rank-deficient model, the rank of \(\boldsymbol{X}\) is strictly smaller than \(p+1\). In this case, the matrix \(\boldsymbol{X^TX}\) is singular and there are no unique solutions for the least-squares coefficient estimates \(\boldsymbol{\hat{\beta}}=\boldsymbol{(X^TX)^{-1}X^TY}\)
In practice, a rank-deficient model is not much of a concern because R is smart enough to detect exact collinearity and drop some of the features to yield a linear model of full rank. What is much more worrying is when some features are very strongly, but not exactly linearly related. In this case, R will not alert you to the presence of collinear variables, but the coefficient estimates may exhibit high variance (this is the “variance inflation” phenomenon in Exam SRM), which means that there is a great deal of uncertainty or instability in the coefficient estimates.
You may see counter-intuitive, nonsensical model results such as a wildly large positive coefficient estimate for one feature accompanied by a similarly large negative coefficient estimate for its correlated counterpart. Because it is misleading in this case to interpret a coefficient estimate as the expected impact of the corresponding feature on the target variable when other variables are held constant-variables strongly correlated with the given feature tend to move together-the interpretation of coefficient estimates becomes difficult and it is hard to separate the individual effects of the features on the target variable.
Detecting collinearity
A simple and intuitive method to detect collinearity is to look at the correlation matrix of the numeric predictors. An element of this matrix which is close to 1 or -1 is an indication that there is a pair of highly correlated predictors. However, collinearity may exist among three or more variables. In this case, one variable may not be strongly correlated with the other variables on an individual basis, and merely looking at pairs of variables may not be an effective method for detecting the existence of collinearity. More advanced tools like the variance inflation factor can be used (details are beyond the scope of Exam PA).
Possible solutions to collinearity
To cope with collinearity, one may:
-
- (A simple solution) Delete one of the problematic predictors that are causing collinearity. Due to their strong linear relationship, the deletion is going to cause a minimal impact on the fitted linear model. In some cases, which predictor to drop is not entirely clear and is a judgment call.
- (A more advanced solution) Pre-process the data using dimension reduction techniques such as the principal components analysis. These techniques combine the collinear predictors into a much smaller number of predictors which are far less correlated with each other and capture different kinds of information in the data.
Feature Selection
Best Subset Selection
For linear models, feature removal is equivalent to setting some of the regression coefficients to zero, so that the corresponding features have no effect on the target variable. There are many different ways to look for these features of little predictive power.
One may use hypothesis tests such as the t-test to assess the statistical significance of each feature in the presence of other features and remove features with a large p-value. In Exam PA, we will use more modern feature selection techniques.
One such technique is best subset selection, which involves fitting a separate linear model for each possible combination of the available features, examining all of the resulting models, and choosing the “best subset” of features to form the best model. If there are a total of \(p\) available features, we have to analyze a total of \(2^p\) models-each predictor can either be included or not included in the model, meaning that for each of the \(p\) features, there are two options, for a total of \(2^p\) possibilities.
The best model is the one which fares best according to a pre-specified criterion (e.g., AIC, BIC).
Stepwise Selection
Number of fits
While best subset selection is conceptually straightforward, it is often computationally prohibitive, if not impossible, to fit \(2^p\) models in practice.
Methodology
For stepwise selection, we go through a list of candidate models fitted by ordinary least squares and decide on a final model with respect to a certain selection criterion. The regression coefficients of the nonpredictive features are then set to (exactly) 0.
To actually do feature selection for linear models in Exam PA, you will implement some computationally efficient automatic feature selection techniques known as stepwise selection algorithms. These algorithms do not search through all possible combinations of features, but determine the best model from a carefully restricted list of candidate models by sequentially adding and/ or dropping variables, one at a time.
They come in two versions:
- Backward selection: With backward selection, we start with the model with all features and work “backward.” In each step of the algorithm, we drop the feature that causes, in its absence, the greatest improvement in the model according to a certain criterion such as the AIC or BIC. The process continues until no more features can be dropped to improve the model.
- Forward selection: Forward selection is the opposite of backward selection. This time we start with the simplest model, namely the model with only the intercept but no features (the model equation is \(Y=\beta_0+\varepsilon\)), and go “forward”. In each step, among all the features that are not already in the model, we add the feature that results in the greatest improvement in the model. When no more features can be added to improve the model, the algorithm terminates.
Although stepwise selection looks simple, it has some not-so-obvious properties:
- Only one feature at a time: In both versions of stepwise selection, we add or drop one and only one feature in every step of the algorithm and stop until no further improvement to the model can be sought. Notice that it is not a good idea to add or drop multiple insignificant features at a time because the significance of a feature can be significantly affected by the presence or absence of other features due to their correlations. A feature can be significant on its own, but become highly insignificant in the presence of another feature.
- Stepwise selection vs. best subset selection: Stepwise selection is a computationally efficient alternative to best subset selection. With stepwise selection, it can be shown that the maximum number of linear models we have to fit is only \(1+p(p+1)/2\), whereas for best subset selection we have to fit \(2^p\).
On the downside, stepwise selection does not consider all possible combinations of features, so there is no guarantee that the final model selected by stepwise selection is the best model among all models that can be constructed from existing features. In optimization parlance, we can say that stepwise selection produces a local optimum whereas best subset selection produces a global optimum. - Backward vs. forward selections: Comparing backward and forward selections, we can see that they have three main differences:
Area | Backward Selection | Forward Selection |
Which model to start Full model with? | Full model | Intercept-only model |
Add or drop variables Drop in each step? | Drop | Add |
Which method tends to produce a simpler model? | Forward selection is more likely to get a simpler and more interpretable model because the starting model is simpler. |
Regularization
Purpose
Regularization (a.k.a. penalization and shrinkage) is an alternative to stepwise selection for doing feature selection, and more generally, for reducing the complexity of a linear model.
Methodology
Considering only a single model hosting all of the potentially useful features, instead of using ordinary least squares, we fit the model using unconventional methods that have the effect of shrinking the coefficient estimates towards zero, especially those of features with limited predictive power. These non-predictive features will then have a smaller impact on the target variable. In some cases, the effect of regularization is so strong that the coefficient estimates of non-predictive features become exactly zero.
How does regularization work?
Recall that the ordinary least squares fitting procedure lies in minimizing the ( training) residual sum of squares:
\(RSS=\sum_{i=1}^{n_{tr}}{[Y_i-(\beta_0+\beta_1X_{i1}+…+\beta_pX_{ip})]^2}\)
To produce the estimates of the regression coefficients, the \(\hat{\beta}_j\) ‘s. Regularization generates coefficient estimates by minimizing a slightly different objective function. Using the RSS as the starting point, regularization incorporates a penalty term, regularization penalty, that reflects the complexity of the model and minimizes:
\(\text{RSS+Penalty Function}=\sum_{i=1}^{n_{tr}}{[Y_i-Y_i-(\beta_0+\beta_1X_{i1}+…+\beta_pX_{ip})]^2}+\boxed{\lambda f_R(\boldsymbol{\beta})}\), where
-
- \(\boldsymbol{\beta}=(\beta_0,\beta_1, …,\beta_p)\) is the vector of regression coefficients to be estimated by minimizing.
- \(\lambda\ge 0\) is the regularization parameter that controls the extent of regularization and quantifies our preference for simple models.
- \(f_R(\boldsymbol{\beta})\) is the penalty function that captures the size of the regression coefficients. The product of the regularization parameter and the penalty function, or \(\lambda f_R(\boldsymbol{\beta})\), defines the regularization penalty.
There are two common choices of the penalty function:
-
- When \(\boxed{f_R(\boldsymbol{\beta})=\sum_{j=1}^{p}{\beta^2_j}}\) (i.e., the sum of squares of the slope coefficients), we are performing ridge regression.
- When \(\boxed{f_R(\boldsymbol{\beta})=\sum_{j=1}^{p}{|\beta_j|}}\) (i.e., the sum of absolute values of the slope coefficients), we are performing lasso (which stands for “least absolute selection and shrinkage operator“).
A more general regularization method that combines both ridge regression and lasso into one objective function and serves as a compromise between them is elastic net, which corresponds to the regularization penalty given by:
\(f_R(\boldsymbol{\beta})=(1-\alpha)\boxed{\sum_{j=1}^{p}{\beta^2_j}}+\alpha\boxed{\sum_{j=1}^{p}{|\beta_j|}},\alpha\subseteq(0,1)\)
\(\alpha\) is the mixing coefficient determining the relative weight of the ridge and lasso regularization penalties. If \(\alpha=0\), we are effectively performing ridge regression; if \(\alpha=1\), we retrieve the lasso. (On the exam, you can run ?glmnet
and check the definition of \(\alpha\))
Regardless of the specification of the regularization penalty, the essence of regularization is to trade off two desirable characteristics of the coefficient estimates:
-
- Goodness of fit to training data (captured by the RSS): On the one hand, we desire coefficient estimates that match the training data well in the sense that the training RSS is reasonably small. Recall that the least squares method consists in minimizing the training RSS.
- Model complexity (captured by the size of the coefficient estimates): Besides a good fit to the training data, it is also desirable to have coefficient estimates that are small in absolute value so that the model has lower variance and is less prone to overfitting. This consideration is reflected by adding the regularization penalty to the RSS. The larger the estimates (in magnitude), the heavier the penalty.
Hyperparameter Tuning – How to Choose Alpha and Lambda
For now, we will focus on trial and error. We will be using cross validation to select the best values for alpha and lambda.
-
- Split the data into training and test sets (80%/20% or otherwise).
- Split the training data into n folds (for n-fold cross validation).
- Select a list of values you want to test, e.g., lambda= (0.01, 0.1, 0.5, 1) and alpha = (0, 0.25, 0.5, 0.75, 1). Note that if only performing ridge or lasso regression, you only need to worry about lambda.
- For each pair of values (or single value for lasso or ridge):
- Train the model on (n–1) folds of the training data using those values.
- Calculate the prediction error on the remaining fold of training data.
- Repeat n times for each fold in the training data (cross validation).
- Calculate the average error rate across all iterations in the cross validation (cross validation error).
- Take the pair of values that gives the lowest cross validation error.
- Train a model on all of the training data using the optimal hyperparameters.
- Test the model on the test data.
Limitations of Feature Selection Using Regularization Techniques
Regularization techniques for feature selection are theoretically the ultimate solution for selecting features for your model because they are embedded in the model optimization process and, thus, are involved with the overall objective of finding a model that predicts well. However, there are some limitations that are important to consider, including:
-
- Performing feature selection as part of the optimization of a model is an automatic method and may not always result in the most interpretable model.
- It is a model dependent technique, so the features we select as part of a linear regression using the lasso method have been optimized for linear regression, and may not necessarily translate to other model forms (although they can still be useful in other models). Ultimately, we can perform feature selection using regularization for different model forms, but due to the complexity of some objective functions, this can sometimes be suboptimal or intractable.
Remarks:
-
- Regularization penalty not applied to \(\beta_0\): Note that the intercept is not part of the regularization penalty. The intercept measures the expected value of the target variable when all other features are zero. Indeed, \(\beta_0\) is not associated with any features and it does not hurt even if \(\beta_0\) is large in magnitude.
- Standardization of predictors: Unlike least squares regression, the scale of the predictors matters when it comes to making predictions. For example, if one of the predictors is in terms of a monetary amount, then whether it is expressed in dollars or thousands of dollars makes a difference on the predicted values. Loosely speaking, this is because the predictor variables \(X_j\)’s appear in the \(RSS\), but not in the regularization penalty. Before performing regularization, it is judicious to standardize the predictors by dividing each by their standard error. Doing so ensures that all predictors are on the same scale.
Effects of \(\boldsymbol{\lambda}\) and Bias-Variance Trade-Off
The coefficient estimates of a regularized linear model balance the two competing forces (fit vs. size) above and serve not only to fit the data sufficiently well, but also to produce a reasonably simple model.
Unlike the ordinary least squares method, regularization produces a family of coefficient estimates \(\hat{\beta}_{\lambda}=\hat{\beta}_{0,\lambda},\hat{\beta}_{1,\lambda},…,\hat{\beta}_{p,\lambda}),\lambda\ge 0\) indexed by the regularization parameter \(\lambda\), with each value of \(\lambda\) giving rise to a particular set of estimates easily determined in R.
The trade-off between model fit and model complexity is quantified by \(\lambda\), which plays its role as follows:
- When \(\lambda=0\), the regularization penalty vanishes, and the coefficient estimates are identical to the ordinary least squares estimates.
- As \(\lambda\) increases, the effect of regularization becomes more severe. There is increasing pressure for the coefficient estimates to be closer and closer to zero, and the flexibility of the model drops, resulting in a decreased variance but an increased bias.
In most practical applications, the initial increase in \(\lambda\) causes a substantial reduction in the variance at the cost of only a slight rise in the bias.
By trading off a minuscule increase in squared bias for a large drop in variance (yes, it makes sense to tolerate some bias, if the variance drops a lot!), we hope to improve upon the prediction accuracy of the model. - In the limiting case as \(\lambda->\infty\), the regularization penalty dominates and the estimates of the slope coefficients have no choice but to be all zero, i.e., \(\beta_{j,\lambda}->\infty\) for all \(j=1,…,p\), and the linear model becomes the intercept-only model (i.e., the model with no features).
Feature Selection Property of Elastic Net
While ridge regression and lasso are both regularization methods that serve to reduce model complexity and prevent overfitting, they differ in one important respect:
- The lasso has the effect of forcing the coefficient estimates to be exactly zero, effectively removing the corresponding features, when the regularization parameter \(\lambda\) is sufficiently large.
- In contrast, the coefficient estimates of a ridge regression model are reduced, but none are reduced to exactly zero (and so all features are retained) even by making \(\lambda\) arbitrarily large.
As a result, the lasso, and more generally the elastic net with \(\alpha\ne 0\), tend to produce simpler and hence more interpretable models carrying fewer features. These models are called sparse models, meaning that they only involve a small subset of the potentially useful features.
If you are interested in identifying only the key factors affecting the target variable, as is the case of some past PA exams (e.g., December 8, 2020, June 2020, and June 2019 PA exams), then the lasso is preferred due to its feature selection property, unless the fitted model is much inferior to that of ridge regression with respect to prediction accuracy.
Hyperparameter Tuning
In the context of regularization, \(\alpha\) and \(\lambda\) serve as hyperparameters, which are pre-specified inputs that go into the model fitting process and are not determined as part of the optimization procedure.
Hyperparameters can be tuned by cross-validation.
- Construct a fine grid of values of \((\lambda, \alpha)\) (we only have to consider \(\lambda\) if we are performing ridge regression or lasso, or the value of a has already been fixed by the exam).
- Compute the cross-validation error for each pair of values of \((\lambda, \alpha)\), and choose the pair that gives the lowest cross-validation error.
- Given the optimal hyperparameters, the penalized regression model is fitted to all of the training data and its prediction accuracy is evaluated on the test data.
Remark:
- Tuning the hyperparameters (at least for lambda) of a regularized model can be readily implemented in R, so there is no need to worry about their computational aspects.
- However, you may be asked to describe how cross-validation selects the best hyperparameters in a conceptual exam item.
You can give further details like:- what cross-validation is (e.g., dividing the training data into k folds, training the model on all but one folds, and calculating the RMSE on the remaining fold. Repeat the process … ).
Pros and Cons of Regularization Techniques for Feature Selection
Pros
-
- Via the use of model matrices, the
glmnet()
function that implements penalized regression in R automatically binarizes categorical predictors and each factor level is treated as a separate feature to be removed. This allows us to assess the significance of individual factor levels, not just the significance of the entire categorical predictor. - A regularized model can be tuned by cross-validation using the same criterion (RMSE for numeric target variables) that will ultimately be used to judge the model against unseen test data. This is conducive to picking a model with good prediction performance.
- For lasso and elastic nets, variable selection can be done by taking the regularization parameter sufficiently large.
- Via the use of model matrices, the
Cons
-
- The
glmnet()
function is restricted in terms of model forms. It can accommodate some, but not all of the distributions for GLMs. The gamma distribution, for instance, is not covered. - Regularization techniques may not produce the most interpretable model, especially for ridge regression, for which all features are retained.
- The