Taking too long? Close loading screen.

SOA ASA Exam: Predictive Analysis (PA)

[mathjax]

Linear Models

Classification of Variables

Intention (by their role in the study)

  1. target/response/dependent/output variable
  2. risk factors/drivers

Characteristics (by their nature)

  1. numeric/quantitative variables
  2. categorical/qualitative/factor variables

The Model Building Process

Stage 1: Define the Business Problem

Objectives

prediction-focused (accurate prediction) vs. interpretation-focused (relationship)

 

Descriptive Analytics:

  • Focuses on insights from the past and answers the question, “What happened?”

Predictive Analytics:

  • Focuses on the future and addresses, “What might happen next?”

Prescriptive Analytics:

  • Suggests decision options; for example, “What would happen if I do this?” or “What is the best course of action?”

Constraints

  • Availability of data
  • Implementation issues

Stage 2: Data Collection

Data Desgin

Relevance

  • source the data from the right population and time frame.

Sampling

Random Sampling

Voluntary surveys may be vulnerable to respondent bias.

Stratified Sampling

Divide the underlying population into a number of non-overlapping strata.

  • Oversampling and undersampling: for unbalanced data.
  • Systematic sampling: Use a set pattern.

Random Sampling:

set.seed(<n>)
data.full$random <- runif(nrow(data.full))
data.train <- data.full[data.full$random < 0.7, ]
data.test <- data.full[data.full$random >= 0.7, ]
# Present the portion of training data
nrow(data.train) / nrow(data.full)

 

Granularity

How detailed the information contained by the variable is.

  • The more detail a variable contains, the more granular it is.

Data Quality Issues

  • Reasonableness: checked by exploratory data analysis.
  • Consistency: numeric & categorical variables; data records across data sets
  • Sufficient documentation
    Effective documentations should at least include the following information (usually provided in the “Business Problem” section):
    • A description of the data set overall, including the data source.
    • A description of each variable in the data, including its name, definition, and format.
    • Notes about any past updates or other irregularities of the data set.
    • A statement of accountability for the correctness of the dataset.
    • A description of the governance processes used to manage the dataset.

Other Data Issues

  • data contains personally identifiable information (PII).
  • variables are causing unfair discrimination.
  • target leakage: the predictors in a model include (“leak”) information that will not be available when the model is applied in practice.

Stage 3: Exploratory Data Analysis

Clean the data with the use of descriptive statistics and graphical displays.

Descriptive Statistics

Find out:

  • target variable
  • Skewness of the distribution
    • log transformation
      • not applicable if there are 0’s for the target variable.
    • square root transformation
      • not applicable if there are negative numbers for the target variable.
     
Median Median around Mean distribution is not far from symmetric
  Media away from Mean left/right skewed distribution
     
(R^2) (R^2) close to 0  
  (R^2) close to 1  

Stage 4: Model Construction and Evaluation

Train the model

  • Need independent data for assessing the prediction performance of our models.

Data Set Split:

  • Training set (typically 70-80% of the full data) + Test set (typically 20-30% of the full data.

How to split:

  • either randomly according to pre-specified proportions
  • or with special statistical techniques

Size of data sets:

  • The relative size of the training and test sets involves a trade-off.

Training Set

Test Set

  • know more about the pattern and have less impact from noise.

Training Set

Test Set

  • poorer and less reliable prediction performance.

How to rank models:

  • use the test set to differentiate the quality of these models and choose the one with the best test set performance according to a certain model selection criterion.

Common (General) Performance Metrics

Regression Problems (Numeric Target Variable)

(test) Root Mean Squared Error (RMSE):

  • aggregates all of the prediction errors on the test set and provides an overall measure of prediction performance.
  • test  RMSE =(sqrt{dfrac{1}{n_text{test}}sum_{i=1}^{n_text{test}}{(y_i-hat{y_i})^2}})
  • The smaller the RMSE, the more predictive the model.

Classification Problems (Categorical Target Variable)

(test) Misclassification Error Rate:

  • computes the number of test observations incorrectly classified and the division by (n_text{test}) returns the proportion of misclassified observations on the test set.
  • test Misclassification Error Rate = (dfrac{1}{n_text{test}}sum_{i=1}^{n_text{test}}1_{y_inehat{y_i}})
  • The smaller the classification error rate, the more predictive the classifier.

Remark:

  • test  RMSE and test Misclassification Error Rate are distribution-free general performance metrics.
  • Additional likelihood-based performance metrics are used if the target variable follows a certain distribution.

Cross-Validation (k-Fold Cross-Validation)

What is Cross-Validation and what is it for?

  • a variation of the training/test set split.
  • a powerful means to assess the prediction performance of a model without using additional test data.
  • select the values of hyperparameters, which are parameters that control some aspect of the fitting process itself.

What is the procedure of applying cross-validation?

Randomly split the training data into (k) folds of approximately equal size; commonly (k=10), but below use (k=3) for illustration.

Stage 5: Model Validation

Model validation techniques may be performed on either the training set or the test set, and they may be model-dependent, i.e. specific to particular types of predictive models.

  • Training set:
    • GLMs: there is a set of model diagnostic tools designed to check the model assumptions based on the training set.
  • Test set:
    • GLMs and decision trees: Compare the predicted values and the observed values of the target variable on the test set (not the training set).
      • the plot of observed against predicted values should fall on the 45° straight line passing through the origin quite closely, and
      • the deviation from the line should also have no systematic patterns.
    • Compare the selected model to an existing, baseline model, again on the test set.
      • a primitive model, such as an ordinary least squares linear model (if the selected model is a GLM), or
      • a model that has no predictors and simply uses the average of the target variable on the training set for prediction.

Stage 6: Model Maintenance

  • The training data should be enlarged with new observations, when new data becomes available.
  • If new variables are collected, they should be fed into the training set and, if useful, enter the predictive model.
  • Solicit external subject matter expertise if there are modeling issues that cannot be easily resolved purely on predictive analytic grounds.
    • high-dimensional categorical predictors with no obvious ways to combine factor levels.

Bias-Variance Trade-off

When does it happen?

Stage 4: Model Construction and Evaluation

Idea

A complex model (not ) A predictive model

Definition

Goodness of fit vs. prediction accuracy

  • Complexity yields lower variance:
    The more complex the model, the lower the training error (Evaluated using RMSE / misclassification error).
  • Goodness-of-fit measures, which quantify how well a model describes past data, are not necessarily good measurements of how well the model will perform on future data.
  • Select the right level of complexity for constructing an effective prediction model.

Decomposition of Expected Test Error

Expected Test Error = Reducible Errors + Irreducible Errors

Expected Test Error = Biase2 + Variance + Variance of Noises

Biase

Definition:

  • Bias is the part of the test error caused by the model not being flexible enough to capture the underlying signal.
  • (Bias_{T_r}(hat{f}(X_0))=E_{Tr}[hat{f}(X_0)]-f(X_0))

Remarks:

  • In general, the more complex a model, the lower the bias (in magnitude) due to its higher ability to capture the signal in the data.
  • Bias quantifies the accuracy of prediction.

Variance

Definition:

  • The variance of (hat{f}(X_0)) quantifies the amount by which (hat{f}(X_0)) would change if we estimated it using a different training set.
  • Variance is the part of the test error caused by the model being overly complex.

Remarks:

  • In general, a more flexible model has a higher variance because it is more sensitive to the training data.
  • Variance quantifies the precision of prediction.

Irreducible Error

Definition:

  • The variance of the noise, which is independent of the choice of the predictive model.

Bias vs. Variance

Bias-Variance Trade-Off

  • U-shape behavior of the test error
  • An inverse relationship (A more flexible model has a lower bias but a higher variance than a less flexible model) between the model variance and the model bias.
Test Error Bias Variance
Simple
Complex/Flexible

Underfitting

  • The model is too simple.
  • Large training error because of large bias, but small variance
  • As the flexibility of the model increases, test error decreases since the bias tends to drop faster than the variance increases.

Overfitting

  • The model is too complex.
  • As the complexity of the model increases, test error increases since the bias tends to drop slower than the variance increases.
  • test error continuously increases because of high variance, caused by mistreating the noise as if it were a signal.

Training Error: continuous to decrease because the bias keeps dropping.

 

 

Feature Generation and Selection

Feature generation and selection is to achieve effective control of model complexity (bias-variance trade-off).

Variables vs. Features

  • Variables: A variable more precisely means a raw measurement that is recorded and constitutes the original dataset before any transformations are applied.
  • Features: Features refer to derivations from the original variables and provide an alternative, more useful view of the information contained in the dataset. In the language of Exam IFM, features are “derivatives” of raw variables.

Feature Generation

Feature generation is the process of “generating” new features based on existing variables in the data.

  • These new derived variables do not add new information.
  • They transform the information contained in the original variables into a more useful form or scale so that a predictive model can “absorb” the information more effectively and capture the relationship between the target variable and the predictors more powerfully.
  • Within the context of the bias-variance trade-off, feature generation seeks to enhance the flexibility of the model and lower the squared bias of the predictions at the expense of an increase in variance.
  • Makes a model easier to interpret.

Feature Selection and Dimension Reduction

What is feature selection?

Feature selection (or feature removal), the opposite of feature generation, is the procedure of dropping features with limited predictive power and therefore reducing the dimension of the data.

What is feature selection for?

In the framework of the bias-variance trade-off, it is an attempt to control model complexity and prevent overfitting by reducing variance at the cost of a slight rise in bias.

Dimension Reduction

Strategies for reducing the dimensionality of a categorical predictor:

  • Categories with very few observations: Combining sparse categories with others.
  • Similar categories
  Biase Variance
Feature Generation [icon name=”arrow-circle-down” style=”solid” unprefixed_class=””] decrease [icon name=”arrow-circle-up” style=”solid” unprefixed_class=””] increase
Feature Selection [icon name=”arrow-circle-up” style=”solid” unprefixed_class=””] slight increase [icon name=”arrow-circle-down” style=”solid” unprefixed_class=””] decrease

Granularity

A categorical variable becoming more granular will lead to:

  • finer information is stored.
  • increased dimension
  • fewer observations at each distinct level

Why do we reduce the granularity of a categorical predictor?

Reduce the granularity of a categorical predictor can reduce the susceptibility of a predictive model to noise by:

  • recording the information contained by the predictor at a less detailed level;
  • making the number of factor levels more manageable.

The optimal level of granularity is the one that optimizes the bias-variance trade-off.

Granularity vs. Dimensionality

  • Applicability: Dimensionality is a concept specific to categorical variables, while granularity applies to both numeric and categorical variables.
  • Comparability: Categorical variables can always be ordered by dimension, while granularity is not. We need to know whether each distinct level of a variable is a subset of another.
    In other words, if VAR_1 is more granular than VAR_2, then knowing the values of VAR_1 would automatically know the levels of VAR_2.

Linear Models: Theoretical Foundations

Model Formulation

  • Model Equation: (Y=beta_0+beta_1X_1+beta_2X_2+…+beta_pX_p+varepsilon)
  • (E[Y]=f(X)=beta_0+beta_1X_1+beta_2X_2+…+beta_pX_p)

Matrix Representation of Linear Models

(boldsymbol{Y=Xbeta+varepsilon})

or (begin{pmatrix}
Y_1\
Y_2\
…\
Y_{n_{tr}}
end{pmatrix}=begin{pmatrix}
1 & X_{11} & X_{12} & … & X_{1p}\
1 & X_{21} & X_{22} & … & X_{2p}\
…\
1 & X_{n_{tr},1} & X_{n_{tr},2} & … & X_{n_{tr},p}
end{pmatrix}begin{pmatrix}
beta_0\
beta_1\
…\
beta_p
end{pmatrix}+begin{pmatrix}
varepsilon_1\
varepsilon_2\
…\
varepsilon_{n_{tr}}
end{pmatrix})

 

Ordinary Least Squares Approach

(min{sum_{i=1}^{n_{tr}}{[Y_i-(beta_0+beta_1X_{i1}+beta_2X_{i2}+…+beta_pX_{ip})]^2}})

(boldsymbol{hat{beta}=(X^TX)^{-1}(X^TY)})

 

Model Quantities

The (training) Residual Sum of Squares (RSS)

(RSS=sum_{i=1}^{n_{tr}}{e_i^2}=sum_{i=1}^{n_{tr}}{(Y_i-hat{Y}_i)^2})

Coefficient of Determination

 

Formula

(R^2=1-dfrac{RSS}{TSS}), where (TSS=sum_{i=1}^{n_{tr}}{(Y_i-bar{Y})^2}), (bar{Y}=dfrac{sum_{i=1}^{n_{tr}}{Y_i}}{n_{tr}})

Properties

    • The higher the value of (R^2), the better the fit of the model to the training set.
    • (R^2) always increases when a new predictor is added to the model.

Extreme Values

    • (R^2=0): implies that (RSS=TSS Rightarrow hat{Y}_i=bar{Y}), the same value for all (i=1,…,n_{tr}). In this case, the fitted linear model is essentially the intercept-only model. The predictors collectively bring no useful information for understanding the target variable.
    • (R^2=1): then (RSS=sum_{i=1}^{n_{tr}}{(Y_i-bar{Y})^2}=Rightarrow hat{Y}_i=Y_i) for all (i=1,…,n_{tr}). The model perfectly fits each training observation. Although the goodness of fit to the training set is perfect, this model may have overfitted the data and may not do well on future, unseen data.

t-statistic

(t(hat{beta}_j)=hat{beta}_j) / (standard error of (hat{beta}_j))

Purpose

It is a measure of the partial effect of the corresponding predictor on the target variable, i.e., the effect of adding that predictor to the model after accounting for the effects of other variables in the model.

Usage

Test the hypothesis (H_0:/beta_j=o), which means that the jth predictor can be dropped out of the linear model in the presence of all other predictors.

The smaller the p-value, the more evidence we have against the null hypothesis in favor of the alternative hypothesis and the more important the predictor that is being tested.

F-statistic

Purpose

The F-statistic assesses the joint significance of the entire set of predictors (excluding the intercept)

Usage

The F-statistic is for testing (H_0: beta_1=beta_2=…=beta_p=0) against the alternative (H_a: text{at least one of }beta_1,…,beta_p text{is non-zero}).

If (H_0) is rejected, then we have strong evidence that at least one of the p variables is an important predictor for the target variable. However, the test itself does not say which variables are really predictive – further analysis is required.

 

  (H_0) (H_a) Purpose
t-statistic (H_0:/beta_j=o) (H_a:/beta_jne 0) test a single regression coefficient
F-statistic (H_0: beta_1=beta_2=…=beta_p=0) (H_a: text{at least one of }beta_1,…,beta_p text{is non-zero}) test the joint significance of the entire set of predictors (excluding the intercept)

Model Evaluation and Validation

Performance Metrics

 

RMSE

(R^2)

loglikelihood

Penalized Likelihood – AIC

Penalized Likelihood – BIC

Target Variable

numeric (continuous)
Formula

(AIC=-2l+2p)

(BIC=-2l+p{ln}n_{tr})

Characteristics

the most interpretable;

sense the typical magnitude of the prediction error.

easy to interpret;

(R^2) keeps increasing when a new predictor is added

Penalty Amount: 2

Penalty Amount: (ln(n_{tr}))

more conservative than AIC

Model Diagnostics

For graphical or visual tools, two of the most important plots are:

“Residuals vs Fitted” Plot

    • Purpose: to check the model specification as well as the homogeneity of the error variance.
    • Usage: Any systematic patterns and significantly non-uniform spread in the residuals are symptomatic of an inadequate model equation and heteroscedasticity (i.e., the random errors have non-constant variance.

“Normal Q-Q” Plot

    • Purpose: to check the normality of the random errors.
    • Usage: the points in the Q-Q plot are expected to lie closely on the 45° straight line passing through the origin if the random errors are normally distributed.

Feature Generation

How to specify a useful linear model?

  1. Distinguish between numeric and categorical predictors.
  2. Generate new features to contain information and take care complex relationships.

Numeric Predictors

Basic Form: (Y=beta_0+beta_1 X_1+…beta_p X_p+varepsilon)

Interpretation: A unit increase in (X) is associated with an increase of (beta_1) (the coefficient of (X)) in (Y) on average, holding all other predictors fixed (we don’t want other predictors to interfere).

Method 1: Polynomial Regression

Formula: (Y=beta_0+ boxed{beta_1 X+beta_2 X^2+…+beta_m X^m}+…+varepsilon)

Pros:

    • The more polynomial terms are included, the more flexible the fit that can be achieved.

Cons:

    • difficult to interpret the coefficients.
    • difficult to choose (m), or its optimal value.
      • (m) is typically 2, and for larger (m) the model becomes overly flexible and assumes an erratic shape.

Method 2: Binning-Using Piecewise Constant Functions

Methodology: bin and order numeric predictors ⇒ categorical predictors

Effect:  Each level is represented by a dummy variable, which in turn receives a separate regression coefficient in the model equation. Effectively, the regression function becomes piecewise constant-it is “constant” over different “pieces” in the range of the numeric variable.

Pros:

    • doesn’t have to assume any particular shape. The larger the number of bins used, the wider the variety of relationships, and the more flexibility. Biase drops but variance increases.

Cons:

    • the break points must be user-specified in advance.
    • results in a loss of information. Exact values are turned into a set of bands.
    • model becomes discontinuous.

Method 3: Using Piecewise Linear Functions

Methodology: Adding “call payoff” functions: ((X-c)_+:=max(0,X-c)) to the formula.

For example: (E[Y]=beta_0+beta_1X) [icon name=”arrow-right” style=”solid” unprefixed_class=”fa-lg”] (E[Y]=beta_0+beta_1X+boxed{beta_2(X-c)_+})

Effect: The regression function is linear over each of the two intervals, ((-infty,c)) and ((c,infty)), but the slope of the regression function changes abruptly from (beta_1) to (beta_1+beta_2) beyond (X=c)

Pros:

    • simple but powerful way to handle non-linearity.
    • recognizes the variation of the target mean within each piece and retains the original values of (X)
    • easy to interpret the change in slope of the regression function at the break points.

Cons:

    • the break points must be user-specified in advance.

Categorical Predictors

Binarization

Purpose: An important feature generation method to incorporate a categorical predictor into a linear model. binarization turns a given categorical predictor into a collection of artificial “binary” variables (also known as dummy or indicator variables), each of which serves as an indicator of one and only one level of the categorical predictor.

 

Formula

(E[Y]=beta_0+boxed{beta_1D_1+…+beta_jD_j+…+beta_{r-1}D_{r-1}}+text{…{Other Predictors}…}), where

    • # of dummy variables, (r-1), represented by (D_1,…,D_{r-1})
    • Interpretation of (boldsymbol{beta_0}): The intercept (beta_0) represents the value of (E[Y]) when the categorical predictor is at its baseline level.
    • Interpretation of slopes: (beta_j=E[Y^{text{jth Level}}]-E[Y^{text{Baselin Level}}],text{ where }j=1,…,r-1)
    • Interpretation of the p-values: The lower the p-value, the more confidence we have about the significant difference of the target mean in that level being significantly different from the target mean in the baseline level.
    • Danger of a high-dimensional categorical predictor:
      • The larger the value of (r), the more dummy variables are needed, therefore consuming a lot of degrees of freedom and lead to a bulky linear model.
      • The high dimension may also mean that some categories are sparse and make the linear model prone to overfitting. => combine some of the sparse categories.

Choice of the Baseline Level

What’s the effect of the choice of the baseline level?

The choice of the baseline level has no effect on the predicted values generated by the linear model or the overall quality of the model, but it does affect the interpretation of the coefficient estimates.

 

How to choose the baseline level:

    • The level that carries the largest number of observations.
    • It is ok to take the first level or the last level as the baseline to make interpretations easy and meaningful, even if these levels are not the most populous.

Why do we base it on the largest number of observations?

The more observations are in the two levels, the more precise the two estimated target means.

 

Should we treat a numeric variable as a factor?

Numeric variables may be converted to a factor variable if:

    • They are coded by a small number of numeric values (usually integers).
    • The values of these variables are merely numeric labels without a sense of numeric order.
    • These variables have a complex, non-linear relationship with the target variable.

Interaction of Predictors

Definition

An interaction arises if the expected effect of one predictor on the target variable depends on the value (or level) of another predictor.

 

Correlation vs. Interaction

    • Correlations entail a pair of numeric variables.
    • Interactions are always in the context of how two (or more) predictors relate to a target variable and involve the relationship among three variables.

Interactions between Two Numeric Predictors

(Y=beta_0+beta_1X_1+beta_2X_2+boxed{beta_3X_1X_2}+varepsilon)

[icon name=”arrow-down” style=”solid” unprefixed_class=””]

(E[Y]=beta_0+boxed{(beta_1+beta_3X_2)}X_1+beta_2X_2=beta_0+beta_1X_1+boxed{(beta_2+beta_3X_1)}X_2)

    • Interaction Term (Variable): the product term (X_3:=X_1X_2)
Interpretation
      • A unit increase in (X_1) will now increase (E[Y]) by (beta_1+beta_3X_2), which depends on the value of (X_2).
      • Equivalently, a unit increase in (X_2) will increase (E[Y]) by (beta_2 + beta_3X_1), which depends on the value of (X_1).

We say that (X_1) and (X_2) interact with each other to affect (E[Y]).

 

Interpretation of (boldsymbol{beta_3})

For every unit increase in (X_2) (resp. (X_1)), the expected effect of (X_1) (resp. (X_2)) on (E[Y]) increases by (beta_3).

 

Effects
      • Main Effects: (beta_1X_1) and (beta_2X_2) in the model equation capture the main effects due to the two numeric predictors.
      • Interaction Effect: (beta_3X_1X_2) produces the interaction effect.

Interactions between Numeric and Categorical Predictors

Consider a linear model with one numeric predictor (X_1) and a binary variable (X_2):

(E[Y]=beta_0+beta_1X_1+beta_2X_2+beta_3X_3+boxed{beta_4X_1X_2}+boxed{beta_5X_1X_3})

[icon name=”arrow-down” style=”solid” unprefixed_class=””]

(E[Y]= Biggl{begin{align*}
& beta_0+beta_1X_1, text{ if }X_2=0\
& (beta_0+beta_2)+(beta_1+beta_3)X_1,text{ if }X_2=1
end{align*})

In general, to capture the interaction between a numeric predictor (X_1) and an r-level categorical predictor, we should multiply (X-1) by the dummy variable of each of the (r-1) non-baseline levels and use all of these (r-1) product terms as interaction variables.

 

Assess the Extent of Interaction

Scatterplot: Use the ggplot2 package to make a scatterplot of the target variable against (X_1), distinguish the data points according to the levels of (X_2) by color using the color aesthetic, and fit a separate regression line or curve to each group using the geom_smooth() function.

If the two fitted lines or curves differ considerably in shape, then the interaction effect is remarkable and should be taken care of in the linear model.

 

Case 3: Interactions between Two Categorical Predictors

(E[Y]=beta_0+beta_1X_1+beta_2X_2+boxed{beta_3{X_1}X_2})

[icon name=”arrow-down” style=”solid” unprefixed_class=””]

(E[Y]= Biggl{begin{align*}
& beta_0, text{ if }X_1=X_2=0\
& beta_0+beta_1, text{ if }X_1=1,X_2=0\
& beta_0+beta_2, text{ if }X_1=0,X_2=1\
& beta_0+beta_1+beta_2+beta_3, text{ if }X_1=X_2=1
end{align*})

 

Assess the Extent of Interaction

Boxplot: The target variable split by the levels of one categorical predictor faceted by the levels of another categorical predictor.

 

Sidebar: Collinearity

Side effect of increasing the flexibility of a model: overfitting, characterized by predictions with low bias but high variance.

Definition

For linear models, high variance can also be induced by a phenomenon known as collinearity (a.k.a. multicollinearity).

Collinearity exists in a linear model when two or more features are closely, if not exactly, linearly related.

A linear model with perfect collinearity is one that includes the dummy variables of all levels of a categorical predictor; the dummy variables always sum to 1.

 

Problems with Collinearity

    • Some of the variables do not bring much additional information.
      • Result in having a rank-deficient linear model: An exact linear relationship among some of the features.
    • The coefficient estimates may exhibit high variance (“variance inflation” phenomenon). which means that there is a great deal of uncertainty or instability in the coefficient estimates.

Detecting Collinearity

    1. Correlation matrix of numeric predictors
      • Closer to -1 or 1 would indicate here is a pair of highly correlated predictors.

Solutions to Collinearity

    • Simple: Delete one of the problematic predictors that are causing collinearity.
    • Advanced: Combine the collinear predictors into a much smaller number of predictors by pre-processing the data using dimension reduction techniques such as the principal components analysis.

Feature Selection

Feature Removal (Feature Selection)

  • hypothesis tests, such as the t-test.
  • best subset selection
    • Number of fits: (2^p)
    • The best model is the one which fares best according to a pre-specified criterion (e.g., AIC, BIC).
    • Properties
      • best subset selection produces a global optimum.
  • stepwise selection
    • Number of fits: (1+p(1+p)/2)
    • Methodology:
      Go through a list of candidate models fitted by ordinary least squares and decide on a final model with respect to a certain selection criterion. The regression coefficients of the nonpredictive features are then set to (exactly) zero.
    • Backward vs. Forward Selections
      • With backward selection, we start with the model with all features and work “backward.”
      • With forward selection, we start with the simplest model, namely the model with only the intercept but no features(Y=beta_0+varepsilon), and go “forward”.
    • Properties
      • Only one feature at a time.
      • Stepwise selection produces a local optimum, because stepwise selection does not consider all possible combinations of features.
  Best Subset Selection Stepwise Selection
Number of Fits (2^p) (1+p(1+p)/2)
Optimum Global Optimum Local Optimum
      • Backward vs. forward selections: Comparing backward and forward selections, we can see that they have three main differences:
Area Backward Selection Forward Selection
Which model to start Full model with? Full model Intercept-only model
Add or drop variables Drop in each step? Drop Add
Which method tends to produce a simpler model? Forward selection is more likely to get a simpler and more interpretable model because the starting model is simpler.
  • Regularization
    • Definition
      Regularization (a.k.a. penalization and shrinkage) is an alternative to stepwise selection for doing feature selection, and more generally, for reducing the complexity of a linear model.
    • Methodology
      Considering only a single model hosting all of the potentially useful features, instead of using ordinary least squares, we fit the model using unconventional methods that have the effect of shrinking the coefficient estimates towards zero, especially those of features with limited predictive power.
      1. Starting from RSS:  (text{RSS}=sum_{i=1}^{n_{tr}}{[Y_i (beta_0+beta_1X_{i1}+…+beta_pX_{ip})]^2})
      2. Adding the regularization penalty (f_R(boldsymbol{beta})): (text{RSS+Penalty Function}=sum_{i=1}^{n_{tr}}{[Y_i-Y_i-(beta_0+beta_1X_{i1}+…+beta_pX_{ip})]^2}+boxed{lambda f_R(boldsymbol{beta})})
        1. Common choice 1 of the penalty function (ridge regression): When (boxed{f_R(boldsymbol{beta})=sum_{j=1}^{p}{beta^2_j}}), the sum of squares of the slope coefficients, we are performing ridge regression.
        2. Common choice 2 of the penalty function (lasso): when (boxed{f_R(boldsymbol{beta})=sum_{j=1}^{p}{|beta_j|}}), the sum of absolute values of the slope coefficients, we are performing lasso (which stands for “least absolute selection and shrinkage operator“).
        3. Combining both ridge regression and lasso (elastic net): (f_R(boldsymbol{beta})=(1-alpha)boxed{sum_{j=1}^{p}{beta^2_j}}+alphaboxed{sum_{j=1}^{p}{|beta_j|}},alphasubseteq(0,1))
    • Purpose
      To trade off two desirable characteristics of the coefficient estimates:
      • Goodness of fit to training data
        (Captured by the RSS): minimize the training RSS.
      • Model complexity
        (Captured by the size of the coefficient estimates): adding the regularization penalty to the RSS. Regularization penalty is not applied to (beta_0)
    • Remark:
      Standardization of predictors: Before performing regularization, it is judicious to standardize the predictors by dividing each by their standard error. Doing so ensures that all predictors are on the same scale.

Regularization vs. Stepwise Selection

Similarities

    • both of them are for doing feature selection to reduce the complexity of a linear model.

Differences

    • Stepwise selection go through a list of candidate models fitted by ordinary least squares and decide on a final model with respect to a certain selection criterion. The regression coefficients of the nonpredictive features are then set to (exactly) zero;
      Regularization only considers a single model hosting all of the potentially useful features, but instead of using ordinary least squares, we fit the model using unconventional methods that have the effect of shrinking the coefficient estimates towards zero, especially those of features with limited predictive power.
    • (text{RSS+Penalty Function}=sum_{i=1}^{n_{tr}}{[Y_i-Y_i-(beta_0+beta_1X_1+…+beta_pX_p)]^2]^2}+boxed{lambda f_R(boldsymbol{beta})})
      • Ridge Regression: ({f_R(boldsymbol{beta})=sum_{j=1}^{p}{beta^2_j})
      • lasso: (f_R(boldsymbol{beta})=sum_{j=1}^{p}{|beta_j|})
      • Elastic Net: (f_R(boldsymbol{beta})=(1-alpha)boxed{sum_{j=1}^{p}{beta^2_j}}+alphaboxed{sum_{j=1}^{p}{|beta_j|}},alphasubseteq(0,1))

Examples

Fitting Least Squares Estimates

Question

You are fitting a multiple linear regression model:

(Y=beta_0+beta_1X_1+beta_2X_2+epsilon)

to the following dataset:

(i) (Y_i) (X_{i1}) (X_{i2})
1 14 2 4
2 11 3 2
3 14 0 6
4 18 8 4
5 12 4 3
6 9 1

2

Perform the following two tasks in R.

    1. Set up the response vector (Y) and design matrix (X) corresponding to the dataset above.
    2. Use the matrix formula (hat{beta}=(X^TX)^{-1}X^TY) to calculate the least squares estimates of (beta_0,beta_1,beta_2)

Solution

> Y <- c(14, 11, 14, 18, 12, 9)
> X1<- c(2, 3, 0, 8, 4, 1)
> X2 <- c(4, 2, 6, 4, 3, 2)
> X <- matrix(c(XO, Xi, X2), nrow = 6)
> X
     [,1] [,2] [,3]
[1,]    1    2    4
[2,]    1    3    2
[3,]    1    0    6
[4,]    1    8    4
[5,]    1    4    3
[6,]    1    1    2
> B <- solve(t(X) %*% X) %*% t(X) %*% Y
> B
          [,1]
[1,] 5.2505543
[2,] 0.8137472
[3,] 1.5166297

Therefore, (hat{beta_0}=5.2505543), (hat{beta_1}=0.8137472), (hat{beta_2}=1.5166297)