Taking too long? Close loading screen.

SOA ASA Exam: Predictive Analysis (PA) – 3.1 Linear Models Case Study 1: Fitting Linear Models in R

[mathjax]

Context

Suppose that we are statistical consultants hired by the company that offers the product. The company is interested in boosting sales of the product, but cannot directly do so (that is determined by market demand). Instead, it has the liberty to control the advertising expenditure in each of the three advertising media: TV, radio, and newspaper.

If we can construct a linear model that accurately predicts sales (the target variable) on the basis of the budgets spent on the three advertising media (the predictors), then such a model can provide the basis for a profitable marketing plan that specifies how much should be spent on the three media to maximize sales, a business issue of great interest to the company.

Learning Objectives

After completing this case study, you should be able to:

  • Fit a multiple linear regression model using the lm() function and extract useful information from a fitted model using the summary() function.
  • Appreciate why variable significance may change as a result of correlations between variables.
  • Generate additional features such as interaction and polynomial terms in a linear model.
  • Partition the data into training and test sets using the createDataPartition() from the caret package.
  • Generate predictions on the training set as well as on the test set using the predict() function.
  • Compare the predictive performance of different linear models.

 

Stage 1: Define the Business Problem

Objectives

Provide the basis for a profitable marketing plan that specifies how much should be spent on the three media to maximize sales.

 

Stage 2: Data Collection

Data Design

Relevance

Read in data and remove irrelevant variables.

ad <- read.csv("Advertising.csv")
head(ad)

  X    TV radio newspaper sales
1 1 230.1  37.8      69.2  22.1
2 2  44.5  39.3      45.1  10.4
3 3  17.2  45.9      69.3   9.3
4 4 151.5  41.3      58.5  18.5
5 5 180.8  10.8      58.4  12.9
6 6   8.7  48.9      75.0   7.2


# Drop column X as it has no valuable information
ad$X <- NULL
head(ad)

     TV radio newspaper sales
1 230.1  37.8      69.2  22.1
2  44.5  39.3      45.1  10.4
3  17.2  45.9      69.3   9.3
4 151.5  41.3      58.5  18.5
5 180.8  10.8      58.4  12.9
6   8.7  48.9      75.0   7.2

 

Stage 3: Exploratory Data Analysis

TASK 1: Explore the data

Examine the distribution of the four variables. For each of the four variables, perform the following:

  • Show key descriptive statistics.
  • Create visual representations.
  • State your observations from the descriptive statistics and visual representations. In particular, form preliminary conclusions regarding which variables are likely to have significant predictive power.

In any predictive analytic endeavors, the first step of our analysis is to gain a general understanding of the key characteristics of the variables in our data.

 

Descriptive Statistics

Apply summary() function to the ad data.

summary(ad)

       TV             radio          newspaper          sales      
 Min.   :  0.70   Min.   : 0.000   Min.   :  0.30   Min.   : 1.60  
 1st Qu.: 74.38   1st Qu.: 9.975   1st Qu.: 12.75   1st Qu.:10.38  
 Median :149.75   Median :22.900   Median : 25.75   Median :12.90  
 Mean   :147.04   Mean   :23.264   Mean   : 30.55   Mean   :14.02  
 3rd Qu.:218.82   3rd Qu.:36.525   3rd Qu.: 45.10   3rd Qu.:17.40  
 Max.   :296.40   Max.   :49.600   Max.   :114.00   Max.   :27.00

Findings

  • Our target variable is sales, which takes on values ranging from 1.60 to 27.00.
  • The center of the distribution is close to 12.90 or 14.02, depending on whether you take the mean or median.
  • For TV, radio, and sales, their mean is pretty close to their median, showing that their distributions are not far from symmetric. The mean of newspaper is higher than its median more remarkably, which indicates a fair amount of right skewness in its distribution.

 

Graphical Display

Use the ggplot2 package to produce histograms of the four variables.

library(ggplot2)
library(gridExtra)
p1 <- ggplot(ad, aes(x = sales)) + geom_histogram()
p2 <- ggplot(ad, aes(x = TV)) + geom_histogram()
p3 <- ggplot(ad, aes(x = radio)) + geom_histogram()
p4 <- ggplot(ad, aes(x = newspaper)) + geom_histogram()
grid.arrange(p1, p2, p3, p4, ncol = 2)

Findings

  • The distribution of the target variable sales is rather symmetric and bell-shaped.
    =>
    supports the use of a linear model (with normal random errors)

 

To examine the pairwise relationship between the target variable and each predictor, we use:

  • correlations: cor()
cor(ad)

                  TV      radio  newspaper     sales
TV        1.00000000 0.05480866 0.05664787 0.7822244
radio     0.05480866 1.00000000 0.35410375 0.5762226
newspaper 0.05664787 0.35410375 1.00000000 0.2282990
sales     0.78222442 0.57622257 0.22829903 1.0000000

Findings:

    • TV and radio are quite strongly positively correlated with sales

 

  • scatterplots: geom_point()
p1 <- ggplot(ad, aes(x = TV, y = sales)) + 
geom_point() + geom_smooth(method = "lm", se = FALSE)
p2 <- ggplot(ad, aes(x = radio, y = sales)) +
geom_point() + geom_smooth(method = "lm", se = FALSE)
p3 <- ggplot(ad, aes(x = newspaper, y = sales)) +
geom_point() + geom_smooth(method = "lm", se = FALSE)
grid.arrange(p1, p2, p3, ncol = 2)

Findings:

    • sales has a positive linear relationship with both TV and radio whereas the relationship between sales and newspaper is not as clear.

 

  • scatter-plot matrix: pairs()
pairs(ad)

Combining all of the observations above, we can say that TV and radio are predictive of sales, but the predictive power of newspaper is questionable.

(Spoiler alert: We will show later that newspaper is not a real driver of sales.)

 

Stage 4: Model Construction and Evaluation

Model Construction

TASK 2: Construct simple linear regression models

Fit three separate simple linear regression models for sales on each of TV, radio, and newspaper. Run the summary function on each model and provide the output. Using the model for sales on TV as an illustration, perform the following:

  • Interpret the estimated coefficient for TV.
  • Comment on the statistical significance of TV.
  • Comment on the goodness of fit of this model and interpret the value of the coefficient of determination .
  • Predict the value of sales when TV equals 0, 100, 200, or 300. If one and only one of the three advertising media can be included as ~ predictor for sales, recommend which medium you would choose.

 

Construct a simple regression model: lm()

> slr.TV <- lm(sales ~ TV, data = ad)
> slr.TV

Call:
lm(formula = sales ~ TV, data = ad)

Coefficients:
(Intercept)           TV  
    7.03259      0.04754

The estimated coefficients are \(\hat{\beta}_0=7.03259\) and \(\hat{\beta}_1=0.04754\), and the fitted regression line is:

\(\hat{\text{sales}}=7.03259+0.04754\times \text{TV}\)

Interpret the results

  • When no amount is spent on TV advertising (i.e., TV equals 0), the expected amount of sales is estimated to be \(\hat{\beta}_0=7.03259\).
  • When TV increases by 1, (i.e., an additional $1,000 is spent on TV advertising; remember that the TV variable is in thousands of dollars), the expected increase in sales is estimated to be \(\hat{\beta}_1=0.04754\).

 

Accessing Additional Information about a Linear Model

  • Use .var$to access components: data.var$
  • Supporting functions for lm():
Function Action
summary() Displays a detailed analysis of the fitted model.
coefficients() or coef() Returns a vector of coefficient estimates.
confint() Produces confidence intervals for the regression coefficients. The probability level is 95% by default, but can be reset using the level option.
residuals() or resid() Returns a vector of raw residuals \(e_i=Y_i-Y_i\) for \(i=1,…,n_{tr}\)· 
anova() Returns the ANOVA table for the fitted model or the ANOVA table for comparing multiple linear models.
AIC() Returns the AIC of the fitted model.
plot() Produces four diagnostic plots for evaluating the appropriateness of the fitted model.

Taking summary() function as an example,

summary(slr.TV)


Call:
lm(formula = sales ~ TV, data = ad)

Residuals:
    Min      1Q  Median      3Q     Max 
-8.3860 -1.9545 -0.1913  2.0671  7.2124 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) 7.032594   0.457843   15.36   <2e-16 ***
TV          0.047537   0.002691   17.67   <2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 3.259 on 198 degrees of freedom
Multiple R-squared:  0.6119, Adjusted R-squared:  0.6099 
F-statistic: 312.1 on 1 and 198 DF,  p-value: < 2.2e-16

Useful information in the model summary has two salient parts:

  • Coefficient estimates table
  • Overall model quality
    • residual standard error \(s=\sqrt{s^2}\)
    • coefficient of determination \(R^2\)
    • F-statistic for testing the significance of all of the predictor(s) taken together

We can say that the fitted model has explained 61.19% of the variation of sales (around its mean), indicating that a moderate amount of variability of sales is explained by regressing sales on TV.

 

Making Predictions

To learn more about the predict() function applied to lm objects, type ?predict.lm.

  • predict()

head(predict(slr.TV))

        1         2         3         4         5         6 
17.970775  9.147974  7.850224 14.234395 15.627218  7.446162

 

  • Add new data to predict target variable specific values of the predictor: predict(y, newdata = data.frame(x_i, ..., x_j))
dat <- data.frame(TV = seq(0, 300, by = 100))
predict(slr.TV, newdata=dat) 

        1         2         3         4 
 7.032594 11.786258 16.539922 21.293586

 

Two more simple linear regression models

Fit the simple linear regression models of sales on radio and newspaper:

slr.radio <- lm(sales ~ radio, data = ad) 
slr.newspaper <- lm(sales ~ newspaper, data = ad) 
summary(slr.radio)

Call:
lm(formula = sales ~ radio, data = ad)

Residuals:
     Min       1Q   Median       3Q      Max 
-15.7305  -2.1324   0.7707   2.7775   8.1810 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  9.31164    0.56290  16.542   <2e-16 ***
radio        0.20250    0.02041   9.921   <2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 4.275 on 198 degrees of freedom
Multiple R-squared:  0.332, Adjusted R-squared:  0.3287 
F-statistic: 98.42 on 1 and 198 DF,  p-value: < 2.2e-16


Call:
lm(formula = sales ~ newspaper, data = ad)

Residuals:
     Min       1Q   Median       3Q      Max 
-11.2272  -3.3873  -0.8392   3.5059  12.7751 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) 12.35141    0.62142   19.88  < 2e-16 ***
newspaper    0.05469    0.01658    3.30  0.00115 ** 
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 5.092 on 198 degrees of freedom
Multiple R-squared:  0.05212, Adjusted R-squared:  0.04733 
F-statistic: 10.89 on 1 and 198 DF,  p-value: 0.001148

Findings:

  • Even though radio and newspaper are highly significant and contribute positively to sales on their own, the quality of these two simple linear regression models is not as favorable as that of the model of sales on TV.
    • \(\boldsymbol{R^2}\): Their coefficients of determination \(R^2\) are only 33.2% and 5.212%, much lower than the 61.19% we saw.
    • If we are to choose only one from the three advertising media, sales on TVis to be chosen because its association with sales captured by \(R^2\) (equivalently, the t-statistic) is the largest/strongest.

Notes:

  • Don’t base our choice on which simple linear regression model has the highest estimated coefficient for the predictor, because:
    • the estimated coefficient itself does not take into account the variability (standard error) of the estimate, but the t-statistics do.

 

TASK 3: Construct a basic multiple linear regression model

Fit a multiple linear regression model for sales on TV, radio, and newspaper. Then do the following:

  • Interpret the estimated coefficients.
  • Comment on the goodness of fit of this model.
  • The statistical significance of newspaper in this model differs from that in the simple linear regression model in Task 2. In view of your observations in Task 1, explain why this is the case.
  • Refit the model by dropping the insignificant variable(s).

 

Model 1: Multiple linear regression model with all three media

To regress sales on TV, radio, and newspaper:

\(\text{sales}=\beta_0+\beta_1\times \text{TV}+\beta_2\times \text{radio}+\beta_3\times \text{newspaper}+\varepsilon\)

model.1 <- lm(sales ~ TV + radio + newspaper, data = ad)
s <- summary(model.1) 
s


Call:
lm(formula = sales ~ TV + radio + newspaper, data = ad)

Residuals:
    Min      1Q  Median      3Q     Max 
-8.8277 -0.8908  0.2418  1.1893  2.8292 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept)  2.938889   0.311908   9.422   <2e-16 ***
TV           0.045765   0.001395  32.809   <2e-16 ***
radio        0.188530   0.008611  21.893   <2e-16 ***
newspaper   -0.001037   0.005871  -0.177     0.86    
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 1.686 on 196 degrees of freedom
Multiple R-squared:  0.8972,	Adjusted R-squared:  0.8956 
F-statistic: 570.3 on 3 and 196 DF,  p-value: < 2.2e-16

Interpretation of Estimates:

  • When no money is spent on any of the three advertising media, the expected amount of sales is estimated to be \(\beta_0=2.938889\).
  • In the multiple linear regression framework, the interpretations of the slope coefficients differ slightly due to the presence of other predictors.
    • For instance, when TV increases by 1 unit, the expected increase in sales is estimated to be \(\beta_1=0.045765\) units, provided that the advertising budgets in radio and newspaper are held fixed.
    • This expected increase is comparable to that in the simple linear regression model if sales on TV, which is 0.04754.
    • Similar interpretations can be made for (\beta_2=0.188530\) and (\beta_3=-0.001037\).

Quality of Model:

We can observe from the p-value column that the key advertising media affecting sales are TV and radio, with their p-values being practically zero. As their coefficient estimates are positive, we can conclude that TV and radio are significant advertising media contributing favorably to sales, as we would have expected.

The last three lines of the summary output contain information about the global quality of the fitted linear regression model.

We can see that the \(R^2\) has increased quite remarkably from 0.6119 to 0.8972 (i.e., 89.72% of the variation of sales is explained by the three advertising media together) when switching from the simple linear regression model of sales on TV to the current multiple linear regression model.

The F-statistic takes the extraordinarily large value of 570.3. The p-value of the F-test borders on zero, so we have extremely strong evidence that at least one of the three advertising media is an important predictor of sales. As we have seen above, however, only TV and radio appear to be truly related to sales.

 

Is newspaper significant or insignificant for sales after all?

There is no contradiction at all if we interpret the results of the two models properly. What the simple linear regression model tells us is that in the absence of other information such as TV and radio, newspaper is an important predictor of sales, while multiple linear regression model says that newspaper contributes little information to understanding sales in the presence of TV and radio.

Loosely speaking, when the values of TV and radio are known and held fixed, changing newspaper is associated with a minimal change in sales. The fact that the multiple linear regression model has accounted for the effects of TV and radio is the key.

We can better understand the “contradiction” between the simple linear regression and multiple linear regression models in two ways.

Correlations

According to the correlation matrix for sales and the three advertising media, our focus is on the pairwise correlations between the three media, so, for example, the correlation between newspaper and radio is about 0.35, which is moderately high.

This means that firms that spend more on radio advertising also tend to spend more on newspaper advertising.

Now if radio is predictive of sales positively (implied by the multiple linear regression model and the high correlation between radio and sales), then with a higher expenditure on radio advertising, sales also tend to be higher.

At the same time, the positive correlation between newspaper and radio means that the expenditure on newspaper also tends to increase. Hence there is a tendency that higher values of sales go hand in hand with higher values of newspaper, as suggested by the simple linear regression model, which only analyzes sales in relation to newspaper and ignores other information such as radio.

In this case, we can say that newspaper serves as a surrogate for radio and inherits some of the predictive power of radio for sales.

Partial correlations

The partial correlation between a target variable \(y\) and a predictor \(X_j\) is the correlation between the two variables after controlling for the effects of other predictors. It can be calculated as follows:

    1. Regress \(y\) on all predictors other than \(X_j\) and denote the residuals by \(e_1\).
    2. Regress \(X_j\) on all predictors other than \(X_j\) and denote the residuals by \(e_2\).
    3. The partial correlation between \(y\) and \(X_j\) is defined as the correlation between the two sets of residuals, \(e_1\) and \(e_2\).

For instance, to calculate the partial correlation between sales and newspaper, we have to fit two models, one regressing sales on TV and radio and one regressing newspaper on TV and radio. Then we use the cor() function to compute the correlation between the two series of residuals.

\(\text{sales}=\beta_0+\beta_1\times \text{TV}+\beta_2\times \text{radio}+\varepsilon\)

m1 <- lm(sales ~ TV + radio, data = ad)
m2 <- lm(newspaper ~ TV + radio, data = ad)
cor(m1$residuals, m2$residuals)

[1] -0.01262147

The partial correlation between sales and newspaper is essentially zero, which suggests that newspaper is virtually uncorrelated with sales after accounting for the effects of TV and radio. In contrast, the ordinary correlation between sales and newspaper is 0.228299. Much of this correlation stems from the positive correlation between radio and newspaper, which is 0.3541.

 

Model 2: Model with only TV and radio

As Model 1 suggests that newspaper is not as-sociated with sales in the presence of TV and radio, let’s refine the model by dropping newspaper and regressing sales only on TV and radio.

model.2 <- lm(sales ~ TV + radio, data = ad)
summary(model.2)

Call:
lm(formula = sales ~ TV + radio, data = ad)

Residuals:
    Min      1Q  Median      3Q     Max 
-8.7977 -0.8752  0.2422  1.1708  2.8328 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  2.92110    0.29449   9.919   <2e-16 ***
TV           0.04575    0.00139  32.909   <2e-16 ***
radio        0.18799    0.00804  23.382   <2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 1.681 on 197 degrees of freedom
Multiple R-squared:  0.8972,	Adjusted R-squared:  0.8962 
F-statistic: 859.6 on 2 and 197 DF,  p-value: < 2.2e-16

 

The deletion of newspaper causes minimal effects on the model, speaking to the statistical insignificance of newspaper in the presence of TV and radio. Not only does the \(R^2\) stay approximately the same at 89. 72% (we know that the \(R^2\) of Model 2 cannot be higher than that of Model 1, but the drop is negligible), the F-statistic and adjusted \(R^2\) of the model also increase when newspaper is omitted. These results suggest that regressing sales only on TV and radio improves the performance of the linear model and is likely to enhance its prediction accuracy.

In the rest of this case study, we will use Model 2 as the starting point and add new features to make the model more flexible and predictive.

 

TASK 4: Construct a multiple linear regression model with interaction

A three-dimensional scatterplot is shown for sales against TV and radio, and the fitted regression plane when sales is regressed on the two variables.

  • Explain why the scatterplot suggests that there may be an interaction between TV and radio.
  • Fit a multiple linear regression model for sales against TV and radio with interaction. Provide the summary output.
  • Interpret the estimated coefficient for the interaction term.
  • Provide evidence that the interaction is significant.

The regression plane is upward tilting with respect to both TV and radio-the more we spend on TV and radio, the larger the value of sales on average.

Response observations that lie above the fitted plane have their residuals (which are positive) colored in cyan whereas those that lie below the plane have their (negative) residuals colored in magenta.

What can you observe from the 3D scatterplot?

There is a clear pattern of positive residuals on the 45-degree line on the radio x TV plane and negative residuals when most of the advertising expenditure is spent on either TV or radio. In other words, the fitted model systematically underestimates sales when the advertising budget is split between TV and radio, but overestimates sales when the company relies mainly on either TV or radio as the advertising medium.

This suggests the presence of a prominent synergy effect between TV and radio which is not taken into account by Model 2. In other words, the rate of increase in sales with respect to TV (resp. radio) appears to increase with the value of radio (resp. TV), which is at odds with Model 2’s implicit assumption that the average effect on sales of each medium is independent of the amount spent on the other medium. This phenomenon provides the motivation for us to refine our linear model with the goal of effectively capturing the systematic patterns in the observations.

 

Model 3: Model with TV and radio with interaction

To take care of the synergy or interaction effect between TV and radio, we employ the following model 3:

\(\text{sales}=\beta_0+\beta_1\times \text{TV}+\beta_2\times \text{radio}+\beta_3\times \boxed{\text{TV}\times \text{radio}}+\varepsilon\)

which has the product of TV and radio added as a new predictor. Since:

  • \(\dfrac{\partial}{\partial\text{TV}}E[\text{sales}]=\beta_1+\beta_3\times\text{radio}\)
  • \(\dfrac{\partial}{\partial\text{radio}}E[\text{sales}]=\beta_2+\beta_3\times\text{TV}\)

The average effect of TV (resp. radio) on sales now depends on the amount spent on radio (resp. TV). We expect the value of \(\beta_3\) to be positive.

Let’s fit the above model with interaction effects (called Model 3) and look at its summary output.

model.3 <- lm(sales ~ TV * radio, data = ad)
summary(model.3)

Call:
lm(formula = sales ~ TV * radio, data = ad)

Residuals:
    Min      1Q  Median      3Q     Max 
-6.3366 -0.4028  0.1831  0.5948  1.5246 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept) 6.750e+00  2.479e-01  27.233   <2e-16 ***
TV          1.910e-02  1.504e-03  12.699   <2e-16 ***
radio       2.886e-02  8.905e-03   3.241   0.0014 ** 
TV:radio    1.086e-03  5.242e-05  20.727   <2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.9435 on 196 degrees of freedom
Multiple R-squared:  0.9678,	Adjusted R-squared:  0.9673 
F-statistic:  1963 on 3 and 196 DF,  p-value: < 2.2e-16

The estimated coefficient for the interaction term, as we expect, is positive and equals 0.001086, with an extremely low p-value, indicating that we have very strong evidence that positive interaction effects exist between TV and radio.

We can say that the effect of radio on expected sales is estimated to be 0.02886 + 0.001086 x TV, which increases by 0.001086 for every unit increase in TV.

Reversing the roles of TV and radio, we can also say that the effect of TV on expected sales increases by 0.001086 for every unit increase in radio.

Further evidence of the significance of the interaction effect is furnished by the notable jump of \(R^2\) from 89.72% for Model 2 to 96.78% for Model 3 and the sharp decline of the \(RSE\) from 1.681 for Model 2 to 0.9435 for Model 3. By any standard, Model 3 is superior to Model 2, which contains only main effects.

Incidentally, both the main effects and interaction effect in Model 3 are statistically significant as evidenced by their extremely small p-value. In other situations, it can happen that the interaction effect is significant, but not the main effects.

Whenever a higher-order variable is retained in a model ( due to statistical significance) , so are all lower-order variables, even if they are insignificant. As a consequence, if \(X1:X2\) is kept, then so are the main effects. This explains why the shorthand \(X1*:X2\) is much more commonly used than \(X1:X2\) alone.

 

TASK 5: Evaluate three marketing plans

Suppose that as the manager of the company, you are given a total of $100,000 to spend on advertising. You are considering three ways to allocate the $100,000 budget between TV advertising and radio advertising:

  1. Spending the entire amount on radio advertising.
  2. Splitting the budget evenly between TV and radio advertising.
  3. Spending $70,000 on radio advertising and $30,000 on TV advertising.

Evaluate these three marketing plans using the interaction model constructed in Task 4. Which plan would you recommend?

(Note: Keep in mind that the TV and radio variables are measured in thousands of dollars.)

 

Before evaluating the three marketing plans, let’s build a data frame called dat.budget (any name will do) to host the three pairs of predictor values of interest.

dat.budget <- data.frame(TV = c(0, 50, 30), radio = c(100, 50, 70))

 

Let’s apply the predict() function to perform predictions at the three combinations of values contained in dat.budget.

predict(model.3, newdata = dat.budget)

        1         2         3 
 9.636254 11.864528 11.625115

 

The coefficient estimates can be obtained from model.3$coefficients or coef(model.3).

coef(model.3)

(Intercept)          TV       radio    TV:radio 
6.750220203 0.019101074 0.028860340 0.001086495

 

TASK 6: Construct a multiple linear regression model  with interaction and polynomial terms

Your supervisor has suggested that adding the square of TV as a feature may improve the multiple linear regression model.

  • Explain, without running any models, why this is or is not a reasonable suggestion.
  • Regardless of your conclusion, fit a multiple linear regression model for sales against TV and radio with interaction and the square of TV added. Provide the summary output.

 

Model 4: Model with TV and radio with interaction and polynomial terms

Besides the use of interaction terms, another way to accommodate complex relationships between the target variable and individual predictors is polynomial regression. Via the use of higher-power terms, we are able to take care of a wide variety of relationships which are substantially more complex than linear ones (e.g, curved relationships).

We observed a positive relationship between sales and TV, but upon closer examination, the relationship appears to be curved rather than linear (the curved relationship between sales and radio is less pronounced).

The rate of increase of sales with respect to TV seems to be declining with TV, conforming to the law of diminishing marginal returns in economics – the marginal increase in sales due to an additional unit increase in TV eventually declines beyond a certain point of saturation.

This motivates the introduction of higher powers of TV, say the square of TV, as an additional predictor to be added to Model 3 to better capture the non-linear relationship between sales and TV. The equation of the resulting model, called Model 4, is:

\(\text{sales}=\beta_0+\beta_1\times \text{TV}+\beta_2\times \text{radio}+\beta_3\times \boxed{\text{TV}^2}+\beta_4\times \boxed{\text{TV}\times \text{radio}}+\varepsilon\)

When \({TV}^2\) is inserted, the expected rate of change of sales with respect to TV is:

  • \(\dfrac{\partial}{\partial\text{TV}}E[\text{sales}]=\beta_1+2\beta_3\times\text{TV}+\beta_4\times\text{radio}\)

which depends not only on the value of radio (due to the interaction term), but also on the value of TV (due to the square term).

In R, higher powers of a predictor can be created before applying the lm() function, e.g., the square of a predictor X can be set up using the command: X2 <- X^2

or they can be created on the fly and entered into the model formula using the I() function, where “I” stands for “insulation.” The use of the I() function is necessary because the ^ operator, when used in an R formula, has a different meaning than exponentiation and the I() function ensures that the elements within the parentheses are processed arithmetically. 

model.4 <- lm(sales ~ TV * radio + I(TV ^ 2), data = ad)
summary(model.4) 


Call:
lm(formula = sales ~ TV * radio + I(TV^2), data = ad)

Residuals:
    Min      1Q  Median      3Q     Max 
-4.9949 -0.2969 -0.0066  0.3798  1.1686 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept)  5.137e+00  1.927e-01  26.663  < 2e-16 ***
TV           5.092e-02  2.232e-03  22.810  < 2e-16 ***
radio        3.516e-02  5.901e-03   5.959 1.17e-08 ***
I(TV^2)     -1.097e-04  6.893e-06 -15.920  < 2e-16 ***
TV:radio     1.077e-03  3.466e-05  31.061  < 2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.6238 on 195 degrees of freedom
Multiple R-squared:  0.986,	Adjusted R-squared:  0.9857 
F-statistic:  3432 on 4 and 195 DF,  p-value: < 2.2e-16

 

Just as we expect, the coefficient estimate of the square term is negative and the p-value is extremely small, showing that the square term is highly significant. Compared to Model 3, the adjusted \(R^2\) and RSE of Model 4 have experienced a noticeable amount of improvement, from 0.9673 to 0.9857 and from 0.9435 to 0.6238, respectively. These metrics are pointers that Model 4 outperforms even Model 3 and is by far the best model among the four linear models we have constructed.

 

Model Evaluation: Training/test set split

TASK 7: Choose a model

  • Split the data into training and test sets.
  • Evaluate the prediction performance of the models in Tasks 3, 4, and 6. Recommend which model should be used.

While the adjusted \(R^2\) and RSE are indirect estimates of the test error of a linear model, in Exam PA it is most desirable to evaluate the predictive performance of a linear model directly on an independent test set.

In this subsection, we split the ad dataset into two parts:

  • A training set, where we train our candidate linear models and get the coefficient estimates.
  • A test set, where we use the linear models fitted to the training set to make predictions. We will then evaluate the prediction accuracy of the models according to a certain criterion (e.g., test RMSE).

 

Creation of training and test sets: createDataPartition()

There are a number of ways to split the data into the training and test sets in R (e.g., using the sample() function is one of the easiest). In Exam PA, we will use the createDataPartition() function in the caret package, which is short for “classification and regression training” and is one of the most important packages in predictive modeling. Used in all of the past and sample PA projects, this data partition function is extremely important as each PA project involves model validation in one way or another.

Let’s perform the training/test set split for the ad dataset using this function.

createDataPartition(<target_variable>, p = <proportion of training data>, list = <if vector>)

library(caret)
set.seed(1)
partition <- createDataPartition(ad$sales, p = 0.7, list = FALSE)
data.train <- ad[partition, ]
data.test <- ad[-partition, ]

 

To validate the training/test set split (i.e., to make sure that it works well) we have just done, we can verify the similarity of the values of sales in the training and test sets by:

summary(data.train$sales)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
   1.60   10.43   12.90   13.96   17.30   26.20 

summary(data.test$sales)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
   3.20   10.32   12.60   14.19   17.40   27.00

The two sets of summary statistics are indeed very close to each other, bearing testimony to the stratified sampling that createDataPartition() implements.

 

Note:

In order to recognize or check the importance of the use of stratified sampling to improve the effectiveness of the model validation” when using the createDataPartition() function, when you do model evaluation, be sure to:

  • Describe the design/results of the training/test set split, i.e., the proportion (or number) of observations that go to the training set and test set.
  • Point out that you are using stratified sampling to ensure that both the training and test sets contain similar and representative values of the response variable.
  • Explicitly check that the two sets of target variable values are comparable by looking at some summary statistics, e.g., the mean.

 

Model Evaluation: Performance Metrics

Fitting the linear models on the training set and testing them out on the test set

For the purposes of illustration, we are going to consider two more models, the basic model in Task 3 and a complex model involving polynomial and interactions terms of TV and radio.

Model Equation
1 \(\text{sales}=\beta_0+\beta_1\times \text{TV}+\beta_2\times \text{radio}+\beta_3\times \text{newspaper}+\varepsilon\)
2 \(\text{sales}=\beta_0+\beta_1\times \text{TV}+\beta_2\times \text{radio}+\varepsilon\)
3 \(\text{sales}=\beta_0+\beta_1\times \text{TV}+\beta_2\times \text{radio}+\beta_3\times\text{TV}\times\text{radio}+\varepsilon\)
4 \(\text{sales}=\beta_0+\beta_1\times \text{TV}+\beta_2\times \text{radio}+\beta_3\times\text{TV}^2+\beta_4\times\text{TV}\times\text{radio}+\varepsilon\)
5 \(\text{sales}=\beta_0+\beta_1\times \text{TV}+\beta_2\times \text{radio}+\beta_3\times\text{TV}^2+\beta_4\times\text{radio}^2+\beta_5\times\text{TV}\times\text{radio}+\varepsilon\)

Let’s fit these five models to the training set (the .tr suffix stands for “training”). Do note that the data argument of the lm() function is set to train rather than ad.

model.1.tr <- lm(sales ~ TV + radio + newspaper, data = data.train)
model.2.tr <- lm(sales ~ TV + radio, data = data.train)
model.3.tr <- lm(sales ~ TV * radio, data = data.train)
model.4.tr <- lm(sales ~ TV * radio + I(TV^2), data = data.train)
model.5.tr <- lm(sales ~ TV * radio + I(TV^2) + I(radio^2), data = data.train)

Then for each trained linear model, we generate predictions on the test set and compute the test RMSE. As the same calculation will be performed for each of the five models, it is desirable to define a function or invoke an existing function that will help us automate the procedure and reduce repetition.

Here the RMSE() function in the caret package takes a vector of observed target variable values and a vector (of the same dimension) of predicted target variable values as inputs, and returns the training RMSE or test RMSE, depending on what is fed into the arguments of the function.

RMSE(data.train$sales, predict(model.1.tr))
[1] 1.701546

RMSE(data.train$sales, predict(model.2.tr))
[1] 1.704158

RMSE(data.train$sales, predict(model.3.tr))
[1] 0.9645623

RMSE(data.train$sales, predict(model.4.tr))
[1] 0.6433654

RMSE(data.train$sales, predict(model.5.tr))
[1] 0.6394058

From Models 2 to 5, the linear models have increasing levels of complexity (i.e., they are nested) and thus decreasing training RMSE. Note that we set the first argument of the RMSE() function to the sales variable in the training set and the second argument to the predicted (or fitted) values generated by the linear model in question. Remember that when the predict() function is applied to an lm object without additional arguments, the fitted values in the training set are formed by default.

We now apply the linear models fitted to the training set to generate the predicted values on the test set, which is passed to the newdata argument.

RMSE(data.test$sales, predict(model.1.tr, newdata = data.test))
[1] 1.602705

RMSE(data.test$sales, predict(model.2.tr, newdata = data.test))
[1] 1.581734

RMSE(data.test$sales, predict(model.3.tr, newdata = data.test))
[1] 0.8645022

RMSE(data.test$sales, predict(model.4.tr, newdata = data.test))
[1] 0.5473428

RMSE(data.test$sales, predict(model.5.tr, newdata = data.test))
[1] 0.5589234

Unlike the training RMSE, the test RMSE first decreases from Models 1 to 4, then eventually rises when going from Models 4 to 5. As we make our linear model more sophisticated, its predictive performance initially improves due to its increased ability to capture complicated patterns. When it becomes unnecessarily complicated, it appears to overfit the data and its predictive performance worsens. This creates the U-shape that characterizes the test error as a function of model complexity and is another illustration that having excessive flexibility or complexity will eventually backfire.

Among the five linear models, Model 4 appears to strike a good balance between bias and variance, and gives rise to the smallest test RMSE. On the basis of prediction accuracy, Model 4 is our recommended linear model.

 

Stage 5: Model Validation

 

Stage 6: Model Maintenance