[mathjax]
Context
Suppose that we are statistical consultants hired by the company that offers the product. The company is interested in boosting sales of the product, but cannot directly do so (that is determined by market demand). Instead, it has the liberty to control the advertising expenditure in each of the three advertising media: TV, radio, and newspaper.
If we can construct a linear model that accurately predicts sales (the target variable) on the basis of the budgets spent on the three advertising media (the predictors), then such a model can provide the basis for a profitable marketing plan that specifies how much should be spent on the three media to maximize sales, a business issue of great interest to the company.
Learning Objectives
After completing this case study, you should be able to:
- Fit a multiple linear regression model using the
lm()
function and extract useful information from a fitted model using thesummary()
function. - Appreciate why variable significance may change as a result of correlations between variables.
- Generate additional features such as interaction and polynomial terms in a linear model.
- Partition the data into training and test sets using the
createDataPartition()
from thecaret
package. - Generate predictions on the training set as well as on the test set using the
predict()
function. - Compare the predictive performance of different linear models.
Stage 1: Define the Business Problem
Objectives
Provide the basis for a profitable marketing plan that specifies how much should be spent on the three media to maximize sales.
Stage 2: Data Collection
Data Design
Relevance
Read in data and remove irrelevant variables.
ad <- read.csv("Advertising.csv") head(ad) X TV radio newspaper sales 1 1 230.1 37.8 69.2 22.1 2 2 44.5 39.3 45.1 10.4 3 3 17.2 45.9 69.3 9.3 4 4 151.5 41.3 58.5 18.5 5 5 180.8 10.8 58.4 12.9 6 6 8.7 48.9 75.0 7.2 # Drop column X as it has no valuable information ad$X <- NULL head(ad) TV radio newspaper sales 1 230.1 37.8 69.2 22.1 2 44.5 39.3 45.1 10.4 3 17.2 45.9 69.3 9.3 4 151.5 41.3 58.5 18.5 5 180.8 10.8 58.4 12.9 6 8.7 48.9 75.0 7.2
Stage 3: Exploratory Data Analysis
TASK 1: Explore the data
Examine the distribution of the four variables. For each of the four variables, perform the following:
- Show key descriptive statistics.
- Create visual representations.
- State your observations from the descriptive statistics and visual representations. In particular, form preliminary conclusions regarding which variables are likely to have significant predictive power.
In any predictive analytic endeavors, the first step of our analysis is to gain a general understanding of the key characteristics of the variables in our data.
Descriptive Statistics
Apply summary()
function to the ad
data.
summary(ad) TV radio newspaper sales Min. : 0.70 Min. : 0.000 Min. : 0.30 Min. : 1.60 1st Qu.: 74.38 1st Qu.: 9.975 1st Qu.: 12.75 1st Qu.:10.38 Median :149.75 Median :22.900 Median : 25.75 Median :12.90 Mean :147.04 Mean :23.264 Mean : 30.55 Mean :14.02 3rd Qu.:218.82 3rd Qu.:36.525 3rd Qu.: 45.10 3rd Qu.:17.40 Max. :296.40 Max. :49.600 Max. :114.00 Max. :27.00
Findings
- Our target variable is
sales
, which takes on values ranging from 1.60 to 27.00. - The center of the distribution is close to 12.90 or 14.02, depending on whether you take the mean or median.
- For
TV
,radio
, andsales
, their mean is pretty close to their median, showing that their distributions are not far from symmetric. The mean ofnewspaper
is higher than its median more remarkably, which indicates a fair amount of right skewness in its distribution.
Graphical Display
Use the ggplot2
package to produce histograms of the four variables.
library(ggplot2) library(gridExtra) p1 <- ggplot(ad, aes(x = sales)) + geom_histogram() p2 <- ggplot(ad, aes(x = TV)) + geom_histogram() p3 <- ggplot(ad, aes(x = radio)) + geom_histogram() p4 <- ggplot(ad, aes(x = newspaper)) + geom_histogram() grid.arrange(p1, p2, p3, p4, ncol = 2)
Findings
- The distribution of the target variable
sales
is rather symmetric and bell-shaped.
=>
supports the use of a linear model (with normal random errors)
To examine the pairwise relationship between the target variable and each predictor, we use:
- correlations:
cor()
cor(ad)
TV radio newspaper sales
TV 1.00000000 0.05480866 0.05664787 0.7822244
radio 0.05480866 1.00000000 0.35410375 0.5762226
newspaper 0.05664787 0.35410375 1.00000000 0.2282990
sales 0.78222442 0.57622257 0.22829903 1.0000000
Findings:
-
TV
andradio
are quite strongly positively correlated withsales
- scatterplots:
geom_point()
p1 <- ggplot(ad, aes(x = TV, y = sales)) + geom_point() + geom_smooth(method = "lm", se = FALSE) p2 <- ggplot(ad, aes(x = radio, y = sales)) + geom_point() + geom_smooth(method = "lm", se = FALSE) p3 <- ggplot(ad, aes(x = newspaper, y = sales)) + geom_point() + geom_smooth(method = "lm", se = FALSE) grid.arrange(p1, p2, p3, ncol = 2)
Findings:
-
sales
has a positive linear relationship with bothTV
andradio
whereas the relationship betweensales
andnewspaper
is not as clear.
- scatter-plot matrix:
pairs()
pairs(ad)
Combining all of the observations above, we can say that TV
and radio
are predictive of sales
, but the predictive power of newspaper
is questionable.
(Spoiler alert: We will show later that newspaper
is not a real driver of sales
.)
Stage 4: Model Construction and Evaluation
Model Construction
TASK 2: Construct simple linear regression models
Fit three separate simple linear regression models for sales on each of TV
, radio
, and newspaper
. Run the summary function on each model and provide the output. Using the model for sales on TV
as an illustration, perform the following:
- Interpret the estimated coefficient for
TV
. - Comment on the statistical significance of
TV
. - Comment on the goodness of fit of this model and interpret the value of the coefficient of determination .
- Predict the value of sales when
TV
equals 0, 100, 200, or 300. If one and only one of the three advertising media can be included as ~ predictor forsales
, recommend which medium you would choose.
Construct a simple regression model: lm()
> slr.TV <- lm(sales ~ TV, data = ad) > slr.TV Call: lm(formula = sales ~ TV, data = ad) Coefficients: (Intercept) TV 7.03259 0.04754
The estimated coefficients are \(\hat{\beta}_0=7.03259\) and \(\hat{\beta}_1=0.04754\), and the fitted regression line is:
\(\hat{\text{sales}}=7.03259+0.04754\times \text{TV}\)
Interpret the results
- When no amount is spent on TV advertising (i.e., TV equals 0), the expected amount of sales is estimated to be \(\hat{\beta}_0=7.03259\).
- When TV increases by 1, (i.e., an additional $1,000 is spent on TV advertising; remember that the
TV
variable is in thousands of dollars), the expected increase in sales is estimated to be \(\hat{\beta}_1=0.04754\).
Accessing Additional Information about a Linear Model
- Use
.var$
to access components:data.var$
- Supporting functions for
lm()
:
Function | Action |
summary() |
Displays a detailed analysis of the fitted model. |
coefficients() or coef() |
Returns a vector of coefficient estimates. |
confint() |
Produces confidence intervals for the regression coefficients. The probability level is 95% by default, but can be reset using the level option. |
residuals() or resid() |
Returns a vector of raw residuals \(e_i=Y_i-Y_i\) for \(i=1,…,n_{tr}\)· |
anova() |
Returns the ANOVA table for the fitted model or the ANOVA table for comparing multiple linear models. |
AIC() |
Returns the AIC of the fitted model. |
plot() |
Produces four diagnostic plots for evaluating the appropriateness of the fitted model. |
Taking summary()
function as an example,
summary(slr.TV) Call: lm(formula = sales ~ TV, data = ad) Residuals: Min 1Q Median 3Q Max -8.3860 -1.9545 -0.1913 2.0671 7.2124 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 7.032594 0.457843 15.36 <2e-16 *** TV 0.047537 0.002691 17.67 <2e-16 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 3.259 on 198 degrees of freedom Multiple R-squared: 0.6119, Adjusted R-squared: 0.6099 F-statistic: 312.1 on 1 and 198 DF, p-value: < 2.2e-16
Useful information in the model summary has two salient parts:
- Coefficient estimates table
- Overall model quality
- residual standard error \(s=\sqrt{s^2}\)
- coefficient of determination \(R^2\)
- F-statistic for testing the significance of all of the predictor(s) taken together
We can say that the fitted model has explained 61.19% of the variation of sales (around its mean), indicating that a moderate amount of variability of sales is explained by regressing sales on TV.
Making Predictions
To learn more about the predict()
function applied to lm
objects, type ?predict.lm
.
-
predict()
head(predict(slr.TV)) 1 2 3 4 5 6 17.970775 9.147974 7.850224 14.234395 15.627218 7.446162
- Add new data to predict target variable specific values of the predictor:
predict(y, newdata = data.frame(x_i, ..., x_j))
dat <- data.frame(TV = seq(0, 300, by = 100)) predict(slr.TV, newdata=dat) 1 2 3 4 7.032594 11.786258 16.539922 21.293586
Two more simple linear regression models
Fit the simple linear regression models of sales
on radio
and newspaper
:
slr.radio <- lm(sales ~ radio, data = ad) slr.newspaper <- lm(sales ~ newspaper, data = ad) summary(slr.radio) Call: lm(formula = sales ~ radio, data = ad) Residuals: Min 1Q Median 3Q Max -15.7305 -2.1324 0.7707 2.7775 8.1810 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 9.31164 0.56290 16.542 <2e-16 *** radio 0.20250 0.02041 9.921 <2e-16 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 4.275 on 198 degrees of freedom Multiple R-squared: 0.332, Adjusted R-squared: 0.3287 F-statistic: 98.42 on 1 and 198 DF, p-value: < 2.2e-16 Call: lm(formula = sales ~ newspaper, data = ad) Residuals: Min 1Q Median 3Q Max -11.2272 -3.3873 -0.8392 3.5059 12.7751 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 12.35141 0.62142 19.88 < 2e-16 *** newspaper 0.05469 0.01658 3.30 0.00115 ** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 5.092 on 198 degrees of freedom Multiple R-squared: 0.05212, Adjusted R-squared: 0.04733 F-statistic: 10.89 on 1 and 198 DF, p-value: 0.001148
Findings:
- Even though
radio
andnewspaper
are highly significant and contribute positively tosales
on their own, the quality of these two simple linear regression models is not as favorable as that of the model ofsales
onTV
.- \(\boldsymbol{R^2}\): Their coefficients of determination \(R^2\) are only 33.2% and 5.212%, much lower than the 61.19% we saw.
- If we are to choose only one from the three advertising media,
sales
onTV
is to be chosen because its association with sales captured by \(R^2\) (equivalently, the t-statistic) is the largest/strongest.
Notes:
- Don’t base our choice on which simple linear regression model has the highest estimated coefficient for the predictor, because:
- the estimated coefficient itself does not take into account the variability (standard error) of the estimate, but the t-statistics do.
TASK 3: Construct a basic multiple linear regression model
Fit a multiple linear regression model for sales
on TV
, radio
, and newspaper
. Then do the following:
- Interpret the estimated coefficients.
- Comment on the goodness of fit of this model.
- The statistical significance of
newspaper
in this model differs from that in the simple linear regression model in Task 2. In view of your observations in Task 1, explain why this is the case. - Refit the model by dropping the insignificant variable(s).
Model 1: Multiple linear regression model with all three media
To regress sales
on TV
, radio
, and newspaper
:
\(\text{sales}=\beta_0+\beta_1\times \text{TV}+\beta_2\times \text{radio}+\beta_3\times \text{newspaper}+\varepsilon\)
model.1 <- lm(sales ~ TV + radio + newspaper, data = ad) s <- summary(model.1) s Call: lm(formula = sales ~ TV + radio + newspaper, data = ad) Residuals: Min 1Q Median 3Q Max -8.8277 -0.8908 0.2418 1.1893 2.8292 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 2.938889 0.311908 9.422 <2e-16 *** TV 0.045765 0.001395 32.809 <2e-16 *** radio 0.188530 0.008611 21.893 <2e-16 *** newspaper -0.001037 0.005871 -0.177 0.86 --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 1.686 on 196 degrees of freedom Multiple R-squared: 0.8972, Adjusted R-squared: 0.8956 F-statistic: 570.3 on 3 and 196 DF, p-value: < 2.2e-16
Interpretation of Estimates:
- When no money is spent on any of the three advertising media, the expected amount of
sales
is estimated to be \(\beta_0=2.938889\). - In the multiple linear regression framework, the interpretations of the slope coefficients differ slightly due to the presence of other predictors.
- For instance, when TV increases by 1 unit, the expected increase in
sales
is estimated to be \(\beta_1=0.045765\) units, provided that the advertising budgets inradio
andnewspaper
are held fixed. - This expected increase is comparable to that in the simple linear regression model if
sales
onTV
, which is 0.04754. - Similar interpretations can be made for (\beta_2=0.188530\) and (\beta_3=-0.001037\).
- For instance, when TV increases by 1 unit, the expected increase in
Quality of Model:
We can observe from the p-value column that the key advertising media affecting sales
are TV
and radio
, with their p-values being practically zero. As their coefficient estimates are positive, we can conclude that TV
and radio
are significant advertising media contributing favorably to sales
, as we would have expected.
The last three lines of the summary output contain information about the global quality of the fitted linear regression model.
We can see that the \(R^2\) has increased quite remarkably from 0.6119 to 0.8972 (i.e., 89.72% of the variation of sales
is explained by the three advertising media together) when switching from the simple linear regression model of sales
on TV
to the current multiple linear regression model.
The F-statistic takes the extraordinarily large value of 570.3. The p-value of the F-test borders on zero, so we have extremely strong evidence that at least one of the three advertising media is an important predictor of sales
. As we have seen above, however, only TV
and radio
appear to be truly related to sales.
Is newspaper significant or insignificant for sales after all?
There is no contradiction at all if we interpret the results of the two models properly. What the simple linear regression model tells us is that in the absence of other information such as TV
and radio
, newspaper
is an important predictor of sales
, while multiple linear regression model says that newspaper
contributes little information to understanding sales in the presence of TV
and radio
.
Loosely speaking, when the values of TV
and radio
are known and held fixed, changing newspaper
is associated with a minimal change in sales
. The fact that the multiple linear regression model has accounted for the effects of TV
and radio
is the key.
We can better understand the “contradiction” between the simple linear regression and multiple linear regression models in two ways.
Correlations
According to the correlation matrix for sales and the three advertising media, our focus is on the pairwise correlations between the three media, so, for example, the correlation between newspaper
and radio
is about 0.35, which is moderately high.
This means that firms that spend more on radio advertising also tend to spend more on newspaper advertising.
Now if radio
is predictive of sales
positively (implied by the multiple linear regression model and the high correlation between radio
and sales
), then with a higher expenditure on radio advertising, sales also tend to be higher.
At the same time, the positive correlation between newspaper
and radio
means that the expenditure on newspaper
also tends to increase. Hence there is a tendency that higher values of sales go hand in hand with higher values of newspaper
, as suggested by the simple linear regression model, which only analyzes sales in relation to newspaper
and ignores other information such as radio
.
In this case, we can say that newspaper serves as a surrogate for radio and inherits some of the predictive power of radio for sales.
Partial correlations
The partial correlation between a target variable \(y\) and a predictor \(X_j\) is the correlation between the two variables after controlling for the effects of other predictors. It can be calculated as follows:
-
- Regress \(y\) on all predictors other than \(X_j\) and denote the residuals by \(e_1\).
- Regress \(X_j\) on all predictors other than \(X_j\) and denote the residuals by \(e_2\).
- The partial correlation between \(y\) and \(X_j\) is defined as the correlation between the two sets of residuals, \(e_1\) and \(e_2\).
For instance, to calculate the partial correlation between sales
and newspaper
, we have to fit two models, one regressing sales
on TV
and radio
and one regressing newspaper
on TV
and radio
. Then we use the cor()
function to compute the correlation between the two series of residuals.
\(\text{sales}=\beta_0+\beta_1\times \text{TV}+\beta_2\times \text{radio}+\varepsilon\)
m1 <- lm(sales ~ TV + radio, data = ad) m2 <- lm(newspaper ~ TV + radio, data = ad) cor(m1$residuals, m2$residuals) [1] -0.01262147
The partial correlation between sales
and newspaper
is essentially zero, which suggests that newspaper
is virtually uncorrelated with sales
after accounting for the effects of TV
and radio
. In contrast, the ordinary correlation between sales
and newspaper
is 0.228299. Much of this correlation stems from the positive correlation between radio
and newspaper
, which is 0.3541.
Model 2: Model with only TV and radio
As Model 1 suggests that newspaper
is not as-sociated with sales
in the presence of TV
and radio
, let’s refine the model by dropping newspaper and regressing sales
only on TV
and radio
.
model.2 <- lm(sales ~ TV + radio, data = ad) summary(model.2) Call: lm(formula = sales ~ TV + radio, data = ad) Residuals: Min 1Q Median 3Q Max -8.7977 -0.8752 0.2422 1.1708 2.8328 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 2.92110 0.29449 9.919 <2e-16 *** TV 0.04575 0.00139 32.909 <2e-16 *** radio 0.18799 0.00804 23.382 <2e-16 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 1.681 on 197 degrees of freedom Multiple R-squared: 0.8972, Adjusted R-squared: 0.8962 F-statistic: 859.6 on 2 and 197 DF, p-value: < 2.2e-16
The deletion of newspaper
causes minimal effects on the model, speaking to the statistical insignificance of newspaper
in the presence of TV
and radio
. Not only does the \(R^2\) stay approximately the same at 89. 72% (we know that the \(R^2\) of Model 2 cannot be higher than that of Model 1, but the drop is negligible), the F-statistic and adjusted \(R^2\) of the model also increase when newspaper is omitted. These results suggest that regressing sales only on TV and radio improves the performance of the linear model and is likely to enhance its prediction accuracy.
In the rest of this case study, we will use Model 2 as the starting point and add new features to make the model more flexible and predictive.
TASK 4: Construct a multiple linear regression model with interaction
A three-dimensional scatterplot is shown for sales against TV and radio, and the fitted regression plane when sales is regressed on the two variables.
- Explain why the scatterplot suggests that there may be an interaction between TV and radio.
- Fit a multiple linear regression model for sales against TV and radio with interaction. Provide the summary output.
- Interpret the estimated coefficient for the interaction term.
- Provide evidence that the interaction is significant.
The regression plane is upward tilting with respect to both TV and radio-the more we spend on TV and radio, the larger the value of sales on average.
Response observations that lie above the fitted plane have their residuals (which are positive) colored in cyan whereas those that lie below the plane have their (negative) residuals colored in magenta.
What can you observe from the 3D scatterplot?
There is a clear pattern of positive residuals on the 45-degree line on the radio
x TV
plane and negative residuals when most of the advertising expenditure is spent on either TV
or radio
. In other words, the fitted model systematically underestimates sales when the advertising budget is split between TV
and radio
, but overestimates sales when the company relies mainly on either TV or radio as the advertising medium.
This suggests the presence of a prominent synergy effect between TV and radio which is not taken into account by Model 2. In other words, the rate of increase in sales with respect to TV (resp. radio) appears to increase with the value of radio (resp. TV), which is at odds with Model 2’s implicit assumption that the average effect on sales of each medium is independent of the amount spent on the other medium. This phenomenon provides the motivation for us to refine our linear model with the goal of effectively capturing the systematic patterns in the observations.
Model 3: Model with TV and radio with interaction
To take care of the synergy or interaction effect between TV and radio, we employ the following model 3:
\(\text{sales}=\beta_0+\beta_1\times \text{TV}+\beta_2\times \text{radio}+\beta_3\times \boxed{\text{TV}\times \text{radio}}+\varepsilon\)
which has the product of TV and radio added as a new predictor. Since:
- \(\dfrac{\partial}{\partial\text{TV}}E[\text{sales}]=\beta_1+\beta_3\times\text{radio}\)
- \(\dfrac{\partial}{\partial\text{radio}}E[\text{sales}]=\beta_2+\beta_3\times\text{TV}\)
The average effect of TV (resp. radio) on sales now depends on the amount spent on radio (resp. TV). We expect the value of \(\beta_3\) to be positive.
Let’s fit the above model with interaction effects (called Model 3) and look at its summary output.
model.3 <- lm(sales ~ TV * radio, data = ad) summary(model.3) Call: lm(formula = sales ~ TV * radio, data = ad) Residuals: Min 1Q Median 3Q Max -6.3366 -0.4028 0.1831 0.5948 1.5246 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 6.750e+00 2.479e-01 27.233 <2e-16 *** TV 1.910e-02 1.504e-03 12.699 <2e-16 *** radio 2.886e-02 8.905e-03 3.241 0.0014 ** TV:radio 1.086e-03 5.242e-05 20.727 <2e-16 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 0.9435 on 196 degrees of freedom Multiple R-squared: 0.9678, Adjusted R-squared: 0.9673 F-statistic: 1963 on 3 and 196 DF, p-value: < 2.2e-16
The estimated coefficient for the interaction term, as we expect, is positive and equals 0.001086, with an extremely low p-value, indicating that we have very strong evidence that positive interaction effects exist between TV and radio.
We can say that the effect of radio on expected sales is estimated to be 0.02886 + 0.001086 x TV, which increases by 0.001086 for every unit increase in TV.
Reversing the roles of TV and radio, we can also say that the effect of TV on expected sales increases by 0.001086 for every unit increase in radio.
Further evidence of the significance of the interaction effect is furnished by the notable jump of \(R^2\) from 89.72% for Model 2 to 96.78% for Model 3 and the sharp decline of the \(RSE\) from 1.681 for Model 2 to 0.9435 for Model 3. By any standard, Model 3 is superior to Model 2, which contains only main effects.
Incidentally, both the main effects and interaction effect in Model 3 are statistically significant as evidenced by their extremely small p-value. In other situations, it can happen that the interaction effect is significant, but not the main effects.
Whenever a higher-order variable is retained in a model ( due to statistical significance) , so are all lower-order variables, even if they are insignificant. As a consequence, if \(X1:X2\) is kept, then so are the main effects. This explains why the shorthand \(X1*:X2\) is much more commonly used than \(X1:X2\) alone.
TASK 5: Evaluate three marketing plans
Suppose that as the manager of the company, you are given a total of $100,000 to spend on advertising. You are considering three ways to allocate the $100,000 budget between TV advertising and radio advertising:
- Spending the entire amount on radio advertising.
- Splitting the budget evenly between TV and radio advertising.
- Spending $70,000 on radio advertising and $30,000 on TV advertising.
Evaluate these three marketing plans using the interaction model constructed in Task 4. Which plan would you recommend?
(Note: Keep in mind that the TV and radio variables are measured in thousands of dollars.)
Before evaluating the three marketing plans, let’s build a data frame called dat.budget
(any name will do) to host the three pairs of predictor values of interest.
dat.budget <- data.frame(TV = c(0, 50, 30), radio = c(100, 50, 70))
Let’s apply the predict()
function to perform predictions at the three combinations of values contained in dat.budget
.
predict(model.3, newdata = dat.budget) 1 2 3 9.636254 11.864528 11.625115
The coefficient estimates can be obtained from model.3$coefficient
s or coef(model.3)
.
coef(model.3) (Intercept) TV radio TV:radio 6.750220203 0.019101074 0.028860340 0.001086495
TASK 6: Construct a multiple linear regression model with interaction and polynomial terms
Your supervisor has suggested that adding the square of TV as a feature may improve the multiple linear regression model.
- Explain, without running any models, why this is or is not a reasonable suggestion.
- Regardless of your conclusion, fit a multiple linear regression model for sales against TV and radio with interaction and the square of TV added. Provide the summary output.
Model 4: Model with TV and radio with interaction and polynomial terms
Besides the use of interaction terms, another way to accommodate complex relationships between the target variable and individual predictors is polynomial regression. Via the use of higher-power terms, we are able to take care of a wide variety of relationships which are substantially more complex than linear ones (e.g, curved relationships).
We observed a positive relationship between sales
and TV
, but upon closer examination, the relationship appears to be curved rather than linear (the curved relationship between sales
and radio
is less pronounced).
The rate of increase of sales
with respect to TV
seems to be declining with TV, conforming to the law of diminishing marginal returns in economics – the marginal increase in sales due to an additional unit increase in TV eventually declines beyond a certain point of saturation.
This motivates the introduction of higher powers of TV, say the square of TV, as an additional predictor to be added to Model 3 to better capture the non-linear relationship between sales and TV. The equation of the resulting model, called Model 4, is:
\(\text{sales}=\beta_0+\beta_1\times \text{TV}+\beta_2\times \text{radio}+\beta_3\times \boxed{\text{TV}^2}+\beta_4\times \boxed{\text{TV}\times \text{radio}}+\varepsilon\)
When \({TV}^2\) is inserted, the expected rate of change of sales with respect to TV
is:
- \(\dfrac{\partial}{\partial\text{TV}}E[\text{sales}]=\beta_1+2\beta_3\times\text{TV}+\beta_4\times\text{radio}\)
which depends not only on the value of radio
(due to the interaction term), but also on the value of TV
(due to the square term).
In R, higher powers of a predictor can be created before applying the lm()
function, e.g., the square of a predictor X
can be set up using the command: X2 <- X^2
or they can be created on the fly and entered into the model formula using the I()
function, where “I” stands for “insulation.” The use of the I()
function is necessary because the ^
operator, when used in an R formula, has a different meaning than exponentiation and the I()
function ensures that the elements within the parentheses are processed arithmetically.
model.4 <- lm(sales ~ TV * radio + I(TV ^ 2), data = ad) summary(model.4) Call: lm(formula = sales ~ TV * radio + I(TV^2), data = ad) Residuals: Min 1Q Median 3Q Max -4.9949 -0.2969 -0.0066 0.3798 1.1686 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 5.137e+00 1.927e-01 26.663 < 2e-16 *** TV 5.092e-02 2.232e-03 22.810 < 2e-16 *** radio 3.516e-02 5.901e-03 5.959 1.17e-08 *** I(TV^2) -1.097e-04 6.893e-06 -15.920 < 2e-16 *** TV:radio 1.077e-03 3.466e-05 31.061 < 2e-16 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 0.6238 on 195 degrees of freedom Multiple R-squared: 0.986, Adjusted R-squared: 0.9857 F-statistic: 3432 on 4 and 195 DF, p-value: < 2.2e-16
Just as we expect, the coefficient estimate of the square term is negative and the p-value is extremely small, showing that the square term is highly significant. Compared to Model 3, the adjusted \(R^2\) and RSE of Model 4 have experienced a noticeable amount of improvement, from 0.9673 to 0.9857 and from 0.9435 to 0.6238, respectively. These metrics are pointers that Model 4 outperforms even Model 3 and is by far the best model among the four linear models we have constructed.
Model Evaluation: Training/test set split
TASK 7: Choose a model
- Split the data into training and test sets.
- Evaluate the prediction performance of the models in Tasks 3, 4, and 6. Recommend which model should be used.
While the adjusted \(R^2\) and RSE are indirect estimates of the test error of a linear model, in Exam PA it is most desirable to evaluate the predictive performance of a linear model directly on an independent test set.
In this subsection, we split the ad dataset into two parts:
- A training set, where we train our candidate linear models and get the coefficient estimates.
- A test set, where we use the linear models fitted to the training set to make predictions. We will then evaluate the prediction accuracy of the models according to a certain criterion (e.g., test RMSE).
Creation of training and test sets: createDataPartition()
There are a number of ways to split the data into the training and test sets in R (e.g., using the sample()
function is one of the easiest). In Exam PA, we will use the createDataPartition()
function in the caret
package, which is short for “classification and regression training” and is one of the most important packages in predictive modeling. Used in all of the past and sample PA projects, this data partition function is extremely important as each PA project involves model validation in one way or another.
Let’s perform the training/test set split for the ad dataset using this function.
createDataPartition(<target_variable>, p = <proportion of training data>, list = <if vector>)
library(caret) set.seed(1) partition <- createDataPartition(ad$sales, p = 0.7, list = FALSE) data.train <- ad[partition, ] data.test <- ad[-partition, ]
To validate the training/test set split (i.e., to make sure that it works well) we have just done, we can verify the similarity of the values of sales in the training and test sets by:
summary(data.train$sales) Min. 1st Qu. Median Mean 3rd Qu. Max. 1.60 10.43 12.90 13.96 17.30 26.20 summary(data.test$sales) Min. 1st Qu. Median Mean 3rd Qu. Max. 3.20 10.32 12.60 14.19 17.40 27.00
The two sets of summary statistics are indeed very close to each other, bearing testimony to the stratified sampling that createDataPartition()
implements.
Note:
In order to recognize or check the importance of the use of stratified sampling to improve the effectiveness of the model validation” when using the createDataPartition()
function, when you do model evaluation, be sure to:
- Describe the design/results of the training/test set split, i.e., the proportion (or number) of observations that go to the training set and test set.
- Point out that you are using stratified sampling to ensure that both the training and test sets contain similar and representative values of the response variable.
- Explicitly check that the two sets of target variable values are comparable by looking at some summary statistics, e.g., the mean.
Model Evaluation: Performance Metrics
Fitting the linear models on the training set and testing them out on the test set
For the purposes of illustration, we are going to consider two more models, the basic model in Task 3 and a complex model involving polynomial and interactions terms of TV
and radio
.
Model | Equation |
1 | \(\text{sales}=\beta_0+\beta_1\times \text{TV}+\beta_2\times \text{radio}+\beta_3\times \text{newspaper}+\varepsilon\) |
2 | \(\text{sales}=\beta_0+\beta_1\times \text{TV}+\beta_2\times \text{radio}+\varepsilon\) |
3 | \(\text{sales}=\beta_0+\beta_1\times \text{TV}+\beta_2\times \text{radio}+\beta_3\times\text{TV}\times\text{radio}+\varepsilon\) |
4 | \(\text{sales}=\beta_0+\beta_1\times \text{TV}+\beta_2\times \text{radio}+\beta_3\times\text{TV}^2+\beta_4\times\text{TV}\times\text{radio}+\varepsilon\) |
5 | \(\text{sales}=\beta_0+\beta_1\times \text{TV}+\beta_2\times \text{radio}+\beta_3\times\text{TV}^2+\beta_4\times\text{radio}^2+\beta_5\times\text{TV}\times\text{radio}+\varepsilon\) |
Let’s fit these five models to the training set (the .tr
suffix stands for “training”). Do note that the data argument of the lm()
function is set to train
rather than ad
.
model.1.tr <- lm(sales ~ TV + radio + newspaper, data = data.train) model.2.tr <- lm(sales ~ TV + radio, data = data.train) model.3.tr <- lm(sales ~ TV * radio, data = data.train) model.4.tr <- lm(sales ~ TV * radio + I(TV^2), data = data.train) model.5.tr <- lm(sales ~ TV * radio + I(TV^2) + I(radio^2), data = data.train)
Then for each trained linear model, we generate predictions on the test set and compute the test RMSE. As the same calculation will be performed for each of the five models, it is desirable to define a function or invoke an existing function that will help us automate the procedure and reduce repetition.
Here the RMSE()
function in the caret
package takes a vector of observed target variable values and a vector (of the same dimension) of predicted target variable values as inputs, and returns the training RMSE or test RMSE, depending on what is fed into the arguments of the function.
RMSE(data.train$sales, predict(model.1.tr)) [1] 1.701546 RMSE(data.train$sales, predict(model.2.tr)) [1] 1.704158 RMSE(data.train$sales, predict(model.3.tr)) [1] 0.9645623 RMSE(data.train$sales, predict(model.4.tr)) [1] 0.6433654 RMSE(data.train$sales, predict(model.5.tr)) [1] 0.6394058
From Models 2 to 5, the linear models have increasing levels of complexity (i.e., they are nested) and thus decreasing training RMSE. Note that we set the first argument of the RMSE()
function to the sales variable in the training set and the second argument to the predicted (or fitted) values generated by the linear model in question. Remember that when the predict()
function is applied to an lm
object without additional arguments, the fitted values in the training set are formed by default.
We now apply the linear models fitted to the training set to generate the predicted values on the test set, which is passed to the newdata
argument.
RMSE(data.test$sales, predict(model.1.tr, newdata = data.test)) [1] 1.602705 RMSE(data.test$sales, predict(model.2.tr, newdata = data.test)) [1] 1.581734 RMSE(data.test$sales, predict(model.3.tr, newdata = data.test)) [1] 0.8645022 RMSE(data.test$sales, predict(model.4.tr, newdata = data.test)) [1] 0.5473428 RMSE(data.test$sales, predict(model.5.tr, newdata = data.test)) [1] 0.5589234
Unlike the training RMSE, the test RMSE first decreases from Models 1 to 4, then eventually rises when going from Models 4 to 5. As we make our linear model more sophisticated, its predictive performance initially improves due to its increased ability to capture complicated patterns. When it becomes unnecessarily complicated, it appears to overfit the data and its predictive performance worsens. This creates the U-shape that characterizes the test error as a function of model complexity and is another illustration that having excessive flexibility or complexity will eventually backfire.
Among the five linear models, Model 4 appears to strike a good balance between bias and variance, and gives rise to the smallest test RMSE. On the basis of prediction accuracy, Model 4 is our recommended linear model.
Stage 5: Model Validation
Stage 6: Model Maintenance