Taking too long? Close loading screen.

SOA ASA Exam: Predictive Analysis (PA) – 5.1. Conceptual Foundations of Decision Trees

[mathjax]

LEARNING OBJECTIVES

Able to construct decision trees for both regression and classification.

  • Understand the basic motivation behind decision trees.
  • Construct regression and classification trees.
  • Use bagging and random forests to improve accuracy.
  • Use boosting to improve accuracy.
  • Select appropriate hyperparameters for decision trees and related techniques.

 

EXAM NOTE

As pointed out in Subsection 3.1.1, there are only two supervised learning techniques in Exam PA, GLMs and decision trees. To assess your knowledge of the syllabus materials effectively, the SOA usually “marries” decision trees with GLMs to come up with a comprehensive exam project.

 

Chapter Overview

This chapter enriches our predictive analytics toolbox and introduces the second type of supervised learning technique in Exam PA, namely, decision trees. Just like GLMs, decision trees can be applied to tackle both regression and classification problems with both numeric and categorical predictors, but with a fundamentally different approach.

While GLMs provide a prediction equation based on a linear combination of the predictors, decision trees divide the feature space (i.e., the space of all combinations of feature values) into a finite set of non-overlapping and exhaustive regions of relatively homogeneous observations more amenable to analysis and prediction.

To predict a given observation, we simply locate the region to which it belongs and use the average value (mean) or the majority class (mode) of the target variable in that region as the prediction. Because the rules of prediction can be vividly represented in the form of a tree, the resulting predictive model is aptly called a decision “tree.”

In Section 5.1, we will look at the basic theory behind decision trees, set up the terminology that will be used throughout the chapter, and learn, at a conceptual level, how a decision tree can be fitted and evaluated. More complex tree models based on ensemble methods are also discussed. Sections 5.2 and 5.3 dive into the practical implementation of the tree-building process, with Section 5.2 presenting a case study that involves a small-scale dataset and aims to acquaint you with the output generated by R’s tree-fitting functions. Then Section 5.3 is a more sophisticated case study that demonstrates how base and ensemble classification trees can be constructed and evaluated in terms of prediction accuracy, and what considerations go into the tree selection process.

 

EXAM NOTE

As pointed out in Subsection 3.1.1, there are only two supervised learning techniques in Exam PA, GLMs and decision trees. To assess your knowledge of the syllabus materials effectively, the SOA usually “marries” decision trees with GLMs to come up with a comprehensive exam project.

 

Base Decision Trees

As mentioned in the introduction, a decision tree develops a series of classification rules or splits based on the values or levels of the predictors. Observations in the same group based on these classification rules are relatively stable (i.e., most of them share similar values or classes with respect to the target variable) and thus much easier to analyze and predict.

We will discuss how these splits are constructed very shortly. For now, it is enough to know that to predict an observation, we drop it down the decision tree and use the classification rules to identify the region to which it belongs. What we use as the prediction depends on the nature of the target variable, or equivalently, whether we are dealing with a regression problem or a classification problem.

  • Quantitative target variables: Use the average (mean) of the target variable in that region as the predicted value.
  • Qualitative target variables: Use the most common class (mode) of the target variable in that region as the predicted class.

Terminology-wise, a decision tree for predicting a quantitative target is called a regression tree, whereas a decision tree for predicting a qualitative target is called a classification tree. Both types of trees are regularly tested in Exam PA and will be covered in this chapter.

 

Some Commonly Used Terms

Terminology-wise, the following terms are useful for describing a generic decision tree:

Node

A node is a point on a decision tree that corresponds to a subset of the training data. There are different types of tree nodes:

Root node

This is the node at the top of a decision tree representing the full dataset.

Terminal Nodes ( or Leaves)

These are nodes at the bottom of a tree which are not split further and constitute an end point.

 

Binary Tree

This is a decision tree in which each node has only two children. In Exam PA, we will only use binary trees.

 

Depth

The number of tree splits needed to go from the tree’s root node to the furthest terminal node. It is a measure of the complexity of a decision tree; the larger the depth, the more complex it is.

We also say that a tree is balanced (resp. unbalanced) if the left and right subtrees coming out of any node differ in depth by at most (resp. more than) one.

 

Construction of Decision Trees: Making the first split

To construct a decision tree, we have to think about how the series of classification rules, or equivalently, the series of tree splits should be designed. In theory, these splits can be constructed using arbitrary configurations of the predictors. In practice, for simplicity it is customary to use binary splits, separating the training observations into two subgroups according to the value or level of one and only one predictor at a time.

Every time we make a binary split, there are two inter-related decisions we have to make:

  • The predictor to split
  • Given the split predictor, the corresponding cutoff value.
    For categorical predictors, the split choices are all possible ways the predictor levels can be partitioned into two subsets.

Remember, our goal is to create terminal nodes that contain similar observations, which are easy to analyze and predict. To do this, in every split we partition the target observations by whichever variable results in the greatest reduction of impurity. Measures of node impurity differ depending on whether the decision tree is a regression tree or a classification tree.

 

Regression Trees

Analogous to linear models, the variability of a numeric target variable in a particular node of a decision tree is quantified by the residual sum of squares, or RSS. The lower the RSS, the more homogeneous the target observations. Mathematically, if Rm denotes a certain node ( not necessarily a terminal node) of the decision tree, then the RSS of the target variable in Rm is:

\(RSS_m=\sum\limits_{i\in R_m}{(y_i-\hat{y}_{R_m})^2}\),

where \(\hat{y}_{R_m}=\dfrac{\sum\limits_{i\in R_m}{y_i}}{|R_m|}\) is the mean of the target variable in Rm (|Rm| is the number of training observations in Rm) and serves as our prediction for the target variable there.

In terms of the RSS, the first split coming from the root node should be such that the overall RSS after the split given by:

RSS of obs. in first branch + RSS of obs. in second branch <=> \(\sum\limits_{i:X_{ij}<s}{(y_i-\hat{y}_{R_1})^2}+\sum\limits_{i:X_{ij}\le s}{(y_i-\hat{y}_{R_2})^2}\)

where \(\hat{y}_{R_1}\) and \(\hat{y}_{R_2}\) are the means of the target variable in the two resulting nodes, is the smallest among all possible choices of the predictor Xj and the cutoff level s.

 

Classification Trees

Classification trees are more difficult to deal with than regression trees because there are several measures of node impurity one can use. For all of these measures, the larger their values, the more impure or heterogeneous the node is.

Here are three common node impurity measures: (Let’s assume a given node Rm and that the target variable has K distinct classes)

Classification Error Rate

Remember that our predicted class label for a categorical target variable in a node is the most common class, i.e., the class with the highest proportion of observations in that node. In terms of the class proportions, the classification error rate in Rm can be expressed as:

\(E_m=1-\underset{1\le k\le K}{\mathop{Max}}\,\hat{p}_{mk}\)

where \(\hat{p}_{mk}\) is the proportion of the target observations in Rm that lie in the kth class, i.e.:

\(\hat{p}_{mk}\)=\(\dfrac{\text{no. of training observations of kth class in }R_m}{\text{total no. of training observations in }R_m}\)

To understand this seemingly complex formula for Em, note that \(\underset{1\le k\le K}{\mathop{Max}}\,\hat{p}_{mk}\) is, by definition, the proportion of training observations in Rm belonging to the most common level of the target variable. These observations are correctly classified to their own class. Taking the complement of this proportion yields the proportion of misclassified observations.

To see how the classification error rate serves as a measure of node purity, note that if \(\underset{1\le k\le K}{\mathop{Max}}\,\hat{p}_{mk}\) is close to 1, meaning that almost all of the training observations in Rm belong to the most common class, then Em will be very close to zero. The higher (resp. lower) the value of \(\underset{1\le k\le K}{\mathop{Max}}\,\hat{p}_{mk}\), the lower (resp. higher) the value of Em and the more concentrated (resp. more disperse) the observations in Rm.

 

Gini Index

A common criticism against the classification error rate is that it is not sensitive enough to node purity in the sense that it depends on the class proportions only through the maximum. If two nodes have the same maximum proportion, then hey will always share the same classification error rate, regardless of the distribution of the observations in other classes.

We now look at two measures that capture node impurity more effectively. Both of them are functions of all of the \(\hat{p}_{mk}\)’s (while the classification error rate depends only on the maximum). One such measure is the Gini index defined as:

\(G_m=\sum_{k=1}^{K}{\hat{p}_{mk}(1-\hat{p}_{mk})}\),

which is a measure of variance across the K classes of the categorical target variable. The higher the degree of purity of Rm, the lower the value of the Gini index. It can be shown that the Gini index achieves its minimum value of zero if \(\hat{p}_{mk}=1\) for some k (the observations are all concentrated in one class) and its maximum value if \(\hat{p}_{m1}=…=\hat{p}_{mk}=\dfrac{1}{K}\) (the observations are evenly spread among the K classes).

 

Entropy

In a similar spirit to the Gini index, the entropy is defined as:

\(D_m=-\sum_{k=1}^{K}{\hat{p}_{mk}\log_2(\hat{p}_{mk})}\),

Like the Gini index, the value of the entropy increases with the degree of impurity in the node.

 

In most cases, the choice of the node impurity measure does not have a dramatic impact on the performance of a classification tree. Figure 5.1.2 plots the three measures in the special case of K = 2, so they can be considered as a function of only \(\hat{p}_{m1}\) (since \(\hat{p}_{m2}=1-\hat{p}_{m1}\) is fully determined by \(\hat{p}_{m1}\)). All of the three measures behave quite similarly, especially for the Gini index and entropy, in that they attain their highest value when the node is most impure (when \(\hat{p}_{m1}=0.5\), meaning that the two classes are equally likely) and their lowest value when the node is most pure (when \(\hat{p}_{m1}=0\) or \(\hat{p}_{m1}=1\), meaning that the observations are concentrated in one class). However, Gini and entropy are differentiable with respect to the class proportions, and thus technically more amenable to numerical optimization. In fact, we will see later that Gini is taken as the default tree-growing criterion in R.

 

Construction of decision trees: Recursive binary splitting

As soon as we decide on an impurity measure, we can construct the first binary split (R will do the construction for us easily) with the goal of minimizing this impurity measure. After the first split, there will be two regions in the predictor space. Next, we seek the best predictor and the corresponding cutoff point for splitting one of the two regions (not the whole predictor space) in order to further minimize the impurity of the tree. Doing so splits the data into three regions in the predictor space.

Note that the predictor used in the second split may or may not be the same as that in the first split; there is no restriction that the two splits involve different predictors. In fact, if the same predictor is used, that is a reflection of the important role that predictor plays in explaining the response. The binary splitting process continues recursively (i.e., each split operating on the results of previous splits) until we reach a stopping criterion, e.g., the number of observations in all terminal nodes falls below some pre-specified threshold or there are no more splits that can contribute to a significant reduction in the node impurity. The whole tree-growing process is aptly called recursive binary splitting, which is a greedy algorithm (this term is glossed over in the PA e-learning modules, but is invoked in the model solutions of some past PA exams), meaning that in each step of the tree-building process, we make the split that is currently the best, instead of looking ahead and selecting a split that results in a better tree in a future step.

 

Controlling tree complexity by cost-complexity pruning

An oft-cited downside of decision trees is that although they can be fitted rather efficiently, they easily suffer from overfitting, meaning that they can be easily made too complex in an attempt to capture signals as well as noise in the training data. In the decision tree setting, the complexity (or size) of a tree is measured by the number of splits it has, or equivalently, the number of terminal nodes, analogous to the number
of (non-zero) coefficients in a GLM.

With numerous terminal nodes, an overblown tree is difficult to interpret and liable to produce predictions with a high variance. One possible reason for the unstable predictions is that the final splits of a large tree are often based on a very small number of observations, which are vulnerable to noise. This calls for strategies in search of the right level of tree complexity that optimizes the bias-variance trade-off and maximizes predictive performance on independent test data.

So how do we go about “pruning” (meaning to cut the branches of) a large decision tree to get a more compact, interpretable, and predictive tree? One way to do the pruning is to pre-specify an impurity reduction threshold, then start from the root node and make a split only when the reduction in impurity exceeds this threshold. While this intuitive method is capable of reducing tree size, it overlooks the important fact that a “not-so-good split” (one that results in a modest impurity reduction) early on in the tree-building process may be followed by a “good split” (one that results in a more impressive impurity reduction).

Following this short-sighted method, we will never arrive at the good split that comes later. A much better strategy for controlling tree complexity is cost-complexity pruning, a term glossed over by the PA modules but quoted a few times in past PA exams (e.g., Task 6 of the December 7, 2020 exam and Task 2 of the December 2019 exam), and so a relatively in-depth treatment of this concept should be desirable.

Using this methodology, we grow a very large tree and prune it back to get a subtree. To this end, we trade off the goodness of fit of the tree to the training data with the size of the tree using the penalized objective function given by:

relative training error + cp × |T| 

<=> (model fit to training data) + (tree complexity)

where:

The relative training error is defined as the training error of the tree scaled by (and therefore “relative” to) the training error of the simplest tree, i.e., the tree with no splits. Depending on the type of tree being fitted, it is given by:

For regression trees, relative training error = \(\dfrac{RSS}{TSS}=\dfrac{\sum_{i=1}^{n_{tr}}{(y_i-\hat{y}_{i})^2}}{\sum_{i=1}^{n_{tr}}{(y_i-\bar{y}_{i})^2}}\)

For classification trees, relative training error = \(\dfrac{\text{# of misclassifications made by tree}}{\text{# of misclassifications made by using majority class}}\)

with the sums and counts above taken over all terminal nodes of the tree.

cp ⊆ [0,1] is the complexity parameter that we specify a priori.

|T| is the size of the tree determined by the number of terminal nodes.

 

For Exam PA, there is no need to worry about how the penalized objective function is solved numerically for any particular value of cp. The fitting is done efficiently in R behind the scenes. The following conceptual issues are of more importance and may be tested:

  • Formulation: The penalized objective function is strongly reminiscent of the motivations behind regularization and involves the same “loss+penalty” formulation. Like the regularization penalty, the penalty term cp × |T|  is in place to prevent overfitting. The role played by the complexity parameter cp is analogous to the role played by the regularization parameter λ:
    • If cp = 0, then there is no price we have to pay for making the tree complex and the fitted tree is identical to the large, overblown tree without pruning.
      (Note: For classification trees, this large tree can be grown with minimizing either the Gini index (default) or entropy as the goal, which can be easily customized in R. However, the relative training error is defined in terms of the classification error rate, not Gini index or entropy.)
    • As cp increases from 0 to 1, the penalty term cp × |T| becomes greater and the tree branches are snipped off retroactively (i.e., starting from the bottom of the large tree, where the branches are based on relatively few observations) to form new, larger terminal nodes. In the framework of the bias-variance trade-off, we can say that the squared bias of the tree predictions will increase, but the variance will drop.
    • If cp = 1, then the penalized objective function is minimized by the trivial tree with the root node only (analogous to the intercept-only linear model). To see this, note that if we make any tree split, the tree size |T| will increase by 1. As cp = 1, the penalty term cp × |T| = |T| will increase by 1 as well.
      Now the relative training error is bounded from above by 1, so no matter how we split, the drop in the relative training error can never compensate for the increase in the penalty. That’s why the fitted tree makes no splits in this case.

By pruning back branches of a complex tree that offer little predictive power, we bring the tree size to a desired level, prevent overfitting, and hopefully improve predictive accuracy.

  • Nested family: Solving the penalized objective function for each cp produces a nested family of fitted decision trees indexed by cp (i.e., each cp gives rise to a certain fitted tree). Here “nested” means that if c1 and c2 are values of the cp parameter such that c1 ≤ c2 , then the tree fitted with c2 (larger value) is the smaller tree and all of its nodes, terminal and non-terminal, form a subset of the nodes of the tree fitted with c1 (smaller value). By varying the cp parameter, we can trace the whole family of trees ranging from the largest, the most complex tree, to the tree with the root node only.
  • Other parameters in play: When minimizing the penalized objective function over cp, the search is performed subject to constraints imposed by other tree control parameters such as the minimum number of observations in a node and maximum depth level. The role played by these external parameters was tested in the December 2019 PA exam.
  • Hyperparameter tuning: Much like the hyperparameters λ and α for regularization, the cp parameter can be selected by k-fold cross-validation. For a given cp value, we use the penalized objective function to fit a tree on all but one of the k folds, generate predictions on the held-out fold, and measure the prediction performance reflected by the RSS for a regression tree and the classification error rate for a classification tree. Repeat this process with each of the k folds left out in turn and compute the overall performance. The cp value that gives rise to the best performance (i.e., lowest RSS or classification error rate) is then chosen.

 

EXAM NOTE

Don’t think that the description of cross-validation above is useless! Task 2 of the December 2019 exam requires that you “build a tree by using cross validation to set the cp value” and “explain how cross validation selects the cp parameter.”

 

Exercise

(SOA Exam SRM Sample Question 25 (Adapted): Facts about splitting and pruning) Determine which of the following statements concerning decision tree pruning is/are true.

I. A tree with more splits tends to have lower variance.
II. When using the cost-complexity pruning method, cp = 0 results in a very large tree.
III. Increasing the value of cp tends to increase the training error of the fitted tree.

(A) None

(B) I and II only

(C) I and III only

(D) II and III only

(E) The correct answer is not given by (A) , (B), (C) , or (D).

Solution

I. False. A tree with more splits is more complex and tends to have higher variance.

II. True. When cp = 0, only the relative training error is minimized. The tree obtained is the biggest tree with no pruning.

III. True. As cp increases, the tree becomes less flexible and captures the patterns in the training data less effectively, with a higher training error. ( Answer: (D))

 

GLMs vs. Decision Trees

Because GLMs, which we studied in Chapters 3 and 4, and decision trees are the only two predictive models covered in Exam PA, it is beneficial, at least from an exam point of view, to compare and contrast these two modeling approaches. In what follows, we show that they are vastly different with respect to the nature of the output, the way they handle continuous predictors, categorical predictors, and interactions, and how they behave in the presence of collinearity.

 

Output

GLMs

A GLM produces a closed-form equation of the form: \(\hat{\mu}=g^{-1}(\hat{\beta}_0+\hat{\beta}_1X_1+…+\hat{\beta}_pX_p)\)

showing how the target mean depends on the feature values, with the coefficients indicating both the magnitude and direction of the effects of the features on the target variable.

Trees

The end product of a decision tree is a set of if-else splitting rules showing how the target mean varies with the feature values; there is no analytic model equation.

 

Numeric predictors

GLMs

In the standard form, a GLM assumes that the effect of a numeric predictor on the target variable is monotonic and assigns it a single regression coefficient, e.g., \(g(\mu)=\beta_0+\beta_1X+…\), where X is a numeric predictor. Non-linear effects can be captured by adding higher-power polynomial terms manually to the model equation, e.g.,

\(g(\mu)=\beta_0+\beta_1X+\beta_2X^2+…+\beta_mX^m+…\)

Trees

Tree splits are made, possibly repeatedly, separating the training observations into distinct groups according to the ordered values of a numeric predictor. The predicted means (for regression trees) or probabilities (for classification trees) in these groups are allowed to behave irregularly as a function of the numeric predictor, depending on the target observations in these groups. Without imposing a monotonic structure on the predicted means or probabilities, trees, by their very nature, excel in handling non-linear relationships.

 

Categorical predictors

GLMs

Before a categorical predictor enters the model equation, it has to be binarized and converted to dummy variables, which are numeric. A coefficient is assigned to the dummy variable for every non-baseline level.

 

Trees

Tree splits are made by directly separating the levels of a categorical predictor into two groups; binarization is not needed.

For instance, consider a categorical predictor X with four levels a, b, c, d. Perhaps to your surprise, there are a total of seven ways to make a tree split based on X.

Note that it is unnecessary and indeed undesirable to binarize X in advance and fit a decision tree to the binarized data. When a tree split is made according to a dummy variable, we are effectively separating our observations into two groups according to one and only one level of the underlying categorical predictor, which is what the first four splits above involve.

The first split, for example, is based on splitting the dummy variable for level “a.” The last three splits, however, will be impermissible-there is more than one level in both groups and so they cannot be accomplished by splitting a dummy variable.

The binarization of categorical predictors actually imposes unwarranted restrictions on the way tree splits can be made. 

 

Collinearity

GLMs

A GLM is vulnerable to the presence of highly correlated predictors. As we learned, these predictors may substantially inflate the variance of the coefficient estimates (which may be widely varying in size and magnitude) and model predictions. We also lose the ability to interpret the coefficient estimates meaningfully.

 

Trees

Collinearity among the predictors is less of a problem for decision trees and will not derail the tree-fitting algorithm even if perfect collinearity exists. However, when there are two highly correlated predictors, which one to use in a split may be chosen rather randomly. This is analogous to the non-uniqueness of the coefficient estimates of a GLM and can adversely skew the variable importance scores.

 

Pros and Cons of Decision Trees

As you can see from the discussions above, decision trees and GLMs are radically different types of predictive models. Compared to GLMs, decision trees have a number of pros and cons, the most prominent of which are discussed below: (The list below is not exhaustive!)

Merits

Interpretability

So long as there are not too many buckets, trees are easy to interpret and explain to non-technical audiences (recall that the client in a PA project is almost always someone unfamiliar with predictive analytics!) because of the transparent nature of the if/then classification rules, which mirror human decision-making more closely than GLMs do-most people make decisions by a set of rules instead of using an equation!

Trees can also be displayed graphically, which makes them even easier to interpret. Looking at a decision tree, we can easily tell which predictors are the most important, i.e., those that appear in top splits. For some link functions (e.g., inverse link), the coefficients in a GLM can be difficult to interpret.

 

Non-linear relationships

From the way trees handle numeric predictors we discussed before, we know that trees excel in accommodating non-linear relationships by dividing the feature space into distinct regions where the numeric predictors are allowed to affect the target variable in a complex manner. No polynomial terms need to be supplied in advance. In fact, the introduction of higher-power terms will not alter a fitted tree. Suppose, for example, that we are fitting a decision tree using a positive numeric variable X and its square X2 as features, and one split is according to whether X2 < 100 or X2 ≥ 100. These two criteria are equivalent to X < 10 and X ≥ 10, respectively. In other words, splits based on X2 are equivalent to splits based on X; the presence of X2 plays no role.

In the same vein, decision trees can take care of skewed variables without the need for variable transformations (e.g., log). In fact, any monotone transformations of numeric features will produce the same tree because the split points are based on the ranks of the feature values, not their actual values. (Transformations on the target variable, however, do affect the resulting tree; see Task 1 of Section 5.3)

 

Interactions

As we discussed before, they automatically recognize interactions between variables. Unlike GLMs, there is no need to identify potential interactions before fitting the tree. This, together with the preceding point, shows that feature generation is less an issue for decision trees.

 

Categorical Variables

Categorical predictors are automatically handled without the need for binarization or selecting a baseline level.

 

Variable Selection

Variables are automatically selected by trees. Variables that do not appear in the tree are filtered out and the most important variables show up at the top of the tree. For GLMs, all variables, including irrelevant ones, are retained, unless we perform stepwise selection or regularization.

 

Missing Data

They can be easily modified to deal with missing data (not much on this point in the PA e-learning modules).

 

Exercise: Merits of decision trees

Determine which of the following considerations may make decision trees preferable to other statistical learning methods.

I. Decision trees are easily interpretable.

II. Decision trees can be displayed graphically.

III. Decision trees tend to be easier to explain to non-technical audiences than generalized linear methods.

(A) None

(B) I and II only

( C) I and III only

(D) II and III only

(E) The correct answer is not given by (A), (B), (C), or (D).

 

Solution

All of the three statements are correct, as discussed above. (Answer: (E))

 

Demerits

Overfitting

Compared to GLMs, decision trees are more prone to overfitting and tend to produce unstable predictions with a high variance, even with pruning. A small change in the training data can lead to a large change in the fitted tree (the cutoff points, for example) and in the tree predictions.

One easy way to see why decision trees tend to have large variance is that the splits are made recursively, conditional on the results of preceding splits, so a bad initial split can screw up the rest of the tree!

 

Numeric Variables

To capture the effects of a numeric variable effectively, we need to make tree splits based on this variable repeatedly. This in turn gives rise to a complex, large-depth tree that is difficult to interpret.

In contrast, a GLM requires only a single coefficient to be estimated for a numeric variable (assuming a monotonic effect).

 

Categorical Variables

Decision trees tend to favor categorical predictors with a large number of levels over those with few levels. Due to the enormous number of ways to splits its levels into two groups, it is easier to choose a good split based on a multi-level categorical predictor for the data at hand, even though that predictor is not necessarily a good one. Combining factor levels prior to fitting a decision tree can help remedy this.

 

Model Diagnosis

Unlike GLMs, there is a lack of model diagnostic tools for decision trees.

 

Ensemble Trees: Random Forests

Motivation behind ensemble methods

Despite its transparency and ease of interpretability, a decision tree, when used alone, is sensitive to noise and tends to overfit even with pruning. With small perturbations in the training data, the fitted decision tree can be vastly different and the resulting predictions can be highly unstable.

Ensemble methods, the subject of the next two subsections, allow us to hedge against the instability of decision trees and substantially improve their predictive performance in many cases. These general-purpose methods produce multiple base models, instead of relying on a single model, and combine the results of these base models to make predictions. Depending on how the base models are constructed and how their predictions are combined, ensemble models tend to be much more successful in:

  • (Bias) Capturing complex relationships in the data due to the use of multiple base models each working on different parts of the complex relationships.

  • (Variance) Reducing the variability of the predictions produced because the variance of an average is smaller than that of one component, usually by a lot.

Overall, the predictions are often far superior to those of a single base model in terms of bias (accuracy) and/or variance (precision).

The drawbacks of ensemble models, however, are that they are usually computationally prohibitive to implement and difficult to interpret.

 

In Exam PA, we will study two ensemble tree methods (i.e., each base model is a decision tree): random forests and boosting. (You may have studied three methods in Exam SRM, but bagging is a special case of random forests.) Their workings may not be easy to understand on first encounter, so to ease our expositions, we will use schematic diagrams with a small amount of notation.

 

Basic Ideas

Random forests entail generating multiple bootstrapped samples of the training set \({(x_i,x_j)}_{i=1}^{n_{tr}}\) (each xi is the vector of predictor values associated with Yi) and fitting base trees in parallel independently on each of the bootstrapped training samples. By the term “bootstrapped,” we mean that these samples are generated by randomly sampling the original training observations with replacement.

Exam PA does not require a very deep understanding of the workings of bootstrap, so we will content ourselves with a toy example as a simple illustration. Suppose that we have a training set with the following three observations: