Summary of Questions
Question | Answer |
What are the modeling improvements? | Modeling Improvements: Adding an interaction term, factorizing a variable, using a tree-based model to take care non-linear relationship. |
Describe / Explain … (how X is used). |
|
Discuss … |
Examples:
|
Propose questions for … that will help clarify the business objective. |
|
What are the reasons why bias may not always decrease with additional degrees of freedom by adding a new predictor? |
|
Rank predictors |
|
Assess using the new target variable in the context of the business problem. |
|
The Model Building Process
Stage | Key Element | Key Command | Comment | ||||||||||||
Define the Business Problem |
|
– |
1. Determine whether the business issue is prediction-based or interpretation/insights-based. |
||||||||||||
Data Collection | Data Source |
|
– |
|
|||||||||||
Sampling |
|
– |
Undersampling: keeps all instances of the minority class and samples from the majority class. Oversampling: keeps all instances of the majority class and samples with replacement instances of the minority class. They increase the prevalence of the minority class, so the predicted probabilities increase for the minority class and decrease for the majority class. Oversampling must be applied after data split and only to the train data, otherwise duplicates of obs from the minority class in train and test data. |
||||||||||||
Granularity |
|
– |
|
||||||||||||
Exploratory Data Analysis | Bivariate / Univariate |
|
– | – | |||||||||||
Model Construction and Evaluation | Data Split |
|
– | Train/Test Data Split
|
|||||||||||
Weights vs. Offsets |
|
– |
|
||||||||||||
Reducing Dimensionality
/ Feature Selection |
Why do feature selection?
|
– |
|
||||||||||||
k-Fold Cross-Validation
|
– | k-Fold Cross-Validation
|
|||||||||||||
Best Subset Selection:
|
– | – | |||||||||||||
Stepwise Selection:
|
– |
AIC and BIC
|
|||||||||||||
Regularization (Shrinkage Methods):
|
– | Regularization trades off:
Differences in reducing dimensionality:
|
|||||||||||||
Unsupervised Learning:
|
– |
|
|||||||||||||
Feature Generation | – | – | |||||||||||||
Model Validation |
|
– | Model Diagnostic Tools:
On Test Data:
|
||||||||||||
Model Maintenance | – | – | – |
Comparison of GLM’s
Distribution | Y ⊆ (a, b) | g(⋅) | μ | Link Name | Use | Comment |
Gaussian | (-∞, ∞) | βX = μ | μ = βX | Identity | Symmetric | |
Gamma / Exponential | (0, ∞) | βX = -μ-1 | μ = –βX-1 | Strictly positive, right skewed, severity | ||
Inverse Gaussian | (0, ∞) | βX = μ-2 | μ = –βX-1/2 |
|
||
Poisson / Negative Binomial | 0, 1, … | βX = ln(μ) | μ = eβX | Log | Count of occurrences in fixed time/space |
|
Bernoulli | {0, 1} | βX = ln(μ/(1-μ)) | μ = eβX / (1 + eβX) | Logit | Outcome of single yes/no | |
Binomial | {0, 1, …, N} | Count of yes out of N yes/no | ||||
Tweedie | Premium, loss ratio, aggregate losses |
|
Comparison of Models
Supervised Learning
Supervised Learning | GLM | Base Decision Tree | Ensemble Model | |
Random Forest | Boosted Trees | |||
Focus | reduce model variance | reduce model bias | ||
Process | – | Cost Complexity Pruning:
|
||
Right-Skewed Data | Poor
strongly fit the right tail of the experience while fitting the majority of observations less well |
Poor
splits emphasize on the skewed area than frequent data. because it puts heavy emphasis on minimizing the squared errors of relatively few extreme points |
||
Nonlinear Interactions |
Poor
|
Good
|
||
Bias/Predictive | Poor
|
Fair
|
Great
|
|
Variance/Robust | Poor
|
Great
|
Fair
|
|
Interpretability | Fair
|
Great
|
Poor
|
Poor
|
Computational Requirements |
Good
|
Fair
|
Poor | Poor |
Determining Predictions |
|
|||
Issues | With high-dimensional categorical variables:
|
|
|
|
Advantage |
Compared to OLS:
|
Unsupervised Learning
Unsupervised Learning | Principal Components Analysis (PCA) | k-Means Clustering | Hierarchical Clustering |
Definition | Summarize high-dimensional data into fewer variables while retaining sufficient information | Find k distinct clusters defined by their centers | Find a hierarchy of unspecified clusters forming a tree structure |
Dataset | high-dimensional | ||
Use |
|
||
Standardization | Required | Required
If we don’t standardize the features, then the clustering places all of its weight on one feature and ignores the other feature because a large proportion of the Euclidean distance is attributed to the feature with larger values. Standardizing the features lets the algorithm place equal weight on both features when determining where the clusters should be. Hence, standardization is most always done prior to running a clustering algorithm. |
|
Process | How to develop features:
How to pick PCs:
|
Partition observations into k clusters, search for the optimal partition (elbow method) because it’s impossible to examine every possible partition; likely that the search will find a local, not a global, optimum | Agglomerative Bottom-up (popular) vs. divisive top-down approach:
Use dendrogram to visualize based on vertical axis where branches fuse. |
R Commands |
|
|
|
Advantage | Reduce dimension; can be used as new variables or to find latent variables | ||
Disadvantage |
|
|
|
From Interpretation Aspect
Decision Tree | |
Decision Tree |
|
Tuning Parameters
Parameter | Tuning | Interpretation | |
Decision Tree | rpart(y ~ ., data, method, control = rpart.control(), parms) |
||
method |
method = "anova" |
For regression trees | |
method = "class" |
For classification trees | ||
control = rpart.control(minsplit,minbucket, cp, xval, maxdepth) |
|||
minsplit |
lower minsplit -> higher complexity |
Minimum # obs that must exist in a node for a split to be attempted | |
minbucket |
lower minbucket -> higher complexity |
Minimum # obs in a terminal node | |
cp |
lower cp -> more splits -> higher complexity -> higher computing time |
|
|
maxdepth |
higher maxdepth -> higher complexity |
Maximum depth of final tree, with root node being 0. Directly controls the interaction depth (ex. 2 does not capture three-way variable interactions). | |
Random Forest | caret.train(y, x, method, ntree, importance, trControl, tuneGrid) |
||
ntree |
higher ntree -> each obs selected at least once, theoretically shouldn’t overfit
|
# trees to train | |
importance = FALSE |
importance scores of the predictors, by default will not be computed | ||
trControl = trainControl(method, number, repeats, sampling) |
|||
method |
method = "cv" |
Cross-Validation | |
method = "repeatedcv" |
Repeated Cross-Validation | ||
method = "rf" |
Construct a random forest | ||
number = k |
The # folds used in the k-fold cross-validation | ||
repeats = c |
higher c -> more times of cross-validation is performed |
Only for |
|
sampling |
sampling = "up" |
Oversampling | |
sampling = "down" |
Undersampling | ||
tuneGrid = expand.grid(mtry = r:c)" |
|||
mtry = r:c |
higher mtry -> the less variance reduction we get as the base trees will be more similar to one another.
Smaller If |
The # features considered in each split, in a form of data frame of (# of rows: # of columns):=(r: c)
Proportion of features used at each split. \(\sqrt{p}\) for classification, p/3 for regression, |
|
Boosting Tree | nrounds |
Higher |
The maximum # of trees to grow or iterations in the model fitting process. |
eta |
|
The learning rate or shrinkage parameter applied to t he contribution of each tree. |
|
maxdepth, min_child_weight, gamma |
– | Controls the complexity of the underlying trees. | |
k-means lustering | kmeans(data, centers, nstart = 1) |
||
centers |
– |
specifies |
|
nstart |
|
controls the number of random selection of initial cluster centers and only the round with the best result will be reported. |
Exploratory Data Analysis
Structured Data vs. Unstructured Data
Structured Data | Unstructured Data | |
Pros |
|
|
Cons |
|
|
Summary of Models
Principal Components Analysis (PCA)
Relative Sign | Possible Cause | Outcome | |
Correlation |
σij -> 1 | – |
|
σij -> 0 | – |
|
|
σij -> -1 | – | ||
PCm Loading |
\(\phi\)jm -> 1 | – |
|
\(\phi\)jm -> 0 | – |
|
|
\(\phi\)jm -> -1 | – |
|
|
\(\phi\)im > 0
\(\phi\)jm < 0 |
– |
|
|
PCm PVE |
High |
|
|
No sharp elbow | The cumulative PVE rises only modestly quickly and the scree plot does not have that sharp an elbow, suggesting that substituting these components for the underlying variables will not produce much dimension reduction, not enough to offset the considerable difficulties for interpretation its use would create. |
|
Statistics
Category | R Code | Interpretation |
Correlation Matrix |
Distance Duration Age Others Cost Distance 1.0000 0.4854 0.0017 -0.0002 0.5751 Duration 0.4854 1.0000 0.0181 0.0279 0.5510 Age 0.0017 0.0181 1.0000 -0.1202 -0.0011 Others -0.0002 0.0279 -0.1202 1.0000 0.0990 Cost 0.5751 0.5510 -0.0011 0.0990 1.0000 |
Categorical Predictors: should be transformed into dummy variables, o.w. unknown interdependency involving categorical predictors => cause modeling issues. Collinearity: The near 0.50 correlation between |
PCA: PC Loadings |
> pca <- prcomp(data, center = TRUE, scale. = TRUE) > pca$rotation PC1 PC2 PC3 PC4 PC5 Distance 0.56963795 -0.06517987 -0.10468926 -0.63334867 0.5090912 Duration 0.56171590 -0.05285474 -0.02230802 0.76444146 0.3111482 Age 0.00201430 -0.70275792 0.71051797 -0.03403320 0.0115414 Others 0.06876669 0.70568844 0.69442066 -0.02990823 0.1189976 Cost 0.59603267 0.03306191 0.03855751 -0.11156140 -0.7935486 |
Collinearity: That Distance and Duration have the same signs in the first three principal components highlights their collinearity in the context of other variables.
Exclude Target Variable: Target variable should be exclude from PCA (data quality issue: target leak). |
PCA: PVE |
> summary(pca) Importance of components: PC1 PC2 PC3 PC4 PC5 Standard deviation 1.4423 1.0594 0.9390 0.7180 0.63259 Proportion of Variance 0.4161 0.2245 0.1763 0.1031 0.08003 Cumulative Proportion 0.4161 0.6405 0.8168 0.9200 1.00000 |
Proportion of Variance Explained (PVE): PC1 has explained 41.61% of variance.
Cumulative PVE: PC2 has only explained 64.05% of the variance. |
Confusion Matrix |
> confusionMatrix(as.factor(train_pred), as.factor(y), positive = "1") [1] "Train confusion matrix" Confusion Matrix and Statistics Reference Prediction 0 1 0 6246 424 1 104 219 Accuracy : 0.9245 95% CI : (0.9181, 0.9306) No Information Rate : 0.9081 P-Value [Acc > NIR] : 5.584e-07 Kappa : 0.4176 Mcnemar's Test P-Value : < 2.2e-16 Sensitivity : 0.34059 Specificity : 0.98362 Pos Pred Value : 0.67802 Neg Pred Value : 0.93643 Prevalence : 0.09195 Detection Rate : 0.03132 Detection Prevalence : 0.04619 Balanced Accuracy : 0.66211 'Positive' Class : 1 ------------------------------------------------------ > confusionMatrix(as.factor(test_pred), as.factor(y), positive = "1") [1] "Test confusion matrix" Confusion Matrix and Statistics Reference Prediction 0 1 0 2658 204 1 24 110 Accuracy : 0.9239 95% CI : (0.9138, 0.9331) No Information Rate : 0.8952 P-Value [Acc > NIR] : 5.069e-08 Kappa : 0.457 Mcnemar's Test P-Value : < 2.2e-16 Sensitivity : 0.35032 Specificity : 0.99105 Pos Pred Value : 0.82090 Neg Pred Value : 0.92872 Prevalence : 0.10481 Detection Rate : 0.03672 Detection Prevalence : 0.04473 Balanced Accuracy : 0.67068 'Positive' Class : 1 |
Sensitivity on Train Data: the proportion of correctly classified x is only 34.059%. => undesired model
Specificity on Train Data: the proportion of correctly classified non x is 98.362%. Predictability: Sensitivity on Test Data: the proportion of correctly classified x is only 35.032%. => Missing 65% of Y Specificity on Test Data: the proportion of correctly classified non x is 99,105%. |
Plots
Residual Plot
Deviance Residual Plot
Definition
Deviance is the difference between our current model and the saturated model.
\(D = 2\log\left(\dfrac{L_{sat}(\hat{\beta})}{L_{model}(\hat{\beta})}\right) = 2(\ell_{sat}(\hat{\beta})-\ell_{model}(\hat{\beta}))\)
The deviance residual for an individual point is defined as: \(dev_i = sign(y_i – \mu_i) \sqrt{d_i}\)
- The null deviance is based on the intercept-only model: how different is that intercept-only model from the saturated model?
- The residual deviance is based on your fitted model: how different is your model from the saturated model?
R Command
residuals(object, type = c("deviance", "pearson", "working", "response", "partial"), ...)
- Pearson = (observed – predicted) / sqrt(V(predicted)), where V(predicted) is the variance function applied to the predicted value.
- Working = (observed – predicted) * g'(predicted), where g'() is the first derivative of the link function.
- Response = observed – predicted
- Deviance = s*sqrt(2*d), where s is 1 if observed > predicted and -1 if observed < predicted and d is the absolute value of the difference between the natural log of the probability of obtaining the observed value when the model predicts the observation exactly and the natural log of the probability of obtaining the observed value using the fitted model.
Interpretation
y-axis is the deviance residual, and the x-axis is the predicted value.
- Deviance residuals should have mean close to 0 (unbiased) and nearly constant variance as the predictive values increase (homoscedastic, or not heteroscedastic), when plotted against any predictor, or against fitted values. (follow normal distribution)
- If the dot sits above 0, then the fitted model underestimates
- Deviance Residual > 0 => Sign > 0 => Observed Deviance > Expected Deviance => Observed Value > Predicted Value => Predicted lower, model underestimates.
- If the dot sits below 0, then the fitted model underestimates
- Deviance Residual < 0 => Sign < 0 => Observed Deviance < Expected Deviance => Observed Value < Predicted Value => Predicted higher, model underestimates.
Reading: Interpreting Residual Plots to Improve Your Regression (qualtrics.com)
AUC – ROC Curve
Definition
A graphical tool plotting the sensitivity against the specificity of a given classifier for each cutoff ranging from 0 to 1.
- ROC is a probability curve and;
- AUC represents the degree or measure of separability.
It tells how much the model is capable of distinguishing between classes.
Plotting
library(pR0C)
- Create a roc object:
roc <- roc(response, predictor)
- Retrieve AUC value:
auc(roc)
- Plot the ROC curve:
plot(roc)
Interpretation
The closer the curve is to the top-left corner, the better the predictive ability.
The higher the AUC, the better the model is at predicting 0 classes as 0 and 1 classes as 1 (true positive). i.e., the higher the AUC, the better the model’s performance at distinguishing between the positive and negative classes.
- If
AUC = 0.5
, it means the predictive performance is just random guess.
Random Forest
Variable Importance Plot
Purpose: show which features contribute the most to the model.
Idea: ranks the predictors according to their importance scores.
imp <- varImp(model.rf) plot(imp)
What are importance scores?
The importance score for a particular predictor is computed by the reductions in error (totaling the drop in node impurity, or incremental node purity) (RSS for regression trees and Gini index for classification trees) due to that predictor, averaged over all the trees in the ensemble tree.
Importance Score | Variable |
How to get | Classification Tree:
determined by observing, in the particular model used, how much making a distinction between groups of X using a particular variable distinguished whether… |
Similar | equally useful |
Lower | the one with higher importance is much more useful than the one with lower importance |
Interpretation
- The importance scores are automatically scaled so that the most important predictor has a score of 100 and the predictors are sorted in descending order of variable importance. (0 <= importance score <= 100)
- The labels on the y-axis are variables, and on x-axis is the important scores (or incremental node purity)
Partial (Average) Dependence Plot
Purpose: assess how the target is dependent on each feature, i.e., get an understanding of the relationship between features and our target.
Idea: calculate and show the “average” predicted value of the target variable by varying the value of one (or more) input features.
Advantage: show complex (directional) relationships between many variables beyond three dimensions.
Disadvantage: only captures the average dependence.
Interpretation
- By construction, PD(x1) simply equals the model predictions averaged over all the observed values of X2, … , Xp in the training set while keeping the value of X1 fixed at x1, a value of interest. We can then examine the behavior of PD(x1) as a function of x1 with the goal of understanding how X1 affects the target variable.
- For a categorical target variable, if the predictions are on the logit scale, meaning that what is shown on the vertical axis of a partial dependence plot is \(\ln\dfrac{\hat{p}}{1\hat{p}}\), and the plot shows how the odds varies with the value of x, with a blue smoothed curve superimposed.
Is the smoothed line recommended? No. If the unsmoothed plot follows bi-modal distribution, the smoothed plot only shows uni-modal distribution, which is misleading. |
Principal Components Analysis (PCA)
Biplot
Description | Interpretation of Vector Position / Direction | |||||||||||||||||||||||||||||||||||||||||||||||||||
PC Loadings |
|
|
|
|||||||||||||||||||||||||||||||||||||||||||||||||
Correlations |
|
|
|
|||||||||||||||||||||||||||||||||||||||||||||||||
PC Scores / Observations |
|
|
|
|||||||||||||||||||||||||||||||||||||||||||||||||
|
|
|
||||||||||||||||||||||||||||||||||||||||||||||||||
|
|
|
||||||||||||||||||||||||||||||||||||||||||||||||||
|
|
|
||||||||||||||||||||||||||||||||||||||||||||||||||
Interpret the PC’s |
|
|
|
|||||||||||||||||||||||||||||||||||||||||||||||||
|
|
|
k-Means Clustering
Elbow plot
Definition
A technique that is used in determining the number of clusters in a data set. The method consists of plotting the explained variation as a function of the number of clusters and picking the elbow of the curve as the number of clusters to use. The same method can be used to choose the number of parameters in other data-driven models, such as the number of principal components to describe a data set.
The basic idea is to:
- specify K = 2 as the initial optimal cluster number K, and then
- keeps increasing K by step 1 to the maximal specified for the estimated potential optimal cluster number, and finally
- distinguish the potential optimal cluster number K corresponding to the plateau.
The optimal cluster number K is distinguished by the fact that before reaching K, the cost rapidly decreases to the called cost peak value, and after exceeding K, it continues to increase with the called cost peak value almost unchanged.
Plotting
ggplot(data, aes(x = K, y = bss_tss)) +
geom_point() +
geom_line() +
ggtitle("Elbow plot")
Interpretation
- on y-axis is the % of variance explained
Interpretation
K = k
, beyond which the gains in the proportion of variation explained appear to be minimal.
Hierarchical Clustering
Dendrogram
Comment | |
Purpose |
|
Height |
Clades with different heights are dissimilar. The greater the difference in height, the more dissimilarity |
Content |
|
Characteristics | Monotonic:
All hierarchical clustering algorithms are monotonic – they either increase or decrease. A summary of the distance matrix: cannot tell you how many clusters you should have Cannot tell how many clusters: In general, it is a mistake to use dendrograms as a tool for determining the number of clusters in data. |