Taking too long? Close loading screen.

SOA ASA Exam: Predictive Analysis (PA) – Summary

Summary of Questions

Question Answer
What are the modeling improvements? Modeling Improvements: Adding an interaction term, factorizing a variable, using a tree-based model to take care non-linear relationship.
Describe / Explain … (how X is used).
  1. Definition of X
  2. Explain one way how X is used
Discuss …
  1. Definition / Effects
  2. Evaluate the influence to the subject

Examples:

  • discuss the plausibility of the outliers:
    1. What kind of outliers would be implausible? (range)
    2. Evaluate why some extreme values are plausible. (causes)
  • discuss the outliers wrt. the goal to to reduce response time below 6 minutes for 90% of calls
    1. Definition of the goal
    2. What kind of outliers does not fit the goal? (exceeded 6 minutes, )
    3. Evaluate outliers contribute to the goal achievement?
  • discuss the outliers wrt. fitting a GLM that predicts response time.
    1. Effects of the outliers to model fitting
Propose questions for … that will help clarify the business objective.
  • insights: initial hypothesis or intuition that might explain variation in y
  • data preparation: consulting specialists before performing analysis
  • data collection: if unexpected changes
What are the reasons why bias may not always decrease with additional degrees of freedom by adding a new predictor?
  • with no predictive power
  • substantial collinearity
  • train/split data not split randomly => good predictability trained by train data doesn’t mean good predictability on test data.
Rank predictors
  • calculate the amount of errors reduced using the composite of variance and squared bias (reducible errors), variance + squared bias
  • Ensembled Trees: Variable Importance (plots), Partial Dependence plots
Assess using the new target variable in the
context of the business problem.
  1. What is the difference between the new and the old target variable?
  2. Using the new target variable, do we need to readjust our goal due to the change of range?

The Model Building Process

Stage Key Element Key Command Comment
Define the Business Problem
  • clarify the business issue
  • testable hypothesis
    • challenges
    • constrains
  • assessable outcome
    • KPIs

1. Determine whether the business issue is prediction-based or interpretation/insights-based.

Data Collection Data Source
  • relevant: only records that are interested in
  • unrepresentative
  • respondent bias: unrepresentative of the population of interest
  • additional weight and overrepresent
  • Reasonableness
  • Consistency
  • Personally Identifiable Information (PII)
  • Variables causing unfair discrimination
  • Target leakage
Sampling
  • Stratified sampling ensures every stratum is represented
  • Unbalanced data? Use oversampling and undersampling

Undersampling: keeps all instances of the minority class and samples from the majority class.

Oversampling: keeps all instances of the majority class and samples with replacement instances of the minority class.

They increase the prevalence of the minority class, so the predicted probabilities increase for the minority class and decrease for the majority class.

Oversampling must be applied after data split and only to the train data, otherwise duplicates of obs from the minority class in train and test data.

Granularity
  • how precisely a variable is measured
  • how detailed the information contained
Granularity # of Levels # of Obs at Each Level Susceptibility to Noise
High High Low High
Low Low High Low
Exploratory Data Analysis Bivariate / Univariate
  • PCA
  • correlation, plots
Model Construction and Evaluation Data Split
  • train/test data split
  • trade-off: more train data, less reliable on predictions
Train/Test Data Split

  • either randomly according to pre-specified proportions or with special statistical techniques
  • on the basis of time, allocating the older observations to the training set and the more recent observations to the test set.
Weights vs. Offsets
  • weights: attach a higher weight to observations with a larger exposure
  • offsets: the exposure, when serving as an offset, is in direct proportion to the mean of the target variable
  • weights give unequal importance to the observations
  • offsets adjust the prediction for each observation but not its relative importance
  • offsets act as a known coefficient for a parameter, rather than a coefficient to be fitted.
Reducing Dimensionality

/

Feature Selection

Why do feature selection?

  • reduce model complexity
  • avoid overfitting
  • Similarities between stepwise selection and shrinkage methods? => purpose of feature selection
k-Fold Cross-Validation

  1. divides the available data into k folds for a series of model fitting runs.
  2. in each model fitting run, the model is fit to the train data in each fold
  3. a test metric (accuracy, error rate, RMSE, etc) is calculated on the test data.
  4. The average test metric across the runs is the result of the cross validation.
k-Fold Cross-Validation

  • assess how well predict with new data
  • applicable to GLM, random forest, training hyperparameter eta for boosted trees
    • choosing eta:
      1. prescribe a series of reasonable values for eta.
      2. for each eta, perform  cross validation in model fitting run.
      3. average cross validation errors for each eta.
      4. choose eta with the best test metric.

Best Subset Selection:

  • fit for each possible combination of the available features
  • # of fits: 2p

Stepwise Selection:

  • decide on a final model based on a specific selection criterion
  • Start: from
    • saturated model (backward selection)
    • intercept-only model (forward selection)
  • Stop: until no improvement as measured by AIC or other performance metric
  • # of fits: 1+p(1+p)/2

AIC and BIC

  • use all available data
  • does not directly consider how well the models fit new data
  • does not give user direct insight into how well the model generalizes to unseen data
Backward Forward
start full intercept-only
add/drop drop  add
conservative?

Regularization (Shrinkage Methods):

  • add a penalty term, a regularization penalty, to shrink the coefficient estimates towards zero, reducing the complexity
    • how: optimize a loss function that includes a penalty parameter
    • should standardize variables, o.w. the regularization will focus on shrinking the variables on a smaller scale over those on a larger scale.
  • elastic net: a compromise between ridge regression and lasso regression
  • hyperparameters: selected using cross-validation:
    1. Construct a fine grid of (λ, α).
    2. Compute the cross-validation error for each pair of values, and choose the pair that gives the lowest cross-validation error.
    3. Given the optimal hyperparameters, the penalized regression model is fitted to all of the training data and its prediction accuracy is evaluated on the test data.
Regularization trades off:

  • bias: increase
  • variance: minimize the size of coefficient estimates so that the model has lower variance and is less prone to overfitting.
  • regularization leverages bias-variance tradeoff if the reduction in variance exceeds the added bias.
lasso ridge
shrink to 0? × retain
simpler
more effective

Differences in reducing dimensionality:

  • LASSO, ridge regression and elastic net consider both the fit of the coefficients in predicting the target and the number of dimensions in the model. PCA reduces the dimensionality without reference to the target variable

Unsupervised Learning:

  • PCA: Use PCs in replace of original numeric variables, PC scores to reduce dimensionality.
  • In PCA, PCs are difficult to describe in detail (interpret), but LASSO is not.
Feature Generation
Model Validation
  • on train set: model diagnostic tools
  • on test set: compare prediction with observation
  • compare the selected model to an existing, baseline model, on the test set.
Model Diagnostic Tools:

  • “Residuals vs Fitted” plot
  • “Normal Q-Q” plot

On Test Data:

  • predicted value vs. observed value
  • baseline model may be intercept-only model
Model Maintenance

Comparison of GLM’s

Distribution Y ⊆ (a, b) g(⋅) μ Link Name Use Comment
Gaussian (-∞, ∞) βX = μ μ = βX Identity Symmetric
Gamma / Exponential (0, ∞) βX = -μ-1 μ = –βX-1 Strictly positive, right skewed, severity
Inverse Gaussian (0, ∞) βX = μ-2 μ = –βX-1/2
  • Like gamma but with sharper peak and fatter tail
  • hard to interpret
Poisson / Negative Binomial 0, 1, … βX = ln(μ) μ = eβX Log Count of occurrences in fixed time/space
  • mean = variance
  • multiplicative structure is intuitive for insurance risks
  • guaranteed positive predictions
  • needless to log-transform target
Bernoulli {0, 1} βX = ln(μ/(1-μ)) μ = eβX / (1 + eβX) Logit Outcome of single yes/no
Binomial {0, 1, …, N} Count of yes out of N yes/no
Tweedie Premium, loss ratio, aggregate losses
  • Mostly at zero and remainder right-skewed
  • equivalent to Poisson-distributed sum of gammas

Comparison of Models

Supervised Learning

Supervised Learning GLM Base Decision Tree Ensemble Model
Random Forest Boosted Trees
Focus     reduce model variance reduce model bias
Process Cost Complexity Pruning:

  1. grow a full tree to decide whether …
  2. reduce the size and the complexity of the tree, including setting the min # of obs per bucket and per node, max depth.
  3. The tradeoff between simplicity of the tree and how well it distinguishes whether …
Right-Skewed Data Poor

strongly fit the right tail of the experience while fitting the majority of observations less well

Poor

splits emphasize on the skewed area than frequent data.

because it puts heavy emphasis on minimizing the squared errors of relatively few extreme points

Nonlinear
Interactions
Poor

  • limited ability to extract complex relationships
    • pairwise relationships each time
Good

  • automatically handles continuous, categorical (no need to binarize or determine a base class), missing data, variable selections, and interactions
Bias/Predictive Poor

  • suboptimal initial split impacts subsequent fits; greedy locally optimal algorithm is unlikely to be globally optimal
  • easy to overfit
Fair

  • simulate many different starting points in parallel; unlike GBM, unbalanced data may reduce predictiveness
Great

  • unlike bagging, each iteration is sequentially trained on previous errors towards negative gradient of the loss function
Variance/Robust Poor

  • overfitting even with pruning
  • small change in data causes large change in final estimate (unstable)
  • information gain favors many-level categorical variables
Great

  • less prone to overfitting than gradient boosting machines.
  • variance of overall is less than of each component; only one parameter to tune (mtry ≈ \(\sqrt{p}\). mtry = p is just bagging)
Fair

  • More prone to overfitting than random forests
  • sensitive to hyperparameters
Interpretability Fair

  • some coefficients are easier to interpret than others
Great

  • intuitive, easy to interpret splits and visualize graphs, so long as complexity is low
Poor

  • not as straightforward
  •  relationship between y and x:
    • influential predictors: varImp()
    • directional effects: partial()
Poor

  • similar to random forest, but smaller trees are more interpretable
Computational
Requirements
Good

  • works in a spreadsheet
Fair

  • does not work well with small data set
Poor Poor
Determining Predictions    
  • Classification Tree: counting votes, category with highest votes
  • Regression Tree: average
 
Issues   With high-dimensional categorical variables:

  • overfitting
  • increase tree size
  • sparse data
  • imbalanced class distribution
  • model opacity issue
  • model opacity issue
    • Resolution:
      1. Feature importance
      2. Partial dependence plots
Advantage

Compared to OLS:

  • the target variable is no longer confined to the class of normal random variables
    • OLS cannot limit the range of the target variable (OLS cannot limit to positive target variable for claim costs, or handle binary target variable, but logistic GLM can).
  • flexible relationship between predictors and the target mean
    • OLS variance depends on the mean, violating the constant variance assumption
    • OLS cannot model non-linearity
 

Unsupervised Learning

Unsupervised Learning Principal Components Analysis (PCA) k-Means Clustering Hierarchical Clustering
Definition Summarize high-dimensional data into fewer variables while retaining sufficient information Find k distinct clusters defined by their centers Find a hierarchy of unspecified clusters forming a tree structure
Dataset high-dimensional
Use
  • Data exploration
  • Feature generation
  • Feature transformation
Standardization Required Required

If we don’t standardize the features, then the clustering places all of its weight on one feature and ignores the other feature because a large proportion of the Euclidean distance is attributed to the feature with larger values.

Standardizing the features lets the algorithm place equal weight on both features when determining where the clusters should be. Hence, standardization is most always done prior to running a clustering algorithm.

Process How to develop features:

  • Create principal components as predictors in place of the original variables.
  • Each principal component is orthogonal to every other principal component to capture a high proportion of the variance of the original variables.
  • Few components are used in place of many original variables.
  • This substitution, specifically the score produced by the linear combination of original variables each component represents, can reduce dimensionality and improve the predictive power of  the resulting model.

How to pick PCs:

  • Choose PC1 with largest prop. of var, then linearly combine top (absolute valued) loadings with variables.
  • Don’t choose all loadings or else dimension isn’t reduced and interpretation isn’t intuitive.
Partition observations into k clusters, search for the optimal partition (elbow method) because it’s impossible to examine every possible partition; likely that the search will find a local, not a global, optimum Agglomerative Bottom-up (popular) vs. divisive top-down approach:

  • Agglomerative clustering starts by considering each observation as its own cluster, then gradually grouping them with nearby clusters at each stage until you only have one cluster left (i.e., bottom-up)
  • Divisive clustering starts by considering all the observations as a single cluster and then progressively splitting into subclusters recursively (i.e., top-down).

Use dendrogram to visualize based on vertical axis where branches fuse.

R Commands
  • PC loadings: prcomp$rotation
  • PC scores: prcomp$x
  • kmeans(data, centers, nstart = 1)
  • hclust(dist(data))
Advantage Reduce dimension; can be used as new variables or to find latent variables
Disadvantage
  • only on numeric variables; categorical variables should be factorized first.
  • on variables with high dimensionality; if the data set has few variables or variables having low dimensionality, loss of information > dimensionality reduction.
  • Variable must be numeric, as opposed to PCA
  • Curse of dimensionality:
    • Problem: As the number of dimensions increases, the data points on average are the same distance away from each other. -> Visualization becomes problematic, and distances become the same.
    • Solution: Use PCA beforehand.
  • Use if hierarchical structure is expected, or else prefer k-means

From Interpretation Aspect

Decision Tree
Decision Tree
  1. Right-skewed target variables: Identify which tree uses untransformed (original) target variable and which tree uses log-transformed target variable
  2. Describe the distribution of the leaves
    • the leaf covers % of the data
    • # leaves have assigned values x
  3. Identify the important predictors
    • in the order of splits
    • times in each split
  4. Relate to the business problem
    • Which tree gives more distinguishment among obs.
    • Right-skewed target variables: more density on the right tail leaves -> more insight into the right tail

Tuning Parameters

Parameter Tuning Interpretation
Decision Tree rpart(y ~ ., data, method, control = rpart.control(), parms)
method method = "anova" For regression trees
method = "class" For classification trees
control = rpart.control(minsplit,minbucket, cp, xval, maxdepth)
minsplit lower minsplit -> higher complexity Minimum # obs that must exist in a node for a split to be attempted
minbucket lower minbucket -> higher complexity Minimum # obs in a terminal node
cp lower cp -> more splits -> higher complexity -> higher computing time
  • Complexity parameter; minimum amount of impurity reduction required for a split to be attempted; improvement in relative error.
  • 0 cp ensures most complex.
  • Extract optimal cp by taking one with the minimum xerror.
maxdepth higher maxdepth -> higher complexity Maximum depth of final tree, with root node being 0. Directly controls the interaction depth (ex. 2 does not capture three-way variable interactions).
Random Forest caret.train(y, x, method, ntree, importance, trControl, tuneGrid)
ntree higher ntree -> each obs selected at least once, theoretically shouldn’t overfit

  • the larger the ntree parameter, the more accurate the predictions
  • n <= 100
# trees to train
importance = FALSE importance scores of the predictors, by default will not be computed
trControl = trainControl(method, number, repeats, sampling)
method method = "cv" Cross-Validation
method = "repeatedcv" Repeated Cross-Validation
method = "rf" Construct a random forest
number = k The # folds used in the k-fold cross-validation
repeats = c higher c -> more times of cross-validation is performed

Only for method = "repeatedcv", controls how many times cross-validation is performed

sampling sampling = "up" Oversampling
sampling = "down" Undersampling
tuneGrid = expand.grid(mtry = r:c)"
mtry = r:c higher mtry -> the less variance reduction we get as the base trees will be more similar to one another.

Smaller mtry -> more trees needed to ensure each parameter selected at least once

If mtry is too small, however, each split may have too little freedom in choosing the split variables.

The # features considered in each split, in a form of data frame of (# of rows: # of columns):=(r: c)

Proportion of features used at each split. \(\sqrt{p}\) for classification, p/3 for regression, mtry = p is just bagging.

Boosting Tree nrounds

nrounds = 1000, large enough so that sufficient trees are grown to capture the signal in the data effectively but not excessively large to avoid overfitting.

Higher nrounds => more trees, possible overfitting

The maximum # of trees to grow or iterations in the model fitting process.

eta

0 < eta < 1, The higher the learning rate, the faster the model will reach optimality and the fewer the number of iterations required, though the resulting model will more likely overfit.

The learning rate or shrinkage parameter applied to t he contribution of each tree.

maxdepth, min_child_weight, gamma  Controls the complexity of the underlying trees.
k-means lustering kmeans(data, centers, nstart = 1)
centers

specifies k, the number of clusters used.

nstart

20 < nstart < 50 is recommended to improve the chance of identifying a global optimum.

controls the number of random selection of initial cluster centers and only the round with the best result will be reported.

Exploratory Data Analysis

Structured Data vs. Unstructured Data

Structured Data Unstructured Data
Pros
  • Easy to access / locate data
  • Easy to compare data
  • more flexible
Cons
  • less flexible
  • There is a limit to the scope of what can be expressed by structured data due to the tabular structure.
    • omitting other information that doesn’t easily fit into the table structure.
  • harder to access

Summary of Models

Principal Components Analysis (PCA)

Relative Sign Possible Cause Outcome
Correlation
σij -> 1
  • Variable i and j are strongly positively correlated with one another
  • The strong correlations suggest that PCA may be an effective technique to “compress” the dominated variables into a single measure while retaining most of the information -> the variables can be effectively summarized in 1 dimension
σij -> 0
  • variable i does not seem to have a strong linear relationship / moderately weak correlation / negligible correlation with the variable j.
σij -> -1
PCm Loading


\(\phi\)jm -> 1
  • Relative Signs: The PCm attaches (positive) weight to the variable j -> PCm can be interpreted as a measure of it.
  • Magnitudes: The larger (more positive) the PC score, the higher the x
\(\phi\)jm -> 0
  • Relative Signs: The PCm attaches (small) weight to the variable j
  • Magnitudes: small x regardless of PC scores
\(\phi\)jm -> -1
  • Relative Signs: The PCm attaches (negative) weight to the variable j -> PCm can be interpreted as a measure of it.
  • Magnitudes: The smaller (more negative) the PC score, the lower the x
\(\phi\)im > 0

\(\phi\)jm < 0

  • Relative Signs: The PCm has a (large) positive loading on variable i and a (smaller) negative loading on variable j. -> we may interpret the third PC roughly as a contrast between variable i and variable j
  • Magnitudes: x is expected to have a large value of xim and smaller value of xjm
PCm PVE
High
  • Large positive correlations among the dominated variables
  • This PC is able to explain most of the variability in the data, which is recommended
No sharp elbow The cumulative PVE rises only modestly quickly and the scree plot does not have that sharp an elbow, suggesting that substituting these components for the underlying variables will not produce much dimension reduction, not enough to offset the considerable difficulties for interpretation its use would create.
  • PCA should be abandoned for this business problem.

Statistics

Category R Code Interpretation
Correlation Matrix
         Distance Duration     Age  Others    Cost
Distance   1.0000   0.4854  0.0017 -0.0002  0.5751
Duration   0.4854   1.0000  0.0181  0.0279  0.5510
Age        0.0017   0.0181  1.0000 -0.1202 -0.0011
Others    -0.0002   0.0279 -0.1202  1.0000  0.0990
Cost       0.5751   0.5510 -0.0011  0.0990  1.0000

Categorical Predictors: should be transformed into dummy variables, o.w. unknown interdependency involving categorical predictors => cause modeling issues.

Collinearity: The near 0.50 correlation between Distance and Duration suggests that collinearity may be an issue when including these as predictors.

PCA: PC Loadings
> pca <- prcomp(data, center = TRUE, scale. = TRUE)
> pca$rotation

                PC1         PC2         PC3         PC4        PC5
Distance 0.56963795 -0.06517987 -0.10468926 -0.63334867  0.5090912
Duration 0.56171590 -0.05285474 -0.02230802  0.76444146  0.3111482
Age      0.00201430 -0.70275792  0.71051797 -0.03403320  0.0115414
Others   0.06876669  0.70568844  0.69442066 -0.02990823  0.1189976
Cost     0.59603267  0.03306191  0.03855751 -0.11156140 -0.7935486
Collinearity: That Distance and Duration have the same signs in the first three principal components highlights their collinearity in the context of other variables.

Exclude Target Variable: Target variable should be exclude from PCA (data quality issue: target leak).

PCA: PVE
> summary(pca)

Importance of components:
                          PC1    PC2    PC3    PC4     PC5
Standard deviation     1.4423 1.0594 0.9390 0.7180 0.63259
Proportion of Variance 0.4161 0.2245 0.1763 0.1031 0.08003
Cumulative Proportion  0.4161 0.6405 0.8168 0.9200 1.00000
Proportion of Variance Explained (PVE): PC1 has explained 41.61% of variance.

Cumulative PVE: PC2 has only explained 64.05% of the variance.

Confusion Matrix
> confusionMatrix(as.factor(train_pred),
                  as.factor(y),
                  positive = "1")
[1] "Train confusion matrix"
Confusion Matrix and Statistics

          Reference
Prediction    0    1
         0 6246  424
         1  104  219
                                          
               Accuracy : 0.9245          
                 95% CI : (0.9181, 0.9306)
    No Information Rate : 0.9081          
    P-Value [Acc > NIR] : 5.584e-07       
                                          
                  Kappa : 0.4176          
                                          
 Mcnemar's Test P-Value : < 2.2e-16       
                                          
            Sensitivity : 0.34059         
            Specificity : 0.98362         
         Pos Pred Value : 0.67802         
         Neg Pred Value : 0.93643         
             Prevalence : 0.09195         
         Detection Rate : 0.03132         
   Detection Prevalence : 0.04619         
      Balanced Accuracy : 0.66211         
                                          
       'Positive' Class : 1   
            
------------------------------------------------------

> confusionMatrix(as.factor(test_pred),
                  as.factor(y),
                  positive = "1")

[1] "Test confusion matrix"
Confusion Matrix and Statistics

          Reference
Prediction    0    1
         0 2658  204
         1   24  110
                                          
               Accuracy : 0.9239          
                 95% CI : (0.9138, 0.9331)
    No Information Rate : 0.8952          
    P-Value [Acc > NIR] : 5.069e-08       
                                          
                  Kappa : 0.457           
                                          
 Mcnemar's Test P-Value : < 2.2e-16       
                                          
            Sensitivity : 0.35032         
            Specificity : 0.99105         
         Pos Pred Value : 0.82090         
         Neg Pred Value : 0.92872         
             Prevalence : 0.10481         
         Detection Rate : 0.03672         
   Detection Prevalence : 0.04473         
      Balanced Accuracy : 0.67068         
                                          
       'Positive' Class : 1
Sensitivity on Train Data: the proportion of correctly classified x is only 34.059%. => undesired model

Specificity on Train Data: the proportion of correctly classified non x is 98.362%.

Predictability:

Sensitivity on Test Data: the proportion of correctly classified x is only 35.032%. => Missing 65% of Y

Specificity on Test Data: the proportion of correctly classified non x is 99,105%.

Plots

Residual Plot

Deviance Residual Plot

Definition

Deviance is the difference between our current model and the saturated model.

\(D = 2\log\left(\dfrac{L_{sat}(\hat{\beta})}{L_{model}(\hat{\beta})}\right) = 2(\ell_{sat}(\hat{\beta})-\ell_{model}(\hat{\beta}))\)

The deviance residual for an individual point is defined as: \(dev_i = sign(y_i – \mu_i) \sqrt{d_i}\)

  • The null deviance is based on the intercept-only model: how different is that intercept-only model from the saturated model?
  • The residual deviance is based on your fitted model: how different is your model from the saturated model?

R Command

residuals(object, type = c("deviance", "pearson", "working", "response", "partial"), ...)
  • Pearson = (observed – predicted) / sqrt(V(predicted)), where V(predicted) is the variance function applied to the predicted value.
  • Working = (observed – predicted) * g'(predicted), where g'() is the first derivative of the link function.
  • Response = observed – predicted
  • Deviance = s*sqrt(2*d), where s is 1 if observed > predicted and -1 if observed < predicted and d is the absolute value of the difference between the natural log of the probability of obtaining the observed value when the model predicts the observation exactly and the natural log of the probability of obtaining the observed value using the fitted model.

Interpretation

y-axis is the deviance residual, and the x-axis is the predicted value.

  • Deviance residuals should have mean close to 0 (unbiased) and nearly constant variance as the predictive values increase (homoscedastic, or not heteroscedastic), when plotted against any predictor, or against fitted values. (follow normal distribution)
  • If the dot sits above 0, then the fitted model underestimates
    • Deviance Residual > 0 => Sign > 0 => Observed Deviance > Expected Deviance => Observed Value > Predicted Value => Predicted lower, model underestimates.
  • If the dot sits below 0, then the fitted model underestimates
    • Deviance Residual < 0 => Sign < 0 => Observed Deviance < Expected Deviance => Observed Value < Predicted Value => Predicted higher, model underestimates.

Reading: Interpreting Residual Plots to Improve Your Regression (qualtrics.com)

AUC – ROC Curve

Definition

A graphical tool plotting the sensitivity against the specificity of a given classifier for each cutoff ranging from 0 to 1.

  • ROC is a probability curve and;
  • AUC represents the degree or measure of separability.

It tells how much the model is capable of distinguishing between classes.

Plotting

  • library(pR0C)
  • Create a roc object: roc <- roc(response, predictor)
  • Retrieve AUC value: auc(roc)
  • Plot the ROC curve: plot(roc)

Interpretation

The closer the curve is to the top-left corner, the better the predictive ability.

The higher the AUC, the better the model is at predicting 0 classes as 0 and 1 classes as 1 (true positive). i.e., the higher the AUC, the better the model’s performance at distinguishing between the positive and negative classes.

  • If AUC = 0.5, it means the predictive performance is just random guess.

Random Forest

Variable Importance Plot

Purpose: show which features contribute the most to the model.

Idea: ranks the predictors according to their importance scores.

imp <- varImp(model.rf)
plot(imp)

What are importance scores?

The importance score for a particular predictor is computed by the reductions in error (totaling the drop in node impurity, or incremental node purity) (RSS for regression trees and Gini index for classification trees) due to that predictor, averaged over all the trees in the ensemble tree.

Importance Score Variable
How to get Classification Tree:

determined by observing, in the particular model used, how much making a distinction between groups of X using a particular variable distinguished whether…

Similar equally useful
Lower the one with higher importance is much more useful than the one with lower importance

Interpretation

  • The importance scores are automatically scaled so that the most important predictor has a score of 100 and the predictors are sorted in descending order of variable importance. (0 <= importance score <= 100)
  • The labels on the y-axis are variables, and on x-axis is the important scores (or incremental node purity)

Partial (Average) Dependence Plot

Purpose: assess how the target is dependent on each feature, i.e., get an understanding of the relationship between features and our target.

Idea: calculate and show the “average” predicted value of the target variable by varying the value of one (or more) input features.

Advantage: show complex (directional) relationships between many variables beyond three dimensions.

Disadvantage: only captures the average dependence.

Interpretation

  • By construction, PD(x1) simply equals the model predictions averaged over all the observed values of X2, … , Xp in the training set while keeping the value of X1 fixed at x1, a value of interest. We can then examine the behavior of PD(x1) as a function of x1 with the goal of understanding how X1 affects the target variable.
  • For a categorical target variable, if the predictions are on the logit scale, meaning that what is shown on the vertical axis of a partial dependence plot is \(\ln\dfrac{\hat{p}}{1\hat{p}}\), and the plot shows how the odds varies with the value of x, with a blue smoothed curve superimposed.

Is the smoothed line recommended?

No. If the unsmoothed plot follows bi-modal distribution, the smoothed plot only shows uni-modal distribution, which is misleading.

Principal Components Analysis (PCA)

Biplot

Description Interpretation of Vector Position / Direction
PC Loadings
  • Vi vector arrow ends far left end on the top axis
  • Vj vector arrow ends far left end on the bottom axis
  • Vk vector arrow ends far left end in the middle
  • Variable i has (very) negative PC1 loading, (very) positive PC2 loading
  • Variable j has (very) negative PC2 loading, (very) positive PC1 loading
  • Variable k has (very) negative PC1 loading, almost 0 PC2 loading
Vi Vi
Vi Vi
         
Vk Vk    
         
Vj Vj
Vj Vj
Correlations
  • Vi and Vj have similar coordinates, close to each other
  • Vk is far away from Vi and Vj
  • Variable i and Variable j are positively correlated
  • Variable k is less correlated to variables i and j
Vi Vj
             
   
             
Vk
PC Scores / Observations
  • Oi sits far left end in the middle
    • PC1 score is very negative
    • PC2 score is almost 0
  • If PC1 < 0, observation i has:
    • (very) high PC1 effect
    • average PC2 effect
  • If PC1 > 0, observation i has:
    • (very) low PC1 effect
    • average PC2 effect
             
Oi Oi     
             
  • Oi sits little far left end on the bottom axis, near the bottom
    • PC1 score is very negative
    • PC2 score is very negative
  • If (PC1<0, PC2<0):
    • very high PC1 effect
    • very high PC2 effect
             
   
             
Oi Oi
Oi Oi
  • Oi sits near the center
    • PC1 score is very negative
    • PC2 score is very negative
  • Observation i has:
    • average PC1 effect
    • average PC2 effect
             
  Oi 
             
  • Oi sits far right end in the middle
    • PC1 score is very positive
    • PC2 score is almost 0
  • If PC1 < 0, observation i has:
    • (very) low PC1 effect
    • average PC2 effect
  • If PC1 > 0, observation i has:
    • (very) high PC1 effect
    • average PC2 effect
             
  Oi  Oi 
             
Interpret the PC’s
  • On the right axis, Vi is much closer to zero
  • PC1 is a measurement of variable i
             
Vi Vi     Vi Vi
             
  • On the top axis, Vi is much closer to zero
  • PC2 is a measurement of variable i
      Vi    
  Vi    
   
   
   
    Vi  
    Vi  

k-Means Clustering

Elbow plot

Definition

A technique that is used in determining the number of clusters in a data set. The method consists of plotting the explained variation as a function of the number of clusters and picking the elbow of the curve as the number of clusters to use. The same method can be used to choose the number of parameters in other data-driven models, such as the number of principal components to describe a data set.

The basic idea is to:

  1. specify K = 2 as the initial optimal cluster number K, and then
  2. keeps increasing K by step 1 to the maximal specified for the estimated potential optimal cluster number, and finally
  3. distinguish the potential optimal cluster number K corresponding to the plateau.

The optimal cluster number K is distinguished by the fact that before reaching K, the cost rapidly decreases to the called cost peak value, and after exceeding K, it continues to increase with the called cost peak value almost unchanged.

Plotting

  • ggplot(data, aes(x = K, y = bss_tss)) +
    geom_point() +
    geom_line() +
    ggtitle("Elbow plot")

Interpretation

  • on y-axis is the % of variance explained

Interpretation

  • K = k, beyond which the gains in the proportion of variation explained appear to be minimal.

Hierarchical Clustering

Dendrogram

Comment
Purpose
  • Shows the relationships, or closeness, between similar sets of data.
Height
  • The height at which any two objects are joined together: The two clades are similar
  • Different height:
    The two clades are dissimilar

Clades with different heights are dissimilar. The greater the difference in height, the more dissimilarity

Content
  • Each group (or “node”) links to two or more successor groups.
  • The groups are nested and organized as a tree.
Characteristics Monotonic:

All hierarchical clustering algorithms are monotonic – they either increase or decrease.

A summary of the distance matrix:

cannot tell you how many clusters you should have

Cannot tell how many clusters:

In general, it is a mistake to use dendrograms as a tool for determining the number of clusters in data.