Summary of Questions

Question	Answer
What are the modeling improvements?	Modeling Improvements: Adding an interaction term, factorizing a variable, using a tree-based model to take care non-linear relationship.
Describe / Explain … (how X is used).	Definition of X Explain one way how X is used
Discuss …	Definition / Effects Evaluate the influence to the subject Examples: discuss the plausibility of the outliers: What kind of outliers would be implausible? (range) Evaluate why some extreme values are plausible. (causes) discuss the outliers wrt. the goal to to reduce response time below 6 minutes for 90% of calls Definition of the goal What kind of outliers does not fit the goal? (exceeded 6 minutes, ) Evaluate outliers contribute to the goal achievement? discuss the outliers wrt. fitting a GLM that predicts response time. Effects of the outliers to model fitting
Propose questions for … that will help clarify the business objective.	insights: initial hypothesis or intuition that might explain variation in y data preparation: consulting specialists before performing analysis data collection: if unexpected changes
What are the reasons why bias may not always decrease with additional degrees of freedom by adding a new predictor?	with no predictive power substantial collinearity train/split data not split randomly => good predictability trained by train data doesn’t mean good predictability on test data.
Rank predictors	calculate the amount of errors reduced using the composite of variance and squared bias (reducible errors), variance + squared bias Ensembled Trees: Variable Importance (plots), Partial Dependence plots
Assess using the new target variable in the context of the business problem.	What is the difference between the new and the old target variable? Using the new target variable, do we need to readjust our goal due to the change of range?

The Model Building Process

Stage

Key Element

Key Command

Comment

Define the Business Problem

clarify the business issue
testable hypothesis
- challenges
- constrains
assessable outcome
- KPIs

–

1. Determine whether the business issue is prediction-based or interpretation/insights-based.

Data Collection

Data Source

relevant: only records that are interested in
unrepresentative
respondent bias: unrepresentative of the population of interest
additional weight and overrepresent

–

Reasonableness
Consistency
Personally Identifiable Information (PII)
Variables causing unfair discrimination
Target leakage

Sampling

Stratified sampling ensures every stratum is represented
Unbalanced data? Use oversampling and undersampling

–

Undersampling: keeps all instances of the minority class and samples from the majority class.

Oversampling: keeps all instances of the majority class and samples with replacement instances of the minority class.

They increase the prevalence of the minority class, so the predicted probabilities increase for the minority class and decrease for the majority class.

Oversampling must be applied after data split and only to the train data, otherwise duplicates of obs from the minority class in train and test data.

Granularity

how precisely a variable is measured
how detailed the information contained

–

Granularity	# of Levels	# of Obs at Each Level	Susceptibility to Noise
High	High	Low	High
Low	Low	High	Low

Exploratory Data Analysis

Bivariate / Univariate

PCA
correlation, plots

–

Model Construction and Evaluation

Data Split

train/test data split
trade-off: more train data, less reliable on predictions

–

Train/Test Data Split

either randomly according to pre-specified proportions or with special statistical techniques
on the basis of time, allocating the older observations to the training set and the more recent observations to the test set.

Weights vs. Offsets

weights: attach a higher weight to observations with a larger exposure
offsets: the exposure, when serving as an offset, is in direct proportion to the mean of the target variable

–

weights give unequal importance to the observations
offsets adjust the prediction for each observation but not its relative importance
offsets act as a known coefficient for a parameter, rather than a coefficient to be fitted.

Reducing Dimensionality

Feature Selection

Why do feature selection?

reduce model complexity
avoid overfitting

–

Similarities between stepwise selection and shrinkage methods? => purpose of feature selection

k-Fold Cross-Validation

divides the available data into k folds for a series of model fitting runs.
in each model fitting run, the model is fit to the train data in each fold
a test metric (accuracy, error rate, RMSE, etc) is calculated on the test data.
The average test metric across the runs is the result of the cross validation.

–

k-Fold Cross-Validation

assess how well predict with new data
applicable to GLM, random forest, training hyperparameter eta for boosted trees
- choosing eta:
  1. prescribe a series of reasonable values for eta.
  2. for each eta, perform cross validation in model fitting run.
  3. average cross validation errors for each eta.
  4. choose eta with the best test metric.

Best Subset Selection:

fit for each possible combination of the available features
# of fits: 2^p

–

Stepwise Selection:

decide on a final model based on a specific selection criterion
Start: from
- saturated model (backward selection)
- intercept-only model (forward selection)
Stop: until no improvement as measured by AIC or other performance metric
# of fits: 1+p(1+p)/2

–

AIC and BIC

use all available data
does not directly consider how well the models fit new data
does not give user direct insight into how well the model generalizes to unseen data

	Backward	Forward
start	full	intercept-only
add/drop	drop	add
conservative?		√

Regularization (Shrinkage Methods):

add a penalty term, a regularization penalty, to shrink the coefficient estimates towards zero, reducing the complexity
- how: optimize a loss function that includes a penalty parameter
- should standardize variables, o.w. the regularization will focus on shrinking the variables on a smaller scale over those on a larger scale.
elastic net: a compromise between ridge regression and lasso regression
hyperparameters: selected using cross-validation:
1. Construct a fine grid of (λ, α).
2. Compute the cross-validation error for each pair of values, and choose the pair that gives the lowest cross-validation error.
3. Given the optimal hyperparameters, the penalized regression model is fitted to all of the training data and its prediction accuracy is evaluated on the test data.

–

Regularization trades off:

bias: increase
variance: minimize the size of coefficient estimates so that the model has lower variance and is less prone to overfitting.
regularization leverages bias-variance tradeoff if the reduction in variance exceeds the added bias.

	lasso	ridge
shrink to 0?	√	× retain
simpler	√
more effective	√

Differences in reducing dimensionality:

LASSO, ridge regression and elastic net consider both the fit of the coefficients in predicting the target and the number of dimensions in the model. PCA reduces the dimensionality without reference to the target variable

Unsupervised Learning:

PCA: Use PCs in replace of original numeric variables, PC scores to reduce dimensionality.

–

In PCA, PCs are difficult to describe in detail (interpret), but LASSO is not.

Feature Generation

–

Model Validation

on train set: model diagnostic tools
on test set: compare prediction with observation
compare the selected model to an existing, baseline model, on the test set.

–

Model Diagnostic Tools:

“Residuals vs Fitted” plot
“Normal Q-Q” plot

On Test Data:

predicted value vs. observed value
baseline model may be intercept-only model

Model Maintenance

–

Comparison of GLM’s

Distribution	Y ⊆ (a, b)	g(⋅)	μ	Link Name	Use	Comment
Gaussian	(-∞, ∞)	βX = μ	μ = βX	Identity	Symmetric
Gamma / Exponential	(0, ∞)	βX = -μ^-1	μ = –βX^-1		Strictly positive, right skewed, severity
Inverse Gaussian	(0, ∞)	βX = μ^-2	μ = –βX^-1/2		Strictly positive, right skewed, severity	Like gamma but with sharper peak and fatter tail hard to interpret
Poisson / Negative Binomial	0, 1, …	βX = ln(μ)	μ = e^βX	Log	Count of occurrences in fixed time/space	mean = variance multiplicative structure is intuitive for insurance risks guaranteed positive predictions needless to log-transform target
Bernoulli	{0, 1}	βX = ln(μ/(1-μ))	μ = e^βX / (1 + e^βX)	Logit	Outcome of single yes/no
Binomial	{0, 1, …, N}	βX = ln(μ/(1-μ))	μ = e^βX / (1 + e^βX)	Logit	Count of yes out of N yes/no
Tweedie					Premium, loss ratio, aggregate losses	Mostly at zero and remainder right-skewed equivalent to Poisson-distributed sum of gammas

Comparison of Models

Supervised Learning

Supervised Learning	GLM	Base Decision Tree	Ensemble Model
Supervised Learning	GLM	Base Decision Tree	Random Forest	Boosted Trees
Focus			reduce model variance	reduce model bias
Process	–	Cost Complexity Pruning: grow a full tree to decide whether … reduce the size and the complexity of the tree, including setting the min # of obs per bucket and per node, max depth. The tradeoff between simplicity of the tree and how well it distinguishes whether …
Right-Skewed Data	Poor strongly fit the right tail of the experience while fitting the majority of observations less well	Poor splits emphasize on the skewed area than frequent data. because it puts heavy emphasis on minimizing the squared errors of relatively few extreme points
Nonlinear Interactions	Poor limited ability to extract complex relationships pairwise relationships each time	Good automatically handles continuous, categorical (no need to binarize or determine a base class), missing data, variable selections, and interactions
Bias/Predictive		Poor suboptimal initial split impacts subsequent fits; greedy locally optimal algorithm is unlikely to be globally optimal ~~easy to overfit~~	Fair simulate many different starting points in parallel; unlike GBM, unbalanced data may reduce predictiveness	Great unlike bagging, each iteration is sequentially trained on previous errors towards negative gradient of the loss function
Variance/Robust		Poor overfitting even with pruning small change in data causes large change in final estimate (unstable) information gain favors many-level categorical variables	Great less prone to overfitting than gradient boosting machines. variance of overall is less than of each component; only one parameter to tune (mtry ≈ $\sqrt{p}$. mtry = p is just bagging)	Fair More prone to overfitting than random forests sensitive to hyperparameters
Interpretability	Fair some coefficients are easier to interpret than others	Great intuitive, easy to interpret splits and visualize graphs, so long as complexity is low	Poor not as straightforward relationship between y and x: influential predictors: `varImp()` directional effects: `partial()`	Poor similar to random forest, but smaller trees are more interpretable
Computational Requirements	Good works in a spreadsheet	Fair does not work well with small data set	Poor	Poor
Determining Predictions			Classification Tree: counting votes, category with highest votes Regression Tree: average
Issues		With high-dimensional categorical variables: overfitting increase tree size sparse data imbalanced class distribution	model opacity issue	model opacity issue Resolution: Feature importance Partial dependence plots
Advantage	Compared to OLS: the target variable is no longer confined to the class of normal random variables OLS cannot limit the range of the target variable (OLS cannot limit to positive target variable for claim costs, or handle binary target variable, but logistic GLM can). flexible relationship between predictors and the target mean OLS variance depends on the mean, violating the constant variance assumption OLS cannot model non-linearity

Unsupervised Learning

Unsupervised Learning	Principal Components Analysis (PCA)	k-Means Clustering	Hierarchical Clustering
Definition	Summarize high-dimensional data into fewer variables while retaining sufficient information	Find k distinct clusters defined by their centers	Find a hierarchy of unspecified clusters forming a tree structure
Dataset	high-dimensional
Use	Data exploration Feature generation Feature transformation
Standardization	Required	Required If we don’t standardize the features, then the clustering places all of its weight on one feature and ignores the other feature because a large proportion of the Euclidean distance is attributed to the feature with larger values. Standardizing the features lets the algorithm place equal weight on both features when determining where the clusters should be. Hence, standardization is most always done prior to running a clustering algorithm.
Process	How to develop features: Create principal components as predictors in place of the original variables. Each principal component is orthogonal to every other principal component to capture a high proportion of the variance of the original variables. Few components are used in place of many original variables. This substitution, specifically the score produced by the linear combination of original variables each component represents, can reduce dimensionality and improve the predictive power of the resulting model. How to pick PCs: Choose PC1 with largest prop. of var, then linearly combine top (absolute valued) loadings with variables. Don’t choose all loadings or else dimension isn’t reduced and interpretation isn’t intuitive.	Partition observations into k clusters, search for the optimal partition (elbow method) because it’s impossible to examine every possible partition; likely that the search will find a local, not a global, optimum	Agglomerative Bottom-up (popular) vs. divisive top-down approach: Agglomerative clustering starts by considering each observation as its own cluster, then gradually grouping them with nearby clusters at each stage until you only have one cluster left (i.e., bottom-up) Divisive clustering starts by considering all the observations as a single cluster and then progressively splitting into subclusters recursively (i.e., top-down). Use dendrogram to visualize based on vertical axis where branches fuse.
R Commands	PC loadings: `prcomp$rotation` PC scores: `prcomp$x`	`kmeans(data, centers, nstart = 1)`	`hclust(dist(data))`
Advantage	Reduce dimension; can be used as new variables or to find latent variables
Disadvantage	only on numeric variables; categorical variables should be factorized first. on variables with high dimensionality; if the data set has few variables or variables having low dimensionality, loss of information > dimensionality reduction.	Variable must be numeric, as opposed to PCA Curse of dimensionality: Problem: As the number of dimensions increases, the data points on average are the same distance away from each other. -> Visualization becomes problematic, and distances become the same. Solution: Use PCA beforehand.	Use if hierarchical structure is expected, or else prefer k-means

From Interpretation Aspect

Decision Tree

Right-skewed target variables: Identify which tree uses untransformed (original) target variable and which tree uses log-transformed target variable
Describe the distribution of the leaves
- the leaf covers % of the data
- # leaves have assigned values x
Identify the important predictors
- in the order of splits
- times in each split
Relate to the business problem
- Which tree gives more distinguishment among obs.
- Right-skewed target variables: more density on the right tail leaves -> more insight into the right tail

Tuning Parameters

	Parameter	Tuning	Interpretation
Decision Tree	`rpart(y ~ ., data, method, control = rpart.control(), parms)`
	`method`	`method = "anova"`	For regression trees
	`method`	`method = "class"`	For classification trees
	`control = rpart.control(minsplit,minbucket, cp, xval, maxdepth)`
	`minsplit`	lower `minsplit` -> higher complexity	Minimum # obs that must exist in a node for a split to be attempted
	`minbucket`	lower `minbucket` -> higher complexity	Minimum # obs in a terminal node
	`cp`	lower `cp` -> more splits -> higher complexity -> higher computing time	Complexity parameter; minimum amount of impurity reduction required for a split to be attempted; improvement in relative error. 0 `cp` ensures most complex. Extract optimal `cp` by taking one with the minimum `xerror`.
	`maxdepth`	higher `maxdepth` -> higher complexity	Maximum depth of final tree, with root node being 0. Directly controls the interaction depth (ex. 2 does not capture three-way variable interactions).
Random Forest	`caret.train(y, x, method, ntree, importance, trControl, tuneGrid)`
	`ntree`	higher `ntree` -> each obs selected at least once, theoretically shouldn’t overfit the larger the `ntree` parameter, the more accurate the predictions `n <= 100`	# trees to train
	`importance = FALSE`	importance scores of the predictors, by default will not be computed
	`trControl = trainControl(method, number, repeats, sampling)`
	`method`	`method = "cv"`	Cross-Validation
		`method = "repeatedcv"`	Repeated Cross-Validation
		`method = "rf"`	Construct a random forest
	`number = k`		The # folds used in the k-fold cross-validation
	`repeats = c`	higher c -> more times of cross-validation is performed	Only for `method = "repeatedcv"`, controls how many times cross-validation is performed
	`sampling`	`sampling = "up"`	Oversampling
	`sampling`	`sampling = "down"`	Undersampling
	`tuneGrid = expand.grid(mtry = r:c)"`
	`mtry = r:c`	higher `mtry` -> the less variance reduction we get as the base trees will be more similar to one another. Smaller `mtry` -> more trees needed to ensure each parameter selected at least once If `mtry` is too small, however, each split may have too little freedom in choosing the split variables.	The # features considered in each split, in a form of data frame of `(# of rows: # of columns):=(r: c)` Proportion of features used at each split. $\sqrt{p}$ for classification, p/3 for regression, `mtry` = p is just bagging.
Boosting Tree	`nrounds`	`nrounds = 1000`, large enough so that sufficient trees are grown to capture the signal in the data effectively but not excessively large to avoid overfitting. Higher `nrounds` => more trees, possible overfitting	The maximum # of trees to grow or iterations in the model fitting process.
	`eta`	`0 < eta < 1`, The higher the learning rate, the faster the model will reach optimality and the fewer the number of iterations required, though the resulting model will more likely overfit.	The learning rate or shrinkage parameter applied to t he contribution of each tree.
	`maxdepth, min_child_weight, gamma`	–	Controls the complexity of the underlying trees.
k-means lustering	`kmeans(data, centers, nstart = 1)`
	`centers`	–	specifies `k`, the number of clusters used.
	`nstart`	`20 < nstart < 50` is recommended to improve the chance of identifying a global optimum.	controls the number of random selection of initial cluster centers and only the round with the best result will be reported.

Exploratory Data Analysis

Structured Data vs. Unstructured Data

	Structured Data	Unstructured Data
Pros	Easy to access / locate data Easy to compare data	more flexible
Cons	less flexible There is a limit to the scope of what can be expressed by structured data due to the tabular structure. omitting other information that doesn’t easily fit into the table structure.	harder to access

Summary of Models

Principal Components Analysis (PCA)

	Relative Sign	Possible Cause	Outcome
Correlation	σ_ij -> 1	–	Variable i and j are strongly positively correlated with one another The strong correlations suggest that PCA may be an effective technique to “compress” the dominated variables into a single measure while retaining most of the information -> the variables can be effectively summarized in 1 dimension
	σ_ij -> 0	–	variable i does not seem to have a strong linear relationship / moderately weak correlation / negligible correlation with the variable j.
	σ_ij -> -1	–
PC_m Loading	$\phi$_jm-> 1	–	Relative Signs: The PC_m attaches (positive) weight to the variable j -> PC_mcan be interpreted as a measure of it. Magnitudes: The larger (more positive) the PC score, the higher the x
	$\phi$_jm-> 0	–	Relative Signs: The PC_m attaches (small) weight to the variable j Magnitudes: small x regardless of PC scores
	$\phi$_jm-> -1	–	Relative Signs: The PC_m attaches (negative) weight to the variable j -> PC_mcan be interpreted as a measure of it. Magnitudes: The smaller (more negative) the PC score, the lower the x
	$\phi$_im> 0 $\phi$_jm< 0	–	Relative Signs: The PC_mhas a (large) positive loading on variable i and a (smaller) negative loading on variable j. -> we may interpret the third PC roughly as a contrast between variable i and variable j Magnitudes: x is expected to have a large value of x_imand smaller value of x_jm
PC_m PVE	High	Large positive correlations among the dominated variables	This PC is able to explain most of the variability in the data, which is recommended
PC_m PVE	No sharp elbow	The cumulative PVE rises only modestly quickly and the scree plot does not have that sharp an elbow, suggesting that substituting these components for the underlying variables will not produce much dimension reduction, not enough to offset the considerable difficulties for interpretation its use would create.	PCA should be abandoned for this business problem.

Statistics

Category	R Code	Interpretation
Correlation Matrix	Distance Duration Age Others Cost Distance 1.0000 0.4854 0.0017 -0.0002 0.5751 Duration 0.4854 1.0000 0.0181 0.0279 0.5510 Age 0.0017 0.0181 1.0000 -0.1202 -0.0011 Others -0.0002 0.0279 -0.1202 1.0000 0.0990 Cost 0.5751 0.5510 -0.0011 0.0990 1.0000	Categorical Predictors: should be transformed into dummy variables, o.w. unknown interdependency involving categorical predictors => cause modeling issues. Collinearity: The near 0.50 correlation between `Distance` and `Duration` suggests that collinearity may be an issue when including these as predictors.
PCA: PC Loadings	> pca <- prcomp(data, center = TRUE, scale. = TRUE) > pca$rotation PC1 PC2 PC3 PC4 PC5 Distance 0.56963795 -0.06517987 -0.10468926 -0.63334867 0.5090912 Duration 0.56171590 -0.05285474 -0.02230802 0.76444146 0.3111482 Age 0.00201430 -0.70275792 0.71051797 -0.03403320 0.0115414 Others 0.06876669 0.70568844 0.69442066 -0.02990823 0.1189976 Cost 0.59603267 0.03306191 0.03855751 -0.11156140 -0.7935486	Collinearity: That `Distance` and `Duration` have the same signs in the first three principal components highlights their collinearity in the context of other variables. Exclude Target Variable: Target variable should be exclude from PCA (data quality issue: target leak).
PCA: PVE	> summary(pca) Importance of components: PC1 PC2 PC3 PC4 PC5 Standard deviation 1.4423 1.0594 0.9390 0.7180 0.63259 Proportion of Variance 0.4161 0.2245 0.1763 0.1031 0.08003 Cumulative Proportion 0.4161 0.6405 0.8168 0.9200 1.00000	Proportion of Variance Explained (PVE): PC1 has explained 41.61% of variance. Cumulative PVE: PC2 has only explained 64.05% of the variance.
Confusion Matrix	> confusionMatrix(as.factor(train_pred), as.factor(y), positive = "1") [1] "Train confusion matrix" Confusion Matrix and Statistics Reference Prediction 0 1 0 6246 424 1 104 219 Accuracy : 0.9245 95% CI : (0.9181, 0.9306) No Information Rate : 0.9081 P-Value [Acc > NIR] : 5.584e-07 Kappa : 0.4176 Mcnemar's Test P-Value : < 2.2e-16 Sensitivity : 0.34059 Specificity : 0.98362 Pos Pred Value : 0.67802 Neg Pred Value : 0.93643 Prevalence : 0.09195 Detection Rate : 0.03132 Detection Prevalence : 0.04619 Balanced Accuracy : 0.66211 'Positive' Class : 1 ------------------------------------------------------ > confusionMatrix(as.factor(test_pred), as.factor(y), positive = "1") [1] "Test confusion matrix" Confusion Matrix and Statistics Reference Prediction 0 1 0 2658 204 1 24 110 Accuracy : 0.9239 95% CI : (0.9138, 0.9331) No Information Rate : 0.8952 P-Value [Acc > NIR] : 5.069e-08 Kappa : 0.457 Mcnemar's Test P-Value : < 2.2e-16 Sensitivity : 0.35032 Specificity : 0.99105 Pos Pred Value : 0.82090 Neg Pred Value : 0.92872 Prevalence : 0.10481 Detection Rate : 0.03672 Detection Prevalence : 0.04473 Balanced Accuracy : 0.67068 'Positive' Class : 1	Sensitivity on Train Data: the proportion of correctly classified x is only 34.059%. => undesired model Specificity on Train Data: the proportion of correctly classified non x is 98.362%. Predictability: Sensitivity on Test Data: the proportion of correctly classified x is only 35.032%. => Missing 65% of Y Specificity on Test Data: the proportion of correctly classified non x is 99,105%.

Plots

Residual Plot

Deviance Residual Plot

Definition

Deviance is the difference between our current model and the saturated model.

$D = 2\log\left(\dfrac{L_{sat}(\hat{\beta})}{L_{model}(\hat{\beta})}\right) = 2(\ell_{sat}(\hat{\beta})-\ell_{model}(\hat{\beta}))$

The deviance residual for an individual point is defined as: $dev_i = sign(y_i – \mu_i) \sqrt{d_i}$

The null deviance is based on the intercept-only model: how different is that intercept-only model from the saturated model?
The residual deviance is based on your fitted model: how different is your model from the saturated model?

R Command

residuals(object, type = c("deviance", "pearson", "working", "response", "partial"), ...)

Pearson = (observed – predicted) / sqrt(V(predicted)), where V(predicted) is the variance function applied to the predicted value.
Working = (observed – predicted) * g'(predicted), where g'() is the first derivative of the link function.
Response = observed – predicted
Deviance = s*sqrt(2*d), where s is 1 if observed > predicted and -1 if observed < predicted and d is the absolute value of the difference between the natural log of the probability of obtaining the observed value when the model predicts the observation exactly and the natural log of the probability of obtaining the observed value using the fitted model.

Interpretation

y-axis is the deviance residual, and the x-axis is the predicted value.

Deviance residuals should have mean close to 0 (unbiased) and nearly constant variance as the predictive values increase (homoscedastic, or not heteroscedastic), when plotted against any predictor, or against fitted values. (follow normal distribution)
If the dot sits above 0, then the fitted model underestimates
- Deviance Residual > 0 => Sign > 0 => Observed Deviance > Expected Deviance => Observed Value > Predicted Value => Predicted lower, model underestimates.
If the dot sits below 0, then the fitted model underestimates
- Deviance Residual < 0 => Sign < 0 => Observed Deviance < Expected Deviance => Observed Value < Predicted Value => Predicted higher, model underestimates.

Reading: Interpreting Residual Plots to Improve Your Regression (qualtrics.com)

AUC – ROC Curve

Definition

A graphical tool plotting the sensitivity against the specificity of a given classifier for each cutoff ranging from 0 to 1.

ROC is a probability curve and;
AUC represents the degree or measure of separability.

It tells how much the model is capable of distinguishing between classes.

Plotting

library(pR0C)
Create a roc object: roc <- roc(response, predictor)
Retrieve AUC value: auc(roc)
Plot the ROC curve: plot(roc)

Interpretation

The closer the curve is to the top-left corner, the better the predictive ability.

The higher the AUC, the better the model is at predicting 0 classes as 0 and 1 classes as 1 (true positive). i.e., the higher the AUC, the better the model’s performance at distinguishing between the positive and negative classes.

If AUC = 0.5, it means the predictive performance is just random guess.

Random Forest

Variable Importance Plot

Purpose: show which features contribute the most to the model.

Idea: ranks the predictors according to their importance scores.

imp <- varImp(model.rf)
plot(imp)

What are importance scores?

The importance score for a particular predictor is computed by the reductions in error (totaling the drop in node impurity, or incremental node purity) (RSS for regression trees and Gini index for classification trees) due to that predictor, averaged over all the trees in the ensemble tree.

Importance Score	Variable
How to get	Classification Tree: determined by observing, in the particular model used, how much making a distinction between groups of X using a particular variable distinguished whether…
Similar	equally useful
Lower	the one with higher importance is much more useful than the one with lower importance

Interpretation

The importance scores are automatically scaled so that the most important predictor has a score of 100 and the predictors are sorted in descending order of variable importance. (0 <= importance score <= 100)
The labels on the y-axis are variables, and on x-axis is the important scores (or incremental node purity)

Partial (Average) Dependence Plot

Purpose: assess how the target is dependent on each feature, i.e., get an understanding of the relationship between features and our target.

Idea: calculate and show the “average” predicted value of the target variable by varying the value of one (or more) input features.

Advantage: show complex (directional) relationships between many variables beyond three dimensions.

Disadvantage: only captures the average dependence.

Interpretation

By construction, PD(x₁) simply equals the model predictions averaged over all the observed values of X₂, … , X_p in the training set while keeping the value of X₁ fixed at x₁, a value of interest. We can then examine the behavior of PD(x₁) as a function of x₁ with the goal of understanding how X₁ affects the target variable.
For a categorical target variable, if the predictions are on the logit scale, meaning that what is shown on the vertical axis of a partial dependence plot is $\ln\dfrac{\hat{p}}{1\hat{p}}$, and the plot shows how the odds varies with the value of x, with a blue smoothed curve superimposed.

Is the smoothed line recommended?

No. If the unsmoothed plot follows bi-modal distribution, the smoothed plot only shows uni-modal distribution, which is misleading.

Principal Components Analysis (PCA)

Biplot

Description

Interpretation of Vector Position / Direction

PC Loadings

V_i vector arrow ends far left end on the top axis
V_j vector arrow ends far left end on the bottom axis
V_k vector arrow ends far left end in the middle

Variable i has (very) negative PC₁ loading, (very) positive PC₂ loading
Variable j has (very) negative PC₂ loading, (very) positive PC₁ loading
Variable k has (very) negative PC₁ loading, almost 0 PC₂ loading
…

V_i	V_i
V_i	V_i

V_k	V_k

V_j	V_j
V_j	V_j

Correlations

V_i and V_j have similar coordinates, close to each other
V_k is far away from V_i and V_j

Variable i and Variable j are positively correlated
Variable k is less correlated to variables i and j


	V_i	V_j




		V_k

PC Scores / Observations

O_isits far left end in the middle
- PC₁ score is very negative
- PC₂ score is almost 0

If PC1 < 0, observation i has:
- (very) high PC₁ effect
- average PC₂ effect
If PC1 > 0, observation i has:
- (very) low PC₁ effect
- average PC₂ effect




O_i	O_i

O_isits little far left end on the bottom axis, near the bottom
- PC₁ score is very negative
- PC₂ score is very negative

If (PC₁<0, PC₂<0):
- very high PC₁ effect
- very high PC₂ effect
…






O_i	O_i
O_i	O_i

O_isits near the center
- PC₁ score is very negative
- PC₂ score is very negative

Observation i has:
- average PC₁ effect
- average PC₂ effect




			O_i

O_isits far right end in the middle
- PC₁ score is very positive
- PC₂ score is almost 0

If PC1 < 0, observation i has:
- (very) low PC₁ effect
- average PC₂ effect
If PC1 > 0, observation i has:
- (very) high PC₁ effect
- average PC₂ effect




					O_i	O_i

Interpret the PC’s

On the right axis, V_i is much closer to zero

PC1 is a measurement of variable i




V_i	V_i	V_i	V_i

On the top axis, V_i is much closer to zero

PC2 is a measurement of variable i

			V_i
			V_i



			V_i
			V_i

k-Means Clustering

Elbow plot

Definition

A technique that is used in determining the number of clusters in a data set. The method consists of plotting the explained variation as a function of the number of clusters and picking the elbow of the curve as the number of clusters to use. The same method can be used to choose the number of parameters in other data-driven models, such as the number of principal components to describe a data set.

The basic idea is to:

specify K = 2 as the initial optimal cluster number K, and then
keeps increasing K by step 1 to the maximal specified for the estimated potential optimal cluster number, and finally
distinguish the potential optimal cluster number K corresponding to the plateau.

The optimal cluster number K is distinguished by the fact that before reaching K, the cost rapidly decreases to the called cost peak value, and after exceeding K, it continues to increase with the called cost peak value almost unchanged.

Plotting

ggplot(data, aes(x = K, y = bss_tss)) + geom_point() + geom_line() + ggtitle("Elbow plot")

Interpretation

on y-axis is the % of variance explained

Interpretation

K = k, beyond which the gains in the proportion of variation explained appear to be minimal.

Hierarchical Clustering

Dendrogram

	Comment
Purpose	Shows the relationships, or closeness, between similar sets of data.
Height	The height at which any two objects are joined together: The two clades are similar Different height: The two clades are dissimilar Clades with different heights are dissimilar. The greater the difference in height, the more dissimilarity
Content	Each group (or “node”) links to two or more successor groups. The groups are nested and organized as a tree.
Characteristics	Monotonic: All hierarchical clustering algorithms are monotonic – they either increase or decrease. A summary of the distance matrix: cannot tell you how many clusters you should have Cannot tell how many clusters: In general, it is a mistake to use dendrograms as a tool for determining the number of clusters in data.