Machine Learning Cheatsheet
Machine Learning Cheatsheet
My github: https://github.com/robinyUArizona
FN
(6) False Negative Rate: = 1-Recall — Fraction of
TP + FN
positives wrongly classified negative. Probability: P [D = 0|Y = 1]
TP + TN
(1) Accuracy: → Ratio of correct
TP + TN + FP + FN
As we can see, the training curve looks ok, but the validation function
moves noisily around the training curve. It could be the case that predictions over total predictions.
validation data is scarce and not very representative of the training Estimate of P [D = Y ] , probability of decision is equal to outcome.
data, so the model struggles to model these examples.
TP
(2) Recall or Sensitivity or True positive rate: .
TP + FN
Completness of model. → Out of total actual positive (1) values,
how often the classifier is correct. Probability: P [D = 1|Y = 1]
Example: ”Fradulent transaction detector” or ”Person Cancer” →
+ve (1) is ”fraud”: Optimize for sensitivity because false positive (FP
normal transactions that are flagged as possible fraud) are more Note: We can think of the plot as the fraction of correct predictions
acceptable than false negative (FN fradulent transactions that are not for the positive class (y-axis) versus the fraction of errors for the
detected) negative class (x-axis).
Here, the validation loss is much better than the training one, which
reflects the validation dataset is easier to predict than the training TP (10) AUC: Area Under the ROC Curve: To compute the points in an
(3) Precision : Exactness of model. → Out of total
dataset. An explanation could be the validation data is scarce but TP + FP ROC curve, an efficient, sorting-based algorithm called AUC. AUC
widely represented by the training dataset, so the model performs predicted positive (1) values, how often classifier is correct. ranges in value from 0 to 1. Area Under the Curve measures how
extremely well on these few examples. likely the model differentiates positives and negatives (perfect AUC =
Probability: P [Y = 1|D = 1] , If our model says positive, how likely
1, basline = 0.5)
Model Evaluation it is correct in that judgement.
Example: ”Spam Filter” +ve (1) class is spam → Optimize for
Classification Problems precision or, specificity because false negatives (FN spam goes to the
inbox) are more acceptable than false positive (FP non-spam is
Confusion Matrix
caught by the spam filter). Example: ”Hotel booking cancelled”
• The data gives us outcomes (“truth”) (y|Y )
+ve (1) class is isCancelled → Optimize for precision or, specificity
• The model makes decisions (d|D) (saving ŷ scores)
because false negatives (FN isCancelled labeled as ”not cancelled” 0)
Then, we compare decisions (d) to outcomes (y)
are more acceptable than false positive (FP isnotCancelled labeled as
Type I error: The null hypothesis H0 is rejected when it is true. ”cancelled” 1).
Type II error: The null hypothesis H0 is not rejected when it is
false. Precision × Recall
(4) F1-Score = 2 × → False positive (FP) and
→ False negative (Type I error: ) — incorrectly decide no Precision + Recall
→ False positive (Type II error: ) — incorrectly decide yes False negative (FN) are equally important.
How to choose threshold for the logistic regression? The choice of a Variance, R2 and the Sum of Squares Convex & Non-convex
threshold depends on the importance of TPR and FPR classification
The total sum of squares: SStotal = i (yi − ȳ)2 A convex function is one where a line drawn between any two points
P
problem. For example: Suppose you are building a model to predict
customer churn. False negatives (not identifying customers who will 1 P
This scales with variance: var(Y ) = n i (yi − ȳ)2 on the graph lies on or above the graph. It has one minimum. A
churn) might lead to loss of revenue, making TPR crucial. In P ˆ non-convex function is one where a line drawn between any two
The regression sum of squares: SSreg = (yi − ȳ)2
i
contrast, falsely predicting churn (false positives) could lead to points on the graph may intersect other points on the graph. It
, → nVar(predictions)
unnecessary retention efforts, making FPR important. If there is no characterized as ”wavy”
The residual
Psum of squares (squared errro):
external concern about low TPR or high FPR, one option is to weight → When a cost function is non-convex, it means that there is a
SSresid = i (yi − yˆi )2 , → nVar(ϵ)
them equally by choosing the threshold that maximizes TPR−FPR. likelihood that the function may find local minima instead of the
Note: ϵ̄ = 0, E[ŷ] = ȳ
global minimum, which is typically undesired in machine learning
Precision-Recall curve: - Focuses on the correct prediction of the
models from an optimization perspective.
minority class, useful when data is imbalanced. Plot precision at SStotal = SSreg + SSresid
different thresholds. Gradient Descent
SSresid SSreg nV ar(P reds) V ar(P reds) Gradient Descent is used to find the coefficients of f that minimizes
R2 = 1 − = = =
SStotal SStotal nV ar(Y ) V ar(Y ) a cost function (for example MSE, SSR).
→ Time Complexity: O(kn2 ) → n is no. of data points.
The fraction of variance explained!
Procedure:
1. Intialization θ = 0 (coefficients to 0 or random)
2. Calculate cost J(θ) = evaluate f (coefficients)
∂
3. Gradient of cost ∂θj
J(θ) we knows the uphill direction
∂
4. Update coeff θj = θj − α ∂θ J(θ) we go downhill
j
s
Optimization
PN
i=1 (yˆi − yi )2 Almost every machine learning method has an optimization algorithm
2. Root Mean Squared Error: RMSE = at its core.
N
→ Hypothesis : The hypothesis is noted hθ and is the model that
1X we choose. For a given input data x(i) the model prediction output is
3. Mean Absolute Error: MAE = |yi − ŷ| Tips :
n i hθ (x(i) ) . • Change learning rate α (”size of jump” at each iteration)
• Plot Cost vs. Time to assess learning rate performance.
→ Loss function : L : (z, y) ∈ R × Y 7−→ L(z, y) ∈ R that takes
• Rescaling the input variables
X
4. Sum of Squared Error: SSE = (yi − ŷ)2 as inputs the predicted value z corresponding to the real data value
i • Reduce passes through training set with SGD
y and outputs how different they are. The loss function is the
• Average over 10 or more updated to observe the learning trend
function that computes the distance or difference between the
while using SGD
X
5. Total Sum of Squares: SST = (yi − ȳ)2 current output z of the algorithm and the expected output y.
i The common loss functions are summed up in the table below: Batch Gradient Descent does summing/average of the cost over all
the observations.
6. R2 Error : Least squared error Logistic loss Hinge loss Stochastic Gradient Descent apply the procedure of parameter
1
MSE (model) SSE 2
(y − z)2 log (1 + exp(−yz)) max(0, 1 − yz) updating for each observation.
R2 = 1 − R2 = 1 − Linear Regression Logistic Regression SVM → Time Complexity: O(km2 ) → m is the sample of data selected
MSE(baseline) SST
randomly from the entire data of size n
The proportion of explained y-variability. Negative R2 means
the model is worse than just predicting the mean. R2 is not
valid for nonlinear models as SSresidual + SSerror ̸= SST .
7. Adjusted R2 :
n−1
→ Cost function : The cost function J is commonly used to know
Ra2 = 1 − (1 − R2 ) the performance of a model, and is defined with the loss function L
n−k−1
as follows:
Xm
which changes only when predictors (features) affect R2 above J(θ) = L(hθ (x(i) ), y (i) )
what would be expected by chance i=1
Ordinary Least Squares • yi ∈ {0, 1} is outcome Linear Algorithms
y
Least Squares Regression • ŷi i is ŷi if yi = 1, and 1 if yi = 0 — multiplicative if
Regression
We fit linear models: Conditioning on Parameters
X ⃗ and write function:
Fuller definition - condition on parameters β → Regression predicts (or estimates) a continuous variable
ŷ = β0 + βj xj
j
⃗ ⃗
P (Y = 1|x, β) = ŷ = m(x, β) = logistic(...) Dependent variable Y , Independent variable(s) X
→ compute estimate ŷ ≈ y
Here, βj is the j-th coefficient and xj is the j-th feature. Likelihood Function yˆi = β0 + β1 xi
Ordinary Least Squares - find β ⃗ that minimizes squared error: Given data y = ⟨y1 , ..., yn ⟩, x = ⟨x1 , ..., xn ⟩ and parameters β̂ yi = yˆi + ϵi
Here, β0 is intercept, β1 P
is slope and ϵ is residuals. The goal is to
Y
X ⃗ = P (y, x|β)
Likelihood(y, x, β) ⃗ ∝ P (y|x, β) ⃗ = ⃗
P (yi |xi , β)
arg min (yi − yˆi )2 learn β0 , β1 to minimize ϵ2i (least squares)
⃗
β i i
This is weird: Linearity: A linear equation of k + 1 variables is of the form:
⃗ ∝ P (y|x, β)
P (y, x|β) ⃗
ŷ = β0 + β1 x1 + · · · + βk xk
⃗ = P (y|x, β)P
P (y, x|β) ⃗ (x|β)⃗
⃗ = P (x). And x is fixed, so It is the sum of scalar multiples of the individual variables - aline!
But x is independent of params, so P (x|β)
P (x) is an (unknown) constant. → Linear models are remarkably capable of transforming many
non-linear problems into linear.
⃗ = log Likelihood(y, x, β)
LogLik(y, x, β) ⃗
Linear Regression
Y
= log P (x) ⃗
P yi |xi , β yˆi = β0 + β1 xi1 + β2 xi2 · · · + βp xip + ϵ
i
p
X
⃗ n X
= log P (x) + log P yi |xi , β X
i
yˆi = β0 + βj xij
⃗ = ŷ
Goal: least-squares solution to : X β i=1 j=1
Maximum Likelihood Estimator
Solution: solve the normal equations: X Here, n is total no. of observation, yi is dependent variable, xij is
arg max ⃗
logP yi |xi , β
XT Xβ ⃗ = X T ŷ →β ⃗ = (X T X)−1 X T ŷ explanatory variable of j-th features of the i-th observation. β0 is
⃗
β i intercept or usually called bias coefficient.
⃗ |X,
L(β ⃗ ŷ) P (Y = yi |X = xi ) = yˆi yi (1 − yˆi )1−yi Assumptions:
logP (Y = yi |X = xi ) = yi log yˆi + (1 − yi ) log (1 − yˆi ) → Linear models make four key assumptions necessary for inferential
General Optimization: validity.
1. Understand data (features and outcome variables) Model log likelihood is sum over training data. Applicable to any • Linearity — outcome y and predictor X have linear relationship.
model where ŷ = P (Y = 1|x) • Independence — observations are independent of each other
2. Define loss (or gain/utility) function
3. Define predictive model Likelihood and Posterior - Independent variables (features) are not highly correlated with each
4. Search for parameters that minimize loss function P (y|θ) P (θ) other → Low multicollinearity
P (θ|y) = • Normal errors — residuals are normally distributed - check with
Augmented Loss P (y) Q-Q plots. Violation means line (in Q-Q plots) still fits but p-value
We can add more things to the loss function
→ Logistic function is trained by maximizing the log likelihood of the and CIs are unreliable
• Penalize model complexity
training data given the model • Equal variance — residuals have constant variance (called
• Penalize ”strong” beliefs homoskedasticity; violation is heteroskedasticity) - check scatterplot
– Requires predictive utility to overcome them or regplot between residuals vs. fitted. Violations means model is
→ Least squares generalizes into minimizing loss functions. failing to capture a systematic effect. → These violations are problem
→ This is the heart of machine learning, particularly supervised only for inference not for prediction
learning.
Maximum Likelihood Estimation
MLE is used to find the estimators that minimized the likelihood
function: L(θ|x) = fθ (x) density function of the data distribution
Log Likelihood
Logistic Regression:
X
P (Y = 1|X = x) = ŷ = logistic β0 + βj xj Variance Inflation Factor : Measures the severity if multicollinearity
j 1
The model computes probability of yes. → , where Ri2 is found by regressing Xi aganist all other
1 − Ri2
Probability of Observed
What if we want P (Y = yi ), regardless of whether yi is 1 or 0? variables (a common VIF cutoff is 10)
1−yi Learning: Estimating the coefficients β from the training data using
P (Y = yi |X = xi ) = yˆi yi (1 − yˆi )
the optimization algorithm Gradient Descent or Ordinary Least
• ŷi is model’s estimate of P (Y = 1|X = xi ) Squares.
⃗ that minimizes squared
Ordinary Least Squares - where we find β - Rescale inputs using standardization or normalization
error: X Advantages:
arg min (yi − yˆi )2
⃗ + Good regression baseline considering simplicity
β i
+ Lasso/Ridge can be used to avoid overfitting
+ Lasso/Ridge permit feature selection in case of collinearity
Usecase examples:
- Product sales prediction according to prices or promotions
- Call-center waiting-time prediction according to the number of
complaints and the number of working agents
Logistic Regression
Log-Odds and Logistics The representation below is an equation with binary output, which
Odds actually models the probability of default class:
The probability of success P (S): 0 ≤ p ≤ 1
→ The dimension of the hyperplane of the regression is its eβ0 +β1 x1 +···+βi xi
→ The odds of success are defined as the ratio of the probability of p(X) = = p(y = 1 | X)
complexity.
success over the probability of failure. 1 + eβ0 +β1 x1 +···+βi xi
Variations: There are extensions of Linear Regression training called P (S) P (S)
The odds of success: Odds(S) = P (S c ) = 1−P (S)
regularization methods, that aim to reduce the complexity of the → predict value close to 1 for default class and close to 0 for the
models or to address over-fitting in ML. The regularizer is not Ex: Odds(failure) = x → means x:1 aganist success other class.
dependent on the data. → In relation to the bias-variance trade-off, Log Odds or logit → Assumptions:
regularization aims to decrease complexity in a way that significantly P (A) - Linear relationship between X and log-odds of Y
log Odds(A) = log = logP (A) − log (1 − P (A))
reduces variances while only slightly increasing bias. 1 − P (A) - Observations must be independent to each other
→ Standardize numeric variables when using regularization because - Low multicollinearity
to ensure that 0 is a neutral value, so a low coefficient means ”little Logistic: The inverse of the logit (logit− 1): Learning: Learning the logistic regression coefficients is done by:
x
effect when deviating from average”. So values, and therefore logistic(x) = 1+e1−x = exe+1 Note : Coefficients are linearly related to odds, such that a one unit
coefficients, are on the same scale (# of standard deviations), to increase in x1 affects odds by eβ1 .
properly distribute weight between them. → Minimizing the logistic loss function
→ Multicollinearity → correlated predictors. Problem: Which X
⃗ i)
arg min log 1 + exp(−yi βx
coefficient gets the common effect? To solve: Loss and ⃗
β i
Regularization comes.
sigmoid or logistic curve. → Maximizing the log likelihood of the training data given the model
• Ridge Regression (L2 regularization): where OLS is modified
to minimize the squared sum of the coefficients X
arg max ⃗
log P yi |xi , β
n p p p ⃗
X X X X β i
(yi − β0 − βj xij )2 + λ βj2 = RSS + λ βj2
i=1 j=1 j=1 j=1
→ Odds are another way of representing probabilities. P (Y = yi |X = xi ) = yˆi yi (1 − yˆi )1−yi
→ The logistic and logit functions convert between probabilities and logP (Y = yi |X = xi ) = yi log yˆi + (1 − yi ) log (1 − yˆi )
→ Prevents the weights from getting too large (L2 norm). If
lambda is very large then it will add too much weight and it log-odds. Model log likelihood is sum over training data. Applicable to any
will lead to under-fit. model where ŷ = P (Y = 1|x)
General Linear Models (GLMs):
1 Data preparation:
λ∝ - Probability transformation to binary for classification
model variance yˆi = g −1 (β0 + β1 xi1 + β2 xi2 · · · + βp xip )
- Remove noise such as outliers
• Lasso Regression (L1 regularization) : where OLS is modified
p
Advantages:
to minimize the sum of the coefficients + Good classification baseline considering simplicity
X
−1
ŷi = g β0 + βj xij
Xn Xp Xp p
X j=1
+ Possibility to change cutoff for precision/recall tradeoff
(yi − β0 − βj xij )2 + λ |βj | = RSS + λ |βj | + Robust to noise/overfitting with L1/L2 regularization
i=1 j=1 j=1 j=1 Here, g is a link function + Probability output can be used for ranking
where p is the no. features (or dimensions), λ ≥ 0 is a tuning • Counts: Poision regression, log link func Usecase examples:
parameters to be determined. • Binary: Logistic regression, logit link func and g −1 is logistic func - Customer scoring with probability of purchase
→ In logistic regression, a linear output is converted into a probability - Classification of loan defaults according to profile
→ Lasso shrinks the less important feature’s coefficient to between 0 and 1 using the sigmoid or logistic function. It is the go-to
zero thus, removing some feature altogether. If lambda is very How to choose threshold for the logistic regression? The choice of a
for binary classification. threshold depends on the importance of TPR and FPR classification
large value will make coefficients zero hence it will under-fit. problem. For example, if your classifier will decide which criminal
→ L1 is less likely to shrink coefficients to 0. Therefore L1
suspects will receive a death sentence, false positives are very bad
X
regularization leads to sparser models. P (yi = 1|X) = ŷi = logistic β0 +
βj xij
j (innocents will be killed!). Thus you would choose a threshold that
Data preparation: yields a low FPR while keeping a reasonable TPR (so you actually
- Transform data for linear relationship (ex: log transform for catch some true criminals). If there is no external concern about low
exponential relationship) 1 ex TPR or high FPR, one option is to weight them equally by choosing
→ logistic(x) = = x
- Remove noise such as outliers 1 + e−x e +1 the threshold that maximizes TPR−FPR.
Linear Discriminant Analysis Nonlinear Algorithms Likelihood and Posterior
For multiclass classification, LDA is the preferred linear technique. All Nonlinear Algorithms are non-parametric and more flexible. They P (y|θ) P (θ)
Representation: LDA representation consists of statistical properties P (θ|y) =
are not sensible to outliers and do not require any shape of P (y)
calculated for each class: means and the covariance matrix: distribution. • P (θ) is the prior
n n • P (y|θ) isR the likelihood – how likely is the data given params θ
1 X 1
•
X
µk = xi σ2 = (xi − µk )2 Naive Bayes Classifier P (y) = P (y|θ)P (θ)dθ is a scaling factor (constant for fixed y)
nk i=1 n−k i=1 • P (θ|y) is the posterior.
Naive Bayes is a classification algorithm interested in selecting the • We’re maximizing likelihood (ML estimator)
best hypothesis h given data d assuming that the features of each • Can also maximize posterior (MAP estimator)
data point are all independent • When prior is constant, they’re the same
Representation: The representation is based on Bayes Theorem: • With lots of data, they’re almost the same
P (d|Y ) × P (Y )
P (Y |d) =
P (d)
LDA assumes Gaussian data and attributes of same σ 2 . Predictions max (P (Y |d)) = max (P (d|Y ) × P (Y ))
are made using Bayes Theorem: here, the denominator is not kept as it is only for normalization.
P (k) × P (x|k)
P (y = k | X = x) = Pk Learning: Training is fast because only probabilities need to be
l=1 P (l) × P (x|l) calculated:
to obtain a discriminate function (latent variable) for each class k, instancesY count(x ∧ Y )
P (Y ) = P (x|Y ) =
estimating P (x|k) with a Gaussian distribution: all instances instancesY
µk µ2
Dk (x) = x × 2
− k2 + ln(P (k)) Variations: Gaussian Naive Bayes can extend to numerical attributes
σ 2σ
by assuming a Gaussian distribution. Instead of P (x|h) are calculated
The class with largest discriminant value is the output class. with P (h) during learning, and MAP for prediction is calculated
Variations: using Gaussian PDF
1. Quadratic DA: Each class uses its own variance estimate 1 (x − µ)2
f (x | µ(x), σ) = √ e−
2. Regularized DA: Regularization into the variance estimate. 2πσ 2 2σ 2
Data preparation:
v
n u n
- Review and modify univariate distributions to be Gaussian 1X u1 X
µ(x) = xi σ=t (xi − µ(x))2
- Standardize data to µ = 0, σ = 1 to have same variance n i=1 n i=1
- Remove noise such as outliers
Advantages: Data preparation:
+ Can be used for dimensionality reduction by keeping the latent - Change numerical inputs to categorical (binning) or near Gaussian
variables as new variables inputs (remove outliers, log & boxcox transform)
- Other distributions can be used instead of Gaussian
Usecase example:
- Log-transform of the probabilities can avoid overflow
- Prediction of customer churn
- Probabilities can be updated as data becomes available
Advantages:
+ Fast because of the calculations
+ If the naive assumptions works can converge quicker than other
models. Can be used on smaller training data.
+ Good for few categories variables
Usecase examples: Support Vector Machines
- Article classification using binary word presence
SVM is a go-to for high performance with little tuning. Compares
- Email spam detection using a similar technique
extreme values in your dataset.
In SVM, a hyperplane (or decision boundary: wT x − b = 0) is The first term is the regularization term, which is a technique to
selected to separate the points in the input variables space by their avoid overfitting by penalizing large coefficients in the solution vector.
class, with the largest margin. The closest datapoints (defining the The second term, hinge loss, is to penalize misclassifications. It
margin) are called the support vectors. measures the error due to misclassification (or data points being
→ The goal of a support vector machine is to find the optimal closer to the classification boundary than the margin). The λ is the
separating hyperplane which maximizes the margin of the training regularization coefficient, and its major role is to determine the
data. trade-off between increasing the margin size and ensuring that the xi
lies on the correct side of the margin.
→ Kernel : A kernel is a way of computing the dot product of two
vectors xx and yy in some (possibly very high dimensional) feature
space, which is why kernel functions are sometime called ”generalized
dot product”. The kernel trick is a method of using a linear classifier
to solve a non-linear problem by transforming linearly inseparable data
to linearly separable ones in a higher dimension.
Given a feature mapping ϕ, we define the kernel K as follows:
K(x, z) = ϕ(x)T ϕ(z)
∥x − z∥2
In practice, the kernel K defined by K(x, z) = e− is
2σ 2
called the Gaussian kernel and is commonly used.
DBSCAN
→ Two parameters: ε - distance, minimum points
→ Three classifications of points:
• Core: has atleast minimum points within ε - distance including
itself
• ε - distance has less than minimum points within ε - distance
but can be reached by clusters.
1. Divide data into K clusters or groups. • Outlier: point that cannot be reached by cluster
2. Randomly select centroid for each of these K clusters. Procedure:
3. Assign data points to their closest cluster centroid according
1. Pick a random point that has not been assigned to a cluster
to Euclidean/ Square Euclidean/Manhattan/Cosine
or, designated as an Outlier. Determine if it is a Core Point.
4. Calculate the centroids of the newly formed clusters. If not, label the point as Outlier.
5. Repeat steps 3 and 4 until the same centroids (convergences)
are assigned to each cluster. 2. Once a Core Point has been found, add all directly reachable
to its cluster. Then do neighbor jumps to each reachable
point and add them to the cluster. If an Outlier has been
added, label it as a Border Point.
3. Repeat these steps until all points are assigned a cluster or,
label as Outlier.