0% found this document useful (0 votes)
52 views

Machine Learning Cheatsheet

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
52 views

Machine Learning Cheatsheet

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 12

Machine Learning Cheatsheet Discriminative model Generative model

© 2024 Robins Yadav - magic starts here

My github: https://github.com/robinyUArizona

Learns the decision boundary be- Learns the input distribution


Machine Learning General tween classes
Directly estimate P (y|x) Estimate P (x|y) to find deduce
P (y|x) using Baye’s rule
Definiton Cannot generate new data Can be used to generate new
data
We want to learn a target function f that maps input variables X to Specifically meant for classifica- Typically, their purpose is not → Underfitting or High bias means that the model is not able to
output variable y, with an error e: tion tasks classification capture or learn the trend or pattern in data.
Logistic Regression, Random Hidden Markov Models, Naive → Overfitting or High variance means that the model fits the
y = f (X) + e Forests, SVM, Neural Networks, Bayes, Gaussian Mixture Mod- available data but does not generalize well to predict on new data.
Decision Tree, kNN els, Gaussian Discriminant Anal-
ysis, LDAm Bayesian Networks
Linear, Non-linear
• Unsupervised learning methods learn to find the inherent structure
or hidden patterns from unlabeled data X (x(1) , ..., x(m) ) .
Different algorithms make different assumptions about the shape and
structure of f . Any algorithm can be either: Bias-Variance trade-off, Underfitting, Overfitting
→ Expected test error is the error we expect from predicting new,
• Paramteric (or Linear): simplify the mapping to a known
unobserved data points.
linear combination form and learning its coefficients.
In supervised learning, the prediction (expected) error e is composed
• Non-parametric (or Non-linear): free to learn any functional of the bias, the variance and the irreducible part.
form from the training data, while maintaining some ability to  2 h h ii2
generalize. Error(x) = E[fˆ(x)] − f (x) + E fˆ(x) − E fˆ(x) + σe2 → The training loss goes down over time, achieving low error values
→ The validation loss goes down until a turning point is found, and it
Note: Linear algorithms are usually simpler, faster and requires less → Bias refers to erroneous assumptions made by the model about
starts going up again. That point represents the beginning of
data, while Nonlinear can be are more flexible, more powerful and the data to make the target function easier to learn.
overfitting. Therefore, The training process should be stopped when
more performant. Mathematically, how much predicted values differ from true values?
the validation error trend changes from descending to ascending.
→ Variance is the error, amount that the prediction (estimate of the
target function) will change if different training data sets were used. • Training loss vs. Validation loss:
Supervised, Unsupervised It measures how scattered (inconsistent) are the predicted values from
the correct value due to different training data (or possibly with
• Supervised learning methods learn to predict outcomes y different random seeds) sets. It is also known as Variance Error or
(y (1) , ..., y (m) ) from data points X (x(1) , ..., x(m) ) given that the Error due to Variance.
data is labeled. Note: As the complexity of the model rises, the variance will
increase and bias will decrease.
Type of prediction • The goal of parameterization is to achieve a low bias and low
variance trade-off through methods such as: → Epochs: One Epoch is when an ENTIRE dataset is passed forward
→ Cross-validation can be used to tune models so as to optimize the and backward through the neural network only ONCE.
Regression Classification trade-off → Batch: You can’t pass the entire dataset into the neural net at
Outcome Continuous Class → Dimension reduction and feature selection once. So, you divide dataset into No. of Batches or sets or parts.
Examples Linear Regression Logistic Regression, → Mixture models (probabilistic models) and ensemble learning.
SVM, Naive Bayes
→ Iterations is the No. of Batches needed to complete One Epoch.
• How would you identify if your model is overfitting? By analyzing
the learning curves, you should be able to spot whether the model is
→ Conditional Estimates underfitting or overfitting. The y-axis is some metric of learning (ex:,
Regression → conditional expectation: E [y|X = x] classification accuracy) and the x-axis is experience (time or No. of
Classification → conditional probability: P (Y = y|X = x) iteration).
• Regularization - Dropout During training, randomly set some
Type of models activations to 0. This forces network to not rely on any one node.

→ Discriminative Model: It focuses on predicting the labels of the


data. A discriminative machine learning trains a model which is done
by learning parameters that maximize the conditional probability
P (Y |X)
→ Generative Model: It focuses on explaining how the data was
generated. A generative model learns parameters by maximizing the
joint probability of P (X, Y ).
Unrepresentative Training Dataset FP
(5) False Positive Rate: — Fraction of negatives
TN + FP
When the data available during training is not enough to capture the
model, relative to the validation dataset. wrongly classified positive. Probability: P [D = 1|Y = 0]

FN
(6) False Negative Rate: = 1-Recall — Fraction of
TP + FN
positives wrongly classified negative. Probability: P [D = 0|Y = 1]

Ex: We assume the null hypothesis H0 is true. TN


(7) Specificity: = 1-FPR — Fraction of negatives
→ H0 : Person is not guilty TN + FP
→ H1 : Person is guilty
rightly classified negative. Probability: P [D = 0|Y = 0]
The train and validation curves are improving, but there’s a big gap
between them, which means they operate like datasets from different FP
(8) ”Fradulent transaction detector”, FPR = →
distributions. FP + TN
probability of falsely rejecting ”Null Hypothesis” H0
Unrepresentative Validation Dataset
(9) ROC-curve: What FPR must you tolerate for a certain TPR? An
ROC curve plots TPR vs. FPR at different classification thresholds α.
→ Lowering the classification threshold classifies more items as
positive, thus increasing both False Positives and True Positives.

TP + TN
(1) Accuracy: → Ratio of correct
TP + TN + FP + FN
As we can see, the training curve looks ok, but the validation function
moves noisily around the training curve. It could be the case that predictions over total predictions.
validation data is scarce and not very representative of the training Estimate of P [D = Y ] , probability of decision is equal to outcome.
data, so the model struggles to model these examples.
TP
(2) Recall or Sensitivity or True positive rate: .
TP + FN
Completness of model. → Out of total actual positive (1) values,
how often the classifier is correct. Probability: P [D = 1|Y = 1]
Example: ”Fradulent transaction detector” or ”Person Cancer” →
+ve (1) is ”fraud”: Optimize for sensitivity because false positive (FP
normal transactions that are flagged as possible fraud) are more Note: We can think of the plot as the fraction of correct predictions
acceptable than false negative (FN fradulent transactions that are not for the positive class (y-axis) versus the fraction of errors for the
detected) negative class (x-axis).
Here, the validation loss is much better than the training one, which
reflects the validation dataset is easier to predict than the training TP (10) AUC: Area Under the ROC Curve: To compute the points in an
(3) Precision : Exactness of model. → Out of total
dataset. An explanation could be the validation data is scarce but TP + FP ROC curve, an efficient, sorting-based algorithm called AUC. AUC
widely represented by the training dataset, so the model performs predicted positive (1) values, how often classifier is correct. ranges in value from 0 to 1. Area Under the Curve measures how
extremely well on these few examples. likely the model differentiates positives and negatives (perfect AUC =
Probability: P [Y = 1|D = 1] , If our model says positive, how likely
1, basline = 0.5)
Model Evaluation it is correct in that judgement.
Example: ”Spam Filter” +ve (1) class is spam → Optimize for
Classification Problems precision or, specificity because false negatives (FN spam goes to the
inbox) are more acceptable than false positive (FP non-spam is
Confusion Matrix
caught by the spam filter). Example: ”Hotel booking cancelled”
• The data gives us outcomes (“truth”) (y|Y )
+ve (1) class is isCancelled → Optimize for precision or, specificity
• The model makes decisions (d|D) (saving ŷ scores)
because false negatives (FN isCancelled labeled as ”not cancelled” 0)
Then, we compare decisions (d) to outcomes (y)
are more acceptable than false positive (FP isnotCancelled labeled as
Type I error: The null hypothesis H0 is rejected when it is true. ”cancelled” 1).
Type II error: The null hypothesis H0 is not rejected when it is
false. Precision × Recall
(4) F1-Score = 2 × → False positive (FP) and
→ False negative (Type I error: ) — incorrectly decide no Precision + Recall
→ False positive (Type II error: ) — incorrectly decide yes False negative (FN) are equally important.
How to choose threshold for the logistic regression? The choice of a Variance, R2 and the Sum of Squares Convex & Non-convex
threshold depends on the importance of TPR and FPR classification
The total sum of squares: SStotal = i (yi − ȳ)2 A convex function is one where a line drawn between any two points
P
problem. For example: Suppose you are building a model to predict
customer churn. False negatives (not identifying customers who will 1 P
This scales with variance: var(Y ) = n i (yi − ȳ)2 on the graph lies on or above the graph. It has one minimum. A
churn) might lead to loss of revenue, making TPR crucial. In P ˆ non-convex function is one where a line drawn between any two
The regression sum of squares: SSreg = (yi − ȳ)2
i
contrast, falsely predicting churn (false positives) could lead to points on the graph may intersect other points on the graph. It
, → nVar(predictions)
unnecessary retention efforts, making FPR important. If there is no characterized as ”wavy”
The residual
Psum of squares (squared errro):
external concern about low TPR or high FPR, one option is to weight → When a cost function is non-convex, it means that there is a
SSresid = i (yi − yˆi )2 , → nVar(ϵ)
them equally by choosing the threshold that maximizes TPR−FPR. likelihood that the function may find local minima instead of the
Note: ϵ̄ = 0, E[ŷ] = ȳ
global minimum, which is typically undesired in machine learning
Precision-Recall curve: - Focuses on the correct prediction of the
models from an optimization perspective.
minority class, useful when data is imbalanced. Plot precision at SStotal = SSreg + SSresid
different thresholds. Gradient Descent
SSresid SSreg nV ar(P reds) V ar(P reds) Gradient Descent is used to find the coefficients of f that minimizes
R2 = 1 − = = =
SStotal SStotal nV ar(Y ) V ar(Y ) a cost function (for example MSE, SSR).
→ Time Complexity: O(kn2 ) → n is no. of data points.
The fraction of variance explained!
Procedure:
1. Intialization θ = 0 (coefficients to 0 or random)

2. Calculate cost J(θ) = evaluate f (coefficients)

3. Gradient of cost ∂θj
J(θ) we knows the uphill direction

4. Update coeff θj = θj − α ∂θ J(θ) we go downhill
j

The cost updating process is repeated until convergence (minimum


→ The variance in the outcome variable decomposes into regression found).
variance and residual variance.
Regression Problems → R2 measures the fraction of total variance explained by the
regression.
1X → R2 % of dependent variable variance explained using independent
1. Mean Squared Error: MSE = (yi − ŷ) variable(s)
n i

s
Optimization
PN
i=1 (yˆi − yi )2 Almost every machine learning method has an optimization algorithm
2. Root Mean Squared Error: RMSE = at its core.
N
→ Hypothesis : The hypothesis is noted hθ and is the model that
1X we choose. For a given input data x(i) the model prediction output is
3. Mean Absolute Error: MAE = |yi − ŷ| Tips :
n i hθ (x(i) ) . • Change learning rate α (”size of jump” at each iteration)
• Plot Cost vs. Time to assess learning rate performance.
→ Loss function : L : (z, y) ∈ R × Y 7−→ L(z, y) ∈ R that takes
• Rescaling the input variables
X
4. Sum of Squared Error: SSE = (yi − ŷ)2 as inputs the predicted value z corresponding to the real data value
i • Reduce passes through training set with SGD
y and outputs how different they are. The loss function is the
• Average over 10 or more updated to observe the learning trend
function that computes the distance or difference between the
while using SGD
X
5. Total Sum of Squares: SST = (yi − ȳ)2 current output z of the algorithm and the expected output y.
i The common loss functions are summed up in the table below: Batch Gradient Descent does summing/average of the cost over all
the observations.
6. R2 Error : Least squared error Logistic loss Hinge loss Stochastic Gradient Descent apply the procedure of parameter
1
MSE (model) SSE 2
(y − z)2 log (1 + exp(−yz)) max(0, 1 − yz) updating for each observation.
R2 = 1 − R2 = 1 − Linear Regression Logistic Regression SVM → Time Complexity: O(km2 ) → m is the sample of data selected
MSE(baseline) SST
randomly from the entire data of size n
The proportion of explained y-variability. Negative R2 means
the model is worse than just predicting the mean. R2 is not
valid for nonlinear models as SSresidual + SSerror ̸= SST .

7. Adjusted R2 :

n−1
  → Cost function : The cost function J is commonly used to know
Ra2 = 1 − (1 − R2 ) the performance of a model, and is defined with the loss function L
n−k−1
as follows:
Xm
which changes only when predictors (features) affect R2 above J(θ) = L(hθ (x(i) ), y (i) )
what would be expected by chance i=1
Ordinary Least Squares • yi ∈ {0, 1} is outcome Linear Algorithms
y
Least Squares Regression • ŷi i is ŷi if yi = 1, and 1 if yi = 0 — multiplicative if
Regression
We fit linear models: Conditioning on Parameters
X ⃗ and write function:
Fuller definition - condition on parameters β → Regression predicts (or estimates) a continuous variable
ŷ = β0 + βj xj
j
⃗ ⃗
P (Y = 1|x, β) = ŷ = m(x, β) = logistic(...) Dependent variable Y , Independent variable(s) X
→ compute estimate ŷ ≈ y
Here, βj is the j-th coefficient and xj is the j-th feature. Likelihood Function yˆi = β0 + β1 xi
Ordinary Least Squares - find β ⃗ that minimizes squared error: Given data y = ⟨y1 , ..., yn ⟩, x = ⟨x1 , ..., xn ⟩ and parameters β̂ yi = yˆi + ϵi
Here, β0 is intercept, β1 P
is slope and ϵ is residuals. The goal is to
Y
X ⃗ = P (y, x|β)
Likelihood(y, x, β) ⃗ ∝ P (y|x, β) ⃗ = ⃗
P (yi |xi , β)
arg min (yi − yˆi )2 learn β0 , β1 to minimize ϵ2i (least squares)

β i i
This is weird: Linearity: A linear equation of k + 1 variables is of the form:
⃗ ∝ P (y|x, β)
P (y, x|β) ⃗
ŷ = β0 + β1 x1 + · · · + βk xk
⃗ = P (y|x, β)P
P (y, x|β) ⃗ (x|β)⃗
⃗ = P (x). And x is fixed, so It is the sum of scalar multiples of the individual variables - aline!
But x is independent of params, so P (x|β)
P (x) is an (unknown) constant. → Linear models are remarkably capable of transforming many
non-linear problems into linear.
⃗ = log Likelihood(y, x, β)
LogLik(y, x, β) ⃗
Linear Regression
Y  
= log P (x) ⃗
P yi |xi , β yˆi = β0 + β1 xi1 + β2 xi2 · · · + βp xip + ϵ
i  
p
X
⃗ n X
= log P (x) + log P yi |xi , β X
i
yˆi = β0 + βj xij
⃗ = ŷ
Goal: least-squares solution to : X β i=1 j=1
Maximum Likelihood Estimator
Solution: solve the normal equations: X   Here, n is total no. of observation, yi is dependent variable, xij is
arg max ⃗
logP yi |xi , β
XT Xβ ⃗ = X T ŷ →β ⃗ = (X T X)−1 X T ŷ explanatory variable of j-th features of the i-th observation. β0 is

β i intercept or usually called bias coefficient.
⃗ |X,
L(β ⃗ ŷ) P (Y = yi |X = xi ) = yˆi yi (1 − yˆi )1−yi Assumptions:
logP (Y = yi |X = xi ) = yi log yˆi + (1 − yi ) log (1 − yˆi ) → Linear models make four key assumptions necessary for inferential
General Optimization: validity.
1. Understand data (features and outcome variables) Model log likelihood is sum over training data. Applicable to any • Linearity — outcome y and predictor X have linear relationship.
model where ŷ = P (Y = 1|x) • Independence — observations are independent of each other
2. Define loss (or gain/utility) function
3. Define predictive model Likelihood and Posterior - Independent variables (features) are not highly correlated with each
4. Search for parameters that minimize loss function P (y|θ) P (θ) other → Low multicollinearity
P (θ|y) = • Normal errors — residuals are normally distributed - check with
Augmented Loss P (y) Q-Q plots. Violation means line (in Q-Q plots) still fits but p-value
We can add more things to the loss function
→ Logistic function is trained by maximizing the log likelihood of the and CIs are unreliable
• Penalize model complexity
training data given the model • Equal variance — residuals have constant variance (called
• Penalize ”strong” beliefs homoskedasticity; violation is heteroskedasticity) - check scatterplot
– Requires predictive utility to overcome them or regplot between residuals vs. fitted. Violations means model is
→ Least squares generalizes into minimizing loss functions. failing to capture a systematic effect. → These violations are problem
→ This is the heart of machine learning, particularly supervised only for inference not for prediction
learning.
Maximum Likelihood Estimation
MLE is used to find the estimators that minimized the likelihood
function: L(θ|x) = fθ (x) density function of the data distribution
Log Likelihood
Logistic Regression:  
X
P (Y = 1|X = x) = ŷ = logistic β0 + βj xj  Variance Inflation Factor : Measures the severity if multicollinearity
j 1
The model computes probability of yes. → , where Ri2 is found by regressing Xi aganist all other
1 − Ri2
Probability of Observed
What if we want P (Y = yi ), regardless of whether yi is 1 or 0? variables (a common VIF cutoff is 10)
1−yi Learning: Estimating the coefficients β from the training data using
P (Y = yi |X = xi ) = yˆi yi (1 − yˆi )
the optimization algorithm Gradient Descent or Ordinary Least
• ŷi is model’s estimate of P (Y = 1|X = xi ) Squares.
⃗ that minimizes squared
Ordinary Least Squares - where we find β - Rescale inputs using standardization or normalization
error: X Advantages:
arg min (yi − yˆi )2
⃗ + Good regression baseline considering simplicity
β i
+ Lasso/Ridge can be used to avoid overfitting
+ Lasso/Ridge permit feature selection in case of collinearity
Usecase examples:
- Product sales prediction according to prices or promotions
- Call-center waiting-time prediction according to the number of
complaints and the number of working agents

Logistic Regression

Log-Odds and Logistics The representation below is an equation with binary output, which
Odds actually models the probability of default class:
The probability of success P (S): 0 ≤ p ≤ 1
→ The dimension of the hyperplane of the regression is its eβ0 +β1 x1 +···+βi xi
→ The odds of success are defined as the ratio of the probability of p(X) = = p(y = 1 | X)
complexity.
success over the probability of failure. 1 + eβ0 +β1 x1 +···+βi xi
Variations: There are extensions of Linear Regression training called P (S) P (S)
The odds of success: Odds(S) = P (S c ) = 1−P (S)
regularization methods, that aim to reduce the complexity of the → predict value close to 1 for default class and close to 0 for the
models or to address over-fitting in ML. The regularizer is not Ex: Odds(failure) = x → means x:1 aganist success other class.
dependent on the data. → In relation to the bias-variance trade-off, Log Odds or logit → Assumptions:
regularization aims to decrease complexity in a way that significantly P (A) - Linear relationship between X and log-odds of Y
log Odds(A) = log = logP (A) − log (1 − P (A))
reduces variances while only slightly increasing bias. 1 − P (A) - Observations must be independent to each other
→ Standardize numeric variables when using regularization because - Low multicollinearity
to ensure that 0 is a neutral value, so a low coefficient means ”little Logistic: The inverse of the logit (logit− 1): Learning: Learning the logistic regression coefficients is done by:
x
effect when deviating from average”. So values, and therefore logistic(x) = 1+e1−x = exe+1 Note : Coefficients are linearly related to odds, such that a one unit
coefficients, are on the same scale (# of standard deviations), to increase in x1 affects odds by eβ1 .
properly distribute weight between them. → Minimizing the logistic loss function
→ Multicollinearity → correlated predictors. Problem: Which X
⃗ i)

arg min log 1 + exp(−yi βx
coefficient gets the common effect? To solve: Loss and ⃗
β i
Regularization comes.
sigmoid or logistic curve. → Maximizing the log likelihood of the training data given the model
• Ridge Regression (L2 regularization): where OLS is modified
to minimize the squared sum of the coefficients X  
arg max ⃗
log P yi |xi , β
n p p p ⃗
X X X X β i
(yi − β0 − βj xij )2 + λ βj2 = RSS + λ βj2
i=1 j=1 j=1 j=1
→ Odds are another way of representing probabilities. P (Y = yi |X = xi ) = yˆi yi (1 − yˆi )1−yi
→ The logistic and logit functions convert between probabilities and logP (Y = yi |X = xi ) = yi log yˆi + (1 − yi ) log (1 − yˆi )
→ Prevents the weights from getting too large (L2 norm). If
lambda is very large then it will add too much weight and it log-odds. Model log likelihood is sum over training data. Applicable to any
will lead to under-fit. model where ŷ = P (Y = 1|x)
General Linear Models (GLMs):
1 Data preparation:
λ∝ - Probability transformation to binary for classification
model variance yˆi = g −1 (β0 + β1 xi1 + β2 xi2 · · · + βp xip )
- Remove noise such as outliers
• Lasso Regression (L1 regularization) : where OLS is modified 
p
 Advantages:
to minimize the sum of the coefficients + Good classification baseline considering simplicity
X
−1
ŷi = g β0 + βj xij 
Xn Xp Xp p
X j=1
+ Possibility to change cutoff for precision/recall tradeoff
(yi − β0 − βj xij )2 + λ |βj | = RSS + λ |βj | + Robust to noise/overfitting with L1/L2 regularization
i=1 j=1 j=1 j=1 Here, g is a link function + Probability output can be used for ranking
where p is the no. features (or dimensions), λ ≥ 0 is a tuning • Counts: Poision regression, log link func Usecase examples:
parameters to be determined. • Binary: Logistic regression, logit link func and g −1 is logistic func - Customer scoring with probability of purchase
→ In logistic regression, a linear output is converted into a probability - Classification of loan defaults according to profile
→ Lasso shrinks the less important feature’s coefficient to between 0 and 1 using the sigmoid or logistic function. It is the go-to
zero thus, removing some feature altogether. If lambda is very How to choose threshold for the logistic regression? The choice of a
for binary classification. threshold depends on the importance of TPR and FPR classification
large value will make coefficients zero hence it will under-fit.   problem. For example, if your classifier will decide which criminal
→ L1 is less likely to shrink coefficients to 0. Therefore L1
suspects will receive a death sentence, false positives are very bad
X
regularization leads to sparser models. P (yi = 1|X) = ŷi = logistic β0 +
 βj xij 
j (innocents will be killed!). Thus you would choose a threshold that
Data preparation: yields a low FPR while keeping a reasonable TPR (so you actually
- Transform data for linear relationship (ex: log transform for catch some true criminals). If there is no external concern about low
exponential relationship) 1 ex TPR or high FPR, one option is to weight them equally by choosing
→ logistic(x) = = x
- Remove noise such as outliers 1 + e−x e +1 the threshold that maximizes TPR−FPR.
Linear Discriminant Analysis Nonlinear Algorithms Likelihood and Posterior
For multiclass classification, LDA is the preferred linear technique. All Nonlinear Algorithms are non-parametric and more flexible. They P (y|θ) P (θ)
Representation: LDA representation consists of statistical properties P (θ|y) =
are not sensible to outliers and do not require any shape of P (y)
calculated for each class: means and the covariance matrix: distribution. • P (θ) is the prior
n n • P (y|θ) isR the likelihood – how likely is the data given params θ
1 X 1

X
µk = xi σ2 = (xi − µk )2 Naive Bayes Classifier P (y) = P (y|θ)P (θ)dθ is a scaling factor (constant for fixed y)
nk i=1 n−k i=1 • P (θ|y) is the posterior.
Naive Bayes is a classification algorithm interested in selecting the • We’re maximizing likelihood (ML estimator)
best hypothesis h given data d assuming that the features of each • Can also maximize posterior (MAP estimator)
data point are all independent • When prior is constant, they’re the same
Representation: The representation is based on Bayes Theorem: • With lots of data, they’re almost the same
P (d|Y ) × P (Y )
P (Y |d) =
P (d)

With naive hypothesis,


P (Y |d) = P (x1 , x2 , · · · , xi | Y ) = P (x1 |Y ) × P (x1 |Y ) × · · · P (xi |Y )
n
Y
P (d|Y ) = P (xi | Y )
i=1

The prediction is the maximum a posterior hypothesis:

LDA assumes Gaussian data and attributes of same σ 2 . Predictions max (P (Y |d)) = max (P (d|Y ) × P (Y ))

are made using Bayes Theorem: here, the denominator is not kept as it is only for normalization.
P (k) × P (x|k)
P (y = k | X = x) = Pk Learning: Training is fast because only probabilities need to be
l=1 P (l) × P (x|l) calculated:

to obtain a discriminate function (latent variable) for each class k, instancesY count(x ∧ Y )
P (Y ) = P (x|Y ) =
estimating P (x|k) with a Gaussian distribution: all instances instancesY
µk µ2
Dk (x) = x × 2
− k2 + ln(P (k)) Variations: Gaussian Naive Bayes can extend to numerical attributes
σ 2σ
by assuming a Gaussian distribution. Instead of P (x|h) are calculated
The class with largest discriminant value is the output class. with P (h) during learning, and MAP for prediction is calculated
Variations: using Gaussian PDF
1. Quadratic DA: Each class uses its own variance estimate 1 (x − µ)2
f (x | µ(x), σ) = √ e−
2. Regularized DA: Regularization into the variance estimate. 2πσ 2 2σ 2
Data preparation:
v
n u n
- Review and modify univariate distributions to be Gaussian 1X u1 X
µ(x) = xi σ=t (xi − µ(x))2
- Standardize data to µ = 0, σ = 1 to have same variance n i=1 n i=1
- Remove noise such as outliers
Advantages: Data preparation:
+ Can be used for dimensionality reduction by keeping the latent - Change numerical inputs to categorical (binning) or near Gaussian
variables as new variables inputs (remove outliers, log & boxcox transform)
- Other distributions can be used instead of Gaussian
Usecase example:
- Log-transform of the probabilities can avoid overflow
- Prediction of customer churn
- Probabilities can be updated as data becomes available
Advantages:
+ Fast because of the calculations
+ If the naive assumptions works can converge quicker than other
models. Can be used on smaller training data.
+ Good for few categories variables
Usecase examples: Support Vector Machines
- Article classification using binary word presence
SVM is a go-to for high performance with little tuning. Compares
- Email spam detection using a similar technique
extreme values in your dataset.
In SVM, a hyperplane (or decision boundary: wT x − b = 0) is The first term is the regularization term, which is a technique to
selected to separate the points in the input variables space by their avoid overfitting by penalizing large coefficients in the solution vector.
class, with the largest margin. The closest datapoints (defining the The second term, hinge loss, is to penalize misclassifications. It
margin) are called the support vectors. measures the error due to misclassification (or data points being
→ The goal of a support vector machine is to find the optimal closer to the classification boundary than the margin). The λ is the
separating hyperplane which maximizes the margin of the training regularization coefficient, and its major role is to determine the
data. trade-off between increasing the margin size and ensuring that the xi
lies on the correct side of the margin.
→ Kernel : A kernel is a way of computing the dot product of two
vectors xx and yy in some (possibly very high dimensional) feature
space, which is why kernel functions are sometime called ”generalized
dot product”. The kernel trick is a method of using a linear classifier
to solve a non-linear problem by transforming linearly inseparable data
to linearly separable ones in a higher dimension.
Given a feature mapping ϕ, we define the kernel K as follows:
K(x, z) = ϕ(x)T ϕ(z)

∥x − z∥2
In practice, the kernel K defined by K(x, z) = e− is
2σ 2
called the Gaussian kernel and is commonly used.

Note: Higher k → higher the bias, Lower k → higher the variance.


The prediction function is the signed distance of the new input x to • Choice of k is very critical → A small value of k means that noise
the separating hyperplane w, with b the bias: will have a higher influence on the result. → A large value of k make
f (x) = ⟨w, x⟩ + b = wT x + b Note: we say that we use the ”kernel trick” to compute the cost everything classified as the most probable class and also
function using the kernel because we actually don’t need to know the computationally expensive.

explicit mapping ϕ, which is often very complicated. Instead, only the → A simple approach to select k is set k = n or cross-validating
→ Optimal margin classifier: The optimal margin classifier h is such
values K(x, z) are needed.
that: on small subset of training data (validation data) by varying values of
h(x) = sign(wT x − b) Variations: k and observing training - validation error.
SVM is implemented using various kernels, which define the measure 1
where (w, b) ∈ Rn × R is the solution of the following optimization X
p
between new data and support vectors: → Minkowski Distance = |ai − bi |p
problem:
1 X
min ∥w∥2 1. Linear (dot-product): K(x, xi ) = (x × xi ) X
2 - p=1 gives Manhattan distance |ai − bi |
such that
X
d
2. Polynomial: K(x, xi ) = 1 + (x × xi ) qX
y (i) (wT x(i) − b) ≥ 1 - p=2 gives Euclidean distance (ai − bi )2
X
− 2
3. Radial: K(x, xi ) = e γ (x − xi ) → Hamming Distance - count of the differences between two vectors,
Learning:
→ Hinge loss : The hinge loss is used in the setting of SVMs and is often used to compare categorical variables.
defined as follows: Data preparation: Time complexity: The distance calculation step requires quadratic
- SVM assumes numeric inputs, may require dummy transformation time complexity, and the sorting of the calculated distances requires
L(z, y) = [1 − yz]+ = max(0, 1 − yz)
of categorical features an O(N logN ) time. Total process is an O(N 3 logN )
Advantages: Space complexity: Since it stores all the pairwise distances and is
→ Lagrangian : We define the Lagrangian L(w, b) as follows: + Allow nonlinear separation with nonlinear Kernels sorted in memory on a machine, memory is also the problem. Usually,
l
X + Works good in high dimensional space local machines will crash, if we have very large datasets.
L(w, b) = f (w) + βi hi (w) + Robust to multicollinearity and overfitting Data preparation:
i=1 Usecase examples: - Rescale inputs using standardization or normalization
- Face detection from images - Address missing data for distance calculations
Lagrange method is required to convert constrained optimization
- Target Audience Classification from tweets - Dimensionality reduction or feature selection for COD
problem into unconstrained optimization problem. The goal of above
equation to get the optimal value for w and b. K-Nearest Neighbors Advantages:
" n
# + Effective if the training data is large
2 1X 
λ∥w∥
⃗ + max 0, 1 − yi (w
⃗ · x⃗i − b) If you are similar to your neighbors, you are one of them. KNN uses + No learning phase
n i=1 the entire training data, no training is required. + Robust to noisy data, no need to filter outliers
Usecase examples: 3. Choose attributes with largest information Gain as the AdaBoost
- Recommending products based on similar customers decision node, divide the dataset by its branches and repeat • Uses the same training samples at each stage
- Anomaly detection in customer behavior the same process on every branch. • ”Weakness” = Misclassified data points
4. A branch with entropy of 0 is a leaf node • Increase the weight of misclassified data points
Classification and Regression Trees (CART) Algorithm:
5. A branch with entropy more than 0 needs futher splitting. 1. Initialize Weights: Assign equal weight to each of the training
Decision Tree is a Supervised learning technique that can be used for
both Classification and Regression problems. 6. ID3 algorithm is run recursively on the non-leaf branches, until data
all data is classified 2. Train weak model and Evaluate: Provide this as input to the
• CART for regression minimizes SSE by splitting data into
Advantages: weak model and identify the wrongly classified data points
sub-regions and predicting the average value at leaf nodes. The
→ Can take any type of variables and do not require any data 3. Adjust Weights: Increase the weight of wrongly classified data
complexity parameter cp only keeps splits that reduce loss by at least
prepraration points
cp (small cp → deep tree).
→ Simple to understand, interpret, visualize 4. Combined Models: Combine the weak models using a
• CART for classification minimizes the sum of region impurity, → Non-linear parameters don’t effect its performance weighted sum, where weights are based on the accuracy of
where pi is the probability of a sample being in category i. Possible
Disadvantages: each learner.
measures, each with a max impurity of 0.5.
X → Overfitting (High variance) occurs, when noise data 5. Repeat steps 2-4 for a predefined number of iterations or until
- Gini Impurity / Gini Index / Gini Coefficient = 1 − (pi )2 → DT can be unstable (use bagging or boosting) because of small the error is minimized.
variation in data
X Gradient Boosting
- Cross Entropy = (pi )log2 (pi ) Note: The most common Stopping Criterion for splitting is a • Uses the different training samples at each stage
At each leaf node, CART predicts the most frequent category, minimum of training observations per node. • ”Weakness” = Residuals or Erros
assuming false negative and false positive costs are the same. • A loss function to be optimized, additive model to add weak learners
Ensemble Algorithms Algorithm:
→ The splitting process handles multicollinearity and outliers.
→ Trees are prone to high variance, so tune through CV. 1. Initialize Model: Start with an initial model (e.g., a constant
Ensemble methods combine multiple, simpler algorithms (weak value). Let’s say Avg.
A decision tree is made up of nodes. Each node represents a question learners) to obtain better performance algorithm.
about the data. And the branches from each node represents the 2. Compute Residuals: Calculate the residuals (errors) of the
possible answers. current model.
Bagging Boosting 3. Train Weak Learner: Train a weak learner on the residuals.
AdaBoost 4. Update Model: Add the weak learner to the model with a
Random Forest Gradient Boosting certain learning rate.
XGBoost 5. Repeat steps 2-4 for a fixed number of iterations or until the
model converges.
• Bootstrapping is drawing random sub-samples (sampling with
replacement) from a large sample (available data) to estimate XGBoost
quantity (parameters) of a unknown population by averaging the • Optimized and scalable implementation of gradient boosting.
estimates from these sub-samples. → Execution speed: Parallelization (It will use all cores of CPU),
• Bagging: It uses bootstrap technique, and can reduce the variance Cache optimization, Out of memory (Data size bigger than memory)
of high-variance models.. → Model performance: Adds regularization (prevent overfitting),
Auto running (Uses ”max depth” to control the growth of trees),
How bagging works? 1. Bootstrapping → 2. Parallel training → 3. Missing values treatment, Efficient handling of sparse data,
Aggregation. Finally, depending on the type of task — regression or
classification, for example—the average or majority of those
Root Node: It is the very first node, or we can call it as a parent predictions yield a more accurate estimate.
node. It denotes the whole population and gets split into two or more Random Forest
Decision nodes based on the feature value. → Bagged Decision Trees: Each DT may contain different no. of
Decision Node: Decision nodes are used to make any decisions and rows and different no. of features.
have multiple branches → Individual DTs may face overfitting i.e. have low bias (complex
Leaf Node: Leaf nodes are the output of those decisions. model) but high variance, by ensembling a lot of DTs we are going to
Sub-Tree: A branch is a subdivision of a complete tree. reduce the variance, while not increasing the bias.
Note: In decision trees, the depth of the tree determines the variance.
• Boosting: The idea of boosting methods is to train weak learners
Decision trees are commonly pruned to control variance
sequentially, each trying to correct its predecessor.
Procedure:
1. Calculate entropy of the outcome classes (c)
c
X
E(T ) = −pi log2 pi
i=1

2. The dataset is split on the different attributes. The entropy of


each branch is calculated. Then it is added proportionally to
get total entropy for the split. The resulting entropy is
subtracted from the entropy before the split.
Gain(T, X) = Entropy(T ) − Entropy(T, X)
Unsupervised Machine Learning → K-means always converges (mostly to local minimum not to Principle Component Analysis (PCA)
global minimum)
1. Clustering PCA combines highly correlated variables into a new, smaller set of
• How to choose K number of clusters in K-Means algorithm?
2. Dimension Reduction constructs called principal components, which capture most of the
→ The maximum possible number of clusters will be equal to the
3. Association Rule Mining number of observations in the dataset. variance present in the data.
4. Graphical Modelling and Network Analysis • Dimensionality reduction
Clustering Hierarchial Clustering • Feature extraction
Grouping objects into meaningful subets or, clusters. → Objects Agglomerative method: ”Bottom-up” • Data visualization
within each cluster are similar. Procedure:
1. Compute the distance or, proximity matrix
Clustering Algorithms: X − mean
2. Initialization: Each observation is a cluster 1. Standarize the data: Z =
1. Partition-based methods SD
(a) K-means clustering 3. Iteration: Merge two clusters which are most similar; until all
observations are merged into a single cluster. 2. Calculate covariance-matrix of the standarized data
(b) Fuzzy C-Means
2. Hierarchical methods Divisive method: ”Top-down” V = cov(Z T )
(a) Agglomerative Clustering
1. Compute the distance, or proximity matrix
(b) Divisive Clustering
2. Initialization: All objects stay in one cluster 3. Find eigen-values and eigen-vectors from the
3. Density-based methods covariance-matrix
(a) Density-Based methods (DBSCAN) 3. Iteration: Select a cluster and split it into two sub-cluster
until each leaf cluster contains only one observation. values, vectors = eig(V )
Proximity (distance) matrix
K-means clustering → Single or ward linkage: Minimize within cluster distance 4. Feature vectors; It is simply the matrix that has columns, the
The objective of K-means clustering is to minimize total intra-cluster h  i eigen-vectors of the components that we decide to keep.
or, the squared error function. L(C1 , C2 ) = min D XiC1 , XjC2
5. Project data → Znew = vectorsT · Z T
K X
n
X (j) → Complete linkage: Longest distance between two points in each
Objective function → J = ∥Xi − Cj ∥ 2 Association Rule Mining
cluster. Minimize maximum distance of between cluster pairs
j=1 i=1
h  i ”Market Basket Analysis” → It uses Machine Learning models to
Here, K is No. of clusters, n is No. of cases, Cj is centroid for L(C1 , C2 ) = max D XiC1 , XjC2 analyze data for patterns or, co-occurence in a database.
cluster j
→ Average linkage: Minimize average distance between cluster pairs Graphical Modelling and Network Analysis
nC1 nC2 ”Bayesian Networks”
1 X Xh  C i
L(C1 , C2 ) = D Xi 1 , XjC2
nC1 nC2 i=1 j=1

DBSCAN
→ Two parameters: ε - distance, minimum points
→ Three classifications of points:
• Core: has atleast minimum points within ε - distance including
itself
• ε - distance has less than minimum points within ε - distance
but can be reached by clusters.
1. Divide data into K clusters or groups. • Outlier: point that cannot be reached by cluster
2. Randomly select centroid for each of these K clusters. Procedure:
3. Assign data points to their closest cluster centroid according
1. Pick a random point that has not been assigned to a cluster
to Euclidean/ Square Euclidean/Manhattan/Cosine
or, designated as an Outlier. Determine if it is a Core Point.
4. Calculate the centroids of the newly formed clusters. If not, label the point as Outlier.
5. Repeat steps 3 and 4 until the same centroids (convergences)
are assigned to each cluster. 2. Once a Core Point has been found, add all directly reachable
to its cluster. Then do neighbor jumps to each reachable
point and add them to the cluster. If an Outlier has been
added, label it as a Border Point.
3. Repeat these steps until all points are assigned a cluster or,
label as Outlier.

Dimensionality Reduction Methods


Reduce the number of input variables (attributes or features) in
dataset.
Neural Network Sigmoid ReLU Tanh
1 ez −e−z
Deep Learning Tutorial by CampusX 1+e−z
max(0, z) ez +e−z
→ Feeds inputs through different hidden layers and relies on weights
and nonlinear functions (activation functions, convolution, or pooling)
to reach an output

• Neural Network - a multi-layer perceptron


→ Softmax - used as the last activation function of a neural network
• Perceptron - the foundation of a neural network, and it is a
to normalize the output of a network to a probability distribution over
single-layer neural network that multiplies inputs by weights, adds ezi
predicted output classes. These probabilities sum to 1 → P e z The principle of the backpropagation approach is to model a given
bias, and feeds the result to an activation function
→ If there is more than one ‘correct’ label, the sigmoid function function by modifying internal weightings of input signals to produce
Note: Perceptron is usually used to classify the data into two parts. an expected output signal. The system is trained using a supervised
Therefore, it is also known as a Linear Binary Classifier. provides probabilities for all, some, or none of the labels.
learning method, where the error between the system’s output and a
An Artificial Neuron is a basic building block of a neural network. known expected output is presented to the system and used to modify
→ Loss Function - The loss function is the function that computes
the distance or difference between the predicted output ŷ of the its internal state.
algorithm and the expected output y. → The loss function is used to
optimize the model by minimizing the loss. Convolutional Neural Network
– Regression Loss: Mean Squared Error/Squared loss/ L2 loss, CNNs are a type of deep learning neural network architecture that is
Mean Absolute Error/ L1 loss, Huber Loss particularly well suited to image classification and object recognition
– Classification Loss: Binary Cross Entropy/log loss, Categorical tasks. The general CNN architectures is as shown below:
Cross Entropy

Gradient Descent - minimizes the average loss by moving iteratively


in the direction of steepest descent, controlled by the learning rate γ
(step size). Note, γ can be updated adaptively for better performance.
For neural networks, finding the best set of weights involves:

1. Initialize weights W randomly with near-zero values


A convolutional neural network starts by taking an input image, which
2. Loop until convergence: is then transformed into a feature map through a series of
– Calculate the average network loss J(W ) convolutional and pooling layers.
The convolutional layer applies a set of filters to the input image (by
– Backpropagation - iterate backwards from the last layer,
∂J(W ) applying weights, bias, and an activation function), each filter
computing the gradient ∂W and updating the weight producing a feature map that highlights a specific aspect of the input
∂J(W )
W ←W −γ ∂W image. Different weights lead to different feature maps.
→ Weights: are the real values that are attached with each
input/feature and they convey the importance of that corresponding 3. Return the minimum loss weight matrix W
feature in predicting the final output.
→ Bias: is used for shifting the activation function towards left or • To prevent overfitting, regularization can be applied by:
right.
→ Summation Function: used to bind the weights and inputs – Stopping training when validation performance drops
together and calculate their sum. – Dropout - randomly drop some nodes during training to prevent
→ Activation Function: decides whether a neuron should be over-reliance on a single node
activated or not. The activation function introduces non-linearities – Embedding weight penalties into the objective function
into the network which makes input capable of learning and
– Batch Normalization - stabilizes learning by normalizing inputs
performing more complex tasks. The pooling layer then downsamples the feature map to reduce its
to a layer
size, while retaining the most important information even if they have
Stochastic Gradient Descent - only uses a single point to compute shifted slightly.
gradients, leading to smoother convergence and faster compute Again, the pooled feature maps produced by the convolutional layer
speeds. Alternatively, mini-batch gradient descent trains on small and pooling layer are then passed through multiple additional
subsets of the data, striking a balance between the approaches. convolutional and pooling layers, each layer learning increasingly
Backpropagation : complex features of the input image.
Artificial neural networks (ANNs) and deep neural networks use Now, the output obtained from above is fed into a fully connected
backpropagation as a learning algorithm to compute a gradient layer for classification, object detection, or other structural analyses.
descent, which is an optimization algorithm that guides the user to The final output of the network is a predicted class label or
the maximum or minimum of a function. probability score for each class, depending on the task.
Recurrent Neural Network • Preservation of gradient information by LSTM. The sensitivity of Time Series
the output layer can be switched on and off.
Recurrent Neural Networks (RNNs) are designed to process sequences It is a random sequence {Xt } of real values recorded at successive
of data. They work well for jobs requiring sequences, such as time LSTM is based on RNN, therefore, the basic structure of RNN is equally spaced points in time.
series data, voice, natural language, and other activities. explained first and then LSTM structure is explained referencing → Not every data collected with respect to time represents a time
• RNN works on the principle of saving the output using hidden RNN. series.
states of a particular layer and feeding this back to the input in order Methods of prediction & forecasting, time based data is Time
to predict the output of the layer. Series Modeling
• Examples of time series: Stock Market Price, Passenger Count of
airlines, Temperature over time, Monthly Sales Data,
Quarterly/Annual Revenue, Hourly Weather Data/Wind Speed, IOT
sensors in Industries and Smart Devices, Energy Forecasting
Difference between Time Series and Regression
• Time Series is time dependent. However the basic assumption of a
→ RNN memorize information from previous data with feedback LSTM memorize the information for the long period of time, which is
linear regression model is that the observations are independent.
loops inside it which helps to keep data information over time. important in many applications such as time prediction of the High
• Along with an increasing or decreasing trend, most Time Series
→ It has an arrow pointing to itself, indicating that the data inside frequency (HF) spectrum. The basic structure of the RNN and LSTM
have some form of seasonality trends
block “A” will be recursively used. Once expanded, its structure is are similar as shown respectively below.
Note:
equivalent to a chain-like structure.
→ Predicting a time series using regression techniques is not a good
→ Learning to store information or data over long periods of time
approach.
intervals via recurrent backpropagation takes a very long time. Hence,
→ Time series forecasting is the use of a model to predict future
the gradient gradually vanishes as they propagate to earlier time
values based on previously observed values.
steps. These downstream gradients relies on parameter (weight)
sharing for efficiency, and repeatedly multiplying values greater than → A stochastic process is defined as a collection of random variables
or less than 1 leads to: X = {Xt : t ∈ T } defined on a common probability space, taking
– Exploding gradients - model instability and overflows values in a common set S (the state space), and indexed by a set T ,
– Vanishing gradients - loss of learning ability often either N or [0, ∞) and thought of as time (discrete or
continuous respectively) (Oliver, 2009).
→ This can be solved using: → The difference between RNN and LSTM are: RNN cell has only
– Gradient clipping - cap the maximum value of gradients one tanh layer while LSTM cell has four layers: forget gate layer, Time Series Statistical Models
– ReLU - its derivative prevents gradient shrinkage for x > 0 store gate layer, new cell state layer, output layer, and previous cell
– Gated cells - regulate the flow of information A time series model specifies the joint distribution of the sequence
state as shown in Figure below.
{Xt } of random variables; e.g.,
And, also for the non-convex problem, the RNN model training → The forget layer is responsible for deciding what information to
P (X1 ≤ x1 , . . . , Xt ≤ xt ) for all t and x1 , . . . , xt
confuse between local minimum and global minimum. To overcome retain from the previous cell state, and what information is to be
these problem, LSTM has been introduced as RNN languages forgotten or removed Typically, a time series model can be described as
modelling learning algorithm based on the feedforward architecture.
ft = σ(Wt .[ht−1 , xt ] + bf ) Xt = mt + st + Yt

where mt : trend component; st : seasonal component; Yt : Zero-mean


→ The store gate layer has an input gate using which we calculate error
another variable called new candidate values. The new candidate
values are information which seem relevant are added to the cell state. Note: The following are some zero-mean models
→ iid noise: The simplest time series model is the one with no trend
it = σ(Wi .[ht−1 , xt ] + bi ) or seasonal component, and the observations Xt s are simply
independent and identically distribution random variables with zero
C̃t = tanh(Wc .[ht−1 , xt ] + bc ) mean. Such a sequence of random variable {Xt } is referred to as iid
noise.
→ The new cell state layer calculates the new cell state by updating Y Y
P (X1 ≤ x1 , . . . , Xt ≤ xt ) = P (Xt ≤ xt ) = F (xt )
the information from last cell. And the new cell state is calculated
• Vanishing gradient problem for RNNs. The sensitivity increases as t t
using the information acquired from the previous two layers.
the network backpropagates through in time. The darker the shade, where F (·) is the cdf of each Xt . Further E(Xt ) = 0 for all t. We
the greater the sensitivity. Ct = ft ∗ Ct−1 + it ∗ C̃t denote such sequence as Xt ∼ IID(0, σ 2 ). IID noise is not interesting
for forecasting since Xt |X1 , . . . , Xt−1 = Xt .
→ The output layer makes use of all this information gathered over → iid noise example: A binary (discrete) process {Xt } is a sequence
the last three layers to produce an output. of iid random variables Xt s with
P (Xt = 1) = 0.5, P (Xt = −1) = 0.5
ot = σ(Wo .[ht−1 , xt ] + bo )
→ Gaussian Noise example:A continues process: Gaussian noise
ht = ot ∗ tanh(Ct ) {Xt } is a sequence of iid normal random variables with zero mean
→ Also, the cell state at the top of the Figure starting with c(t − 1) and σ 2 variance; i.e., Xt ∼ N (0, σ 2 )
runs horizontally as it keeps the information integrity from long period → Random walk: The random walk {St , t = 0, 1, 2, . . .} (starting at
of time with some minor linear attractions. zero, S0 = 0) is obtained by cumulatively summing (or ”integrating”)
random variables; i.e., S0 = 0 and st = αxt + (1 − α)(st−1 + bt−1 ) References
St = X1 + · · · + Xt , for t = 1, 2, . . . bt = β(st − st−1 ) + (1 − β)bt−1 [1] Rahul Beakta. “Big data and hadoop: A review
where {Xt } is iid noise with zero mean and σ 2 variance. Note that by Triple exponential smoothing adds a third variable γ that accounts for
differencing, we can recover Xt ; i.e.,
paper”. In: International Journal of Computer Science
seasonality.
∇St = St − St−1 = Xt
& Information Technology 2.2 (2015), pp. 13–15.
• ARIMA - models time series using three parameters (p, d, q):
Further, we have – Autoregressive - the past p values affect the next value [2] M. Sundermeyer, H. Ney, and R. Schlüter. “From
Feedforward to Recurrent LSTM Neural Networks for
!
X X X – Integrated - values are replaced with the difference between
E(St ) = E Xt = E(Xt ) = 0=0 current and previous values, using the difference degree d (0 for
t t i
Language Modeling”. In: IEEE/ACM Transactions on
stationary data, and 1 for non-stationary)
! Audio, Speech, and Language Processing 23.3 (Mar.
X X – Moving Average - the number of lagged forecast errors and the
Var(St ) = Var Xt = Var(Xt ) = tσ 2 size of the moving average window q 2015), pp. 517–529. issn: 2329-9290. doi:
t t 10.1109/TASLP.2015.2400218.
→ White Noise: We say {Xt } is a white noise; i.e., • SARIMA - models seasonality through four additional
Xt ∼ WN(0, σ 2 ), if {Xt } is uncorrelated, i.e., Cov (Xt1 , Xt2 ) = 0 for seasonality-specific parameters: P , D, Q, and the season length s [3] Varsha B Bobade. “Survey paper on big data and
any t1 and t2 with E[Xt ] = 0 and Var(Xt = σ 2 ). • Prophet - additive model that uses non-linear trends to account for Hadoop”. In: Int. Res. J. Eng. Technol 3.1 (2016),
Note: Every IID(0, σ 2 ) sequence is WN(0, σ 2 ) but not conversely. multiple seasonalities such as yearly, weekly, and daily. pp. 861–863.
→ Robust to missing data and handles outliers well.
• Moving Average Smoother This is an essentially non-parametric → Can be represented as: y(t) = g(t) + s(t) + h(t) + ϵ(t), with four [4] D. Dong, Z. Sheng, and T. Yang. “Wind Power
method for trend estimation. It takes averages of observations around distinct components for the growth over time, seasonality, holiday Prediction Based on Recurrent Neural Network with
t; i.e., it smooths the series. For example, let effects, and error. This specification is similar to a generalized
1
Long Short-Term Memory Units”. In: 2018
additive model.
Xt = (Wt−1 + Wt + Wt+1 ) International Conference on Renewable Energy and
3 • Generalized Additive Model - combine predictive methods while
which is a three-point moving average of the white noise series Wt . preserving additivity across variables, in a form such as Power Engineering (REPE). Nov. 2018, pp. 34–38.
→ AR(1) model (Autoregression of order 1): Let y = β0 + f1 (x1 ) + · · · + fm (xm ), where functions can be non-linear. doi: 10.1109/REPE.2018.8657666.
→ GAMs also provide regularized and interpretable solutions for
Xt = 0.6Xt−1 + Wt
regression and classification problems. [5] Analog Devices. Training Convolutional Neural
where Wt is a white noise series. It represents a regression or Networks: What is Machine Learning? Part 2. Analog
Tutorial: Complete Guide on Time Series Analysis in Python
prediction of the current value Xt of a time series as a function of the Dialogue. url:
past two values of the series. https://www.analog.com/en/analog-
Stationary Process dialogue/articles/training-convolutional-
Extracts characteristics from time-sequenced data, which may exhibit
neural-networks-what-is-machine-learning-
the following characteristics: part-2.html.
– Stationarity - statistical properties such as mean, variance,
[6] deeplearning.ai. Natural Language Processing
auto-correlation are constant over time, an autocovariance that
does not depend on time, and no trend or seasonality Resources. deeplearning.ai. url: https:
– Non-Stationary - There are 2 major reasons behind the //www.deeplearning.ai/resources/natural-
non-stationary of a Time Series language-processing/.
– Trend - varying mean over time (mean is not constant)
– Seasonality - variations at specific time-frames (standard [7] Edureka. MapReduce Tutorial. Edureka. url: https:
deviation is not constant) //www.edureka.co/blog/mapreduce-tutorial/.
– Trend - Trend is a general direction in which something is [8] Edureka. Top 50 Hadoop Interview Questions (2016).
developing or changing. Edureka. url:
– Seasonality - Any predictable change or pattern in a time series
that recurs or repeats over a specific time period (calendar times)
https://www.edureka.co/blog/interview-
occurring at regular intervals less than a year questions/top-50-hadoop-interview-
– Cyclicality - variations without a fixed time length, occurring in questions-2016/.
periods of greater or less than one year
– Autocorrelation - degree of linear similarity between current and [9] Nilay Chauhan. Getting Started with NLP Pipelines.
lagged values Kaggle. url: https:
• CV must account for the time aspect, such as for each fold Fx : //www.kaggle.com/code/nilaychauhan/getting-
– Sliding Window - train F1 , test F2 , then train F2 , test F3 started-with-nlp-pipelines.
– Forward Chain - train F1 , test F2 , then train F1 , F2 , test F3
• Exponential Smoothing - uses an exponentially decreasing weight
to observations over time, and takes a moving average. The time t
output is st = αxt + (1 − α)st−1 , where 0 < α < 1.
• Double Exponential Smoothing - applies a recursive exponential
filter to capture trends within a time series Last Updated July 11, 2024

You might also like