Final ML
Final ML
Concept Overview
Machine Learning (ML) is a subset of artificial intelligence (AI) that allows
computer systems to learn from data without explicit programming. This learn-
ing involves identifying patterns, building models, and making predictions or
decisions based on the data.
Linear Regression
Concept Overview Linear regression is a fundamental supervised learning
algorithm that models the relationship between a dependent variable and one
or more independent variables. It assumes a linear relationship, meaning
that changes in the independent variables are proportionally reflected in the
dependent variable.
Purpose and Relevance
• Predicting continuous values: House prices, stock market trends, sales
figures
• Understanding relationships: Quantifying the impact of independent
variables on the dependent variable.
1
• Bivariate linear regression (one independent variable, X, and one de-
pendent variable, Y ): ŷ = b + wx where:
– ŷ is the predicted value of Y
– b is the intercept (value of Y when X is zero)
– w is the regression coefficient (slope of the line)
– x is the value of the independent variable
• Multivariate linear regression (multiple independent variables): ŷ =
b + w1x1 + w2x2 + … + wdxd where:
– d is the number of independent variables
The method of least squares estimates the optimal values of b and w by
minimising the sum of squared errors (SSE), which is the sum of the squared
differences between the actual values (yi) and the predicted values (ŷi):
SSE = �(yi - ŷi)2
Algorithms
1. Calculate the means: Calculate the means of X (µX) and Y (µY).
2. Calculate the regression coefficient (w):
• For bivariate linear regression: w = �((xi - µX)(yi - µY)) / �(xi - µX)2
• For multivariate linear regression: Refer to Algorithm 23.1 in source,
which utilises QR-decomposition for efficient computation.
3. Calculate the intercept (b): b = µY - wµX
4. Predict the dependent variable: Use the calculated b and w to predict
Y for any given X.
2
Polynomial Regression Polynomial regression extends linear regression
by allowing for non-linear relationships between the independent and dependent
variables. Instead of fitting a straight line, polynomial regression fits a curve to
the data. It involves transforming the independent variables by raising them to
different powers (e.g., X2, X3). The model becomes:
ŷ = b + w1x + w2x2 + … + wdxd
where:
• d is the degree of the polynomial
By choosing the appropriate degree, polynomial regression can model complex
relationships between variables. However, higher-degree polynomials can lead to
overfitting, where the model captures noise in the data instead of the underlying
trend.
Applications
• Economics: Modelling the relationship between economic factors and
growth.
• Engineering: Fitting curves to experimental data.
• Environmental science: Analysing the impact of pollution on environ-
mental indicators.
Logistic Regression
Concept Overview Logistic regression is a supervised learning algorithm
for binary classification, predicting one of two possible outcomes. It predicts
the probability of an instance belonging to a particular class and then classifies
it based on that probability.
Purpose and Relevance
• Predict probabilities: The likelihood of a customer purchasing a prod-
uct, a patient developing a disease, or an email being spam.
• Classify instances: Based on the predicted probability, assign an in-
stance to a class.
3
The logistic regression model predicts the probability of an instance belonging
to the positive class (class 1) as:
P(Y=1|X=x) = �(z)
The model parameters (b and w) are learned using maximum likelihood
estimation (MLE), which aims to find the parameters that maximise the
likelihood of observing the training data.
Gradient Descent
Concept Overview
Gradient descent is a fundamental iterative optimisation algorithm for finding
the minimum of a differentiable function. It starts with an initial guess for the
solution and repeatedly updates it by moving in the direction of the steepest
descent, which is the negative of the function’s gradient at that point.
Purpose and Relevance:
• Finding optimal solutions: In machine learning, gradient descent is
widely used to find the best-fit parameters of models, minimising a chosen
loss function.
• Wide applicability: Gradient descent applies to various fields beyond
machine learning, including engineering, physics, and economics.
Interrelation with Other Concepts:
• Basis for many algorithms: Many machine learning algorithms, such
as linear regression, logistic regression, and neural networks, use gradient
4
descent as their core optimisation technique.
• Building block for more advanced methods: Stochastic Gradient
Descent (SGD) and its variants are extensions of gradient descent.
Mathematical Foundation
The gradient of a differentiable function 𝑓(𝑤) ∶ ℝ𝑑 → ℝ at a point 𝑤, denoted
as ∇𝑓(𝑤), is the vector of its partial derivatives:
∇𝑓(𝑤) = ( 𝜕𝑓(𝑤) 𝜕𝑓(𝑤)
𝜕𝑤 , … , 𝜕𝑤[𝑑] ).
Algorithms
Gradient Descent
Purpose: Minimise a differentiable function 𝑓(𝑤).
5
Steps:
1. Initialise: Set an initial solution 𝑤(1) (often set to 0). Choose a learning
rate 𝜂.
2. Iterate: Repeat the following steps until convergence:
• Calculate the gradient: ∇𝑓(𝑤(𝑡) )
• Update the solution: 𝑤(𝑡+1) = 𝑤(𝑡) − 𝜂∇𝑓(𝑤(𝑡) )
3. Output: Return the final solution 𝑤(𝑡) .
Convergence Criteria:
• Maximum number of iterations: Stop after a predefined number of
iterations.
• Change in solution: Stop when the change in the solution between
iterations is smaller than a threshold.
• Gradient magnitude: Stop when the magnitude of the gradient is
smaller than a threshold.
Examples
Example: Finding the Minimum of a Quadratic Function
Consider the function: 𝑓(𝑤) = 𝑤2 + 2𝑤 + 1
1. Calculate the gradient: ∇𝑓(𝑤) = 2𝑤 + 2
2. Initialise: Set 𝑤(1) = 0 and 𝜂 = 0.1.
3. Iterate:
• Iteration 1: ∇𝑓(0) = 2, 𝑤(2) = 0 − 0.1 ∗ 2 = −0.2
• Iteration 2: ∇𝑓(−0.2) = 1.6, 𝑤(3) = −0.2 − 0.1 ∗ 1.6 = −0.36
• … Continue iterating until convergence
The algorithm will converge towards the minimum of the function, which is
𝑤 = −1.
Applications
• Linear Regression: Finding the coefficients that minimise the sum of
squared errors between the predicted and actual values.
• Logistic Regression: Finding the parameters that maximise the likeli-
hood of observing the training data.
• Neural Networks: Training deep learning models by adjusting the net-
work weights to minimise the loss function.
6
Purpose and Relevance:
• Efficiency: Significantly faster than standard gradient descent, especially
for large datasets.
• Handling noisy data: SGD is more robust to noise and outliers in the
data.
• Online learning: Suitable for online learning scenarios where data ar-
rives sequentially.
Mathematical Foundation
SGD uses a stochastic approximation of the true gradient. At each iteration,
it randomly selects a data point (𝑥𝑖 , 𝑦𝑖 ) and calculates the gradient of the loss
function for that point. This stochastic gradient is then used to update the
solution:
𝑤(𝑡+1) = 𝑤(𝑡) − 𝜂𝑣𝑡
where:
• 𝑣𝑡 is a random vector such that 𝐸[𝑣𝑡 |𝑤(𝑡) ] ∈ 𝜕𝑓(𝑤(𝑡) ), i.e., the expected
value of 𝑣𝑡 given the current solution is a subgradient of the function at
that point.
Algorithms
Stochastic Gradient Descent (SGD)
Purpose: Minimise a function 𝑓(𝑤), often a loss function in machine learning.
Steps:
1. Initialise: Set an initial solution 𝑤(1) . Choose a learning rate 𝜂 and the
number of iterations 𝑇 .
2. Iterate: For 𝑡 = 1, 2, … , 𝑇 :
• Randomly select a data point (𝑥𝑖 , 𝑦𝑖 )
• Calculate the gradient for the selected point: 𝑣𝑡 ∈
𝜕𝑙(𝑤(𝑡) , (𝑥𝑖 , 𝑦𝑖 )) where 𝑙 is the loss function.
• Update the solution: 𝑤(𝑡+1) = 𝑤(𝑡) − 𝜂𝑣𝑡
𝑇
3. Output: Return the averaged solution 𝑤̄ = 𝑇1 ∑𝑡=1 𝑤(𝑡) .
Examples
Example: SGD for Linear Regression
Consider a set of data points (𝑥𝑖 , 𝑦𝑖 ), and the squared loss function:
𝑙(𝑤, (𝑥𝑖 , 𝑦𝑖 )) = 12 (𝑤𝑇 𝑥𝑖 − 𝑦𝑖 )2 .
1. Randomly select a data point (𝑥𝑖 , 𝑦𝑖 )
2. Calculate the gradient for the selected point: 𝑣𝑡 = (𝑤𝑇 𝑥𝑖 − 𝑦𝑖 )𝑥𝑖 .
3. Update the solution: 𝑤(𝑡+1) = 𝑤(𝑡) − 𝜂(𝑤𝑇 𝑥𝑖 − 𝑦𝑖 )𝑥𝑖 .
7
Applications
• Large-scale machine learning: Training models on massive datasets
where standard gradient descent is computationally expensive.
• Online advertising: Continuously updating models for ad targeting as
new user data arrives.
• Natural language processing: Training language models for tasks like
machine translation and text generation.
Subgradients
Concept Overview
Subgradients generalise the concept of gradients to non-differentiable func-
tions. A subgradient at a point is any vector that lies below the function’s
graph at that point.
Purpose and Relevance:
• Handling non-differentiable functions: Many loss functions used in
machine learning, like the hinge loss in Support Vector Machines, are non-
differentiable.
• Extending optimisation algorithms: Subgradients allow the use of
gradient-based optimisation algorithms for non-differentiable functions.
Mathematical Foundation
For a convex function 𝑓(𝑤), a vector 𝑣 is a subgradient at 𝑤 if for all 𝑢 in the
domain of 𝑓:
𝑓(𝑢) ≥ 𝑓(𝑤) + ⟨𝑢 − 𝑤, 𝑣⟩.
The subdifferential 𝜕𝑓(𝑤) is the set of all subgradients at 𝑤. If 𝑓 is differen-
tiable at 𝑤, then 𝜕𝑓(𝑤) = {∇𝑓(𝑤)}.
Examples
Example: Subgradient of the Absolute Value Function
Consider the absolute value function 𝑓(𝑤) = |𝑤|.
• For 𝑤 > 0, the subdifferential is 𝜕𝑓(𝑤) = {1}.
• For 𝑤 < 0, the subdifferential is 𝜕𝑓(𝑤) = {−1}.
• For 𝑤 = 0, the subdifferential is 𝜕𝑓(𝑤) = [−1, 1] because any line with a
slope between -1 and 1 lies below the graph of the function at 𝑤 = 0.
8
Stochastic Gradient Descent for Risk Minimisation
Concept Overview
Stochastic Gradient Descent (SGD) for risk minimisation directly ap-
plies the SGD algorithm to minimise the risk function, which measures the
expected loss of a model over the data distribution.
Purpose and Relevance:
• Direct risk minimisation: Instead of relying on the empirical risk (loss
on the training data), SGD can directly optimise the risk function.
• Suitable for online learning: Efficiently update the model as new data
becomes available.
Mathematical Foundation
In machine learning, the risk function 𝐿𝐷 (𝑤) is the expected loss of a model
with parameters 𝑤 over the data distribution 𝐷:
𝐿𝐷 (𝑤) = 𝐸𝑧∼𝐷 [𝑙(𝑤, 𝑧)]
where:
• 𝑙(𝑤, 𝑧) is the loss function.
SGD for risk minimisation uses the loss gradient on a single randomly sam-
pled data point as an unbiased estimate of the risk gradient.
Algorithms
SGD for Risk Minimisation
Purpose: Minimise the risk function 𝐿𝐷 (𝑤).
Steps:
1. Initialise: Set an initial solution 𝑤(1) . Choose a learning rate 𝜂 and the
number of iterations 𝑇 .
2. Iterate: For 𝑡 = 1, 2, … , 𝑇 :
• Sample a data point: 𝑧 ∼ 𝐷.
• Calculate a subgradient for the sampled point: 𝑣𝑡 ∈ 𝜕𝑙(𝑤(𝑡) , 𝑧).
• Update the solution: 𝑤(𝑡+1) = 𝑤(𝑡) − 𝜂𝑣𝑡 .
𝑇
3. Output: Return the averaged solution 𝑤̄ = 𝑇1 ∑𝑡=1 𝑤(𝑡) .
Applications
• Online Learning: Update models in real-time as new data points become
available.
• Large-scale machine learning: Efficiently train models when the
dataset is too large to fit in memory.
9
• Recommendation systems: Update user preferences based on their
interactions.
Mathematical Foundation
The expected squared error for a given test point 𝑥 and model 𝑀 (𝑥, 𝐷)
trained on dataset 𝐷 can be decomposed into three terms:
𝐸𝑥,𝐷,𝑦 [(𝑦−𝑀 (𝑥, 𝐷))2 ] = 𝐸𝑥,𝑦 [(𝑦−𝐸𝑦 [𝑦|𝑥])2 ]+𝐸𝑥,𝐷 [(𝑀 (𝑥, 𝐷)−𝐸𝐷 [𝑀 (𝑥, 𝐷)])2 ]+
𝐸𝑥 [(𝐸𝐷 [𝑀 (𝑥, 𝐷)] − 𝐸𝑦 [𝑦|𝑥])2 ]
where:
• Noise: The first term, 𝐸𝑥,𝑦 [(𝑦−𝐸𝑦 [𝑦|𝑥])2 ], represents the inherent noise
in the data. It’s independent of the model and serves as a baseline error.
10
• Average Variance: The second term, 𝐸𝑥,𝐷 [(𝑀 (𝑥, 𝐷) − 𝐸𝐷 [𝑀 (𝑥, 𝐷)])2 ],
measures the variance of the model’s predictions over different training
sets.
• Average Bias: The third term, 𝐸𝑥 [(𝐸𝐷 [𝑀 (𝑥, 𝐷)] − 𝐸𝑦 [𝑦|𝑥])], quantifies
the bias, reflecting the difference between the average prediction of the
model and the true expected value.
The sources provide information on the bias and variance of a classifier in the
context of the bias-variance decomposition of the expected squared loss.
Bias reflects the systematic deviation of a classifier’s predicted decision bound-
ary from the true decision boundary. It is calculated as:
Ex[(ED[M(x,D)] - Ey[y|x])]
where:
• M(x, D) represents the predicted value of the model M for a test point
x, trained on dataset D.
• ED[M(x, D)] is the expected prediction of the model over all possible
training sets D.
• Ey[y|x] is the expected value of the true response y given the test point
x.
This term measures the squared difference between the average prediction of
the model and the true expected value of the response.
Variance refers to the variability of the learned decision boundaries over differ-
ent training sets. It is calculated as:
Ex,D[(M(x, D) - ED[M(x, D)])2]
where:
• M(x, D) represents the predicted value of the model M for a test point
x, trained on dataset D.
• ED[M(x, D)] is the expected prediction of the model over all possible
training sets D.
This term measures the average squared difference between the model’s pre-
diction for a specific training set and its average prediction over all training
sets.
Problems Solved by the Formula:
• Decomposition of error: Allows the analysis of the expected squared
error in terms of noise, bias, and variance, providing insights into the
source of model errors.
• Understanding the tradeoff: Helps visualise and quantify the tradeoff
between bias and variance, guiding the selection of appropriate model
complexity.
11
Relation with Model Fitting
• High Bias (Underfitting): A high-bias model will have a large average
bias term, indicating that its predictions are consistently far from the true
values, regardless of the training data used.
• High Variance (Overfitting): A high-variance model will have a large
average variance term, signifying its sensitivity to the specific training set.
Its predictions will fluctuate significantly for different training datasets.
Applications
• Model Selection: Choosing the right model complexity by finding the
sweet spot that minimises both bias and variance. For example, in polyno-
mial regression, selecting the appropriate degree of the polynomial based
on the bias-variance tradeoff.
• Regularisation: Techniques like L1 and L2 regularisation in linear
regression help control model complexity by penalising large weights, thus
reducing variance.
• Ensemble Methods: Techniques like bagging and boosting combine
multiple models to reduce variance (bagging) or bias (boosting).
• Hyperparameter Tuning: Adjusting hyperparameters to find the op-
timal balance between bias and variance. For instance, in k-nearest
neighbours, choosing the optimal value of k involves considering its im-
pact on the bias-variance tradeoff.
• Neural Network Architecture: The choice of the number of layers and
neurons in a neural network affects its complexity and consequently the
bias-variance tradeoff.
Examples
Example: Polynomial Regression
Consider fitting a polynomial to a dataset.
• Low-degree polynomial (e.g., linear): May have high bias as it can’t
capture complex non-linear relationships but will have low variance.
• High-degree polynomial: Can perfectly fit the training data, resulting
in low bias. However, it will likely overfit, leading to high variance and
poor generalisation to new data.
Example: k-Nearest Neighbours
In k-NN classification, the value of k affects bias and variance:
• Small k: Results in low bias as the prediction relies heavily on nearby
points but has high variance as it’s sensitive to noise in individual data
points.
• Large k: Averages over more neighbours, reducing variance but increasing
bias as it smooths out the decision boundaries.
12
By understanding the bias-variance tradeoff, one can choose the appropriate de-
gree of the polynomial or the value of k to achieve a balance between underfitting
and overfitting, leading to better generalisation performance.
Methods
Validation for Model Selection
• The simplest validation approach involves splitting the dataset into two
parts: a training set and a validation set.
• The model is trained on the training set, and its performance is evaluated
on the validation set.
• The model with the best performance on the validation set is selected.
K-Fold Cross-Validation
• A more robust method to reduce the variance of the validation estimate.
• The data is divided into 𝐾 equal-sized folds.
• The model is trained 𝐾 times, each time using 𝐾 − 1 folds for training
and the remaining fold for validation.
• The final performance estimate is the average performance over all 𝐾 folds.
Algorithm:
k-Fold Cross Validation for Model Selection
input:
training set S = (x1, y1), . . . , (xm, ym)
set of parameter values Θ
learning algorithm A
13
integer k
partition S into S1, S2, . . . , Sk
foreach � � Θ
for i = 1 . . . k
hi,� = A(S \ Si; �)
error(�) = 1/k �_(i=1)^k LSi(hi,�)
output
�? = argmin_� [error(�)]
h_�? = A(S; �?)
Training-Validation-Test Split
• The data is divided into three sets: training, validation, and test.
• The training set is used for model training, the validation set is used for
model selection and hyperparameter tuning, and the test set is used for
final evaluation of the chosen model.
14
parameters that minimise a combination of the empirical loss on the training
data and a penalty term that encourages simpler models.
Lasso Regression (L1 Regularization):
The objective function for lasso regression is given as:
1
• 𝑚𝑖𝑛𝑤 𝐽 (𝑤) = 2 ⋅ ||𝑌 − 𝐷𝑤||2 + 𝛼 ⋅ ||𝑤||1
Where:
• 𝐷 is the data matrix containing the independent variables.
• 𝑌 is the response vector.
• 𝑤 represents the vector of model parameters (weights).
• 𝛼 ≥ 0 is the regularization constant controlling the strength of the penalty.
𝑑
• ||𝑤||1 = ∑𝑖=1 |𝑤𝑖 | is the L1-norm of the weight vector, representing the
sum of the absolute values of the weights.
Ridge Regression (L2 Regularization):
The objective function for ridge regression is:
• 𝑚𝑖𝑛𝑤̃ 𝐽 (𝑤)̃ = ||𝑌 − 𝐷̃ 𝑤||
̃ 2 + 𝛼 ⋅ ||𝑤||
̃ 2
Where:
• 𝐷̃ is the augmented data matrix (including a column of 1s for the bias
term).
• 𝑤̃ is the augmented weight vector (including the bias term).
• 𝛼 ≥ 0 is the regularization constant.
𝑑
̃ 2 = ∑𝑖=1 𝑤𝑖2 is the L2-norm of the weight vector, which is the sum of
• ||𝑤||
the squared values of the weights.
Key Differences:
• Lasso regression (L1) uses the absolute values of the weights in the
penalty term, encouraging sparsity in the solution (some weights become
exactly zero). This can be beneficial for feature selection, as unimportant
features are effectively removed from the model.
• Ridge regression (L2) uses the squared values of the weights, lead-
ing to shrinkage of the weights towards zero (but not exactly zero). This
helps to prevent any single feature from having an overly dominant influ-
ence on the model.
15
• Regularization techniques are used during model training to prevent
overfitting.
• Validation techniques are applied to assess the impact of different regular-
ization parameters and select the best setting.
By carefully combining these techniques, we can train models that perform well
on unseen data and avoid the pitfalls of overfitting.
Examples
Example: Regularized Linear Regression
In linear regression, L1 (LASSO) or L2 (Ridge) regularization can be added
to the loss function to prevent overfitting. K-fold cross-validation can then be
used to select the best regularization parameter 𝜆 by comparing the average
performance on the validation folds.
Example: Choosing the Depth of a Decision Tree
When training a decision tree, the depth of the tree is a hyperparameter that
needs to be tuned. A deeper tree can capture more complex patterns but risks
overfitting. Cross-validation can be used to evaluate trees of different depths
and select the one that generalises best to the validation data.
Example: Regularization in Neural Networks
Neural networks often employ techniques like dropout or weight decay (L2 regu-
larization) to prevent overfitting. A validation set is crucial for determining the
optimal dropout rate or weight decay parameter, ensuring the network doesn’t
memorise the training data but learns generalizable features.
Applications
• Image Classification: In image recognition tasks, models like convolu-
tional neural networks (CNNs) are trained using regularization techniques
and cross-validation to achieve high accuracy on unseen images.
• Natural Language Processing: Models for tasks like machine transla-
tion or sentiment analysis are trained using regularization and validation
to ensure they generalise well to different language styles and domains.
• Medical Diagnosis: Machine learning models used for disease prediction
or diagnosis are carefully validated and regularized to ensure reliable and
accurate performance on new patients.
• Financial Modelling: Predictive models for stock prices or credit risk
are validated and regularized to avoid overfitting to historical data and to
ensure robustness to market fluctuations.
16
Support Vector Machines (SVM)
Concept Overview
Support Vector Machines (SVMs) are supervised learning models used for clas-
sification and regression tasks. They are particularly well-suited for handling
high-dimensional data and are known for their ability to find complex, non-
linear decision boundaries. The fundamental idea behind SVMs is to find the
optimal hyperplane that maximises the separation between different
classes in the data.
Key Concepts:
• Hyperplane: In a d-dimensional space, a hyperplane is a (d-1)-
dimensional subspace that separates the space into two halves. In SVMs,
a hyperplane is used as the decision boundary between classes.
• Margin: The margin of a separating hyperplane is the distance between
the hyperplane and the closest data point from either class. SVMs aim to
find the hyperplane that maximises this margin.
• Support Vectors: The data points closest to the separating hyperplane
are called support vectors. These points play a crucial role in defining the
hyperplane and are the only ones that influence the classification decision.
• Kernel Trick: For non-linearly separable data, SVMs use the kernel trick
to implicitly map the data into a higher-dimensional space where it might
become linearly separable.
SVM Types:
• Hard SVM: This type of SVM works well for linearly separable data. It
seeks a hyperplane that perfectly classifies all training examples, maximis-
ing the margin.
• Soft SVM: When the data is not linearly separable, soft SVM allows
for some misclassification during training. It introduces a penalty term
for misclassified points, balancing the margin maximisation with error
minimisation.
Relation to Other Concepts:
• Regularized Loss Minimisation: SVMs can be viewed as a form of
regularized loss minimisation. The margin maximisation corresponds to
minimizing the norm of the weight vector, which acts as a regulariser,
while the loss function penalises misclassifications.
• Convex Optimisation: Training an SVM involves solving a convex opti-
misation problem, which guarantees finding the globally optimal solution.
This makes SVMs more robust compared to other models that might get
stuck in local optima.
17
Mathematical Foundation
Hard SVM The objective of Hard SVM is to find the optimal hyperplane,
defined by weight vector w and bias b, that maximises the margin while correctly
classifying all training data. This can be formulated as an optimisation problem:
Objective Function: maximize(w,b):||w||=1 mini�[m] |�w,xi�+ b| subject to �i,
yi(�w,xi�+ b) > 0
where:
• xi is the i-th data point.
• yi is the corresponding class label (either +1 or -1).
• �w,xi� represents the dot product of the weight vector and the data point.
This formulation aims to find the hyperplane (w, b) that has a unit norm (||w||
= 1) and maximises the distance to the closest data point, ensuring all points
are correctly classified.
An equivalent formulation of the Hard SVM rule as a quadratic optimisation
problem is:
Objective Function: minimize(w,b) ||w||2 subject to �i, yi(�w,xi�+ b) � 1
This formulation minimises the squared norm of the weight vector (||w||2) under
the constraint that all data points are correctly classified with a margin of at
least 1.
Soft SVM When the data is not linearly separable, the Hard SVM constraints
become infeasible. Soft SVM addresses this by introducing slack variables (�i)
and a penalty parameter (C) to allow for misclassifications.
Objective Function: minimize(w,b,�) (1/2||w||2 + C�ni=1 �i) subject to
yi(wTxi + b) � 1 - �i and �i � 0, �xi � D
The objective function now balances margin maximisation (minimising ||w||2)
with error minimisation (minimising the sum of slack variables). The penalty
parameter C controls the trade-off between these two objectives.
Concept Overview
• Support Vector Machines (SVMs): Powerful supervised learning
models used for classification and regression tasks. SVMs aim to find
a hyperplane that best separates data points into different classes.
• Optimality in SVMs: The concept of optimality in SVMs refers to
finding the best possible hyperplane that maximises the margin between
classes, either in the original input space (hard-SVM) or a transformed
feature space (soft-SVM).
18
• Fritz John Optimality Theorem: A generalisation of the Lagrange
multiplier theorem that provides necessary conditions for optimality in
constrained optimisation problems. It’s crucial for understanding and
analysing the solutions found by SVMs.
Mathematical Foundation
Hard-SVM and Margin Maximisation The goal of hard-SVM is to find
a hyperplane (defined by w and b) that perfectly separates the data while
maximising the margin between classes. This can be formulated as:
maximise(w,b):||w||=1 mini�[m] yi(�w,xi� + b)
where:
• xi is a data point.
• yi is the class label (+1 or -1).
• w is the weight vector.
• b is the bias term.
This formulation seeks the hyperplane with the largest perpendicular distance
(margin) to the closest data points from each class.
19
Support Vectors
Support vectors are the data points that lie closest to the decision boundary and
have a direct influence on its position. They are crucial because they determine
the margin and play a key role in the classification process. The Fritz John
Optimality Theorem is used to identify these support vectors.
Algorithms
The sources don’t contain specific algorithms related to the Fritz John Optimal-
ity Theorem and its application in SVMs. However, it’s worth mentioning that
quadratic programming solvers are often used to find the optimal solution for
the SVM primal or dual problem.
Examples
The sources don’t provide solved examples related to the Fritz John Optimality
Theorem applied to SVMs. However, you can find numerous examples in SVM
literature, like the textbook “Understanding Machine Learning” by Shai Shalev-
Shwartz and Shai Ben-David.
Applications
Identifying Support Vectors: The Fritz John Optimality Theorem allows
us to identify support vectors by analysing the Lagrange multipliers (�i) corre-
sponding to the constraints in the SVM optimisation problem.
• Support vectors have non-zero �i values.
• Data points with �i = 0 are not support vectors and do not affect the
decision boundary.
This knowledge is crucial for:
• Understanding model behaviour: Knowing which data points influ-
ence the decision boundary can provide insights into how the model makes
predictions.
• Data visualisation: Highlighting support vectors can help visualise the
decision boundary and the margin in SVM models.
• Model efficiency: Support vectors allow for a more compact representa-
tion of the SVM model, as only these points are required for prediction.
• Kernel methods: In kernel-based SVMs, support vectors become even
more crucial as they determine the shape of the decision boundary in the
implicit high-dimensional feature space.
20
Beyond SVMs: The Fritz John Optimality Theorem is a powerful tool with
applications extending far beyond SVMs:
• Convex optimisation: It forms the basis for solving a wide range of
constrained optimisation problems.
• Economics: Used in problems related to resource allocation, production
planning, and market equilibrium.
• Engineering: Applied in areas like control theory, signal processing, and
structural design.
Understanding this theorem allows for a deeper appreciation of the mathemati-
cal foundations of optimisation and its applications across various domains.
Algorithms
Several algorithms can be used to train SVMs, with the most common ones
being:
• Quadratic Programming: Hard SVM and Soft SVM formulations can
be solved directly using quadratic programming solvers. However, this
approach can be computationally expensive for large datasets.
• Stochastic Gradient Descent (SGD): SGD is a more scalable ap-
proach for training SVMs, especially for large datasets. It iteratively up-
dates the model parameters by computing the gradient of the loss function
on a small batch of data points.
SGD for Soft SVM with Kernels Goal: Solve the Soft SVM optimization
problem using kernels.
Parameter: T (number of iterations)
Initialize: �(1) = 0 (coefficients vector)
For t = 1, …, T:
1. Let �(t) = (1 / �t)�(t) (update coefficients)
2. Choose i randomly from [m] (select a data point)
3. If yi�mj=1 �(t)j K(xj, xi) < 1 (check classification condition):
• �(t+1)i = �(t)i + yi
• �(t+1)j = �(t)j for j � i
4. Else:
• �(t+1) = �(t)
Output: �(T) (final coefficients)
This algorithm uses the kernel trick to implicitly work in the higher-dimensional
feature space, avoiding the explicit computation of the feature mapping. The
kernel function K(x, x’) calculates the inner product between the feature repre-
sentations of two data points.
21
Applications
SVMs have wide applications in various domains, including:
• Image Classification: SVMs are used for recognizing objects, scenes,
and faces in images. They can handle high-dimensional image data effec-
tively.
• Text Classification: SVMs are applied for sentiment analysis, spam
filtering, and topic categorization, where the high dimensionality of text
data is a challenge.
• Bioinformatics: SVMs are used for protein classification, gene expres-
sion analysis, and drug discovery, where the data is often complex and
high-dimensional.
• Handwriting Recognition: SVMs are employed in recognizing hand-
written characters, digits, and signatures, effectively dealing with the vari-
ability in handwriting styles.
Mathematical Foundation
Duality Consider the primal form of the Soft SVM optimisation problem:
minimize(w,b,�) (1/2||w||2 + C�ni=1 �i) subject to yi(wTxi + b) � 1 - �i and �i �
0, �xi � D
To derive the dual form, we introduce Lagrange multipliers (�i and �i) for the
constraints and form the Lagrangian:
L = (1/2||w||2 + C�ni=1 �i) - �ni=1 �i[yi(wTxi + b) - 1 + �i] - �ni=1�i�i
22
The dual problem is then obtained by minimising L with respect to w, b, and
� and then maximising with respect to � and �. This leads to the dual form:
maximize� �ni=1 �i - 1/2�ni=1�nj=1 �i�jyiyjxiTxj
subject to: 0 � �i � C, �i and �ni=1 �iyi = 0
Kernel Trick The kernel trick exploits the fact that the dual form of the SVM
optimisation problem only involves inner products between data points
(xiTxj). By replacing these inner products with a kernel function K(xi, xj),
we can implicitly compute inner products in a higher-dimensional feature space
defined by a mapping �(x) without explicitly calculating �(x).
The kernel function should satisfy Mercer’s condition, which guarantees that it
corresponds to a valid inner product in some feature space:
K(xi, xj) = �(xi)T�(xj)
Common kernel functions include:
• Polynomial kernel: K(x, x’) = (1 + �x,x’�)k
• Gaussian kernel: K(x, x’) = e- ‖x−x’‖2/2�
Algorithms
SGD for Solving Soft-SVM with Kernels Goal: Solve Equation (16.5):
minw (�/2 ||w||2 + 1/m�mi=1 max{0, 1− y�w, �(xi)�})
Parameter: T (number of iterations)
Initialize: �(1) = 0 (coefficients vector)
For t = 1, …, T:
1. Let �(t) = (1 / �t)�(t) (update coefficients)
2. Choose i randomly from [m] (select a data point)
3. If yi�mj=1 �(t)j K(xj, xi) < 1 (check classification condition):
• �(t+1)i = �(t)i + yi
• �(t+1)j = �(t)j for j � i
23
4. Else:
• �(t+1) = �(t)
Output: �(T) (final coefficients)
This algorithm utilises the kernel trick, thus avoiding the explicit calculation
of the feature mapping. The complexity of this algorithm is dominated by the
computation of the kernel matrix, which is O(n2) where n is the number of
training points.
Examples
Example: Handwritten Digit Recognition Consider the task of classify-
ing handwritten digits. Each digit image can be represented as a vector of pixel
values, resulting in a high-dimensional input space.
1. Feature Mapping: A suitable kernel function, such as a Gaussian kernel,
can be chosen to implicitly map the digit images into a higher-dimensional
space where they might become linearly separable.
2. Training: Using the SGD algorithm with the chosen kernel, a soft SVM
model is trained on a labelled dataset of handwritten digits.
3. Classification: A new digit image is classified by computing its kernel
values with the support vectors from the training set and applying the
learned SVM decision function.
Applications
The combination of duality, the kernel trick, and soft SVM has led to widespread
applications of SVMs:
• Image Recognition and Object Detection: SVMs are used in various
computer vision tasks, including image classification, object detection, and
image segmentation, where their ability to handle high-dimensional data
and learn complex decision boundaries is beneficial.
• Natural Language Processing: Applications include text classification,
sentiment analysis, and machine translation, where kernels can be de-
signed to capture semantic similarities between text documents or words.
• Bioinformatics: SVMs are used in various bioinformatics applications
like protein structure prediction, gene expression analysis, and drug discov-
ery, where they can handle complex biological data and build predictive
models.
The kernel trick enables SVMs to model non-linear relationships effectively,
making them suitable for various real-world applications.
24
Classification Assessment Notes
Performance Measures
• Accuracy: The proportion of correctly classified instances out of all in-
stances. It provides a general idea of the classifier’s performance but can
be misleading when dealing with imbalanced datasets.
– Formula: Accuracy = (TP + TN) / (TP + TN + FP + FN)
• Precision: The proportion of correctly classified positive instances out of
all instances predicted as positive. It focuses on the accuracy of positive
predictions.
– Formula: Precision = TP / (TP + FP)
• Recall (Sensitivity or True Positive Rate): The proportion of cor-
rectly classified positive instances out of all actual positive instances. It
focuses on the classifier’s ability to identify all positive instances.
– Formula: Recall = TP / (TP + FN)
• F1-Score: The harmonic mean of precision and recall. It provides a
balanced measure that considers both the accuracy of positive predictions
and the ability to identify all positive instances.
– Formula: F1-Score = 2 * (Precision * Recall) / (Precision
+ Recall)
• ROC-AUC: A graphical and numerical metric used for binary classifi-
cation problems. The ROC curve plots the True Positive Rate (TPR)
against the False Positive Rate (FPR) at various threshold settings. The
AUC (Area Under the Curve) summarises the ROC curve into a single
value, representing the classifier’s ability to distinguish between positive
and negative classes. A higher AUC indicates better performance.
Classifier Evaluation
• Confusion Matrix: A table that summarises the performance of a clas-
sification model by showing the counts of true positives, true negatives,
false positives, and false negatives. It provides a detailed breakdown of
the classifier’s predictions for each class, allowing for a deeper analysis of
its strengths and weaknesses.
• Cross-Validation: A technique used to evaluate the performance of a
model on unseen data by splitting the dataset into multiple folds and
using each fold as a test set while training the model on the remaining
folds. Common types include k-fold cross-validation and leave-one-out
cross-validation. It helps to mitigate the impact of data randomness on
the evaluation results and obtain a more robust estimate of the model’s
generalisation performance.
• Bias-Variance Tradeoff: The tradeoff between a model’s ability to fit
the training data (low bias) and its ability to generalise to new data (low
variance). A model with high bias underfits the data and has poor predic-
tive power, while a model with high variance overfits the data and may
25
not generalise well. The goal is to find a balance that minimises both bias
and variance.
Calculations:
• True Positives (TP) = 80
• False Positives (FP) = 10
• False Negatives (FN) = 20
• True Negatives (TN) = 90
• Precision = TP / (TP + FP) = 80 / (80 + 10) = 0.89
• Recall = TP / (TP + FN) = 80 / (80 + 20) = 0.80
• F1-Score = 2 * (Precision * Recall) / (Precision + Recall) = 2 * (0.89 *
0.80) / (0.89 + 0.80) = 0.84
Interpretation:
• The precision of 0.89 indicates that when the model predicts a click, it
is correct 89% of the time.
• The recall of 0.80 suggests that the model correctly identifies 80% of the
actual clicks.
• The F1-score of 0.84 represents a balanced measure of the model’s ability
to make accurate positive predictions and identify all actual clicks.
Here are some notes on classification assessment, expanding on the concepts of
TPR, FPR, TNR, FNR, sensitivity, and specificity:
26
• Precision: The proportion of correctly classified positive instances out of
all instances predicted as positive. It focuses on the accuracy of positive
predictions.
– Formula: Precision = TP / (TP + FP)
• Recall (Sensitivity or True Positive Rate (TPR)): The proportion
of correctly classified positive instances out of all actual positive instances.
It focuses on the classifier’s ability to identify all positive instances.
– Formula: Recall = Sensitivity = TPR = TP / (TP + FN)
• F1-Score: The harmonic mean of precision and recall. It provides a
balanced measure that considers both the accuracy of positive predictions
and the ability to identify all positive instances.
– Formula: F1-Score = 2 * (Precision * Recall) / (Precision
+ Recall)
• Specificity (True Negative Rate (TNR)): The proportion of correctly
classified negative instances out of all actual negative instances. It mea-
sures the classifier’s ability to correctly identify negative instances.
– Formula: Specificity = TNR = TN / (FP + TN)
• False Positive Rate (FPR): The proportion of incorrectly classified neg-
ative instances (predicted as positive) out of all actual negative instances.
– Formula: FPR = FP / (FP + TN) = 1 - Specificity
• False Negative Rate (FNR): The proportion of incorrectly classified
positive instances (predicted as negative) out of all actual positive in-
stances.
– Formula: FNR = FN / (TP + FN) = 1 - Sensitivity
• ROC-AUC: A graphical and numerical metric used for binary classifi-
cation problems. The ROC curve plots the True Positive Rate (TPR)
against the False Positive Rate (FPR) at various threshold settings. The
AUC (Area Under the Curve) summarizes the ROC curve into a single
value, representing the classifier’s ability to distinguish between positive
and negative classes. A higher AUC indicates better performance.
27
• TPR (also called sensitivity or recall) is the proportion of correctly
classified positive instances out of all actual positive instances.
• FPR (equivalent to 1 - specificity) is the proportion of incorrectly clas-
sified negative instances (predicted as positive) out of all actual negative
instances.
• The Area Under the ROC Curve (AUC) is a single numerical value
that summarizes the overall performance of the classifier. A higher AUC
value indicates better discrimination capability, with 1 representing a per-
fect classifier and 0.5 representing a random classifier.
Purpose and Relevance: ROC analysis and AUC are particularly useful
when: * Dealing with imbalanced datasets, where accuracy alone can be mis-
leading. * Comparing different classification models to choose the one with
the best discriminatory power. * Choosing an appropriate threshold for a given
application, depending on the relative costs of false positives and false negatives.
28
Steps:
1. Predict Scores: For each test point xi in D, predict the score S(xi) for
the positive class.
2. Sort Scores: Sort the (S(xi), yi) pairs (score and true class) in decreasing
order of score.
3. Initialise Variables: Set FP = 0, TP = 0, AUC = 0, and � = score of
the first pair.
4. Iterate through Sorted Pairs: For each sorted pair (S(xi), yi):
• If S(xi) < �:
– Update AUC: AUC = AUC + (TP / n1) * (FP / n2)
– Update �: � = S(xi)*
• If yi = positive class: Increment TP by 1.
• Else (negative class): Increment FP by 1.
5. Final AUC Update: AUC = AUC + (TP / n1) * (FP / n2)
6. Plot ROC Curve: Plot the points (FPR, TPR) calculated during the
iteration.
7. Return: The plotted ROC curve and the calculated AUC value.
Use Cases: The ROC curve and AUC are widely used in machine learning for
model evaluation and comparison, particularly in scenarios where a threshold-
independent assessment of classifier performance is required.
Efficiency: The algorithm has a time complexity of O(n log n) due to the
sorting step, where n is the number of instances in the testing dataset.
Example:
Given a testing dataset with the following sorted scores and true class labels:
(0.9, c1), (0.8, c2), (0.8, c1), (0.8, c1), (0.1, c2)
where c1 represents the positive class.
Applying Algorithm 22.1, we obtain the following points for the ROC plot and
running AUC calculation:
The resulting ROC plot will have these (FPR, TPR) points connected by lines,
and the AUC will be the area under this curve, which is 0.833 in this example.
29
• Bootstrap Resampling is a statistical technique used to estimate the
sampling distribution of a statistic (e.g., mean, median, accuracy) by re-
peatedly resampling with replacement from the original dataset.
• It involves creating multiple bootstrap samples, each of which is gen-
erated by randomly drawing data points from the original dataset with
replacement, ensuring that each sample has the same size as the original
dataset.
• Confidence Intervals are ranges around a point estimate (e.g., sample
mean) that are likely to contain the true population parameter with a
certain level of confidence (e.g., 95%). Bootstrap resampling can be used
to construct confidence intervals for various statistics.
Purpose and Relevance:
• Estimate Sampling Distribution: Bootstrap resampling allows us to
approximate the sampling distribution of a statistic without making strong
assumptions about the underlying population distribution.
• Construct Confidence Intervals: Bootstrap confidence intervals pro-
vide a measure of uncertainty around a point estimate, giving a range of
values that are likely to include the true population parameter.
• Evaluate Model Performance: In the context of classification, boot-
strap resampling can be used to estimate the variability of performance
metrics like accuracy or AUC, helping to assess the model’s robustness.
30
Input: * D: Dataset with n data points * B: Number of bootstrap samples *
M : Classifier model * �: Performance measure (e.g., accuracy)
Output: * Confidence interval for �
Steps:
1. Create Bootstrap Samples: For i = 1 to B:
• Generate a bootstrap sample Di by randomly drawing n data points
from D with replacement.
2. Evaluate Classifier on Bootstrap Samples: For i = 1 to B:
• Train the classifier M on Di.
• Evaluate the performance measure �i on the full dataset D using the
model trained on Di.
3. Calculate Confidence Interval: Construct the confidence interval for
� using the distribution of �i (e.g., using the percentile method).
Efficiency: The time complexity of the bootstrap resampling algorithm is O(B
* T), where B is the number of bootstrap samples and T is the time complexity
of training and evaluating the classifier model.
31
Study Material: Neural Networks
1. Feedforward Neural Networks
1.1 Concept Overview
• Artificial Neural Networks (ANNs) are computational models in-
spired by the structure and function of biological neural networks. They
consist of interconnected nodes called neurons, organized in layers, which
process and transmit information.
• Feedforward Neural Networks are a type of ANN where the informa-
tion flows in one direction, from the input layer through hidden layers to
the output layer, without any loops or cycles. Each connection between
neurons has an associated weight, which determines the strength of the
signal transmission.
Purpose and Relevance:
• Universal Function Approximators: Feedforward neural networks,
with sufficient complexity (number of layers and neurons), can approxi-
mate any continuous function to arbitrary accuracy.
• Pattern Recognition and Classification: They excel in tasks like
image recognition, natural language processing, and speech recognition
due to their ability to learn complex patterns and representations from
data.
Connection to Other Fields/Concepts:
• Biology: Inspired by the structure of biological neural networks in the
brain.
• Linear Algebra: Heavily relies on matrix operations for efficient compu-
tation.
• Calculus: Uses derivatives for gradient-based optimization algorithms
like backpropagation.
32
mance.
33
Efficiency: The time complexity of backpropagation is O(|E|) per training
example, where |E| is the number of edges (connections) in the network.
2.3 Implications
• Theoretical Guarantee: The universal approximation theorem provides
a theoretical foundation for the ability of neural networks to solve a wide
range of problems.
• Practical Considerations: The theorem does not specify the optimal
network architecture or the number of neurons needed. These aspects
34
are determined empirically through experimentation and model selection
techniques.
35
• Update Weights: w(i + 1) = w(i) - �i(vi + �w(i)), where vi is the
gradient, �i is the learning rate at iteration i, and � is the regulariza-
tion parameter.
3. Output: Return the weight vector w that achieved the best performance
on a validation set.
3.3.2 Backpropagation Algorithm
Purpose: To compute the gradients of the loss function with respect to the
weights and biases in a neural network.
Algorithm:
1. Forward Pass: Compute the network’s output for a given input (as in
the feedforward algorithm).
2. Backward Pass:
• Output Layer: Calculate the error signal (�T) at the output layer.
• Hidden Layers: Propagate the error signal back through the hidden
layers, calculating the error signal for each layer (�t) using the chain
rule and the activation function’s derivative.
3. Gradient Calculation: Calculate the gradients of the loss function with
respect to weights and biases using the error signals and the outputs of
the neurons.
Efficiency: Both SGD and backpropagation have efficient implementations
using matrix operations, enabling their use in large-scale neural networks.
Mathematical Foundation
Activation Functions Activation functions introduce non-linearity to the
network, enabling it to learn complex relationships between inputs and outputs.
Common activation functions include:
36
• Sigmoid: Maps the input to a value between 0 and 1, often used in the
output layer for binary classification.
1
𝜎(𝑎) =
1 + 𝑒𝑥𝑝(−𝑎)
𝑒𝑥𝑝(𝑏𝑖 + 𝑤𝑖𝑇 𝑥)
𝜋𝑖 (𝑥) = 𝐾
, for all 𝑖 = 1, 2, ..., 𝐾
∑𝑗=1 𝑒𝑥𝑝(𝑏𝑗 + 𝑤𝑗𝑇 𝑥)
Error Functions Error functions measure the difference between the net-
work’s predicted output and the true target values. Common error functions
include:
• Squared Error: Measures the average squared difference between the
predicted and target values.
𝑛 𝑛 𝑛
SSE = ∑ 𝜖2𝑖 = ∑(𝑦𝑖 − 𝑦𝑖̂ )2 = ∑(𝑦𝑖 − 𝑏 − 𝑤𝑇 𝑥𝑖 )2
𝑖=1 𝑖=1 𝑖=1
37
For an MLP with one hidden layer:
Algorithms
Multilayer Perceptron (MLP) with One Hidden Layer Purpose:
Learn complex relationships between inputs and outputs for both regression
and classification tasks.
Algorithm:
1. Initialization: Initialize the weight matrices and bias vectors to small
random values.
2. Feed-forward phase: For each training example, propagate the input
through the network to compute the predicted output.
3. Backpropagation phase: Calculate the net gradients for the output and
hidden layers using the chosen error function and activation functions.
4. Gradient Descent: Update the weight matrices and bias vectors using
the computed gradients and a chosen learning rate.
5. Iteration: Repeat steps 2-4 for a specified number of iterations or until
convergence.
38
Algorithm:
The algorithm for Deep MLPs is similar to that of MLPs with one hidden layer,
but with additional steps for backpropagating gradients through multiple hidden
layers.
1. Initialization: Initialize weight matrices and bias vectors for all layers.
2. Feed-forward phase: Propagate the input through all layers to compute
the output.
3. Backpropagation phase: Compute net gradients for the output layer
and backpropagate them recursively through all hidden layers.
4. Gradient Descent: Update weights and biases for all layers using the
computed gradients.
5. Iteration: Repeat steps 2-4 until convergence or for a specified number
of iterations.
39
Neural Networks: Activation Functions
Concept Overview
Activation functions are a crucial component of neural networks. Their pri-
mary purpose is to introduce non-linearity into the network’s computations.
Without activation functions, a neural network would simply be a linear combi-
nation of its inputs, limiting its capacity to model complex relationships in data.
Activation functions allow neural networks to approximate arbitrarily complex
functions, making them suitable for a wide range of tasks.
Activation functions are applied to the net input of a neuron, which is the
weighted sum of its inputs plus a bias term. The output of the activation func-
tion then becomes the neuron’s output, which can be passed to other neurons
in subsequent layers.
Mathematical Foundation
Common Activation Functions and Their Derivatives
• Linear Function: The simplest activation function, outputting its input
directly. While straightforward, it limits the network to learning only
linear relationships.
𝑓(𝑧) = 𝑧
𝜕𝑓(𝑧)
=1
𝜕𝑧
𝜕𝑓(𝑧)
= 1 − 𝑓(𝑧)2
𝜕𝑧
40
computational efficiency and effectiveness in preventing vanishing gradi-
ents.
𝑓(𝑧) = 𝑚𝑎𝑥(0, 𝑧)
𝜕𝑓(𝑧) 1 if 𝑧 > 0,
={
𝜕𝑧 0 otherwise.
Concepts Involved
• Net Input: The weighted sum of inputs to a neuron, calculated as:
𝑑
𝑛𝑒𝑡𝑘 = 𝑏𝑘 + ∑ 𝑤𝑖𝑘 ⋅ 𝑥𝑖 = 𝑏𝑘 + 𝑤𝑇 𝑥
𝑖=1
where:
– 𝑛𝑒𝑡𝑘 is the net input of neuron 𝑘
– 𝑏𝑘 is the bias term for neuron 𝑘
– 𝑤𝑖𝑘 is the weight connecting input neuron 𝑥𝑖 to neuron 𝑘
– 𝑥𝑖 is the output of the 𝑖th input neuron
– 𝑤 and 𝑥 are the weight and input vectors, respectively
• Neuron Output: The result of applying the activation function to the
net input:
𝑧𝑘 = 𝑓(𝑛𝑒𝑡𝑘 )
where:
– 𝑧𝑘 is the output of neuron 𝑘
– 𝑓 is the activation function
• Backpropagation: An algorithm for efficiently calculating the gradients
of the error function with respect to the network’s weights, utilizing the
chain rule of calculus and the derivatives of activation functions.
41
Understanding activation functions and their role in neural networks is funda-
mental to comprehending how these models learn and make predictions. Se-
lecting the appropriate activation function can significantly impact a network’s
performance and training efficiency.
Classification Techniques
This study material summarises key theoretical concepts related to Decision
Trees, Random Forests, and Ensemble Techniques, tailored for a Stanford Mas-
ters level understanding.
Decision Trees
Concept Overview A decision tree is a supervised learning model that
predicts the target value (class label) of an instance by learning simple decision
rules inferred from the data features. The tree structure comprises internal
nodes representing tests on attributes, branches representing outcomes of the
tests, and leaf nodes representing class labels or predictions.
Key advantages of decision trees:
• Easy to understand and interpret. The decision rules can be readily visu-
alized and explained, even to non-technical audiences.
• Handle both numerical and categorical data.
• Non-parametric method: They don’t make strong assumptions about the
underlying data distribution, making them robust to outliers and diverse
data patterns.
𝑘
𝐻(𝐷) = − ∑ 𝑃 (𝑐𝑖 ) ⋅ 𝑙𝑜𝑔2 (𝑃 (𝑐𝑖 ))
𝑖=1
42
where DY and DN are the subsets of D resulting from the split. The
algorithm chooses the attribute test that yields the highest information
gain.
• Gini Index: Another impurity measure, calculated as:
𝑘
𝐺𝑖𝑛𝑖(𝐷) = 1 − ∑(𝑃 (𝑐𝑖 ))2
𝑖=1
A split is chosen to minimize the weighted average Gini index of the re-
sulting subsets.
• CART Measure: Specifically used in the CART (Classification and
Regression Trees) algorithm. For a binary split:
|𝐷𝑌 | |𝐷𝑁 | 𝑘
𝐶𝐴𝑅𝑇 (𝐷𝑌 , 𝐷𝑁 ) = 2 ⋅ ⋅ ⋅ ∑ |𝑃 (𝑐𝑖 |𝐷𝑌 ) − 𝑃 (𝑐𝑖 |𝐷𝑁 )|
|𝐷| |𝐷| 𝑖=1
Algorithms
43
else:
best_split = find_best_split(D, evaluation_metric)
create_child_nodes(best_split)
for each child_node:
child_node = DECISIONTREE(child_node.data, evaluation_metric, stopping_criteria)
return root_node(best_split, child_nodes)
Computational Complexity:
• Evaluating all split points for a numeric attribute takes O(n log n) time,
where n is the number of instances.
• Evaluating categorical splits depends on the number of possible partitions,
and can be O(n log n) if the size of partitions is bounded.
• Overall complexity can be O(n d log2 n) in the worst case, where d is the
number of attributes.
Applications:
• Credit risk assessment: Predict the likelihood of loan default based on
applicant features.
• Medical diagnosis: Classify patients into disease categories based on
symptoms and test results.
• Customer churn prediction: Identify customers likely to stop using a
service.
Random Forest
Concept Overview A random forest is an ensemble learning method that
constructs a multitude of decision trees during training and outputs the predic-
tion that is the mode of the classes (classification) or mean/average prediction
(regression) of the individual trees. It addresses the overfitting issue that can
occur with single decision trees by introducing randomness in two ways:
1. Bootstrap aggregating (Bagging): Each tree is trained on a different
random subset of the training data, sampled with replacement.
2. Random Subspace: At each node, a random subset of features is con-
sidered for splitting, further increasing diversity among trees.
Algorithms
44
• Draw a bootstrap sample Dt from the original dataset D.
• Grow a decision tree Mt on Dt. At each node, randomly select
a subset of p features and choose the best split among them. Grow
the tree to full depth or until a stopping criterion is met.
2. Output the ensemble of trees {M1, M2, …, MK}.
Prediction:
• For a new instance, predict the class label by taking a majority vote among
the predictions of all K trees.
Pseudocode:
function RANDOMFOREST(D, K, p, stopping_criteria):
trees = []
for t = 1 to K:
Dt = bootstrap_sample(D)
Mt = DECISIONTREE(Dt, p, stopping_criteria)
trees.append(Mt)
return trees
Ensemble Techniques
Concept Overview Ensemble techniques combine multiple individual
models (base learners) to produce a more powerful and robust predictor. The
idea is that the weaknesses of individual models can be offset by the strengths
of others, leading to improved accuracy, generalization, and stability.
Key Benefits of Ensembles:
45
• Improved Accuracy: Ensembles often outperform single models, espe-
cially when base learners are diverse and make different types of errors.
• Robustness: Less sensitive to noise and outliers in the data, as errors
from individual models are averaged out.
• Generalization: Can better capture complex relationships in the data,
reducing overfitting.
Algorithms (Boosting)
AdaBoost
1. Initialize instance weights: Set wi = 1/m for all instances i = 1, 2, …,
m.
2. For t = 1 to T (number of iterations):
• Train a weak learner Mt on the data, weighted by w.
• Calculate the weighted error rate of Mt:
𝑚
𝜖𝑡 = ∑ 𝑤𝑖 ⋅ 𝐼(𝑀𝑡 (𝑥𝑖 ) ≠ 𝑦𝑖 )
𝑖=1
46
• Update instance weights:
𝑤𝑖 = 𝑤𝑖 ⋅ 𝑒𝑥𝑝(−𝛼𝑡 ⋅ 𝑦𝑖 ⋅ 𝑀𝑡 (𝑥𝑖 ))
Pseudocode:
function ADABOOST(D, T, weak_learner):
m = len(D)
weights = [1/m] * m
ensemble = []
for t = 1 to T:
Mt = weak_learner(D, weights)
error_t = calculate_weighted_error(D, Mt, weights)
alpha_t = 0.5 * ln((1 - error_t) / error_t)
update_weights(D, Mt, weights, alpha_t)
ensemble.append((Mt, alpha_t))
return ensemble
Computational Complexity:
• Depends on the complexity of the weak learner and the number of boosting
iterations.
Applications:
• Face detection: AdaBoost was widely used in early face detection sys-
tems.
• Spam filtering: Classify emails as spam or not spam.
• Text categorization: Assign documents to predefined categories.
Choosing the Right Technique: The choice between decision trees, random
forests, and specific ensemble techniques depends on factors like dataset size,
dimensionality, noise levels, and desired interpretability. For example:
• For highly interpretable models, single decision trees can be suitable.
• For improved accuracy and generalization, random forests are often a good
choice.
• For complex datasets with potential for high bias, boosting methods might
be preferred.
47
Remember that understanding the underlying concepts and the strengths and
weaknesses of each method is essential for making informed choices and achiev-
ing optimal classification performance.
Mathematical Foundation
Scatter Matrices
• Within-class scatter matrix (S): Represents the scatter of data points
within each class.
• Between-class scatter matrix (B): Represents the scatter between the
means of different classes.
𝑤𝑇 𝐵𝑤
𝑤𝑇 𝑆𝑤
𝑤𝑇 𝑥
48
2. Means: Let µ1 and µ2 represent the means of the two classes in the
original feature space. Their projections onto w are:
𝑤𝑇 1
𝑤𝑇 2
3. Scatter: The scatter of the projected points for each class ci is calculated
as:
𝑠2𝑖 = ∑ (𝑤𝑇 𝑥 − 𝑤𝑇 𝑖 )2
𝑥∈𝑐𝑖
4. Objective: The goal is to find the vector w that maximizes the difference
between projected means while minimizing the scatter within each class:
(𝑤𝑇 1 − 𝑤𝑇 2 )2
𝑚𝑎𝑥𝑤
𝑠21 + 𝑠22
5. Rewriting with Scatter Matrices: The above expression can be rewrit-
ten using the between-class and within-class scatter matrices, leading to
the ratio:
𝑤𝑇 𝐵𝑤
𝑤𝑇 𝑆𝑤
6. Eigenvalue Problem: The optimal w that maximizes this ratio is the
eigenvector corresponding to the largest eigenvalue of the matrix S-1B.
Algorithms
Linear Discriminant Analysis (LDA) Algorithm Input: Dataset D with
n points xi � Rd and corresponding class labels yi � {c1, c2}.
Output: Optimal linear discriminant vector w.
1. Class-Specific Subsets: Partition D into two subsets, D1 and D2, based
on class labels.
2. Class Means: Calculate the mean vectors µ1 and µ2 for D1 and D2.
3. Between-Class Scatter Matrix: Compute the between-class scatter
matrix B = (µ1 − µ2)(µ1 − µ2)T.
4. Centred Data Matrices: For each class, create a centred data matrix
Di by subtracting the corresponding mean vector µi from each data point.
5. Class Scatter Matrices: Calculate the within-class scatter matrices S1
and S2 as Si = DiTDi.
6. Within-Class Scatter Matrix: Compute the total within-class scatter
matrix S = S1 + S2.
49
7. Eigenvalue Decomposition: Find the dominant eigenvector w of S-1B.
This w represents the optimal linear discriminant direction.
Time Complexity: O(d3 + nd2). The most computationally expensive steps
are calculating the within-class scatter matrix S (O(nd2)) and solving the eigen-
value problem (O(d3) in the worst case).
Mathematical Foundation
Bayes’ Theorem At the heart of probabilistic classification lies Bayes’ the-
orem, a fundamental concept in probability theory that allows us to update our
beliefs about an event based on new evidence. In the context of classification,
Bayes’ theorem calculates the posterior probability P(ci|x) of an instance x
belonging to class ci given the following:
• Likelihood P(x|ci): The probability of observing data point x given that
it belongs to class ci.
• Prior Probability P(ci): The probability of class ci occurring in the
dataset.
• Evidence P(x): The probability of observing data point x regardless of
the class.
Bayes’ theorem is formulated as follows:
50
𝑃 (𝑥|𝑐𝑖 )𝑃 (𝑐𝑖 )
𝑃 (𝑐𝑖 |𝑥) =
𝑃 (𝑥)
Derivation:
Bayes’ theorem stems from the definition of conditional probability:
𝑃 (𝐴 ∩ 𝐵)
𝑃 (𝐴|𝐵) =
𝑃 (𝐵)
where:
• P(A|B) is the probability of event A occurring given that event B has
occurred.
• P(A � B) is the probability of both events A and B occurring.
• P(B) is the probability of event B occurring.
Applying this to our case, we get:
𝑃 (𝑐𝑖 ∩ 𝑥)
𝑃 (𝑐𝑖 |𝑥) =
𝑃 (𝑥)
𝑃 (𝑥 ∩ 𝑐𝑖 )
𝑃 (𝑥|𝑐𝑖 ) =
𝑃 (𝑐𝑖 )
51
Algorithms
Bayes Classifier The Bayes classifier directly applies Bayes’ theorem to pre-
dict the class label. Here’s a step-by-step explanation:
1. Estimate prior probabilities P(ci) for each class from the training
data. This can be done by calculating the relative frequency of each class.
2. Estimate the likelihood function P(x|ci) for each class. This in-
volves choosing a suitable probability distribution that represents the data
and estimating its parameters from the training data. For example, the
likelihood can be modelled using a multivariate normal distribution
where the mean (µi) and covariance matrix (Σi) are estimated for
each class ci.
3. For a new data point x, calculate the posterior probability
P(ci|x) for each class using Bayes’ theorem.
4. Predict the class label ŷ as the class with the maximum posterior
probability using MAP estimation.
Pseudocode:
Algorithm: Bayes Classifier
// Training phase
for each class c_i:
Estimate prior probability P(c_i)
Estimate likelihood function P(x|c_i)
// Testing phase
for each class c_i:
Calculate posterior probability P(c_i|x) using Bayes' theorem
ŷ = argmax_ci P(c_i|x)
return ŷ
Naive Bayes Classifier The Naive Bayes classifier simplifies the Bayes clas-
sifier by assuming conditional independence among attributes, given the
class label. While this assumption might be naive in many real-world scenarios,
it significantly reduces computational complexity and often leads to surprisingly
good performance, particularly for high-dimensional data.
Here’s how Naive Bayes works:
1. Estimate prior probabilities P(ci) for each class from the training
52
data.
2. For each attribute Xj, estimate the class-conditional probabili-
ties P(Xj|ci) for each class ci.
3. For a new data point x, calculate the posterior probability for
each class using the naive Bayes formula:
𝑑
𝑃 (𝑐𝑖 |𝑥) ∝ 𝑃 (𝑐𝑖 ) ∏ 𝑃 (𝑥𝑗 |𝑐𝑖 )
𝑗=1
4. Predict the class label ŷ as the class with the maximum posterior
probability.
Pseudocode:
Algorithm: Naive Bayes Classifier
// Training phase
for each class c<sub>i</sub>:
Estimate prior probability P(c<sub>i</sub>)
for each attribute X<sub>j</sub>:
Estimate class-conditional probability P(X<sub>j</sub>|c<sub>i</sub>)
// Testing phase
for each class c<sub>i</sub>:
Calculate posterior probability P(c<sub>i</sub>|x) using the naive Bayes formula
ŷ = argmax<sub>ci</sub> P(c<sub>i</sub>|x)
return ŷ
53
Mathematical Formulation:
The posterior probability of a new point x belonging to class ci is estimated as
follows:
𝐾𝑖
𝑃 (𝑐𝑖 |𝑥) ≈
𝐾
where:
• Ki represents the number of points among the K nearest neighbours of x
that belong to class ci.
• K is the total number of nearest neighbours.
This formula essentially reflects the proportion of neighbours belonging to a
specific class.
Use Cases and Advantages:
KNN is particularly useful when the decision boundary is complex or non-
linear. It’s a lazy learning algorithm as it doesn’t build an explicit model
during training but rather relies on the training data directly during prediction.
However, KNN can be computationally expensive for large datasets and is sen-
sitive to the choice of distance metric and the value of k.
This study material provides a foundational understanding of probabilistic clas-
sification techniques. It equips Stanford masters students with the necessary
knowledge to explore more advanced machine learning concepts.
54