0% found this document useful (0 votes)
14 views54 pages

Final ML

The document provides an introduction to Machine Learning (ML), explaining its motivation, types of learning (supervised, unsupervised, and reinforcement), and specific algorithms like linear and logistic regression. It details the mathematical foundations, applications, and optimization techniques such as gradient descent and its variant, stochastic gradient descent (SGD). The document emphasizes the importance of these methods in automating tasks, adapting to new data, and personalizing user experiences.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
14 views54 pages

Final ML

The document provides an introduction to Machine Learning (ML), explaining its motivation, types of learning (supervised, unsupervised, and reinforcement), and specific algorithms like linear and logistic regression. It details the mathematical foundations, applications, and optimization techniques such as gradient descent and its variant, stochastic gradient descent (SGD). The document emphasizes the importance of these methods in automating tasks, adapting to new data, and personalizing user experiences.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 54

Introduction to Machine Learning

Concept Overview
Machine Learning (ML) is a subset of artificial intelligence (AI) that allows
computer systems to learn from data without explicit programming. This learn-
ing involves identifying patterns, building models, and making predictions or
decisions based on the data.

Motivation for Machine Learning Machine learning is motivated by the


desire to:
• Automate complex tasks: Image recognition, spam filtering
• Adapt to new data: ML models can continuously learn and improve.
• Personalise experiences: Recommendations based on individual pref-
erences

Different Types of Learning


• Supervised Learning: Training a model on labelled data, where each
data point has a corresponding output or target value. The model learns to
map inputs to outputs based on labelled examples (e.g., linear regression,
logistic regression).
• Unsupervised Learning: The model learns from unlabelled data, dis-
covering patterns, structures, or relationships (e.g., clustering, dimension-
ality reduction).
• Reinforcement Learning: An agent interacts with an environment and
learns through trial and error, aiming to maximise its cumulative reward
over time.

Linear Regression
Concept Overview Linear regression is a fundamental supervised learning
algorithm that models the relationship between a dependent variable and one
or more independent variables. It assumes a linear relationship, meaning
that changes in the independent variables are proportionally reflected in the
dependent variable.
Purpose and Relevance
• Predicting continuous values: House prices, stock market trends, sales
figures
• Understanding relationships: Quantifying the impact of independent
variables on the dependent variable.

Mathematical Foundation Linear regression aims to find the best-fit line


that represents the relationship between variables.

1
• Bivariate linear regression (one independent variable, X, and one de-
pendent variable, Y ): ŷ = b + wx where:
– ŷ is the predicted value of Y
– b is the intercept (value of Y when X is zero)
– w is the regression coefficient (slope of the line)
– x is the value of the independent variable
• Multivariate linear regression (multiple independent variables): ŷ =
b + w1x1 + w2x2 + … + wdxd where:
– d is the number of independent variables
The method of least squares estimates the optimal values of b and w by
minimising the sum of squared errors (SSE), which is the sum of the squared
differences between the actual values (yi) and the predicted values (ŷi):
SSE = �(yi - ŷi)2

Algorithms
1. Calculate the means: Calculate the means of X (µX) and Y (µY).
2. Calculate the regression coefficient (w):
• For bivariate linear regression: w = �((xi - µX)(yi - µY)) / �(xi - µX)2
• For multivariate linear regression: Refer to Algorithm 23.1 in source,
which utilises QR-decomposition for efficient computation.
3. Calculate the intercept (b): b = µY - wµX
4. Predict the dependent variable: Use the calculated b and w to predict
Y for any given X.

Examples Predicting House Prices


• Independent variables: Size, location, number of bedrooms
• Dependent variable: Price
• A linear regression model can be trained on historical data to predict the
price of a new house based on its features.
Forecasting Sales
• Independent variables: Advertising spend, promotions, seasonality
• Dependent variable: Sales
• Linear regression can be used to forecast future sales based on historical
data and planned marketing activities.

Extensions of Linear Regression


Multilinear Regression In multilinear regression, the dependent variable
is modelled as a linear combination of multiple independent variables. This is
the most common form of linear regression, and its mathematical foundation
is described in the “Mathematical Foundation” section of “Linear Regression”
above.

2
Polynomial Regression Polynomial regression extends linear regression
by allowing for non-linear relationships between the independent and dependent
variables. Instead of fitting a straight line, polynomial regression fits a curve to
the data. It involves transforming the independent variables by raising them to
different powers (e.g., X2, X3). The model becomes:
ŷ = b + w1x + w2x2 + … + wdxd
where:
• d is the degree of the polynomial
By choosing the appropriate degree, polynomial regression can model complex
relationships between variables. However, higher-degree polynomials can lead to
overfitting, where the model captures noise in the data instead of the underlying
trend.

Applications
• Economics: Modelling the relationship between economic factors and
growth.
• Engineering: Fitting curves to experimental data.
• Environmental science: Analysing the impact of pollution on environ-
mental indicators.

Logistic Regression
Concept Overview Logistic regression is a supervised learning algorithm
for binary classification, predicting one of two possible outcomes. It predicts
the probability of an instance belonging to a particular class and then classifies
it based on that probability.
Purpose and Relevance
• Predict probabilities: The likelihood of a customer purchasing a prod-
uct, a patient developing a disease, or an email being spam.
• Classify instances: Based on the predicted probability, assign an in-
stance to a class.

Mathematical Foundation Logistic regression uses the sigmoid function


to transform a linear combination of independent variables into a probability
value between 0 and 1:
�(z) = 1 / (1 + exp(-z))
where:
• z is a linear combination of independent variables: z = b + w1x1 +
w2x2 + … + wdxd

3
The logistic regression model predicts the probability of an instance belonging
to the positive class (class 1) as:
P(Y=1|X=x) = �(z)
The model parameters (b and w) are learned using maximum likelihood
estimation (MLE), which aims to find the parameters that maximise the
likelihood of observing the training data.

Algorithms Iterative optimization algorithms, such as gradient descent or


stochastic gradient descent (SGD), are used to learn the parameters of a
logistic regression model. The “Algorithms” section of “Linear Regression” pro-
vides a description of gradient descent. Algorithm 24.2 in the source outlines
the SGD algorithm for multiclass logistic regression.

Examples Email Spam Classification:


• Features: Email content, sender information, subject line
• Classes: Spam or not spam
• A logistic regression model can be trained to predict the probability of an
email being spam based on its features.
Credit Risk Assessment:
• Features: Income, credit history, debt-to-income ratio
• Classes: High risk or low risk
• Logistic regression can be used to predict the probability of a borrower
defaulting on a loan.

Gradient Descent
Concept Overview
Gradient descent is a fundamental iterative optimisation algorithm for finding
the minimum of a differentiable function. It starts with an initial guess for the
solution and repeatedly updates it by moving in the direction of the steepest
descent, which is the negative of the function’s gradient at that point.
Purpose and Relevance:
• Finding optimal solutions: In machine learning, gradient descent is
widely used to find the best-fit parameters of models, minimising a chosen
loss function.
• Wide applicability: Gradient descent applies to various fields beyond
machine learning, including engineering, physics, and economics.
Interrelation with Other Concepts:
• Basis for many algorithms: Many machine learning algorithms, such
as linear regression, logistic regression, and neural networks, use gradient

4
descent as their core optimisation technique.
• Building block for more advanced methods: Stochastic Gradient
Descent (SGD) and its variants are extensions of gradient descent.

Mathematical Foundation
The gradient of a differentiable function 𝑓(𝑤) ∶ ℝ𝑑 → ℝ at a point 𝑤, denoted
as ∇𝑓(𝑤), is the vector of its partial derivatives:
∇𝑓(𝑤) = ( 𝜕𝑓(𝑤) 𝜕𝑓(𝑤)
𝜕𝑤 , … , 𝜕𝑤[𝑑] ).

Gradient descent iteratively updates the solution as follows:


𝑤(𝑡+1) = 𝑤(𝑡) − 𝜂∇𝑓(𝑤(𝑡) )
where:
• 𝑤(𝑡) is the solution at iteration 𝑡
• 𝜂 is the learning rate, a scalar that controls the step size
Problems Solved by the Formula:
• Finding the direction of steepest descent: The negative gradient
−∇𝑓(𝑤(𝑡) points towards the direction of the greatest rate of decrease of
the function at the current point.
• Iteratively approaching the minimum: By repeatedly taking steps in
the direction of the negative gradient, the algorithm gradually converges
towards a local minimum of the function.
Derivation or Justification:
The update rule aims to find a new point 𝑤(𝑡+1) that reduces the function value
compared to the current point 𝑤(𝑡) . A first-order Taylor approximation of the
function around 𝑤(𝑡) is given by:
𝑓(𝑤(𝑡+1) ) ≈ 𝑓(𝑤(𝑡) ) + ⟨𝑤(𝑡+1) − 𝑤(𝑡) , ∇𝑓(𝑤(𝑡) )⟩
To ensure a decrease in the function value, we want:
𝑓(𝑤(𝑡+1) ) < 𝑓(𝑤(𝑡) )
Substituting the Taylor approximation and the update rule, we get:
𝑓(𝑤(𝑡) ) − 𝜂⟨∇𝑓(𝑤(𝑡) ), ∇𝑓(𝑤(𝑡) )⟩ < 𝑓(𝑤(𝑡) ).
This inequality holds when 𝜂 > 0, meaning that moving in the direction of the
negative gradient with a positive learning rate will likely lead to a decrease in
the function value.

Algorithms
Gradient Descent
Purpose: Minimise a differentiable function 𝑓(𝑤).

5
Steps:
1. Initialise: Set an initial solution 𝑤(1) (often set to 0). Choose a learning
rate 𝜂.
2. Iterate: Repeat the following steps until convergence:
• Calculate the gradient: ∇𝑓(𝑤(𝑡) )
• Update the solution: 𝑤(𝑡+1) = 𝑤(𝑡) − 𝜂∇𝑓(𝑤(𝑡) )
3. Output: Return the final solution 𝑤(𝑡) .
Convergence Criteria:
• Maximum number of iterations: Stop after a predefined number of
iterations.
• Change in solution: Stop when the change in the solution between
iterations is smaller than a threshold.
• Gradient magnitude: Stop when the magnitude of the gradient is
smaller than a threshold.

Examples
Example: Finding the Minimum of a Quadratic Function
Consider the function: 𝑓(𝑤) = 𝑤2 + 2𝑤 + 1
1. Calculate the gradient: ∇𝑓(𝑤) = 2𝑤 + 2
2. Initialise: Set 𝑤(1) = 0 and 𝜂 = 0.1.
3. Iterate:
• Iteration 1: ∇𝑓(0) = 2, 𝑤(2) = 0 − 0.1 ∗ 2 = −0.2
• Iteration 2: ∇𝑓(−0.2) = 1.6, 𝑤(3) = −0.2 − 0.1 ∗ 1.6 = −0.36
• … Continue iterating until convergence
The algorithm will converge towards the minimum of the function, which is
𝑤 = −1.

Applications
• Linear Regression: Finding the coefficients that minimise the sum of
squared errors between the predicted and actual values.
• Logistic Regression: Finding the parameters that maximise the likeli-
hood of observing the training data.
• Neural Networks: Training deep learning models by adjusting the net-
work weights to minimise the loss function.

Stochastic Gradient Descent (SGD)


Concept Overview
Stochastic Gradient Descent (SGD) is a variation of gradient descent that
uses randomly selected data points to update the solution at each iteration
instead of calculating the gradient based on the entire dataset.

6
Purpose and Relevance:
• Efficiency: Significantly faster than standard gradient descent, especially
for large datasets.
• Handling noisy data: SGD is more robust to noise and outliers in the
data.
• Online learning: Suitable for online learning scenarios where data ar-
rives sequentially.

Mathematical Foundation
SGD uses a stochastic approximation of the true gradient. At each iteration,
it randomly selects a data point (𝑥𝑖 , 𝑦𝑖 ) and calculates the gradient of the loss
function for that point. This stochastic gradient is then used to update the
solution:
𝑤(𝑡+1) = 𝑤(𝑡) − 𝜂𝑣𝑡
where:
• 𝑣𝑡 is a random vector such that 𝐸[𝑣𝑡 |𝑤(𝑡) ] ∈ 𝜕𝑓(𝑤(𝑡) ), i.e., the expected
value of 𝑣𝑡 given the current solution is a subgradient of the function at
that point.

Algorithms
Stochastic Gradient Descent (SGD)
Purpose: Minimise a function 𝑓(𝑤), often a loss function in machine learning.
Steps:
1. Initialise: Set an initial solution 𝑤(1) . Choose a learning rate 𝜂 and the
number of iterations 𝑇 .
2. Iterate: For 𝑡 = 1, 2, … , 𝑇 :
• Randomly select a data point (𝑥𝑖 , 𝑦𝑖 )
• Calculate the gradient for the selected point: 𝑣𝑡 ∈
𝜕𝑙(𝑤(𝑡) , (𝑥𝑖 , 𝑦𝑖 )) where 𝑙 is the loss function.
• Update the solution: 𝑤(𝑡+1) = 𝑤(𝑡) − 𝜂𝑣𝑡
𝑇
3. Output: Return the averaged solution 𝑤̄ = 𝑇1 ∑𝑡=1 𝑤(𝑡) .

Examples
Example: SGD for Linear Regression
Consider a set of data points (𝑥𝑖 , 𝑦𝑖 ), and the squared loss function:
𝑙(𝑤, (𝑥𝑖 , 𝑦𝑖 )) = 12 (𝑤𝑇 𝑥𝑖 − 𝑦𝑖 )2 .
1. Randomly select a data point (𝑥𝑖 , 𝑦𝑖 )
2. Calculate the gradient for the selected point: 𝑣𝑡 = (𝑤𝑇 𝑥𝑖 − 𝑦𝑖 )𝑥𝑖 .
3. Update the solution: 𝑤(𝑡+1) = 𝑤(𝑡) − 𝜂(𝑤𝑇 𝑥𝑖 − 𝑦𝑖 )𝑥𝑖 .

7
Applications
• Large-scale machine learning: Training models on massive datasets
where standard gradient descent is computationally expensive.
• Online advertising: Continuously updating models for ad targeting as
new user data arrives.
• Natural language processing: Training language models for tasks like
machine translation and text generation.

Subgradients
Concept Overview
Subgradients generalise the concept of gradients to non-differentiable func-
tions. A subgradient at a point is any vector that lies below the function’s
graph at that point.
Purpose and Relevance:
• Handling non-differentiable functions: Many loss functions used in
machine learning, like the hinge loss in Support Vector Machines, are non-
differentiable.
• Extending optimisation algorithms: Subgradients allow the use of
gradient-based optimisation algorithms for non-differentiable functions.

Mathematical Foundation
For a convex function 𝑓(𝑤), a vector 𝑣 is a subgradient at 𝑤 if for all 𝑢 in the
domain of 𝑓:
𝑓(𝑢) ≥ 𝑓(𝑤) + ⟨𝑢 − 𝑤, 𝑣⟩.
The subdifferential 𝜕𝑓(𝑤) is the set of all subgradients at 𝑤. If 𝑓 is differen-
tiable at 𝑤, then 𝜕𝑓(𝑤) = {∇𝑓(𝑤)}.

Examples
Example: Subgradient of the Absolute Value Function
Consider the absolute value function 𝑓(𝑤) = |𝑤|.
• For 𝑤 > 0, the subdifferential is 𝜕𝑓(𝑤) = {1}.
• For 𝑤 < 0, the subdifferential is 𝜕𝑓(𝑤) = {−1}.
• For 𝑤 = 0, the subdifferential is 𝜕𝑓(𝑤) = [−1, 1] because any line with a
slope between -1 and 1 lies below the graph of the function at 𝑤 = 0.

8
Stochastic Gradient Descent for Risk Minimisation
Concept Overview
Stochastic Gradient Descent (SGD) for risk minimisation directly ap-
plies the SGD algorithm to minimise the risk function, which measures the
expected loss of a model over the data distribution.
Purpose and Relevance:
• Direct risk minimisation: Instead of relying on the empirical risk (loss
on the training data), SGD can directly optimise the risk function.
• Suitable for online learning: Efficiently update the model as new data
becomes available.

Mathematical Foundation
In machine learning, the risk function 𝐿𝐷 (𝑤) is the expected loss of a model
with parameters 𝑤 over the data distribution 𝐷:
𝐿𝐷 (𝑤) = 𝐸𝑧∼𝐷 [𝑙(𝑤, 𝑧)]
where:
• 𝑙(𝑤, 𝑧) is the loss function.
SGD for risk minimisation uses the loss gradient on a single randomly sam-
pled data point as an unbiased estimate of the risk gradient.

Algorithms
SGD for Risk Minimisation
Purpose: Minimise the risk function 𝐿𝐷 (𝑤).
Steps:
1. Initialise: Set an initial solution 𝑤(1) . Choose a learning rate 𝜂 and the
number of iterations 𝑇 .
2. Iterate: For 𝑡 = 1, 2, … , 𝑇 :
• Sample a data point: 𝑧 ∼ 𝐷.
• Calculate a subgradient for the sampled point: 𝑣𝑡 ∈ 𝜕𝑙(𝑤(𝑡) , 𝑧).
• Update the solution: 𝑤(𝑡+1) = 𝑤(𝑡) − 𝜂𝑣𝑡 .
𝑇
3. Output: Return the averaged solution 𝑤̄ = 𝑇1 ∑𝑡=1 𝑤(𝑡) .

Applications
• Online Learning: Update models in real-time as new data points become
available.
• Large-scale machine learning: Efficiently train models when the
dataset is too large to fit in memory.

9
• Recommendation systems: Update user preferences based on their
interactions.

Bias & Variance: Understanding and Tradeoff, Relation


with Model Fitting
Concept Overview
• Bias: Bias refers to the systematic error introduced by a model due to
its simplifying assumptions about the relationship between the input
features and the target variable. A high-bias model tends to underfit
the training data, meaning it fails to capture the underlying patterns and
produces predictions far from the true values.
• Variance: Variance, on the other hand, measures the model’s sensitiv-
ity to fluctuations in the training data. A high-variance model tends
to overfit the training data, capturing noise and random variations as if
they were true patterns. Consequently, it performs well on the training
data but poorly on unseen data.
Purpose and Relevance:
• Understanding model performance: Bias and variance provide a
framework for analysing why a model might be performing poorly.
• Guiding model selection: By understanding the bias-variance trade-
off, we can choose models and adjust their complexity to achieve optimal
performance on unseen data.
Interrelation with Model Fitting:
• Underfitting (High Bias): Occurs when the model is too simple to
capture the underlying data patterns. This leads to high bias and poor
performance on both training and unseen data.
• Overfitting (High Variance): Happens when the model is too complex,
fitting the training data perfectly, including noise. This results in high
variance and poor generalisation to unseen data.

Mathematical Foundation
The expected squared error for a given test point 𝑥 and model 𝑀 (𝑥, 𝐷)
trained on dataset 𝐷 can be decomposed into three terms:
𝐸𝑥,𝐷,𝑦 [(𝑦−𝑀 (𝑥, 𝐷))2 ] = 𝐸𝑥,𝑦 [(𝑦−𝐸𝑦 [𝑦|𝑥])2 ]+𝐸𝑥,𝐷 [(𝑀 (𝑥, 𝐷)−𝐸𝐷 [𝑀 (𝑥, 𝐷)])2 ]+
𝐸𝑥 [(𝐸𝐷 [𝑀 (𝑥, 𝐷)] − 𝐸𝑦 [𝑦|𝑥])2 ]
where:
• Noise: The first term, 𝐸𝑥,𝑦 [(𝑦−𝐸𝑦 [𝑦|𝑥])2 ], represents the inherent noise
in the data. It’s independent of the model and serves as a baseline error.

10
• Average Variance: The second term, 𝐸𝑥,𝐷 [(𝑀 (𝑥, 𝐷) − 𝐸𝐷 [𝑀 (𝑥, 𝐷)])2 ],
measures the variance of the model’s predictions over different training
sets.
• Average Bias: The third term, 𝐸𝑥 [(𝐸𝐷 [𝑀 (𝑥, 𝐷)] − 𝐸𝑦 [𝑦|𝑥])], quantifies
the bias, reflecting the difference between the average prediction of the
model and the true expected value.
The sources provide information on the bias and variance of a classifier in the
context of the bias-variance decomposition of the expected squared loss.
Bias reflects the systematic deviation of a classifier’s predicted decision bound-
ary from the true decision boundary. It is calculated as:
Ex[(ED[M(x,D)] - Ey[y|x])]
where:
• M(x, D) represents the predicted value of the model M for a test point
x, trained on dataset D.
• ED[M(x, D)] is the expected prediction of the model over all possible
training sets D.
• Ey[y|x] is the expected value of the true response y given the test point
x.
This term measures the squared difference between the average prediction of
the model and the true expected value of the response.
Variance refers to the variability of the learned decision boundaries over differ-
ent training sets. It is calculated as:
Ex,D[(M(x, D) - ED[M(x, D)])2]
where:
• M(x, D) represents the predicted value of the model M for a test point
x, trained on dataset D.
• ED[M(x, D)] is the expected prediction of the model over all possible
training sets D.
This term measures the average squared difference between the model’s pre-
diction for a specific training set and its average prediction over all training
sets.
Problems Solved by the Formula:
• Decomposition of error: Allows the analysis of the expected squared
error in terms of noise, bias, and variance, providing insights into the
source of model errors.
• Understanding the tradeoff: Helps visualise and quantify the tradeoff
between bias and variance, guiding the selection of appropriate model
complexity.

11
Relation with Model Fitting
• High Bias (Underfitting): A high-bias model will have a large average
bias term, indicating that its predictions are consistently far from the true
values, regardless of the training data used.
• High Variance (Overfitting): A high-variance model will have a large
average variance term, signifying its sensitivity to the specific training set.
Its predictions will fluctuate significantly for different training datasets.

Applications
• Model Selection: Choosing the right model complexity by finding the
sweet spot that minimises both bias and variance. For example, in polyno-
mial regression, selecting the appropriate degree of the polynomial based
on the bias-variance tradeoff.
• Regularisation: Techniques like L1 and L2 regularisation in linear
regression help control model complexity by penalising large weights, thus
reducing variance.
• Ensemble Methods: Techniques like bagging and boosting combine
multiple models to reduce variance (bagging) or bias (boosting).
• Hyperparameter Tuning: Adjusting hyperparameters to find the op-
timal balance between bias and variance. For instance, in k-nearest
neighbours, choosing the optimal value of k involves considering its im-
pact on the bias-variance tradeoff.
• Neural Network Architecture: The choice of the number of layers and
neurons in a neural network affects its complexity and consequently the
bias-variance tradeoff.

Examples
Example: Polynomial Regression
Consider fitting a polynomial to a dataset.
• Low-degree polynomial (e.g., linear): May have high bias as it can’t
capture complex non-linear relationships but will have low variance.
• High-degree polynomial: Can perfectly fit the training data, resulting
in low bias. However, it will likely overfit, leading to high variance and
poor generalisation to new data.
Example: k-Nearest Neighbours
In k-NN classification, the value of k affects bias and variance:
• Small k: Results in low bias as the prediction relies heavily on nearby
points but has high variance as it’s sensitive to noise in individual data
points.
• Large k: Averages over more neighbours, reducing variance but increasing
bias as it smooths out the decision boundaries.

12
By understanding the bias-variance tradeoff, one can choose the appropriate de-
gree of the polynomial or the value of k to achieve a balance between underfitting
and overfitting, leading to better generalisation performance.

Model Selection and Validation


Concept Overview
Model selection is the process of choosing the best model from a set of candi-
date models for a given machine learning task. This involves evaluating different
models and selecting the one that performs best based on a chosen metric. Val-
idation is a crucial aspect of model selection, as it provides a way to estimate
how well a model will generalise to unseen data.
Purpose and Relevance:
• Avoiding Overfitting: Validation helps prevent overfitting by evaluating
models on data unseen during training, ensuring they can generalise well.
• Optimising Hyperparameters: Many models have hyperparameters
that need to be tuned for optimal performance. Validation sets are used
to evaluate the model performance with different hyperparameter settings.
• Comparing Different Models: Validation allows a fair comparison of
different models (e.g. linear regression vs. decision tree) on the same task.

Methods
Validation for Model Selection
• The simplest validation approach involves splitting the dataset into two
parts: a training set and a validation set.
• The model is trained on the training set, and its performance is evaluated
on the validation set.
• The model with the best performance on the validation set is selected.

K-Fold Cross-Validation
• A more robust method to reduce the variance of the validation estimate.
• The data is divided into 𝐾 equal-sized folds.
• The model is trained 𝐾 times, each time using 𝐾 − 1 folds for training
and the remaining fold for validation.
• The final performance estimate is the average performance over all 𝐾 folds.
Algorithm:
k-Fold Cross Validation for Model Selection
input:
training set S = (x1, y1), . . . , (xm, ym)
set of parameter values Θ
learning algorithm A

13
integer k
partition S into S1, S2, . . . , Sk
foreach � � Θ
for i = 1 . . . k
hi,� = A(S \ Si; �)
error(�) = 1/k �_(i=1)^k LSi(hi,�)
output
�? = argmin_� [error(�)]
h_�? = A(S; �?)

Training-Validation-Test Split
• The data is divided into three sets: training, validation, and test.
• The training set is used for model training, the validation set is used for
model selection and hyperparameter tuning, and the test set is used for
final evaluation of the chosen model.

Regularized Loss Minimization


Regularization is a technique used to prevent overfitting by adding a penalty
term to the loss function. This penalty term discourages the model from learning
overly complex functions that fit the training data too closely.
Concept:
The goal is to find the model parameters that minimize the regularized loss,
which is the sum of the empirical loss and the regularization term.
Mathematical Foundation:
𝐴(𝑆) = argmin𝑤 (𝐿𝑆 (𝑤) + 𝜆𝑅(𝑤))
where:
• 𝐴(𝑆) is the learning algorithm.
• 𝐿𝑆 (𝑤) is the empirical loss on the training set 𝑆.
• 𝑅(𝑤) is the regularization term.
• 𝜆 > 0 is a regularization parameter that controls the strength of the
regularization.
Types of Regularization:
• L1 Regularization: Adds a penalty proportional to the sum of the abso-
lute values of the model parameters. It encourages sparsity, driving some
parameters to zero, effectively performing feature selection.
• L2 Regularization: Adds a penalty proportional to the sum of the
squared values of the parameters. It prevents any single parameter from
becoming too large, promoting smoother decision boundaries.
The formulas for lasso and ridge regression are closely related to the concept
of regularized loss minimization. Both methods aim to find the model

14
parameters that minimise a combination of the empirical loss on the training
data and a penalty term that encourages simpler models.
Lasso Regression (L1 Regularization):
The objective function for lasso regression is given as:
1
• 𝑚𝑖𝑛𝑤 𝐽 (𝑤) = 2 ⋅ ||𝑌 − 𝐷𝑤||2 + 𝛼 ⋅ ||𝑤||1
Where:
• 𝐷 is the data matrix containing the independent variables.
• 𝑌 is the response vector.
• 𝑤 represents the vector of model parameters (weights).
• 𝛼 ≥ 0 is the regularization constant controlling the strength of the penalty.
𝑑
• ||𝑤||1 = ∑𝑖=1 |𝑤𝑖 | is the L1-norm of the weight vector, representing the
sum of the absolute values of the weights.
Ridge Regression (L2 Regularization):
The objective function for ridge regression is:
• 𝑚𝑖𝑛𝑤̃ 𝐽 (𝑤)̃ = ||𝑌 − 𝐷̃ 𝑤||
̃ 2 + 𝛼 ⋅ ||𝑤||
̃ 2
Where:
• 𝐷̃ is the augmented data matrix (including a column of 1s for the bias
term).
• 𝑤̃ is the augmented weight vector (including the bias term).
• 𝛼 ≥ 0 is the regularization constant.
𝑑
̃ 2 = ∑𝑖=1 𝑤𝑖2 is the L2-norm of the weight vector, which is the sum of
• ||𝑤||
the squared values of the weights.
Key Differences:
• Lasso regression (L1) uses the absolute values of the weights in the
penalty term, encouraging sparsity in the solution (some weights become
exactly zero). This can be beneficial for feature selection, as unimportant
features are effectively removed from the model.
• Ridge regression (L2) uses the squared values of the weights, lead-
ing to shrinkage of the weights towards zero (but not exactly zero). This
helps to prevent any single feature from having an overly dominant influ-
ence on the model.

Relationship between Model Selection, Validation and Regularization


Model selection, validation, and regularization are interrelated concepts that
work together to improve model generalisation.
• Model selection uses validation techniques to choose the best model and
its hyperparameters.

15
• Regularization techniques are used during model training to prevent
overfitting.
• Validation techniques are applied to assess the impact of different regular-
ization parameters and select the best setting.
By carefully combining these techniques, we can train models that perform well
on unseen data and avoid the pitfalls of overfitting.

Examples
Example: Regularized Linear Regression
In linear regression, L1 (LASSO) or L2 (Ridge) regularization can be added
to the loss function to prevent overfitting. K-fold cross-validation can then be
used to select the best regularization parameter 𝜆 by comparing the average
performance on the validation folds.
Example: Choosing the Depth of a Decision Tree
When training a decision tree, the depth of the tree is a hyperparameter that
needs to be tuned. A deeper tree can capture more complex patterns but risks
overfitting. Cross-validation can be used to evaluate trees of different depths
and select the one that generalises best to the validation data.
Example: Regularization in Neural Networks
Neural networks often employ techniques like dropout or weight decay (L2 regu-
larization) to prevent overfitting. A validation set is crucial for determining the
optimal dropout rate or weight decay parameter, ensuring the network doesn’t
memorise the training data but learns generalizable features.

Applications
• Image Classification: In image recognition tasks, models like convolu-
tional neural networks (CNNs) are trained using regularization techniques
and cross-validation to achieve high accuracy on unseen images.
• Natural Language Processing: Models for tasks like machine transla-
tion or sentiment analysis are trained using regularization and validation
to ensure they generalise well to different language styles and domains.
• Medical Diagnosis: Machine learning models used for disease prediction
or diagnosis are carefully validated and regularized to ensure reliable and
accurate performance on new patients.
• Financial Modelling: Predictive models for stock prices or credit risk
are validated and regularized to avoid overfitting to historical data and to
ensure robustness to market fluctuations.

16
Support Vector Machines (SVM)
Concept Overview
Support Vector Machines (SVMs) are supervised learning models used for clas-
sification and regression tasks. They are particularly well-suited for handling
high-dimensional data and are known for their ability to find complex, non-
linear decision boundaries. The fundamental idea behind SVMs is to find the
optimal hyperplane that maximises the separation between different
classes in the data.
Key Concepts:
• Hyperplane: In a d-dimensional space, a hyperplane is a (d-1)-
dimensional subspace that separates the space into two halves. In SVMs,
a hyperplane is used as the decision boundary between classes.
• Margin: The margin of a separating hyperplane is the distance between
the hyperplane and the closest data point from either class. SVMs aim to
find the hyperplane that maximises this margin.
• Support Vectors: The data points closest to the separating hyperplane
are called support vectors. These points play a crucial role in defining the
hyperplane and are the only ones that influence the classification decision.
• Kernel Trick: For non-linearly separable data, SVMs use the kernel trick
to implicitly map the data into a higher-dimensional space where it might
become linearly separable.
SVM Types:
• Hard SVM: This type of SVM works well for linearly separable data. It
seeks a hyperplane that perfectly classifies all training examples, maximis-
ing the margin.
• Soft SVM: When the data is not linearly separable, soft SVM allows
for some misclassification during training. It introduces a penalty term
for misclassified points, balancing the margin maximisation with error
minimisation.
Relation to Other Concepts:
• Regularized Loss Minimisation: SVMs can be viewed as a form of
regularized loss minimisation. The margin maximisation corresponds to
minimizing the norm of the weight vector, which acts as a regulariser,
while the loss function penalises misclassifications.
• Convex Optimisation: Training an SVM involves solving a convex opti-
misation problem, which guarantees finding the globally optimal solution.
This makes SVMs more robust compared to other models that might get
stuck in local optima.

17
Mathematical Foundation
Hard SVM The objective of Hard SVM is to find the optimal hyperplane,
defined by weight vector w and bias b, that maximises the margin while correctly
classifying all training data. This can be formulated as an optimisation problem:
Objective Function: maximize(w,b):||w||=1 mini�[m] |�w,xi�+ b| subject to �i,
yi(�w,xi�+ b) > 0
where:
• xi is the i-th data point.
• yi is the corresponding class label (either +1 or -1).
• �w,xi� represents the dot product of the weight vector and the data point.
This formulation aims to find the hyperplane (w, b) that has a unit norm (||w||
= 1) and maximises the distance to the closest data point, ensuring all points
are correctly classified.
An equivalent formulation of the Hard SVM rule as a quadratic optimisation
problem is:
Objective Function: minimize(w,b) ||w||2 subject to �i, yi(�w,xi�+ b) � 1
This formulation minimises the squared norm of the weight vector (||w||2) under
the constraint that all data points are correctly classified with a margin of at
least 1.

Soft SVM When the data is not linearly separable, the Hard SVM constraints
become infeasible. Soft SVM addresses this by introducing slack variables (�i)
and a penalty parameter (C) to allow for misclassifications.
Objective Function: minimize(w,b,�) (1/2||w||2 + C�ni=1 �i) subject to
yi(wTxi + b) � 1 - �i and �i � 0, �xi � D
The objective function now balances margin maximisation (minimising ||w||2)
with error minimisation (minimising the sum of slack variables). The penalty
parameter C controls the trade-off between these two objectives.

Optimality and the Fritz John Optimality Theorem

Concept Overview
• Support Vector Machines (SVMs): Powerful supervised learning
models used for classification and regression tasks. SVMs aim to find
a hyperplane that best separates data points into different classes.
• Optimality in SVMs: The concept of optimality in SVMs refers to
finding the best possible hyperplane that maximises the margin between
classes, either in the original input space (hard-SVM) or a transformed
feature space (soft-SVM).

18
• Fritz John Optimality Theorem: A generalisation of the Lagrange
multiplier theorem that provides necessary conditions for optimality in
constrained optimisation problems. It’s crucial for understanding and
analysing the solutions found by SVMs.

Mathematical Foundation
Hard-SVM and Margin Maximisation The goal of hard-SVM is to find
a hyperplane (defined by w and b) that perfectly separates the data while
maximising the margin between classes. This can be formulated as:
maximise(w,b):||w||=1 mini�[m] yi(�w,xi� + b)
where:
• xi is a data point.
• yi is the class label (+1 or -1).
• w is the weight vector.
• b is the bias term.
This formulation seeks the hyperplane with the largest perpendicular distance
(margin) to the closest data points from each class.

Soft-SVM and Regularised Loss Minimisation In real-world scenarios,


data is often not perfectly separable. Soft-SVM addresses this by allowing mis-
classifications while penalising them. This is achieved by introducing slack vari-
ables (�i) and a penalty parameter (C):
minimise(w,b,�) (1/2||w||2 + C�ni=1 �i) subject to yi(wTxi + b) � 1 - �i and �i �
0, �xi � D
This formulation balances finding a hyperplane with a large margin and min-
imising the misclassification penalty.

Fritz John Optimality Theorem


The Fritz John Optimality Theorem provides conditions for optimality in con-
strained optimisation problems. For a problem like:
minimisew f(w) subject to gi(w) � 0, �i � [m]
where f and gi are differentiable functions, the theorem states:
If w0 is an optimal solution, then there exist constants (�0, �1,…, �m) not all
equal to zero such that:
• �0�f(w0) + �mi=1 �i�gi(w0) = 0
• �igi(w0) = 0 for all i � [m]
• �i � 0 for all i � [0, m]

19
Support Vectors
Support vectors are the data points that lie closest to the decision boundary and
have a direct influence on its position. They are crucial because they determine
the margin and play a key role in the classification process. The Fritz John
Optimality Theorem is used to identify these support vectors.

Algorithms
The sources don’t contain specific algorithms related to the Fritz John Optimal-
ity Theorem and its application in SVMs. However, it’s worth mentioning that
quadratic programming solvers are often used to find the optimal solution for
the SVM primal or dual problem.

Examples
The sources don’t provide solved examples related to the Fritz John Optimality
Theorem applied to SVMs. However, you can find numerous examples in SVM
literature, like the textbook “Understanding Machine Learning” by Shai Shalev-
Shwartz and Shai Ben-David.

Applications
Identifying Support Vectors: The Fritz John Optimality Theorem allows
us to identify support vectors by analysing the Lagrange multipliers (�i) corre-
sponding to the constraints in the SVM optimisation problem.
• Support vectors have non-zero �i values.
• Data points with �i = 0 are not support vectors and do not affect the
decision boundary.
This knowledge is crucial for:
• Understanding model behaviour: Knowing which data points influ-
ence the decision boundary can provide insights into how the model makes
predictions.
• Data visualisation: Highlighting support vectors can help visualise the
decision boundary and the margin in SVM models.
• Model efficiency: Support vectors allow for a more compact representa-
tion of the SVM model, as only these points are required for prediction.
• Kernel methods: In kernel-based SVMs, support vectors become even
more crucial as they determine the shape of the decision boundary in the
implicit high-dimensional feature space.

Analysing SVM Solutions The Fritz John Optimality Theorem serves as


a theoretical foundation for analysing and justifying the solutions found by
SVM algorithms. It ensures that the obtained solution satisfies the necessary
conditions for optimality, validating the learning process.

20
Beyond SVMs: The Fritz John Optimality Theorem is a powerful tool with
applications extending far beyond SVMs:
• Convex optimisation: It forms the basis for solving a wide range of
constrained optimisation problems.
• Economics: Used in problems related to resource allocation, production
planning, and market equilibrium.
• Engineering: Applied in areas like control theory, signal processing, and
structural design.
Understanding this theorem allows for a deeper appreciation of the mathemati-
cal foundations of optimisation and its applications across various domains.

Algorithms
Several algorithms can be used to train SVMs, with the most common ones
being:
• Quadratic Programming: Hard SVM and Soft SVM formulations can
be solved directly using quadratic programming solvers. However, this
approach can be computationally expensive for large datasets.
• Stochastic Gradient Descent (SGD): SGD is a more scalable ap-
proach for training SVMs, especially for large datasets. It iteratively up-
dates the model parameters by computing the gradient of the loss function
on a small batch of data points.

SGD for Soft SVM with Kernels Goal: Solve the Soft SVM optimization
problem using kernels.
Parameter: T (number of iterations)
Initialize: �(1) = 0 (coefficients vector)
For t = 1, …, T:
1. Let �(t) = (1 / �t)�(t) (update coefficients)
2. Choose i randomly from [m] (select a data point)
3. If yi�mj=1 �(t)j K(xj, xi) < 1 (check classification condition):
• �(t+1)i = �(t)i + yi
• �(t+1)j = �(t)j for j � i
4. Else:
• �(t+1) = �(t)
Output: �(T) (final coefficients)
This algorithm uses the kernel trick to implicitly work in the higher-dimensional
feature space, avoiding the explicit computation of the feature mapping. The
kernel function K(x, x’) calculates the inner product between the feature repre-
sentations of two data points.

21
Applications
SVMs have wide applications in various domains, including:
• Image Classification: SVMs are used for recognizing objects, scenes,
and faces in images. They can handle high-dimensional image data effec-
tively.
• Text Classification: SVMs are applied for sentiment analysis, spam
filtering, and topic categorization, where the high dimensionality of text
data is a challenge.
• Bioinformatics: SVMs are used for protein classification, gene expres-
sion analysis, and drug discovery, where the data is often complex and
high-dimensional.
• Handwriting Recognition: SVMs are employed in recognizing hand-
written characters, digits, and signatures, effectively dealing with the vari-
ability in handwriting styles.

Support Vector Machines: Duality, Kernel Trick, Imple-


menting Soft SVM with Kernels
Concept Overview
• Duality: In the context of optimisation problems like SVMs, duality pro-
vides an alternative perspective and formulation that can often be easier
to solve or offer valuable insights into the original problem. The dual prob-
lem leverages Lagrange multipliers to incorporate the constraints into the
objective function.
• Kernel Trick: The kernel trick is a powerful technique used with SVMs
to implicitly map data into a higher-dimensional space where it
might become linearly separable without actually performing the compu-
tationally expensive mapping explicitly. This is achieved by replacing
dot products in the feature space with a kernel function evaluated in the
original input space.
• Implementing Soft SVM with Kernels: Combining soft SVM with
the kernel trick provides a flexible and efficient approach for handling
non-linearly separable data. By leveraging kernels, soft SVM can learn
complex decision boundaries while maintaining computational efficiency.

Mathematical Foundation
Duality Consider the primal form of the Soft SVM optimisation problem:
minimize(w,b,�) (1/2||w||2 + C�ni=1 �i) subject to yi(wTxi + b) � 1 - �i and �i �
0, �xi � D
To derive the dual form, we introduce Lagrange multipliers (�i and �i) for the
constraints and form the Lagrangian:
L = (1/2||w||2 + C�ni=1 �i) - �ni=1 �i[yi(wTxi + b) - 1 + �i] - �ni=1�i�i

22
The dual problem is then obtained by minimising L with respect to w, b, and
� and then maximising with respect to � and �. This leads to the dual form:
maximize� �ni=1 �i - 1/2�ni=1�nj=1 �i�jyiyjxiTxj
subject to: 0 � �i � C, �i and �ni=1 �iyi = 0

Kernel Trick The kernel trick exploits the fact that the dual form of the SVM
optimisation problem only involves inner products between data points
(xiTxj). By replacing these inner products with a kernel function K(xi, xj),
we can implicitly compute inner products in a higher-dimensional feature space
defined by a mapping �(x) without explicitly calculating �(x).
The kernel function should satisfy Mercer’s condition, which guarantees that it
corresponds to a valid inner product in some feature space:
K(xi, xj) = �(xi)T�(xj)
Common kernel functions include:
• Polynomial kernel: K(x, x’) = (1 + �x,x’�)k
• Gaussian kernel: K(x, x’) = e- ‖x−x’‖2/2�

Implementing Soft SVM with Kernels The Representer Theorem


states that the optimal weight vector w can be expressed as a linear combi-
nation of the training data points mapped into the feature space:
w = �mi=1 �i�(xi)
Therefore, instead of directly optimising w, we can work with the coefficients �.
The dual problem can then be expressed solely in terms of the kernel function:
maximize� �ni=1 �i - 1/2�ni=1�nj=1 �i�jyiyjK(xi, xj)
subject to: 0 � �i � C, �i and �ni=1 �iyi = 0

Algorithms
SGD for Solving Soft-SVM with Kernels Goal: Solve Equation (16.5):
minw (�/2 ||w||2 + 1/m�mi=1 max{0, 1− y�w, �(xi)�})
Parameter: T (number of iterations)
Initialize: �(1) = 0 (coefficients vector)
For t = 1, …, T:
1. Let �(t) = (1 / �t)�(t) (update coefficients)
2. Choose i randomly from [m] (select a data point)
3. If yi�mj=1 �(t)j K(xj, xi) < 1 (check classification condition):
• �(t+1)i = �(t)i + yi
• �(t+1)j = �(t)j for j � i

23
4. Else:
• �(t+1) = �(t)
Output: �(T) (final coefficients)
This algorithm utilises the kernel trick, thus avoiding the explicit calculation
of the feature mapping. The complexity of this algorithm is dominated by the
computation of the kernel matrix, which is O(n2) where n is the number of
training points.

Examples
Example: Handwritten Digit Recognition Consider the task of classify-
ing handwritten digits. Each digit image can be represented as a vector of pixel
values, resulting in a high-dimensional input space.
1. Feature Mapping: A suitable kernel function, such as a Gaussian kernel,
can be chosen to implicitly map the digit images into a higher-dimensional
space where they might become linearly separable.
2. Training: Using the SGD algorithm with the chosen kernel, a soft SVM
model is trained on a labelled dataset of handwritten digits.
3. Classification: A new digit image is classified by computing its kernel
values with the support vectors from the training set and applying the
learned SVM decision function.

Applications
The combination of duality, the kernel trick, and soft SVM has led to widespread
applications of SVMs:
• Image Recognition and Object Detection: SVMs are used in various
computer vision tasks, including image classification, object detection, and
image segmentation, where their ability to handle high-dimensional data
and learn complex decision boundaries is beneficial.
• Natural Language Processing: Applications include text classification,
sentiment analysis, and machine translation, where kernels can be de-
signed to capture semantic similarities between text documents or words.
• Bioinformatics: SVMs are used in various bioinformatics applications
like protein structure prediction, gene expression analysis, and drug discov-
ery, where they can handle complex biological data and build predictive
models.
The kernel trick enables SVMs to model non-linear relationships effectively,
making them suitable for various real-world applications.

24
Classification Assessment Notes
Performance Measures
• Accuracy: The proportion of correctly classified instances out of all in-
stances. It provides a general idea of the classifier’s performance but can
be misleading when dealing with imbalanced datasets.
– Formula: Accuracy = (TP + TN) / (TP + TN + FP + FN)
• Precision: The proportion of correctly classified positive instances out of
all instances predicted as positive. It focuses on the accuracy of positive
predictions.
– Formula: Precision = TP / (TP + FP)
• Recall (Sensitivity or True Positive Rate): The proportion of cor-
rectly classified positive instances out of all actual positive instances. It
focuses on the classifier’s ability to identify all positive instances.
– Formula: Recall = TP / (TP + FN)
• F1-Score: The harmonic mean of precision and recall. It provides a
balanced measure that considers both the accuracy of positive predictions
and the ability to identify all positive instances.
– Formula: F1-Score = 2 * (Precision * Recall) / (Precision
+ Recall)
• ROC-AUC: A graphical and numerical metric used for binary classifi-
cation problems. The ROC curve plots the True Positive Rate (TPR)
against the False Positive Rate (FPR) at various threshold settings. The
AUC (Area Under the Curve) summarises the ROC curve into a single
value, representing the classifier’s ability to distinguish between positive
and negative classes. A higher AUC indicates better performance.

Classifier Evaluation
• Confusion Matrix: A table that summarises the performance of a clas-
sification model by showing the counts of true positives, true negatives,
false positives, and false negatives. It provides a detailed breakdown of
the classifier’s predictions for each class, allowing for a deeper analysis of
its strengths and weaknesses.
• Cross-Validation: A technique used to evaluate the performance of a
model on unseen data by splitting the dataset into multiple folds and
using each fold as a test set while training the model on the remaining
folds. Common types include k-fold cross-validation and leave-one-out
cross-validation. It helps to mitigate the impact of data randomness on
the evaluation results and obtain a more robust estimate of the model’s
generalisation performance.
• Bias-Variance Tradeoff: The tradeoff between a model’s ability to fit
the training data (low bias) and its ability to generalise to new data (low
variance). A model with high bias underfits the data and has poor predic-
tive power, while a model with high variance overfits the data and may

25
not generalise well. The goal is to find a balance that minimises both bias
and variance.

Example: Calculating Precision, Recall, and F1-Score from a Confu-


sion Matrix
Scenario: Suppose we have a binary classification problem where we are trying
to predict whether a customer will click on an ad.
Confusion Matrix:

Predicted Click Predicted No Click


Actual Click 80 20
Actual No Click 10 90

Calculations:
• True Positives (TP) = 80
• False Positives (FP) = 10
• False Negatives (FN) = 20
• True Negatives (TN) = 90
• Precision = TP / (TP + FP) = 80 / (80 + 10) = 0.89
• Recall = TP / (TP + FN) = 80 / (80 + 20) = 0.80
• F1-Score = 2 * (Precision * Recall) / (Precision + Recall) = 2 * (0.89 *
0.80) / (0.89 + 0.80) = 0.84
Interpretation:
• The precision of 0.89 indicates that when the model predicts a click, it
is correct 89% of the time.
• The recall of 0.80 suggests that the model correctly identifies 80% of the
actual clicks.
• The F1-score of 0.84 represents a balanced measure of the model’s ability
to make accurate positive predictions and identify all actual clicks.
Here are some notes on classification assessment, expanding on the concepts of
TPR, FPR, TNR, FNR, sensitivity, and specificity:

Performance Measures revision


• Accuracy: The proportion of correctly classified instances out of all in-
stances. It provides a general idea of the classifier’s performance, but can
be misleading when dealing with imbalanced datasets.
– Formula: Accuracy = (TP + TN) / (TP + TN + FP + FN)

26
• Precision: The proportion of correctly classified positive instances out of
all instances predicted as positive. It focuses on the accuracy of positive
predictions.
– Formula: Precision = TP / (TP + FP)
• Recall (Sensitivity or True Positive Rate (TPR)): The proportion
of correctly classified positive instances out of all actual positive instances.
It focuses on the classifier’s ability to identify all positive instances.
– Formula: Recall = Sensitivity = TPR = TP / (TP + FN)
• F1-Score: The harmonic mean of precision and recall. It provides a
balanced measure that considers both the accuracy of positive predictions
and the ability to identify all positive instances.
– Formula: F1-Score = 2 * (Precision * Recall) / (Precision
+ Recall)
• Specificity (True Negative Rate (TNR)): The proportion of correctly
classified negative instances out of all actual negative instances. It mea-
sures the classifier’s ability to correctly identify negative instances.
– Formula: Specificity = TNR = TN / (FP + TN)
• False Positive Rate (FPR): The proportion of incorrectly classified neg-
ative instances (predicted as positive) out of all actual negative instances.
– Formula: FPR = FP / (FP + TN) = 1 - Specificity
• False Negative Rate (FNR): The proportion of incorrectly classified
positive instances (predicted as negative) out of all actual positive in-
stances.
– Formula: FNR = FN / (TP + FN) = 1 - Sensitivity
• ROC-AUC: A graphical and numerical metric used for binary classifi-
cation problems. The ROC curve plots the True Positive Rate (TPR)
against the False Positive Rate (FPR) at various threshold settings. The
AUC (Area Under the Curve) summarizes the ROC curve into a single
value, representing the classifier’s ability to distinguish between positive
and negative classes. A higher AUC indicates better performance.

Study Material: ROC Analysis, ROC/AUC, Bootstrap Re-


sampling, and Confidence Intervals
1. ROC Analysis and ROC/AUC
1.1 Concept Overview
• ROC (Receiver Operating Characteristic) analysis is a technique
used to evaluate the performance of binary classification models. It pro-
vides a visual and numerical representation of a classifier’s ability to dis-
criminate between positive and negative classes across various decision
thresholds.
• The ROC curve is a graphical plot that illustrates the trade-off between
the True Positive Rate (TPR) and the False Positive Rate (FPR)
at different classification thresholds.

27
• TPR (also called sensitivity or recall) is the proportion of correctly
classified positive instances out of all actual positive instances.
• FPR (equivalent to 1 - specificity) is the proportion of incorrectly clas-
sified negative instances (predicted as positive) out of all actual negative
instances.
• The Area Under the ROC Curve (AUC) is a single numerical value
that summarizes the overall performance of the classifier. A higher AUC
value indicates better discrimination capability, with 1 representing a per-
fect classifier and 0.5 representing a random classifier.
Purpose and Relevance: ROC analysis and AUC are particularly useful
when: * Dealing with imbalanced datasets, where accuracy alone can be mis-
leading. * Comparing different classification models to choose the one with
the best discriminatory power. * Choosing an appropriate threshold for a given
application, depending on the relative costs of false positives and false negatives.

1.2 Mathematical Foundation Formulas:


• TPR (Sensitivity) = TP / (TP + FN)
• FPR (1 - Specificity) = FP / (FP + TN)
• AUC: Calculated as the area under the ROC curve, which is a plot of
TPR against FPR.
Derivation of AUC: The AUC can be calculated using various numerical
integration methods, such as the trapezoidal rule. Conceptually, it represents
the probability that the classifier will rank a randomly chosen positive instance
higher than a randomly chosen negative instance.
Problem Solved: The ROC curve and AUC address the problem of evaluating
and comparing classifier performance in a threshold-independent manner, pro-
viding a more comprehensive view of the model’s ability to discriminate between
classes.
Real-world Applications:
• Medical Diagnosis: Evaluating the performance of diagnostic tests for
diseases.
• Credit Scoring: Assessing the risk of loan defaults.
• Spam Detection: Identifying spam emails.

1.3 Algorithm: ROC Curve and AUC Calculation Purpose: To plot


the ROC curve and calculate the area under the curve (AUC) for a given clas-
sifier and dataset.
Algorithm 22.1: ROC Curve and Area under the Curve
Input:
* D: Testing dataset * M : Classifier
Output: * ROC curve plot * AUC value

28
Steps:
1. Predict Scores: For each test point xi in D, predict the score S(xi) for
the positive class.
2. Sort Scores: Sort the (S(xi), yi) pairs (score and true class) in decreasing
order of score.
3. Initialise Variables: Set FP = 0, TP = 0, AUC = 0, and � = score of
the first pair.
4. Iterate through Sorted Pairs: For each sorted pair (S(xi), yi):
• If S(xi) < �:
– Update AUC: AUC = AUC + (TP / n1) * (FP / n2)
– Update �: � = S(xi)*
• If yi = positive class: Increment TP by 1.
• Else (negative class): Increment FP by 1.
5. Final AUC Update: AUC = AUC + (TP / n1) * (FP / n2)
6. Plot ROC Curve: Plot the points (FPR, TPR) calculated during the
iteration.
7. Return: The plotted ROC curve and the calculated AUC value.
Use Cases: The ROC curve and AUC are widely used in machine learning for
model evaluation and comparison, particularly in scenarios where a threshold-
independent assessment of classifier performance is required.
Efficiency: The algorithm has a time complexity of O(n log n) due to the
sorting step, where n is the number of instances in the testing dataset.
Example:
Given a testing dataset with the following sorted scores and true class labels:
(0.9, c1), (0.8, c2), (0.8, c1), (0.8, c1), (0.1, c2)
where c1 represents the positive class.
Applying Algorithm 22.1, we obtain the following points for the ROC plot and
running AUC calculation:

� FP TP (FPR, TPR) AUC


∞ 0 0 (0, 0) 0
0.9 0 1 (0, 0.333) 0
0.8 1 3 (0.5, 1) 0.333
0.1 2 3 (1, 1) 0.833

The resulting ROC plot will have these (FPR, TPR) points connected by lines,
and the AUC will be the area under this curve, which is 0.833 in this example.

2. Bootstrap Resampling and Confidence Intervals


2.1 Concept Overview

29
• Bootstrap Resampling is a statistical technique used to estimate the
sampling distribution of a statistic (e.g., mean, median, accuracy) by re-
peatedly resampling with replacement from the original dataset.
• It involves creating multiple bootstrap samples, each of which is gen-
erated by randomly drawing data points from the original dataset with
replacement, ensuring that each sample has the same size as the original
dataset.
• Confidence Intervals are ranges around a point estimate (e.g., sample
mean) that are likely to contain the true population parameter with a
certain level of confidence (e.g., 95%). Bootstrap resampling can be used
to construct confidence intervals for various statistics.
Purpose and Relevance:
• Estimate Sampling Distribution: Bootstrap resampling allows us to
approximate the sampling distribution of a statistic without making strong
assumptions about the underlying population distribution.
• Construct Confidence Intervals: Bootstrap confidence intervals pro-
vide a measure of uncertainty around a point estimate, giving a range of
values that are likely to include the true population parameter.
• Evaluate Model Performance: In the context of classification, boot-
strap resampling can be used to estimate the variability of performance
metrics like accuracy or AUC, helping to assess the model’s robustness.

2.2 Mathematical Foundation Formula for Confidence Interval:


A confidence interval for a statistic � can be constructed using the bootstrap
distribution of � (obtained from multiple bootstrap samples):
• Percentile Method: A (1-�)*100% confidence interval is given by the
�/2 and (1-�/2) percentiles of the bootstrap distribution of �.
Problem Solved: Bootstrap resampling and confidence intervals address the
problem of quantifying the uncertainty associated with sample estimates, pro-
viding a range of plausible values for the true population parameter.
Real-world Applications:
• Survey Analysis: Estimating confidence intervals for survey results.
• Clinical Trials: Assessing the effectiveness of a new drug with a confi-
dence interval.
• Machine Learning: Evaluating the performance of a machine learning
model with confidence intervals on metrics like accuracy.

2.3 Algorithm: Bootstrap Resampling for Confidence Intervals Pur-


pose: To estimate the confidence interval of a statistic using bootstrap resam-
pling.
Algorithm 22.3: Bootstrap for Classifier Evaluation

30
Input: * D: Dataset with n data points * B: Number of bootstrap samples *
M : Classifier model * �: Performance measure (e.g., accuracy)
Output: * Confidence interval for �
Steps:
1. Create Bootstrap Samples: For i = 1 to B:
• Generate a bootstrap sample Di by randomly drawing n data points
from D with replacement.
2. Evaluate Classifier on Bootstrap Samples: For i = 1 to B:
• Train the classifier M on Di.
• Evaluate the performance measure �i on the full dataset D using the
model trained on Di.
3. Calculate Confidence Interval: Construct the confidence interval for
� using the distribution of �i (e.g., using the percentile method).
Efficiency: The time complexity of the bootstrap resampling algorithm is O(B
* T), where B is the number of bootstrap samples and T is the time complexity
of training and evaluating the classifier model.

3. Interrelation between ROC Analysis and Bootstrap Resampling


• Bootstrap resampling can enhance ROC analysis by providing confidence
intervals for the AUC, offering a more robust assessment of the classifier’s
performance. By generating multiple bootstrap samples and calculating
the AUC for each sample, one can estimate the variability of the AUC
and construct a confidence interval. This helps to determine whether the
observed AUC is statistically significant or simply due to chance variation
in the data.

4. Connection to Other Fields/Concepts


• Statistical Inference: ROC analysis, ROC/AUC, bootstrap resampling,
and confidence intervals are all fundamental concepts in statistical infer-
ence, used to make inferences about a population based on a sample of
data.
• Machine Learning: These concepts are widely applied in machine learn-
ing for model evaluation, comparison, and selection.
• Data Mining: They are essential tools for analyzing and interpreting
patterns in large datasets.
This study material provides a concise overview of ROC Analysis, ROC/AUC,
Bootstrap Resampling, and Confidence Intervals, covering the key theoretical
aspects, mathematical foundations, and algorithms. These concepts are fun-
damental for understanding and evaluating the performance of classification
models in various applications.

31
Study Material: Neural Networks
1. Feedforward Neural Networks
1.1 Concept Overview
• Artificial Neural Networks (ANNs) are computational models in-
spired by the structure and function of biological neural networks. They
consist of interconnected nodes called neurons, organized in layers, which
process and transmit information.
• Feedforward Neural Networks are a type of ANN where the informa-
tion flows in one direction, from the input layer through hidden layers to
the output layer, without any loops or cycles. Each connection between
neurons has an associated weight, which determines the strength of the
signal transmission.
Purpose and Relevance:
• Universal Function Approximators: Feedforward neural networks,
with sufficient complexity (number of layers and neurons), can approxi-
mate any continuous function to arbitrary accuracy.
• Pattern Recognition and Classification: They excel in tasks like
image recognition, natural language processing, and speech recognition
due to their ability to learn complex patterns and representations from
data.
Connection to Other Fields/Concepts:
• Biology: Inspired by the structure of biological neural networks in the
brain.
• Linear Algebra: Heavily relies on matrix operations for efficient compu-
tation.
• Calculus: Uses derivatives for gradient-based optimization algorithms
like backpropagation.

1.2 Mathematical Foundation


• Neuron Model: A neuron zk receives inputs x1, x2, …, xd and computes
its output using:
– Net Input: A weighted sum of inputs: netk = bk + �di=1 wik � xi,
where wik are weights and bk is the bias.
– Activation Function: A non-linear function applied to the net
input: zk = f (netk). Common activation functions include sigmoid,
ReLU, and tanh.
• Network Architecture: A layered structure with input, hidden, and
output layers. The number of layers and neurons in each layer defines the
network’s complexity.
• Weights and Biases: The parameters of the network that are learned
during training to adjust the network’s behaviour and improve its perfor-

32
mance.

1.3 Algorithms 1.3.1 Feedforward Algorithm


Purpose: To compute the output of a feedforward neural network for a given
input.
Algorithm:
1. Input Layer: Set the values of the input neurons to the input vector x.
2. Hidden Layers: For each hidden layer:
• Compute the net input for each neuron using the weighted sum of
outputs from the previous layer and the bias term.
• Apply the activation function to the net input to obtain the output
of each neuron.
3. Output Layer: Compute the net input and output of the output neurons
using the same process as for hidden layers.
4. Output: The output of the network is the vector of values from the
output neurons.
1.3.2 Backpropagation Algorithm
Purpose: To train a feedforward neural network by adjusting weights and
biases using gradient descent to minimize a loss function (e.g., mean squared
error).
Algorithm:
1. Initialization: Randomly initialize weights w and biases b.
2. Forward Pass: Compute the network’s output for a given input using
the feedforward algorithm.
3. Loss Calculation: Calculate the error (loss) between the predicted out-
put and the true target value.
4. Backward Pass: Propagate the error back through the network, calcu-
lating gradients of the loss function with respect to weights and biases
using the chain rule.
5. Update Parameters: Update weights and biases using gradient descent:
• w = w - � �w, where �w is the gradient of the loss with respect to w
and � is the learning rate.
• b = b - � �b.
6. Repeat: Iterate steps 2-5 for multiple epochs (passes through the training
data) until the loss function converges to a minimum.
Use Cases: Backpropagation is the core algorithm for training feedforward
neural networks in a wide range of applications, including:
• Image Classification
• Natural Language Processing
• Regression Analysis

33
Efficiency: The time complexity of backpropagation is O(|E|) per training
example, where |E| is the number of edges (connections) in the network.

2. Expressive Power of Neural Networks


2.1 Concept Overview
• Expressiveness refers to the ability of a neural network to approximate
a wide range of functions. The more expressive a network is, the more
complex the functions it can represent and learn.
• Universal Approximation Theorem: States that a feedforward neu-
ral network with a single hidden layer containing a sufficient number of
neurons can approximate any continuous function on a compact subset of
Rn to arbitrary accuracy.
Factors Influencing Expressiveness:
• Number of Layers: Deeper networks (with more hidden layers) can
learn more complex hierarchical representations, increasing expressiveness.
• Number of Neurons: More neurons in hidden layers provide more flex-
ibility to model complex functions.
• Activation Function: The choice of activation function (e.g., sigmoid,
ReLU) impacts the non-linearity and expressiveness of the network.
Limitations:
• While theoretically a single hidden layer can suffice, in practice, deep
networks often perform better.
• The size of the network required to represent some functions can be expo-
nentially large, making it impractical.

2.2 Mathematical Foundation The universal approximation theorem relies


on mathematical concepts like:
• Stone-Weierstrass Theorem: A fundamental theorem in analysis that
guarantees the ability to approximate continuous functions using polyno-
mials.
• Fourier Series: A method for representing periodic functions as a sum
of sines and cosines, which can be related to the weighted sums computed
by neurons in neural networks.

2.3 Implications
• Theoretical Guarantee: The universal approximation theorem provides
a theoretical foundation for the ability of neural networks to solve a wide
range of problems.
• Practical Considerations: The theorem does not specify the optimal
network architecture or the number of neurons needed. These aspects

34
are determined empirically through experimentation and model selection
techniques.

3. SGD and Backpropagation


3.1 Concept Overview
• Stochastic Gradient Descent (SGD) is an optimization algorithm
used to train neural networks. It iteratively updates the network’s weights
and biases in the direction that reduces the loss function.
• Backpropagation is an efficient algorithm used to compute the gradients
of the loss function with respect to the weights and biases in the network.
Purpose and Relevance:
• Efficient Parameter Update: SGD allows for efficient learning by up-
dating parameters based on a small batch of data (or even a single data
point) rather than the entire dataset.
• Gradient Calculation: Backpropagation enables the efficient compu-
tation of gradients in deep networks, which would be computationally
expensive using traditional methods.
Interrelation:
SGD relies on backpropagation to calculate the gradients used for parameter
updates.

3.2 Mathematical Foundation


• Gradient Descent: The core idea is to move parameters in the direction
of the negative gradient of the loss function to minimize the error.
• Chain Rule: Backpropagation utilizes the chain rule from calculus to
efficiently compute gradients in a layered network by propagating the error
backwards.

3.3 Algorithms 3.3.1 Stochastic Gradient Descent Algorithm


Purpose: To train a neural network by iteratively updating parameters
(weights and biases) using gradients computed by backpropagation.
Algorithm:
1. Initialization: Randomly initialize weight vector w.
2. Iterations: Repeat for a specified number of iterations:
• Sample Data: Randomly sample a mini-batch of data from the
training set.
• Forward Pass: Compute the network’s output for the mini-batch.
• Backpropagation: Calculate the gradients of the loss function with
respect to w using the backpropagation algorithm.

35
• Update Weights: w(i + 1) = w(i) - �i(vi + �w(i)), where vi is the
gradient, �i is the learning rate at iteration i, and � is the regulariza-
tion parameter.
3. Output: Return the weight vector w that achieved the best performance
on a validation set.
3.3.2 Backpropagation Algorithm
Purpose: To compute the gradients of the loss function with respect to the
weights and biases in a neural network.
Algorithm:
1. Forward Pass: Compute the network’s output for a given input (as in
the feedforward algorithm).
2. Backward Pass:
• Output Layer: Calculate the error signal (�T) at the output layer.
• Hidden Layers: Propagate the error signal back through the hidden
layers, calculating the error signal for each layer (�t) using the chain
rule and the activation function’s derivative.
3. Gradient Calculation: Calculate the gradients of the loss function with
respect to weights and biases using the error signals and the outputs of
the neurons.
Efficiency: Both SGD and backpropagation have efficient implementations
using matrix operations, enabling their use in large-scale neural networks.

Neural Networks: Theoretical Foundations and Algorithms


Concept Overview
Neural networks are a powerful class of machine learning models inspired by the
structure and function of biological brains. They are composed of interconnected
nodes called neurons, organised into layers: an input layer, one or more
hidden layers, and an output layer. The connections between neurons have
associated weights, which determine the strength of the signal passed between
them.
• Multilayer Perceptron (MLP): A feedforward neural network with
one or more hidden layers. Information flows in one direction, from input
to output, through the network.
• Deep Multilayer Perceptron: An MLP with multiple hidden layers.
Deep networks have shown exceptional performance in various tasks due
to their ability to learn complex hierarchical representations of data.

Mathematical Foundation
Activation Functions Activation functions introduce non-linearity to the
network, enabling it to learn complex relationships between inputs and outputs.
Common activation functions include:

36
• Sigmoid: Maps the input to a value between 0 and 1, often used in the
output layer for binary classification.
1
𝜎(𝑎) =
1 + 𝑒𝑥𝑝(−𝑎)

• Threshold: Outputs 1 if the input exceeds a threshold and 0 otherwise.

𝜎(𝑎) = 1[𝑎 > 0]

• ReLU (Rectified Linear Unit): Outputs the input if it is positive


and 0 otherwise. Widely used in hidden layers due to its computational
efficiency.
𝜎(𝑎) = 𝑚𝑎𝑥(0, 𝑎)
• Softmax: Generalization of the sigmoid function for multi-class classifi-
cation, producing a probability distribution over the output classes.

𝑒𝑥𝑝(𝑏𝑖 + 𝑤𝑖𝑇 𝑥)
𝜋𝑖 (𝑥) = 𝐾
, for all 𝑖 = 1, 2, ..., 𝐾
∑𝑗=1 𝑒𝑥𝑝(𝑏𝑗 + 𝑤𝑗𝑇 𝑥)

Error Functions Error functions measure the difference between the net-
work’s predicted output and the true target values. Common error functions
include:
• Squared Error: Measures the average squared difference between the
predicted and target values.
𝑛 𝑛 𝑛
SSE = ∑ 𝜖2𝑖 = ∑(𝑦𝑖 − 𝑦𝑖̂ )2 = ∑(𝑦𝑖 − 𝑏 − 𝑤𝑇 𝑥𝑖 )2
𝑖=1 𝑖=1 𝑖=1

• Cross-entropy: Measures the dissimilarity between the predicted and


true probability distributions for multi-class classification.
𝐾
𝐸𝑥 = − ∑ 𝑦𝑖 ⋅ 𝑙𝑛(𝑜𝑖 )
𝑖=1

• Binary Cross-entropy: Special case of cross-entropy for binary classifi-


cation.
𝐸𝑥 = −(𝑦 ⋅ 𝑙𝑛(𝑜) + (1 − 𝑦) ⋅ 𝑙𝑛(1 − 𝑜))

Feed-forward Phase During the feed-forward phase, an input vector is prop-


agated through the network, layer by layer, to produce an output. At each layer,
the input to a neuron is a weighted sum of outputs from the previous layer,
passed through an activation function.
Mathematical Representation:

37
For an MLP with one hidden layer:

𝑜 = 𝑓(𝑏𝑜 + 𝑊𝑜𝑇 𝑧) = 𝑓(𝑏𝑜 + 𝑊𝑜𝑇 ⋅ 𝑓(𝑏ℎ + 𝑊ℎ𝑇 𝑥))

For a deep MLP with h hidden layers:

𝑜 = 𝑓ℎ+1 (𝑏ℎ + 𝑊ℎ𝑇 ⋅ 𝑓ℎ (𝑏ℎ−1 + 𝑊ℎ−1


𝑇
⋅ 𝑓ℎ−1 (...𝑓2 (𝑏1 + 𝑊1𝑇 ⋅ 𝑓1 (𝑏0 + 𝑊0𝑇 ⋅ 𝑥)))))

Backpropagation Algorithm Backpropagation is an efficient algorithm for


computing the gradients of the error function with respect to the network’s
weights. These gradients are then used to update the weights via gradient
descent, minimizing the error and improving the network’s performance.
Key Steps:
1. Compute the net gradient at the output layer: Calculate the partial
derivative of the error function with respect to the net input of each output
neuron. This depends on the chosen error function and activation function.
2. Backpropagate the gradients to hidden layers: Recursively compute
the net gradients for each hidden layer, starting from the output layer and
moving backward. The net gradient at a hidden neuron is a weighted sum
of the gradients from the next layer, multiplied by the derivative of the
activation function.
3. Update weights and biases: Use the computed gradients to update
the weights and biases for each layer using gradient descent.

Algorithms
Multilayer Perceptron (MLP) with One Hidden Layer Purpose:
Learn complex relationships between inputs and outputs for both regression
and classification tasks.
Algorithm:
1. Initialization: Initialize the weight matrices and bias vectors to small
random values.
2. Feed-forward phase: For each training example, propagate the input
through the network to compute the predicted output.
3. Backpropagation phase: Calculate the net gradients for the output and
hidden layers using the chosen error function and activation functions.
4. Gradient Descent: Update the weight matrices and bias vectors using
the computed gradients and a chosen learning rate.
5. Iteration: Repeat steps 2-4 for a specified number of iterations or until
convergence.

Deep Multilayer Perceptron (Deep MLP) Purpose: Learn highly com-


plex and hierarchical representations of data by utilizing multiple hidden layers.

38
Algorithm:
The algorithm for Deep MLPs is similar to that of MLPs with one hidden layer,
but with additional steps for backpropagating gradients through multiple hidden
layers.
1. Initialization: Initialize weight matrices and bias vectors for all layers.
2. Feed-forward phase: Propagate the input through all layers to compute
the output.
3. Backpropagation phase: Compute net gradients for the output layer
and backpropagate them recursively through all hidden layers.
4. Gradient Descent: Update weights and biases for all layers using the
computed gradients.
5. Iteration: Repeat steps 2-4 until convergence or for a specified number
of iterations.

Stochastic Gradient Descent (SGD) Purpose: Efficiently train neural


networks by updating weights based on the gradient computed from a single
training example or a small batch of examples.
Algorithm:
1. Initialization: Initialize weights randomly.
2. Iteration:
• Randomly select a training example (or a mini-batch of examples).
• Perform the feed-forward phase to compute the predicted output.
• Use the backpropagation algorithm to calculate the gradients of the
error function with respect to the weights.
• Update the weights using the computed gradients and a chosen learn-
ing rate.
3. Repeat: Iterate for a specified number of epochs (passes over the entire
training dataset) or until convergence.
Efficiency:
SGD is computationally more efficient than traditional gradient descent, espe-
cially for large datasets. However, the convergence can be noisy due to the
randomness in example selection.
Variants:
Several variants of SGD exist, including mini-batch SGD (using a small batch of
examples for gradient computation) and techniques like momentum and adap-
tive learning rates to improve convergence.
By understanding the fundamental concepts of neural networks, their mathemat-
ical underpinnings, and the algorithms involved in their training, you can gain
a strong foundation for applying these powerful models to various real-world
problems.

39
Neural Networks: Activation Functions
Concept Overview
Activation functions are a crucial component of neural networks. Their pri-
mary purpose is to introduce non-linearity into the network’s computations.
Without activation functions, a neural network would simply be a linear combi-
nation of its inputs, limiting its capacity to model complex relationships in data.
Activation functions allow neural networks to approximate arbitrarily complex
functions, making them suitable for a wide range of tasks.
Activation functions are applied to the net input of a neuron, which is the
weighted sum of its inputs plus a bias term. The output of the activation func-
tion then becomes the neuron’s output, which can be passed to other neurons
in subsequent layers.

Mathematical Foundation
Common Activation Functions and Their Derivatives
• Linear Function: The simplest activation function, outputting its input
directly. While straightforward, it limits the network to learning only
linear relationships.
𝑓(𝑧) = 𝑧
𝜕𝑓(𝑧)
=1
𝜕𝑧

• Sigmoid Function: A smooth, S-shaped function that squashes the input


to a range between 0 and 1. Commonly used in output layers for binary
classification tasks.
1
𝑓(𝑧) =
1 + 𝑒𝑥𝑝(−𝑧)
𝜕𝑓(𝑧)
= 𝑓(𝑧) ⋅ (1 − 𝑓(𝑧))
𝜕𝑧

• Hyperbolic Tangent (tanh) Function: Similar to the sigmoid function


but outputs values between -1 and 1. Often used in hidden layers due to
its zero-centered property.

𝑒𝑥𝑝(𝑧) − 𝑒𝑥𝑝(−𝑧) 𝑒𝑥𝑝(2𝑧) − 1


𝑓(𝑧) = =
𝑒𝑥𝑝(𝑧) + 𝑒𝑥𝑝(−𝑧) 𝑒𝑥𝑝(2𝑧) + 1

𝜕𝑓(𝑧)
= 1 − 𝑓(𝑧)2
𝜕𝑧

• Rectified Linear Unit (ReLU) Function: Outputs the input if pos-


itive, otherwise outputs 0. A popular choice for hidden layers due to its

40
computational efficiency and effectiveness in preventing vanishing gradi-
ents.
𝑓(𝑧) = 𝑚𝑎𝑥(0, 𝑧)
𝜕𝑓(𝑧) 1 if 𝑧 > 0,
={
𝜕𝑧 0 otherwise.

• Softmax Function: A generalization of the sigmoid function for multi-


class classification, producing a probability distribution over all classes.
Primarily used in the output layer.
𝑒𝑥𝑝(𝑧𝑖 )
𝑓(𝑧𝑖 |𝑧) = 𝑝
∑𝑗=1𝑒𝑥𝑝(𝑧𝑗 )

𝜕𝑓(𝑧𝑖 |𝑧) 𝑓(𝑧 ) ⋅ (1 − 𝑓(𝑧𝑖 )) if 𝑗 = 𝑖,


={ 𝑖
𝜕𝑧𝑗 −𝑓(𝑧𝑖 ) ⋅ 𝑓(𝑧𝑗 ) if 𝑗 ≠ 𝑖.

Derivatives of Activation Functions The derivatives of activation func-


tions are essential for the backpropagation algorithm, which calculates the
gradients of the error function with respect to the network’s weights. These
gradients indicate the direction and magnitude of weight adjustments needed to
minimize the error. The choice of activation function influences the effectiveness
of training, affecting issues like vanishing gradients.

Concepts Involved
• Net Input: The weighted sum of inputs to a neuron, calculated as:
𝑑
𝑛𝑒𝑡𝑘 = 𝑏𝑘 + ∑ 𝑤𝑖𝑘 ⋅ 𝑥𝑖 = 𝑏𝑘 + 𝑤𝑇 𝑥
𝑖=1

where:
– 𝑛𝑒𝑡𝑘 is the net input of neuron 𝑘
– 𝑏𝑘 is the bias term for neuron 𝑘
– 𝑤𝑖𝑘 is the weight connecting input neuron 𝑥𝑖 to neuron 𝑘
– 𝑥𝑖 is the output of the 𝑖th input neuron
– 𝑤 and 𝑥 are the weight and input vectors, respectively
• Neuron Output: The result of applying the activation function to the
net input:
𝑧𝑘 = 𝑓(𝑛𝑒𝑡𝑘 )
where:
– 𝑧𝑘 is the output of neuron 𝑘
– 𝑓 is the activation function
• Backpropagation: An algorithm for efficiently calculating the gradients
of the error function with respect to the network’s weights, utilizing the
chain rule of calculus and the derivatives of activation functions.

41
Understanding activation functions and their role in neural networks is funda-
mental to comprehending how these models learn and make predictions. Se-
lecting the appropriate activation function can significantly impact a network’s
performance and training efficiency.

Classification Techniques
This study material summarises key theoretical concepts related to Decision
Trees, Random Forests, and Ensemble Techniques, tailored for a Stanford Mas-
ters level understanding.

Decision Trees
Concept Overview A decision tree is a supervised learning model that
predicts the target value (class label) of an instance by learning simple decision
rules inferred from the data features. The tree structure comprises internal
nodes representing tests on attributes, branches representing outcomes of the
tests, and leaf nodes representing class labels or predictions.
Key advantages of decision trees:
• Easy to understand and interpret. The decision rules can be readily visu-
alized and explained, even to non-technical audiences.
• Handle both numerical and categorical data.
• Non-parametric method: They don’t make strong assumptions about the
underlying data distribution, making them robust to outliers and diverse
data patterns.

Mathematical Foundation The core of decision tree learning involves re-


cursively partitioning the data space based on attribute tests that maximize
the “purity” of resulting subsets with respect to class labels. Several metrics
quantify this purity:
• Entropy: Measures the impurity or randomness in a set of instances. For
a set D with k classes:

𝑘
𝐻(𝐷) = − ∑ 𝑃 (𝑐𝑖 ) ⋅ 𝑙𝑜𝑔2 (𝑃 (𝑐𝑖 ))
𝑖=1

where P(ci) is the proportion of instances belonging to class ci in D.


• Information Gain: Quantifies the reduction in entropy achieved by split-
ting a dataset D based on an attribute test.

𝐺𝑎𝑖𝑛(𝐷, 𝐷𝑌 , 𝐷𝑁 ) = 𝐻(𝐷) − 𝐻(𝐷𝑌 , 𝐷𝑁 )

42
where DY and DN are the subsets of D resulting from the split. The
algorithm chooses the attribute test that yields the highest information
gain.
• Gini Index: Another impurity measure, calculated as:

𝑘
𝐺𝑖𝑛𝑖(𝐷) = 1 − ∑(𝑃 (𝑐𝑖 ))2
𝑖=1

A split is chosen to minimize the weighted average Gini index of the re-
sulting subsets.
• CART Measure: Specifically used in the CART (Classification and
Regression Trees) algorithm. For a binary split:

|𝐷𝑌 | |𝐷𝑁 | 𝑘
𝐶𝐴𝑅𝑇 (𝐷𝑌 , 𝐷𝑁 ) = 2 ⋅ ⋅ ⋅ ∑ |𝑃 (𝑐𝑖 |𝐷𝑌 ) − 𝑃 (𝑐𝑖 |𝐷𝑁 )|
|𝐷| |𝐷| 𝑖=1

It favors splits that lead to a large difference in class distributions between


the two subsets.

Algorithms

Decision Tree Learning (ID3, C4.5, CART) These algorithms follow a


greedy, recursive approach to construct the decision tree:
1. Start with the entire dataset D at the root node.
2. For each attribute, evaluate all possible split points. This involves
calculating the chosen impurity metric (entropy, Gini index, etc.) for every
potential split.
3. Select the attribute and split point that results in the best im-
purity reduction. For example, the split with the highest information
gain.
4. Create child nodes representing the subsets of data resulting
from the split.
5. Recursively repeat steps 2-4 for each child node, until a stop-
ping criterion is met. This could be reaching a maximum tree depth,
minimum number of instances in a node, or a purity threshold.
6. Assign a class label to each leaf node. Typically, this is the majority
class among the instances in that node.
Pseudocode (Generic Decision Tree Algorithm):
function DECISIONTREE(D, evaluation_metric, stopping_criteria):
if stopping_criteria(D):
return leaf_node(majority_class(D))

43
else:
best_split = find_best_split(D, evaluation_metric)
create_child_nodes(best_split)
for each child_node:
child_node = DECISIONTREE(child_node.data, evaluation_metric, stopping_criteria)
return root_node(best_split, child_nodes)
Computational Complexity:
• Evaluating all split points for a numeric attribute takes O(n log n) time,
where n is the number of instances.
• Evaluating categorical splits depends on the number of possible partitions,
and can be O(n log n) if the size of partitions is bounded.
• Overall complexity can be O(n d log2 n) in the worst case, where d is the
number of attributes.
Applications:
• Credit risk assessment: Predict the likelihood of loan default based on
applicant features.
• Medical diagnosis: Classify patients into disease categories based on
symptoms and test results.
• Customer churn prediction: Identify customers likely to stop using a
service.

Random Forest
Concept Overview A random forest is an ensemble learning method that
constructs a multitude of decision trees during training and outputs the predic-
tion that is the mode of the classes (classification) or mean/average prediction
(regression) of the individual trees. It addresses the overfitting issue that can
occur with single decision trees by introducing randomness in two ways:
1. Bootstrap aggregating (Bagging): Each tree is trained on a different
random subset of the training data, sampled with replacement.
2. Random Subspace: At each node, a random subset of features is con-
sidered for splitting, further increasing diversity among trees.

Mathematical Foundation The key mathematical idea behind random


forests is variance reduction through averaging. By aggregating predic-
tions from multiple, decorrelated trees, the variance of the final prediction is
reduced, leading to improved generalization performance.

Algorithms

Random Forest Algorithm


1. For t = 1 to K (number of trees):

44
• Draw a bootstrap sample Dt from the original dataset D.
• Grow a decision tree Mt on Dt. At each node, randomly select
a subset of p features and choose the best split among them. Grow
the tree to full depth or until a stopping criterion is met.
2. Output the ensemble of trees {M1, M2, …, MK}.
Prediction:
• For a new instance, predict the class label by taking a majority vote among
the predictions of all K trees.
Pseudocode:
function RANDOMFOREST(D, K, p, stopping_criteria):
trees = []
for t = 1 to K:
Dt = bootstrap_sample(D)
Mt = DECISIONTREE(Dt, p, stopping_criteria)
trees.append(Mt)
return trees

function predict(trees, instance):


votes = []
for tree in trees:
votes.append(tree.predict(instance))
return majority_vote(votes)
Computational Complexity:
• Training time is roughly K times that of a single decision tree, but can be
parallelized as trees are built independently.
• Prediction time is O(K depth of trees)*.
Applications:
• Object detection in computer vision: Classify image regions as con-
taining specific objects.
• Fraud detection: Identify fraudulent transactions based on transaction
patterns.
• Recommendation systems: Predict user preferences for products or
content.

Ensemble Techniques
Concept Overview Ensemble techniques combine multiple individual
models (base learners) to produce a more powerful and robust predictor. The
idea is that the weaknesses of individual models can be offset by the strengths
of others, leading to improved accuracy, generalization, and stability.
Key Benefits of Ensembles:

45
• Improved Accuracy: Ensembles often outperform single models, espe-
cially when base learners are diverse and make different types of errors.
• Robustness: Less sensitive to noise and outliers in the data, as errors
from individual models are averaged out.
• Generalization: Can better capture complex relationships in the data,
reducing overfitting.

Types of Ensemble Techniques:


• Bagging (Bootstrap Aggregating): Create multiple base learners by
training them on different bootstrap samples of the data. Predictions are
combined by averaging (regression) or voting (classification). Example:
Random Forest.
• Boosting: Sequentially train base learners, where each learner focuses
on correcting the mistakes of previous ones. Weights are assigned to in-
stances, with higher weights given to misclassified examples. Predictions
are combined by weighted voting. Examples: AdaBoost, Gradient Boost-
ing.
• Stacking: Combine multiple base learners using a meta-learner that
learns how to best weight their predictions. The base learners are trained
on the original data, and their predictions are used as input features to
train the meta-learner. Example: Using logistic regression to combine
predictions from SVM, KNN, and Decision Tree.

Mathematical Foundation The effectiveness of ensemble methods often re-


lies on the concept of diversity. Diverse base learners make different kinds
of errors, allowing their combination to better approximate the true underlying
function. Metrics like diversity measures quantify the degree of disagreement
among base learners.

Algorithms (Boosting)

AdaBoost
1. Initialize instance weights: Set wi = 1/m for all instances i = 1, 2, …,
m.
2. For t = 1 to T (number of iterations):
• Train a weak learner Mt on the data, weighted by w.
• Calculate the weighted error rate of Mt:
𝑚
𝜖𝑡 = ∑ 𝑤𝑖 ⋅ 𝐼(𝑀𝑡 (𝑥𝑖 ) ≠ 𝑦𝑖 )
𝑖=1

• Compute the learner weight:


1 1 − 𝜖𝑡
𝛼𝑡 = ⋅ 𝑙𝑛( )
2 𝜖𝑡

46
• Update instance weights:

𝑤𝑖 = 𝑤𝑖 ⋅ 𝑒𝑥𝑝(−𝛼𝑡 ⋅ 𝑦𝑖 ⋅ 𝑀𝑡 (𝑥𝑖 ))

• Normalize instance weights to sum to 1.


3. **Output the ensemble of weighted weak learners {(M1, �1), ( M2, �2), …,
(MT, �T*)}.**
Prediction:
• For a new instance, predict the class label by weighted majority voting:
𝑇
𝑀𝑇 (𝑥) = 𝑠𝑖𝑔𝑛(∑ 𝛼𝑡 ⋅ 𝑀𝑡 (𝑥))
𝑡=1

Pseudocode:
function ADABOOST(D, T, weak_learner):
m = len(D)
weights = [1/m] * m
ensemble = []
for t = 1 to T:
Mt = weak_learner(D, weights)
error_t = calculate_weighted_error(D, Mt, weights)
alpha_t = 0.5 * ln((1 - error_t) / error_t)
update_weights(D, Mt, weights, alpha_t)
ensemble.append((Mt, alpha_t))
return ensemble
Computational Complexity:
• Depends on the complexity of the weak learner and the number of boosting
iterations.
Applications:
• Face detection: AdaBoost was widely used in early face detection sys-
tems.
• Spam filtering: Classify emails as spam or not spam.
• Text categorization: Assign documents to predefined categories.
Choosing the Right Technique: The choice between decision trees, random
forests, and specific ensemble techniques depends on factors like dataset size,
dimensionality, noise levels, and desired interpretability. For example:
• For highly interpretable models, single decision trees can be suitable.
• For improved accuracy and generalization, random forests are often a good
choice.
• For complex datasets with potential for high bias, boosting methods might
be preferred.

47
Remember that understanding the underlying concepts and the strengths and
weaknesses of each method is essential for making informed choices and achiev-
ing optimal classification performance.

Classification Techniques Study Material: Linear Discrim-


inant Analysis
Concept Overview
Linear Discriminant Analysis (LDA) is a supervised dimensionality reduc-
tion and classification technique. It seeks to find linear combinations of
features that best separate different classes within a dataset. This optimal
separation is achieved by maximising the ratio of between-class scatter
to within-class scatter. LDA is widely employed in pattern recognition, ma-
chine learning, and computer vision for dimensionality reduction and improving
classification accuracy.

Mathematical Foundation
Scatter Matrices
• Within-class scatter matrix (S): Represents the scatter of data points
within each class.
• Between-class scatter matrix (B): Represents the scatter between the
means of different classes.

Objective The primary goal of LDA is to determine a projection vector


(w) that maximizes the following ratio:

𝑤𝑇 𝐵𝑤
𝑤𝑇 𝑆𝑤

This ratio represents the separation between classes (numerator) relative


to the scatter within classes (denominator). Maximising this ratio ensures
that classes are well-separated after projection onto w.

Fisher’s Linear Discriminant The optimal projection vector w, also known


as Fisher’s Linear Discriminant, is the eigenvector corresponding to the
largest eigenvalue of the matrix S-1B.

Derivation of the Optimal Projection Vector


1. Projection: Projecting a data point x onto the vector w results in a
scalar value:

𝑤𝑇 𝑥

48
2. Means: Let µ1 and µ2 represent the means of the two classes in the
original feature space. Their projections onto w are:

𝑤𝑇 1

𝑤𝑇 2
3. Scatter: The scatter of the projected points for each class ci is calculated
as:

𝑠2𝑖 = ∑ (𝑤𝑇 𝑥 − 𝑤𝑇 𝑖 )2
𝑥∈𝑐𝑖

4. Objective: The goal is to find the vector w that maximizes the difference
between projected means while minimizing the scatter within each class:

(𝑤𝑇 1 − 𝑤𝑇 2 )2
𝑚𝑎𝑥𝑤
𝑠21 + 𝑠22
5. Rewriting with Scatter Matrices: The above expression can be rewrit-
ten using the between-class and within-class scatter matrices, leading to
the ratio:

𝑤𝑇 𝐵𝑤
𝑤𝑇 𝑆𝑤
6. Eigenvalue Problem: The optimal w that maximizes this ratio is the
eigenvector corresponding to the largest eigenvalue of the matrix S-1B.

Algorithms
Linear Discriminant Analysis (LDA) Algorithm Input: Dataset D with
n points xi � Rd and corresponding class labels yi � {c1, c2}.
Output: Optimal linear discriminant vector w.
1. Class-Specific Subsets: Partition D into two subsets, D1 and D2, based
on class labels.
2. Class Means: Calculate the mean vectors µ1 and µ2 for D1 and D2.
3. Between-Class Scatter Matrix: Compute the between-class scatter
matrix B = (µ1 − µ2)(µ1 − µ2)T.
4. Centred Data Matrices: For each class, create a centred data matrix
Di by subtracting the corresponding mean vector µi from each data point.
5. Class Scatter Matrices: Calculate the within-class scatter matrices S1
and S2 as Si = DiTDi.
6. Within-Class Scatter Matrix: Compute the total within-class scatter
matrix S = S1 + S2.

49
7. Eigenvalue Decomposition: Find the dominant eigenvector w of S-1B.
This w represents the optimal linear discriminant direction.
Time Complexity: O(d3 + nd2). The most computationally expensive steps
are calculating the within-class scatter matrix S (O(nd2)) and solving the eigen-
value problem (O(d3) in the worst case).

Classification Techniques Study Material: Probabilistic


Classification
Concept Overview
Probabilistic classification predicts the class label of a given input point
using probability theory. The core idea is to estimate the probability of an
instance belonging to each class and then select the class with the highest
posterior probability. This approach not only provides a prediction but also
a measure of confidence, which can be valuable in decision-making processes.
Probabilistic classification is a fundamental classification approach with
diverse methods like Bayes Classifier, Naive Bayes, and k-Nearest Neigh-
bors derived from this framework. It finds applications in various fields, includ-
ing:
• Medical diagnosis: Predicting the likelihood of a disease based on symp-
toms and patient history.
• Spam filtering: Determining the probability of an email being spam.
• Credit scoring: Assessing the risk of loan default based on financial
history.
• Image recognition: Assigning probabilities to different object categories
in an image.

Mathematical Foundation
Bayes’ Theorem At the heart of probabilistic classification lies Bayes’ the-
orem, a fundamental concept in probability theory that allows us to update our
beliefs about an event based on new evidence. In the context of classification,
Bayes’ theorem calculates the posterior probability P(ci|x) of an instance x
belonging to class ci given the following:
• Likelihood P(x|ci): The probability of observing data point x given that
it belongs to class ci.
• Prior Probability P(ci): The probability of class ci occurring in the
dataset.
• Evidence P(x): The probability of observing data point x regardless of
the class.
Bayes’ theorem is formulated as follows:

50
𝑃 (𝑥|𝑐𝑖 )𝑃 (𝑐𝑖 )
𝑃 (𝑐𝑖 |𝑥) =
𝑃 (𝑥)

Derivation:
Bayes’ theorem stems from the definition of conditional probability:

𝑃 (𝐴 ∩ 𝐵)
𝑃 (𝐴|𝐵) =
𝑃 (𝐵)

where:
• P(A|B) is the probability of event A occurring given that event B has
occurred.
• P(A � B) is the probability of both events A and B occurring.
• P(B) is the probability of event B occurring.
Applying this to our case, we get:

𝑃 (𝑐𝑖 ∩ 𝑥)
𝑃 (𝑐𝑖 |𝑥) =
𝑃 (𝑥)

𝑃 (𝑥 ∩ 𝑐𝑖 )
𝑃 (𝑥|𝑐𝑖 ) =
𝑃 (𝑐𝑖 )

From the second equation, we have P(x � ci) = P(x|ci)P(ci). Substituting


this into the first equation yields Bayes’ theorem.

Maximum A Posteriori (MAP) Estimation The predicted class label ŷ


is determined using Maximum A Posteriori (MAP) estimation, which
chooses the class with the highest posterior probability:

𝑦 ̂ = 𝑎𝑟𝑔𝑚𝑎𝑥𝑐𝑖 𝑃 (𝑐𝑖 |𝑥)

Estimating Likelihood and Prior Probabilities To apply Bayes’ theorem,


we need to estimate the likelihood and prior probabilities from the training
data.
• Prior probabilities are usually estimated as the relative frequency of
each class in the training set.
• Likelihood estimation depends on the nature of the data and the as-
sumptions made about its distribution. For instance, if we assume a mul-
tivariate normal distribution, we can estimate the mean and covari-
ance matrix for each class from the training data.

51
Algorithms
Bayes Classifier The Bayes classifier directly applies Bayes’ theorem to pre-
dict the class label. Here’s a step-by-step explanation:
1. Estimate prior probabilities P(ci) for each class from the training
data. This can be done by calculating the relative frequency of each class.
2. Estimate the likelihood function P(x|ci) for each class. This in-
volves choosing a suitable probability distribution that represents the data
and estimating its parameters from the training data. For example, the
likelihood can be modelled using a multivariate normal distribution
where the mean (µi) and covariance matrix (Σi) are estimated for
each class ci.
3. For a new data point x, calculate the posterior probability
P(ci|x) for each class using Bayes’ theorem.
4. Predict the class label ŷ as the class with the maximum posterior
probability using MAP estimation.
Pseudocode:
Algorithm: Bayes Classifier

Input: Training dataset D, test point x

Output: Predicted class label ŷ

// Training phase
for each class c_i:
Estimate prior probability P(c_i)
Estimate likelihood function P(x|c_i)

// Testing phase
for each class c_i:
Calculate posterior probability P(c_i|x) using Bayes' theorem

ŷ = argmax_ci P(c_i|x)

return ŷ

Naive Bayes Classifier The Naive Bayes classifier simplifies the Bayes clas-
sifier by assuming conditional independence among attributes, given the
class label. While this assumption might be naive in many real-world scenarios,
it significantly reduces computational complexity and often leads to surprisingly
good performance, particularly for high-dimensional data.
Here’s how Naive Bayes works:
1. Estimate prior probabilities P(ci) for each class from the training

52
data.
2. For each attribute Xj, estimate the class-conditional probabili-
ties P(Xj|ci) for each class ci.
3. For a new data point x, calculate the posterior probability for
each class using the naive Bayes formula:

𝑑
𝑃 (𝑐𝑖 |𝑥) ∝ 𝑃 (𝑐𝑖 ) ∏ 𝑃 (𝑥𝑗 |𝑐𝑖 )
𝑗=1

4. Predict the class label ŷ as the class with the maximum posterior
probability.
Pseudocode:
Algorithm: Naive Bayes Classifier

Input: Training dataset D, test point x

Output: Predicted class label ŷ

// Training phase
for each class c<sub>i</sub>:
Estimate prior probability P(c<sub>i</sub>)
for each attribute X<sub>j</sub>:
Estimate class-conditional probability P(X<sub>j</sub>|c<sub>i</sub>)

// Testing phase
for each class c<sub>i</sub>:
Calculate posterior probability P(c<sub>i</sub>|x) using the naive Bayes formula

ŷ = argmax<sub>ci</sub> P(c<sub>i</sub>|x)

return ŷ

k-Nearest Neighbors (KNN) Classifier The KNN classifier is a non-


parametric probabilistic classifier. It directly uses the data sample to estimate
the density, without making any assumptions about the underlying joint
probability density function. It classifies a new data point based on the
majority class among its k nearest neighbors in the training data. Here’s a
simplified explanation of its process:
1. Calculate the distance between the new data point and all data
points in the training set.
2. Select the k data points closest to the new data point as its k
nearest neighbors.
3. Assign the class label to the new data point based on the ma-
jority class among its k nearest neighbors.

53
Mathematical Formulation:
The posterior probability of a new point x belonging to class ci is estimated as
follows:

𝐾𝑖
𝑃 (𝑐𝑖 |𝑥) ≈
𝐾
where:
• Ki represents the number of points among the K nearest neighbours of x
that belong to class ci.
• K is the total number of nearest neighbours.
This formula essentially reflects the proportion of neighbours belonging to a
specific class.
Use Cases and Advantages:
KNN is particularly useful when the decision boundary is complex or non-
linear. It’s a lazy learning algorithm as it doesn’t build an explicit model
during training but rather relies on the training data directly during prediction.
However, KNN can be computationally expensive for large datasets and is sen-
sitive to the choice of distance metric and the value of k.
This study material provides a foundational understanding of probabilistic clas-
sification techniques. It equips Stanford masters students with the necessary
knowledge to explore more advanced machine learning concepts.

54

You might also like