0% found this document useful (0 votes)
51 views111 pages

Session 01 - Classical Machine Learning

Uploaded by

Oussama Amiri
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
51 views111 pages

Session 01 - Classical Machine Learning

Uploaded by

Oussama Amiri
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 111

MACHINE LEARNING –

PART 1 | MIAS M1
Idriss JAIRI
[email protected]
COURSE PLAN
Session 1: Classical Machine Learning

Session 2: Deep Learning

Session 3: Deep Learning for Computer Vision

Session 4: YOLO for Object Detection

Session 5: U-NET for Biomedical Image Segmentation

Session 6: Reinforcement Learning

20XX 2
SESSION 1: MACHINE
LEARNING
Introduction to AI, Supervised Learning, Unsupervised
Learning

20XX
Introduction: Why has Artificial Intelligence (AI)
become so popular?

Big tech companies invest a lot of effort and money Scientific research and
in AI impressive results
20XX 4
Introduction: Why has Artificial Intelligence (AI) become
so popular? <Increased volumes of data>

Note: 1 Zettabyte = 1012 Gigabyte

Source: explodingtopics.com

20XX 5
Introduction: Why has Artificial Intelligence (AI) become
so popular? <Increased volumes of data>

• Data Availability: Vast amounts of structured


and unstructured data are available for training
and testing AI models, which is crucial for their
learning and decision-making processes.
• Data Collection and Storage Technologies:
Innovations in data collection methods, sensors,
and storage technologies have made it possible
to capture and store vast quantities of data.
This is a critical component for training and
refining AI algorithms.

20XX 6
Introduction: Why has Artificial Intelligence (AI) become
so popular? <Advanced algorithms>

1. Machine Learning Algorithms: Breakthroughs in machine learning algorithms,


particularly in areas like deep learning, have significantly improved the performance
of AI systems. Deep learning, in particular, has proven to be highly effective in tasks
like image recognition, natural language processing, and more.
2. Transfer Learning and Pretrained Models: Techniques like transfer learning, where a
model trained on a large dataset for one task is adapted for another related task,
have accelerated progress in various domains. Pretrained models, which are models
that have been trained on large datasets and then fine-tuned for specific tasks, have
become a powerful tool for many applications.
3. Reinforcement Learning and Generative Models: Advances in reinforcement learning
have enabled AI systems to learn through interaction with their environment, making
them well-suited for tasks like autonomous control. Generative models, such as
Generative Adversarial Networks (GANs), have enabled the creation of realistic
synthetic data, which has numerous applications.

20XX 7
Introduction: Why has Artificial Intelligence (AI) become
so popular? <Advancements in computing power>

• CPUs (Central Processing Units), GPUs (Graphics Processing


Units), and TPUs (Tensor Processing Units) are different types
of processors designed for specific types of computational
tasks.
• Specialized Hardware: The development of specialized
hardware like Graphics Processing Units (GPUs) and Tensor
Processing Units (TPUs) has further accelerated the training
and deployment of AI models, especially in deep learning
applications.

20XX 8
Introduction: A Brief History of AI

Source: A quick history of AI, ML, and DL


20XX 9
Introduction: A Brief History of AI
<From 1940s to Now>

• 1943: Warren McCulloch and Walter Pitts create a mathematical model of a neural network.
• 1949: Donald Hebb proposes a learning rule for neural networks, known as Hebbian learning.
• 1950: Alan Turing introduces the "Turing Test" as a way to evaluate a machine's ability to exhibit intelligent
behavior.
• 1951: Marvin Minsky and Dean Edmonds build the first neural network computer, SNARC.
• 1956: John McCarthy organized the Dartmouth Conference, which is considered the birth of AI as a field of study.
"The term artificial intelligence was first coined by John McCarthy in 1956 "
• 1957: Frank Rosenblatt develops the perceptron, a simplified model of a biological neuron, which becomes one of
the earliest machine learning algorithms. It lays the groundwork for later developments in neural networks.
• 1960: John McCarthy develops the programming language LISP, which becomes widely used in AI research.
• 1966: Shakey the robot, developed at Stanford Research Institute, demonstrates basic problem-solving abilities.
• 1966: ELIZA is an early natural language processing computer program, the first program that allowed some kind of
plausible conversation between humans and machines.

Note: There were two major AI winters (Dark Age of AI) approximately 1974–1980 and 1987–2000

20XX 10
Introduction: A Brief History of AI
<From 1940s to Now>

• 1968: Terry Winograd develops SHRDLU, a natural language processing system.


• 1970: Expert systems, which are rule-based AI systems, gain popularity. "Expert systems were
among the first truly successful forms of artificial intelligence (AI) software"
• 1972: The MYCIN system, developed at Stanford, becomes one of the first expert systems used for
medical diagnosis.
• 1980: The first commercial expert system, XCON, is deployed at Digital Equipment Corporation.
• 1986: The concept of backpropagation, a key algorithm for training artificial neural networks, is
rediscovered.
• 1997: IBM's Deep Blue defeats world chess champion Garry Kasparov in a six-game match.
• 2002: Roomba, a domestic robot developed by iRobot, is introduced, showcasing advancements in
robotics and AI.
• 2010: Deep learning techniques, particularly convolutional neural networks (CNNs) and recurrent
neural networks (RNNs), gain prominence.
• 2011 : IBM's Watson wins the Jeopardy! game show, demonstrating natural language processing
and question-answering capabilities.

20XX 11
Introduction: A Brief History of AI
<From 1940s to Now>

• 2013: Google's DeepMind develops a deep learning algorithm (Q-Learning) that learns to play Atari 2600
video games at a human-level performance. "Playing Atari with Deep Reinforcement Learning".
• 2016: AlphaGo, a program developed by DeepMind, defeats world Go champion Lee Sedol in a five-game
match.
• 2018: GPT-1 (Generative Pre-trained Transformer ) by OpenAI showcases powerful language generation
capabilities.
• 2019: OpenAI introduces GPT-2, a large-scale language model
• 2020: OpenAI introduces GPT-3, a large-scale language model.
• 2021: AlphaFold is an artificial intelligence program developed by DeepMind, which performs predictions of
protein structure.
• 2022: Generative Pre-trained Transformer 3.5 (GPT-3.5) is a sub class of GPT-3 Models created by OpenAI in
2022
• 2023: Generative Pre-trained Transformer 4 (GPT-4) is a multimodal large language model created
by OpenAI, and the fourth in its series of GPT foundation models

20XX 12
Introduction: Deep Learning State of the Art

Deep Learning State of the Art (2020)

20XX 13
Introduction: A Brief History of AI

1997: Deep Blue IBM chess computer beats 2011: IBM-Watson Defeats Humans in "Jeopardy!" 2016: Google's AlphaGo (Developed by DeepMind)
Garry Kasparov (Chess Grandmaster) beats Go master Lee Se-dol.
Full Documentary: AlphaGo – The Movie

20XX 14
Artificial Intelligence: New Trends
<Large Language Models (LLMs)>

Large Language Models


(LLMs)

Fig. The Transformer - Model Architecture.


Paper Link: Attention is All you Need
20XX 15
Artificial Intelligence: New Trends
<Text to Image Models>

A text-to-image model is a machine learning model


which takes an input natural language description and
produces an image matching that description.
Such models began to be developed in the mid-2010s
during the beginnings of the AI spring, as a result of
advances in deep neural networks. In 2022, the output
of state-of-the-art text-to-image models, such as Source: Midjourney.com
OpenAI's DALL-E 2/3, Google Brain's Imagen,
StabilityAI's Stable Diffusion, and Midjourney began Source: DALL-E 2
to approach the quality of real photographs and
human-drawn art.
Source: wikipedia.org

20XX 16
Artificial Intelligence: New Trends
<Text to Image Models>

Text-to-image models generally combine


a language model, which transforms the input
text into a latent representation, and
a generative image model, which produces an
image conditioned on that representation. The
most effective models have generally been Source: Midjourney.com
trained on massive amounts of image and text
data scraped from the web. Source: DALL-E 2

Source: wikipedia.org

20XX 17
Artificial Intelligence: New Trends
<Text to Image Models>

Text-to-image models have been built using a variety of


architectures. The text encoding step may be performed
with a recurrent neural network such as a long short-term
memory (LSTM) network, though transformer models have
since become a more popular option. For the image
generation step, conditional generative adversarial
networks have been commonly used, with diffusion
models also becoming a popular option in recent years.
Rather than directly training a model to output a high-
resolution image conditioned on a text embedding, a
popular technique is to train a model to generate low-
resolution images, and use one or more auxiliary deep Fig. Overview of Generative Adversarial
learning models to upscale it, filling in finer details. Networks (GANs)

20XX 18
Artificial Intelligence, Machine Learning, and Deep Learning.
<What is the difference?>

Source: blogs.nvidia.com

20XX 19
Introduction: The Importance of Mathematics in ML

Mathematics are essential for understanding and working with


machine learning algorithms. These include:
1. Linear Algebra:
• Vectors and Matrices: Vectors and matrices are fundamental
for representing and manipulating data. They are used
extensively in tasks like transformations, regression, and
neural networks.
• Matrix Operations: Operations like addition, multiplication,
and inversion of matrices are essential for various machine
learning algorithms.
2. Calculus:
• Derivatives and Gradients: Calculus is crucial for optimization
algorithms. Understanding derivatives allows for finding the
optimal parameters of a model.
• Integration: Useful in probability theory, which is essential for
many machine learning models.

20XX 20
Introduction: The Importance of Mathematics in ML

1. Probability and Statistics:


• Probability Distributions: Understanding different probability
distributions is crucial for modeling uncertainty and making
predictions.
• Bayesian Inference: It's used in Bayesian methods, which are
important in areas like Bayesian networks and probabilistic
programming.
• Hypothesis Testing and Confidence Intervals: Used for evaluating the
significance of results and estimating uncertainties.
2. Optimization:
• Gradient Descent: A widely used optimization algorithm for training
machine learning models.
• Convex Optimization: Important for problems where the objective
function is convex, which is common in machine learning.
3. Information Theory:
• Entropy, Mutual Information: These concepts are used in feature
selection, dimensionality reduction, and understanding the
information content of data.

20XX 21
Machine Learning: Types of Machine Learning

20XX 22
Machine Learning: Types of Machine Learning

Supervised Learning Unsupervised Learning


Input Data: Labeled Data Input Data: Unlabeled Data

Label: Dog

Label: Cat

Label: Dog

20XX 23
Machine Learning: Types of Machine Learning

Reinforcement Learning

The typical framing of a Reinforcement Learning (RL) scenario: an agent


takes actions in an environment, which is interpreted into a reward and a
representation of the state, which are fed back into the agent.

20XX 24
Machine Learning: Types of Supervised Learning

Source: enjoyalgorithms.com

25
Machine Learning: Supervised Learning Algorithms

Supervised
Learning
Algorithms

Classification Regression

Support Support
Logistic Decision Trees Linear Decision Trees Lasso Ridge
Vector K-NN Vector
Regression Classification Regression Regression Regression Regression
Classifier Regression

Fig. Some supervised learning algorithms

26
Supervised Learning Algorithms: Linear Regression

• Linear regression is a fundamental statistical and machine learning technique used to


model the relationship between a dependent variable (target) and one or more
independent variables (features).
• Depending on whether there are one or more independent variables, a distinction is
made between simple and multiple linear regression analysis.
• The output y can be calculated from a linear combination of the input variables X

Source: datatab.net 27
Supervised Learning: Linear Regression

Mathematical model for simple linear regression

From simple linear regression to multiple linear


regression

Source: datatab.net 28
Supervised Learning: Linear Regression

Question: How can we find the


optimal values for a and b that fit the
data?

Source: datatab.net 29
Supervised Learning: Linear Regression
<Ordinary Least Squares and Normal Equations>

Using a mathematical approach


called Least Squares (Ordinary Least
Squares) we can find the values of a
and b that minimizes the error
epsilon ε

Example of simple linear regression

Source: datatab.net 30
Supervised Learning: Linear Regression
<Ordinary Least Squares and Normal Equations>

Normal equations are equations obtained by setting equal to zero the partial
derivatives of the sum of squared errors (least squares); normal equations allow
one to estimate the parameters of a multiple linear regression.

If you cannot solve it by yourself, you can check this useful link: http://seismo.berkeley.edu/~kirchner/eps_120/Toolkits/Toolkit_10.pdf

31
Supervised Learning: Linear Regression
<Gradient Descent>

There is another alternative to the normal equations, using


an optimization-based approach called "Gradient Descent"
• Definition: Gradient descent is an iterative optimization
algorithm used for finding the minimum of a function. It's
widely used in machine learning and deep learning for
training models. The basic idea behind gradient descent
is to take steps in the direction of steepest decrease in
the function, i.e., in the direction of the negative gradient.
• Objective: Given a function f(x) that we want to minimize
(Loss function/Cost function), gradient descent aims to
find the value of x that minimizes f(x). Gradient Descent: Iterative optimization algorithm

32
Supervised Learning: Linear Regression
<Gradient Descent>

Gradient: The gradient of a function f(x) is


a vector that points in the direction of the
steepest increase in the function. The
negative gradient points in the direction of
the steepest decrease.

Gradient Descent: Iterative optimization algorithm

33
Supervised Learning: Linear Regression
<Gradient Descent>

• Step 1: Start with a random initialization of the coefficients (α and β)


(Set the values to 0 for example)
• Step 2: Repeat "K" times this "descending" using the learning rate alpha
(but in order to avoid confusion α, we will name it gamma(γ))

34
Supervised Learning: Linear Regression
<Gradient Descent>

Source: alykhantejani.github.io
35
Supervised Learning: Linear Regression
<Gradient Descent>

Gradient: The gradient of a function f(x) is a vector that points in the


direction of the steepest increase in the function. The negative gradient
points in the direction of the steepest decrease.

36
Supervised Learning: Advanced Types
of Regression

Polynomial Regression: In which, we describe Lasso and Ridge regression are both techniques used in regression
the relationship between the independent analysis to handle the problem of overfitting and to improve the
variable x and the dependent variable y using generalization of the model. They do this by adding a penalty term
an nth-degree polynomial in x to the standard linear regression cost function.

37
Supervised Learning: Linear Regression
<DEMO>
DEMO: Session 1 – Linear Regression

38
Supervised Learning Algorithms: Logistic Regression

• Logistic regression is a statistical model used for analyzing datasets where


there are one or more independent variables (features) that can be used to
predict the outcome of a categorical dependent variable (target).
• Binary outcome: It's particularly useful when the dependent variable is binary,
meaning it has only two possible outcomes (e.g., 0 or 1, yes or no, true or
false).
• Note: For multi-class classification tasks (i.e., more than two classes),
logistic regression can be extended using techniques like one-vs-all
(also known as one-vs-rest) or softmax regression.
• Why there is the "regression" word in a classification algorithm? Despite its
name, logistic regression is primarily used for classification rather than
regression. Logistic regression gives a continuous value of P(Y=1) for a given
input X, which is later converted to Y=0 or Y=1 based on a threshold value.
• Sigmoid function: It is the core of logistic regression (also known as the
logistic function) to model the relationship between the independent variables
(features) and the probability of the outcome. The sigmoid function
"squashes" the output to be between 0 and 1, which makes it suitable for
where "x" is the linear combination of the input
representing probabilities. It is defined as follows:
features and their corresponding coefficients.
39
Supervised Learning Algorithms: Logistic Regression

Instead of fitting a line to the data (as in Linear Regression),


Logistic Regression fits an “S” shaped logistic function
called the Sigmoid Function.

40
Supervised Learning Algorithms: Logistic Regression

The difference between logistic regression and linear regression


• One big difference between linear regression and logistic
regression is how the line is fit to the data.
• In linear regression, we fit the line using least squares, In other
words, we find the line that minimizes the sum of the
squares of residuals (errors).
• Logistic regression does not have the same concept as residual,
so it cannot use least squares. There is not a closed-
form solution to solve the logistic regression
• Logistic regression uses something called "Maximum likelihood
estimation" or "gradient descent", to estimate the coefficients
of the logistic regression.

Note: For logistic regression, there is no longer a closed-form solution, due to the
nonlinearity of the logistic sigmoid function.
41
Supervised Learning Algorithms: Logistic Regression
<How to train a logistic regression algorithm?>

Note: As usual, before training any machine learning algorithm we need to define
a Loss/Cost Function.

As we have seen before, linear regression uses Least


Squared Error as a loss function that gives a convex
loss function and then we can complete the
optimization by finding its vertex as a global
minimum. However, for logistic regression, the
hypothesis is changed, the Least Squared Error will
result in a non-convex loss function with local
minimums by calculating with the sigmoid function
applied to the model's output.

42
Supervised Learning Algorithms: Logistic Regression
<How to train a logistic regression algorithm?>
Note: As usual, before training any machine learning algorithm we need to define
a Loss/Cost Function.

Due to the problem of non-convex if we apply the


same loss function used for linear regression,
therefore, the cost function for logistic regression
is going to be defined as follows:

Where,

43
Supervised Learning Algorithms: Logistic Regression
<Gradient Descent>

As you can see, If y = 1 and h(x) = 1, the cost = 0. But if y = 1


and h(x) = 0, we will penalize the learning algorithm by a very
large cost, cost = +inf.
The two defined functions for cost function can be compressed
into a single function as follow:

To minimize our cost function J(θ), we are going to use the


gradient descent algorithm.

44
Supervised Learning: Logistic Regression
<DEMO>

DEMO: Session 1 – Logistic Regression

45
Supervised Learning Algorithms: Tree-Based
Algorithms and Ensemble Methods

Bagging
Decision Tree Random Forest
(Bootstrap Aggregating)

XGBoost Gradient Boosting Boosting

46
Supervised Learning Algorithms: Tree-Based Algorithms and
Ensemble Methods

These methods can be used for both regression and classification problems.
• CART: Classification and Regression Trees (CART), commonly known as
decision trees, can be represented as binary/non-binary trees. They have the
advantage to be very interpretable.
• Bagging: Bootstrap + Aggregating, is the ensemble technique used by
random forest. Bagging chooses a random sample/random subset from the
entire data set. Hence each model is generated from the samples (Bootstrap
Samples) provided by the Original Data with replacement known as row
sampling.
• Random Forest: It is a tree-based technique that uses a high number of
decision trees built out of randomly selected sets of features. Contrary to the
simple decision tree, it is highly uninterpretable, but it's generally good
performance makes it a popular algorithm.
• Boosting: The idea of boosting methods is to combine several weak learners
to form a stronger one. The main ones are gradient boosting (e.g., XGBoost)
and Adaptive boosting (AdaBoost)

47
Supervised Learning Algorithms: Decision Trees

Decision trees (also called Classification And Regression Trees "CART") are a
type of machine learning algorithm that makes decisions by splitting data
into subsets based on certain features. They are used for both classification
(assigning a label to an item) and regression (predicting a numerical value).
1. Structure: Imagine a flowchart where each node represents a decision
based on a feature, and the branches represent the possible outcomes
or further decisions.
2. How They Work:
1. Root Node: This is the starting point where the entire dataset is.
It's the feature that is most effective at splitting the data.
2. Internal Nodes: These nodes split the data based on certain
conditions (e.g., if a value is greater than a certain threshold).
3. Leaf Nodes: These are the final nodes that do not split further.
They provide the final prediction or classification.

48
Supervised Learning Algorithms: Decision Trees
<Splitting Criteria>

Question: How can we choose the best decision


nodes?
Answer: Using splitting criteria, the model chooses
the best features to split the data based on some
criteria (e.g., Gini impurity for classification, Mean
Squared Error for regression).
Splitting Criteria:
• For classification tasks: Gini Impurity,
Information Gain, Entropy, Gini Gain. (and
more)
• For regression tasks: Mean squared error,
mean absolute error, and variance
reduction.

49
Supervised Learning Algorithms: Decision Trees
for Classification

• Gini Impurity:
• Gini impurity is a measure of how mixed the classes are in a
given dataset or subset of data.
• It ranges from 0 to 0.5, where 0 indicates that a node contains
only samples of a single class, and 0.5 indicates that the
samples are evenly distributed among the classes.
• The decision tree algorithm aims to minimize the Gini impurity
when making splits.
• Entropy:
• Entropy measures the disorder or uncertainty in a set of
data.
• Like Gini impurity, it also ranges from 0 to 1, with 0
indicating perfect order and 1 indicating maximum disorder.
• Decision trees aim to minimize entropy when making splits.

50
Supervised Learning Algorithms: Decision Trees
for Classification
Training a decision tree consists of iteratively splitting the current data into two branches.
Say we had the following datapoints: How can we quantify the best split?

Fig. 1. The dataset Fig. 2. A perfect split (x=2) Fig. 3. An imperfect split (x=1.5)

Source: gini-impurity 51
Supervised Learning Algorithms: Decision Trees
for Classification <Gini Impurity>

Suppose we randomly pick a data point in


our dataset, and then randomly classify it
according to the class distribution in the
dataset. For our dataset, we’d classify it as
blue 5/10 of the time and as green 5/10 of
the time, since we have 5 data points of each
color.
Question: What’s the probability we classify
the datapoint incorrectly?
Answer: Gini Impurity
Fig. 1. The dataset

Source: gini-impurity 52
Supervised Learning Algorithms: Decision Trees
for Classification <Gini Impurity>
Example 1: The whole dataset

Let’s calculate the Gini Impurity of our entire


dataset. If we randomly pick a data point,
it’s either blue (50%) or green (50%). Now, we
randomly classify our data point according to
the class distribution. Since we have 5 of
each color, we classify it as blue 50% of the
time and as green 50% of the time.
What’s the probability we classify our
datapoint incorrectly?

Fig. 1. The dataset

Source: gini-impurity 53
Supervised Learning Algorithms: Decision Trees
for Classification <Gini Impurity>
Example 1: The whole dataset

Event Probability

Pick blue, classify blue 25%


We only classify it incorrectly in 2 of the
Pick blue, classify green 25%
events above. Thus, our total probability is
25% + 25% = 50%, so the Gini Impurity is 0.5 Pick green, classify blue 25%

Pick green, classify green 25%

Source: gini-impurity 54
Supervised Learning Algorithms: Decision Trees
for Classification <Gini Impurity>
FORMULA

If we have C total classes and p(i) is the probability of picking a data


point with class i, then the Gini Impurity is calculated as:

Source: gini-impurity 55
Supervised Learning Algorithms: Decision Trees
for Classification <Gini Impurity>
Example 2: A Perfect Split
Left Branch has only blues, so its Gini Impurity is:

Right Branch has only greens, so its Gini Impurity is

Note: A Gini Impurity of 0 is the lowest and best possible impurity. It can
only be achieved when everything is the same class (e.g. only blues or
only greens).
Fig. A Perfect Split
Source: gini-impurity
56
Supervised Learning Algorithms: Decision Trees
for Classification <Gini Impurity>
Example 3: An Imperfect Split

Left Branch has only blues, so we know that Gleft = 0

Right Branch has 1 blue and 5 greens, so:

Fig. An Imperfect Split

Source: gini-impurity 57
Supervised Learning Algorithms: Decision Trees
for Classification <Gini Impurity>
Picking The Best Split
We’ve already calculated the Gini Impurities for:
o Before the split (the entire dataset): 0.5
o Left Branch: 0
o Right Branch: 0.278
We’ll determine the quality of the split by weighting the impurity of each branch by
how many elements it has. Since Left Branch has 4 elements and Right Branch has 6,
we get:

Thus, the amount of impurity we’ve “removed” with this split is: 0.5 - 0.167 = 0.333
This value is called Gini Gain. This is what’s used to pick the best split in a decision
tree! Higher Gini Gain = Better Split. For example, it’s easy to verify that the Gini Gain of
the perfect split on our dataset is 0.5 > 0.3
Fig. An Imperfect Split

Source: gini-impurity 58
Supervised Learning Algorithms: Decision Trees
for Classification

• Information Gain:
• Information Gain is used to determine the best
feature to split the data at each node in a decision
tree.
• It quantifies how much information a feature gives
us about the classes.
• Gini Gain:
• Gini Gain is similar to Information Gain, but it uses
Gini impurity as the measure instead of entropy.
• It's calculated similarly to Information Gain but using
Gini impurity instead of entropy.

59
Supervised Learning Algorithms: Decision Trees
for Regression

1. Mean Squared Error (MSE):


• The MSE measures the average squared difference
between predicted and actual values. It quantifies the
accuracy of a regression model.
• In the context of decision trees, the MSE is used to
evaluate the quality of a split. The split that
minimizes the MSE is chosen.
2. Mean Absolute Error (MAE):
• The MAE is similar to MSE but measures the average
absolute difference between predicted and actual
values, rather than squared differences.
• MAE is less sensitive to outliers compared to MSE.

60
Supervised Learning Algorithms: Decision Trees for Regression

Variance Reduction:
• In regression tasks, the goal is often to minimize the variance of the target variable within each node.
• The variance reduction is used as the splitting criterion, and it measures how much the variance of
the target variable is reduced after the split.
• The formula for variance reduction is specific to regression tasks and involves the calculation of
variances.
• The formula for variance reduction in the context of regression trees is as follows:
• Given a node with a dataset D containing n samples, where yi represents the target variable
values for each sample, the variance reduction is calculated as:

• m is the number of child nodes resulting from the split.


• Di represents the dataset in the i-th child node after the
split.
• ∣Di ∣ is the number of samples in the i-th child node.
• Var(D) is the variance of the target variable in node D
• y-bar is the mean of the target variable values in node D

61
Supervised Learning Algorithms: Decision Trees
<DEMO>

DEMO: Session 1 – Decision Trees

62
Supervised Learning Algorithms: Ensemble
Bagging Methods

Ensemble bagging methods are a type of


machine learning technique that involves
training multiple models independently and
combining their predictions to improve overall
performance. Bagging stands for Bootstrap
Aggregating, and the key idea is to create
diverse models by training each one on a
different subset of the training data.

Source: towardsdatascience.com
63
Supervised Learning Algorithms: Ensemble
Bagging Methods

The most popular ensemble bagging method is Random Forest, but there are others as
well. Here are some common ensemble bagging methods:
Random Forest: Random Forest builds multiple decision trees during training. Each tree
is trained on a random subset of the training data, and the final prediction is obtained
by averaging or taking a vote over the predictions of all the trees.
Bagged Decision Trees: This is a more generic term for ensemble methods that use
bagging with decision trees as the base model. Random Forest is a specific
implementation of bagged decision trees.
Bagged Support Vector Machines (SVM): Bagging can be applied to SVMs, where each
model is trained on a different subset of the training data. The final prediction is often
determined by averaging the predictions of the individual SVMs.
Bagged Neural Networks: Similar to other models, neural networks can also benefit
from bagging. Multiple neural networks are trained on different subsets of the training
data, and their predictions are combined to form the final output.
Source: towardsdatascience.com
Bootstrap Aggregating (Bagging) in General: Bagging can be applied to various base
models, including decision trees, support vector machines, and neural networks, among
others. The idea is to create an ensemble of diverse models to reduce overfitting and
improve generalization.
64
Supervised Learning Algorithms: Ensemble
Bagging Methods <Random Forest>
Bootstrap: The term "bootstrap" in statistics refers to a resampling
technique where subsets of the dataset are randomly sampled with
replacement. In other words, each data point has an equal chance of
being selected for a subset, and it can be selected more than once.
This process mimics the idea of repeatedly drawing samples from
the original dataset.
Aggregating: "Aggregating" refers to the process of combining or
averaging the predictions of multiple models to obtain a final
prediction or decision.
Therefore, "Bootstrap Aggregating," or "Bagging" for short, involves
the following steps:
Bootstrap: Randomly sample subsets of the training data with
replacement, creating multiple subsets that may overlap.
Aggregating: Train a separate model on each subset of the data,
and then combine their predictions through averaging or voting to
make a final prediction.

Source: spotfire.com

65
Supervised Learning Algorithms: Ensemble
Boosting Methods

Boosting is an ensemble learning method that combines multiple


weak learners to create a strong learner. There are several boosting
algorithms, and some of the popular ones include:

XGBoost
AdaBoost
(Extreme
(Adaptive LightGBM CatBoost
Gradient
Boosting)
Boosting)

66
Supervised Learning Algorithms: K-Nearest Neighbors

k-Nearest Neighbors (k-NN) is a


supervised machine learning algorithm
used for both classification and
regression tasks. It's a simple and
intuitive algorithm that relies on the idea
of finding the "nearest" data points in
the training set to a given test point and
using those neighbors to make a
prediction.
Fig. K-NN Algorithm Steps

67
Supervised Learning Algorithms: K-Nearest Neighbors

1. Training Phase:
• Store the entire dataset in memory.
• No explicit training is done, as k-NN is a lazy learner.
2. Prediction Phase:
• When given a new, unseen data point, the algorithm finds the k
data points in the training set that are closest (most similar) to the
new point.
• The "closeness" is typically determined by a distance metric like
Euclidean distance, Manhattan distance, etc.
3. Classification:
• In the classification task, the algorithm assigns a class label to the
new point based on the majority class among its k nearest
neighbors. Fig. K-NN Algorithm Steps
4. Regression:
• In regression, the algorithm predicts a continuous value for the new
point based on the average (or some other measure) of the target
values of its k nearest neighbors.

68
Supervised Learning Algorithms: K-Nearest Neighbors

1. Key Parameters:
• k: This is the number of neighbors to consider. It's a hyperparameter that you need to choose. A small k
might lead to noise in the prediction, while a large k might smooth out the decision boundaries.
• Distance Metric: This defines how the "closeness" of data points is calculated. Common options include
Euclidean distance, Manhattan distance, Minkowski distance, etc.
2. Advantages:
• Simple to understand and implement.
• No explicit training phase, making it computationally efficient during training.
• Can be used for both classification and regression tasks.
3. Disadvantages:
• Can be computationally expensive during prediction, especially with large datasets.
• Sensitive to the choice of distance metric and value of k.
• Doesn't learn underlying patterns in the data, which can lead to suboptimal performance in some cases.
4. Considerations:
• Scaling of features is crucial as k-NN is sensitive to the scale of the variables.
• It's important to choose an appropriate value of k through techniques like cross-validation.

69
Supervised Learning Algorithms: K-Nearest Neighbors
<Example>

Example of k-NN classification. The test sample


(green dot) should be classified either to blue squares
or to red triangles. If k = 3 (solid line circle) it is
assigned to the red triangles because there are 2
triangles and only 1 square inside the inner circle. If k
= 5 (dashed line circle) it is assigned to the blue
squares (3 squares vs. 2 triangles inside the outer
circle).

70
Supervised Learning Algorithms: K-Nearest Neighbors
<DEMO>

DEMO: Session 1 - k-Nearest Neighbors (k-NN)

71
Supervised Learning Algorithms: Commonly Used
Regression Algorithms

Regression Algorithms:
1. Linear Regression: Predicts a continuous output variable based on the
input features by fitting a linear equation to the observed data.
2. Ridge Regression (L2 regularization): Similar to linear regression but
adds a penalty term to the coefficients to prevent overfitting.
3. Lasso Regression (L1 regularization): Similar to ridge regression, but
uses the absolute values of coefficients, which can lead to sparsity in the
model.
4. ElasticNet: A combination of L1 and L2 regularization that balances
between Ridge and Lasso regression.
5. Decision Tree Regression: Uses a decision tree to predict continuous
values.
6. Random Forest Regression: Ensemble method that uses multiple
decision trees to improve accuracy and control overfitting.
7. Gradient Boosting Regression: Builds an additive model in a forward
stage-wise manner, optimizing for the residual errors.
8. Support Vector Regression (SVR): Uses support vector machines to
perform regression.
72
Supervised Learning Algorithms: Commonly Used
Classification Algorithms

Classification Algorithms:
1. Logistic Regression: Used for binary classification problems, models the probability that a given instance belongs
to a particular category.
2. K-Nearest Neighbors (KNN): Classifies data points based on the majority class of their k-nearest neighbors.
3. Support Vector Machines (SVM): Finds the hyperplane that best separates classes in a high-dimensional space.
4. Naive Bayes: Applies Bayes' theorem with the "naive" assumption that features are independent, commonly used
for text classification.
5. Decision Tree Classification: Divides the data into subsets based on the value of features to make categorical
predictions.
6. Random Forest Classification: Ensemble method that uses multiple decision trees for classification.
7. Gradient Boosting Classification: Builds an ensemble of weak learners (usually decision trees) in a forward
stage-wise manner.
8. Neural Networks (Deep Learning): Multi-layered networks of interconnected nodes used for complex
classification tasks.
9. XGBoost, LightGBM, CatBoost (Gradient Boosting variations): Highly optimized implementations of gradient
boosting algorithms.
10.Adaboost: Combines multiple weak learners to create a strong learner.

73
Supervised Learning Algorithms: Classification
and Regression Algorithms

Source: www.oreilly.com 74
Supervised Learning Algorithms: Parametric vs.
Nonparametric

There are two main types of machine learning algorithms: parametric and
nonparametric.

Parametric Algorithms Nonparametric Algorithms


Parametric algorithms are based on a mathematical Nonparametric algorithms are not based on a
model that defines the relationship between inputs and mathematical model; instead, they learn from the data
outputs. This makes them more restrictive than itself. This makes them more flexible than parametric
nonparametric algorithms, but it also makes them faster algorithms but also more computationally expensive.
and easier to train. Parametric algorithms are most Nonparametric algorithms are most appropriate for
appropriate for problems where the input data is well- problems where the input data is not well-defined or is
defined and predictable. too complex to be modeled using a parametric
Examples of parametric algorithms: Linear regression, algorithm.
Logistic regression, and Neural networks. Examples of nonparametric algorithms: Decision trees,
K-NN.

For more information: machinelearningmastery.com 75


Supervised Learning: Evaluation Metrics
<Classification Metrics>

Question: When dealing with a classification


problem, how can we evaluate its effectiveness and
thus conduct a comparative study with other
classifiers?
Answer: We calculate the correct classified data
points and divide it by the total number of data
points in the dataset.

Another Question: Is the accuracy metric enough?


Are there potential problems with it?

Short Answer: No, It is not enough and Yes, There is


potential problems
76
Supervised Learning: Evaluation Metrics
<Classification Metrics>

77
Supervised Learning: Evaluation Metrics
<Classification Metrics>

Some potential problems with relying only on accuracy

Problem Description Examples

In a medical dataset where only 5% of


In datasets where one class significantly outweighs the others
Imbalanced patients have a rare disease, a model
(class imbalance), a model can achieve high accuracy by
Datasets that always predicts "no disease"
simply predicting the majority class most of the time.
would still achieve 95% accuracy.

In applications where detecting rare events is crucial (e.g.,


fraud detection, disease diagnosis), accuracy can be In fraud detection, a model that misses
Misleading in misleading. A high accuracy score may not reflect the model's rare instances of fraud (false negatives)
Rare Events ability to correctly identify these critical cases. could have high accuracy but would be
Accuracy may not reflect the model's ability to detect rare, ineffective.
but important, occurrences.

78
Supervised Learning: Evaluation Metrics
<Classification Metrics>

Question: What is the solution?


Are there other alternatives to
Accuracy?

Answer: There are other metrics that can


be used: Recall/Sensitivity, Precision,
Specificity, F-Score, ROC-AUC Curve...etc.
WAIT!!!

Before explaining those metrics, let's first


understand the confusing "Confusion
Matrix"

79
Supervised Learning: Evaluation Metrics
<Classification Metrics>

Fig. Confusion Matrix Layout Tab. Basic classification metrics

80
Supervised Learning: Evaluation Metrics
<Classification Metrics>

ROC: The receiver operating curve, also noted as AUC: The area under the receiving operating
ROC, is the plot of TPR versus FPR by varying the curve, also noted AUC or AUROC, is the area
threshold. These metrics are summed up below: below the ROC as shown in the following figure:

Source: stanford.edu
81
Supervised Learning: Evaluation Metrics
<Classification Metrics>

82
Supervised Learning: Evaluation Metrics
<Classification Metrics>

Example Scenario:

Let's say we have a binary classification problem for a medical test that
identifies a disease:
Out of 100 actual cases of the disease:
The test correctly identifies 80 as positive (True Positives).
The test incorrectly identifies 20 as negative (False Negatives).
Out of 100 non-cases (healthy individuals):
The test correctly identifies 90 as negative (True Negatives).
The test incorrectly identifies 10 as positive (False Positives).

Find the confusion matrix and compute the accuracy, recall, precision, and
specificity!!
83
Supervised Learning: Evaluation Metrics
<Regression Metrics>

These regression metrics serve different


purposes in evaluating the performance of
regression models. MAE, MSE, and RMSE
focus on the accuracy of predictions, while
R2 (R-squared) assesses the overall
goodness of fit of the model. Depending on
the specific characteristics of the problem,
different metrics may be more appropriate
for evaluation.

Fig. Regression evaluation metrics

84
Unsupervised Learning Algorithms

• This is called unsupervised learning because


unlike supervised learning, there are no given
labels/outputs and the machine itself finds the
answers.
• Motivation: The goal of unsupervised learning
is to find and discover hidden patterns in
unlabeled data {x(1),...,x(m)}.
• Example: Clustering similar documents based
on content/text. Fig. A simple illustration of the clustering technique

85
Unsupervised Learning Algorithms

• Types of unsupervised learning algorithms:​


• Clustering: Clustering aims to group similar data points together
based on their features or attributes. The goal is to identify natural
clusters in the data. Examples: Customer Segmentation:
Unsupervised
Grouping customers based on learning
purchasing behavior. Image Segmentation: Dividing an image into
regions with similar characteristics.
• Dimensionality Reduction: Dimensionality reduction techniques
aim to reduce the number of features or variables in the data while Dimensionality
Clustering Association
preserving important information.​ reduction
• Association: Association aims to discover relationships or patterns
in data. It identifies sets of items that frequently occur
together.​ Examples: Market Basket Analysis: Identifying which Principal
products tend to be bought together in a shopping Apriori
K-means Component
Algorithm
basket. Recommender Systems: Suggesting products or content Analysis (PCA)
based on user behavior.​

Fig. Some unsupervised learning algorithms


tSNE Visualization: https://nicola17.github.io/tfjs-tsne-demo/

86
Unsupervised Learning Algorithms: Clustering

• Clustering in machine learning is a technique used to group similar data points


together based on their features or attributes. The goal is to identify natural
clusters in the data, where data points within a cluster are more similar to
each other compared to points in different clusters.
• Clustering is an unsupervised learning technique, which means that the
algorithm does not rely on labeled data. It learns from the inherent structure
in the input data without any specific guidance.
• As the examples are unlabeled, clustering relies on unsupervised machine
learning. If the examples are labeled, then clustering becomes classification.
• Clustering Use Cases:
• market segmentation
• social network analysis
• search result grouping Fig. Unlabeled examples grouped into
• medical imaging three clusters.
• image segmentation
• anomaly detection

87
Unsupervised Learning Algorithms: Clustering
using K-Means Algorithm

The K-Means algorithm is a popular clustering algorithm (centroid-based) used in


machine learning. It aims to partition a dataset into K distinct, non-overlapping
subsets (or clusters) based on the similarity of data points. The "K" in K-Means
refers to the number of clusters that the algorithm aims to find.

88
Unsupervised Learning Algorithms: Clustering
using K-Means Algorithm

How does the k-mean algorithm work?

Algorithm: After randomly initializing the


We note c(i) the cluster of
cluster centroids μ1​,μ2,..., μk​. The k-means
data point i and μj the
algorithm repeats the following step until
center of cluster j.
convergence:

and

89
Unsupervised Learning Algorithms: Clustering
using K-Means Algorithm

How does the k-mean algorithm work?

In order to see if the algorithm converges, we look at


the distortion function defined as follows:

90
Unsupervised Learning Algorithms: Clustering
using K-Means Algorithm <Visualization>

Visualizing K-Means Clustering


91
Unsupervised Learning Algorithms:
Clustering Algorithms

Centroid-based Clustering Density-based Clustering

For more details: A Comprehensive Survey of Clustering Algorithms


92
Unsupervised Learning Algorithms:
Clustering Algorithms

Distribution-based Clustering Hierarchical Clustering

For more details: A Comprehensive Survey of Clustering Algorithms


93
Unsupervised Learning Algorithms: Clustering
using K-Means <DEMO>

DEMO: Session 1 - Unsupervised Learning Clustering


using K-Means Algorithm

Source: Wikipedia 94
Machine Learning: Important Concepts
<Overfitting and Underfitting>

Two important concepts:


• Overfitting and underfitting
• Bias and variance trade-off

95
Machine Learning: Important Concepts
<Overfitting and Underfitting>

Underfitting and overfitting are two common problems in


machine learning that occur when a model is not able to
generalize well to new, unseen data.
• Underfitting: Underfitting occurs when a model is too
simple to capture the underlying trend of the data. It
performs poorly on both the training data and unseen
data.
• Causes of underfitting:
• Using a very simple model (e.g., linear
regression for a non-linear problem).
• Not having enough features in the model.
• Using too few training samples.

96
Machine Learning: Important Concepts
<Overfitting and Underfitting>

• Overfitting: Overfitting happens when a model learns the


training data too well, capturing noise or random
fluctuations that are not representative of the true
underlying pattern. As a result, it performs well on the
training data but poorly on unseen data.
• Causes of underfitting:
• Using a very complex model (e.g., a high-
degree polynomial for a simple problem).
• Having too many features relative to the
number of training samples.

97
Machine Learning: Important Concepts
<Bias-Variance Tradeoff>

• What is Bias:
• Error between average model prediction
and ground truth
• The bias of the estimated function tells us
the capacity of the underlying model to predict
the values
• High Bias: Overly-simplified model, Underfitting, and
High error on both train and test sets
• What is Variance?
• Average variability in the model prediction for the
given dataset
• The variance of the estimated function tells you
how much the function can adjust to the change in
the dataset
• High Variance: Overly-complex model, Overfitting, Low
error on train data and high on test, Starts modelling the
noise in the input.
98
Machine Learning: Important Concepts
<K-Fold Cross Validation>

K-fold cross-validation: One of the techniques to evaluate machine learning


models on limited data samples and one of the employed techniques to detect
overfitting and how well a model will generalize to new, unseen data.
• Idea: The k-fold cross-validation (CV) refers to the k number of the folds
or subset division of the dataset (The model is then trained and evaluated
k times, using each fold as the testing set once and the remaining k-1
folds as the training set). It is generally a technique used in machine
learning to evaluate the performance of the models on the entire dataset,
it ensures that every data point from the dataset has the chances of
appearing in both training and testing sets, and thus it is one of the best
approaches to evaluate the models’ performance.

99
Machine Learning: Important Concepts
<Hyper-Parameters Optimization/Tuning>

Problem: We have seen so far different machine


learning algorithms; linear regression, logistic
regression, decision tree, ...etc. In linear regression
for example, we talked about something called
"learning rate", in decision trees, we talked about
the criterion (to measure the quality of a split) like :
gini impurity and entropy.
Question: How to choose the best values for these
hyperparameters

100
Machine Learning: Important Concepts
<Hyper-Parameters Optimization/Tuning>

Definition: A hyperparameter is a configuration setting for a


machine-learning model that is not learned from the data.
Instead, it is set prior to training and remains constant during
the training process. These parameters are essential for
controlling the learning process, but they are not learned from
the data itself.
In contrast, the parameters of a machine learning model are
the variables that the model learns from the training data.
For example, in a linear regression model, the coefficients for
each feature are learned parameters.

101
Machine Learning: Important Concepts
<Hyper-Parameters Optimization/Tuning>

Hyperparameters of some machine learning algorithms:


• Linear regression: Linear regression is not a complex algorithm;
therefore, it does not have hyperparameters except the learning
rate (alpha). However, there are some advanced regression
algorithms that have more hyperparameters.
• Logistic Regression: Solver, penalty, and 'C' regularization
strength.
• Decision Trees: Criterion, Maximum depth.
• K-Nearest Neighbour: Number of Neighbours 'K'.

102
Machine Learning: Important Concepts
<Hyper-Parameters Optimization/Tuning>

• Hyperparameter optimization is the process of finding the


optimal hyperparameters for a machine-learning model in
order to obtain the optimal performance of the model.
Hyperparameters are parameters not learned from the data
but the user sets them before training the
model. Hyperparameter optimization is a crucial task in the
process of building machine learning models because the
performance of a machine learning model heavily depends on
the chosen values for its hyperparameters. Through a
systematic exploration of various hyperparameter
combinations, we can identify the exact values that yield
optimal performance for a given task or problem.
• Several approaches to hyperparameter optimization/tuning,
including manual tuning, grid search, random search, and
some other advanced techniques like Bayesian
Source: https://maelfabien.github.io/machinelearning/Explorium4/#w
optimization and genetic algorithms. hat-is-hyperparameter-optimization

103
Machine Learning: Important Concepts
<Hyper-Parameters Optimization/Tuning>

• Grid Search: In grid search, a set of possible values for


each hyperparameter is specified by the user, and the
algorithm comprehensively evaluates all possible
combinations of hyperparameters.
• Random Search: Random search randomly selects
hyperparameter values from predefined ranges. It
doesn’t cover the entire search space like grid search
but can be more efficient in practice, especially if
some hyperparameters are less influential.
• Note: There are some libraries and tools available that
can help with hyperparameter optimization/tuning,
such as ‘GridSearch’ and ‘RandomSearch’ in scikit-
learn library, Optuna, Hyperopt, and Ray Tune.

104
Automated Machine learning (AutoML)

Automated machine learning or simply AutoML, refers to the process of


automating the end-to-end process of applying machine learning to
real-world problems. It aims to make machine learning more accessible
to individuals and organizations with limited expertise in data science
and machine learning.

Source: learn.microsoft.com 105


Automated Machine learning (AutoML)
< Key Aspects >

Hyperparameter Feature Model evaluation Deployment and


Data preparation Model selection
optimization engineering and selection monitoring

106
Automated Machine learning (AutoML)
< Some Frameworks >

107
Automated Machine learning (AutoML)
<DEMO>
DEMO: Session 1 – Automated Machine
Learning AutoML

108
Automated Machine learning (AutoML)
< AutoML vs Data scientists >

Will AutoML replace data scientists and machine


learning engineers?

Short Answer: No, but...


109
Automated Machine learning (AutoML)
< AutoML vs Data scientists >

• AutoML will not replace data scientists.


• Data scientists must be specialized and
educated much more than before.
• AutoML will boost productivity and make
the work of a data scientist more
interesting and more challenging.
• Domain knowledge will become more
important.
• Scripting will become less important.

110
Bonus: Advice For Machine Learning Beginners

Advice for machine learning beginners | Andrej Karpathy and Lex Fridman

111

You might also like