Session 01 - Classical Machine Learning
Session 01 - Classical Machine Learning
PART 1 | MIAS M1
Idriss JAIRI
[email protected]
COURSE PLAN
Session 1: Classical Machine Learning
20XX 2
SESSION 1: MACHINE
LEARNING
Introduction to AI, Supervised Learning, Unsupervised
Learning
20XX
Introduction: Why has Artificial Intelligence (AI)
become so popular?
Big tech companies invest a lot of effort and money Scientific research and
in AI impressive results
20XX 4
Introduction: Why has Artificial Intelligence (AI) become
so popular? <Increased volumes of data>
Source: explodingtopics.com
20XX 5
Introduction: Why has Artificial Intelligence (AI) become
so popular? <Increased volumes of data>
20XX 6
Introduction: Why has Artificial Intelligence (AI) become
so popular? <Advanced algorithms>
20XX 7
Introduction: Why has Artificial Intelligence (AI) become
so popular? <Advancements in computing power>
20XX 8
Introduction: A Brief History of AI
• 1943: Warren McCulloch and Walter Pitts create a mathematical model of a neural network.
• 1949: Donald Hebb proposes a learning rule for neural networks, known as Hebbian learning.
• 1950: Alan Turing introduces the "Turing Test" as a way to evaluate a machine's ability to exhibit intelligent
behavior.
• 1951: Marvin Minsky and Dean Edmonds build the first neural network computer, SNARC.
• 1956: John McCarthy organized the Dartmouth Conference, which is considered the birth of AI as a field of study.
"The term artificial intelligence was first coined by John McCarthy in 1956 "
• 1957: Frank Rosenblatt develops the perceptron, a simplified model of a biological neuron, which becomes one of
the earliest machine learning algorithms. It lays the groundwork for later developments in neural networks.
• 1960: John McCarthy develops the programming language LISP, which becomes widely used in AI research.
• 1966: Shakey the robot, developed at Stanford Research Institute, demonstrates basic problem-solving abilities.
• 1966: ELIZA is an early natural language processing computer program, the first program that allowed some kind of
plausible conversation between humans and machines.
Note: There were two major AI winters (Dark Age of AI) approximately 1974–1980 and 1987–2000
20XX 10
Introduction: A Brief History of AI
<From 1940s to Now>
20XX 11
Introduction: A Brief History of AI
<From 1940s to Now>
• 2013: Google's DeepMind develops a deep learning algorithm (Q-Learning) that learns to play Atari 2600
video games at a human-level performance. "Playing Atari with Deep Reinforcement Learning".
• 2016: AlphaGo, a program developed by DeepMind, defeats world Go champion Lee Sedol in a five-game
match.
• 2018: GPT-1 (Generative Pre-trained Transformer ) by OpenAI showcases powerful language generation
capabilities.
• 2019: OpenAI introduces GPT-2, a large-scale language model
• 2020: OpenAI introduces GPT-3, a large-scale language model.
• 2021: AlphaFold is an artificial intelligence program developed by DeepMind, which performs predictions of
protein structure.
• 2022: Generative Pre-trained Transformer 3.5 (GPT-3.5) is a sub class of GPT-3 Models created by OpenAI in
2022
• 2023: Generative Pre-trained Transformer 4 (GPT-4) is a multimodal large language model created
by OpenAI, and the fourth in its series of GPT foundation models
20XX 12
Introduction: Deep Learning State of the Art
20XX 13
Introduction: A Brief History of AI
1997: Deep Blue IBM chess computer beats 2011: IBM-Watson Defeats Humans in "Jeopardy!" 2016: Google's AlphaGo (Developed by DeepMind)
Garry Kasparov (Chess Grandmaster) beats Go master Lee Se-dol.
Full Documentary: AlphaGo – The Movie
20XX 14
Artificial Intelligence: New Trends
<Large Language Models (LLMs)>
20XX 16
Artificial Intelligence: New Trends
<Text to Image Models>
Source: wikipedia.org
20XX 17
Artificial Intelligence: New Trends
<Text to Image Models>
20XX 18
Artificial Intelligence, Machine Learning, and Deep Learning.
<What is the difference?>
Source: blogs.nvidia.com
20XX 19
Introduction: The Importance of Mathematics in ML
20XX 20
Introduction: The Importance of Mathematics in ML
20XX 21
Machine Learning: Types of Machine Learning
20XX 22
Machine Learning: Types of Machine Learning
Label: Dog
Label: Cat
Label: Dog
20XX 23
Machine Learning: Types of Machine Learning
Reinforcement Learning
20XX 24
Machine Learning: Types of Supervised Learning
Source: enjoyalgorithms.com
25
Machine Learning: Supervised Learning Algorithms
Supervised
Learning
Algorithms
Classification Regression
Support Support
Logistic Decision Trees Linear Decision Trees Lasso Ridge
Vector K-NN Vector
Regression Classification Regression Regression Regression Regression
Classifier Regression
26
Supervised Learning Algorithms: Linear Regression
Source: datatab.net 27
Supervised Learning: Linear Regression
Source: datatab.net 28
Supervised Learning: Linear Regression
Source: datatab.net 29
Supervised Learning: Linear Regression
<Ordinary Least Squares and Normal Equations>
Source: datatab.net 30
Supervised Learning: Linear Regression
<Ordinary Least Squares and Normal Equations>
Normal equations are equations obtained by setting equal to zero the partial
derivatives of the sum of squared errors (least squares); normal equations allow
one to estimate the parameters of a multiple linear regression.
If you cannot solve it by yourself, you can check this useful link: http://seismo.berkeley.edu/~kirchner/eps_120/Toolkits/Toolkit_10.pdf
31
Supervised Learning: Linear Regression
<Gradient Descent>
32
Supervised Learning: Linear Regression
<Gradient Descent>
33
Supervised Learning: Linear Regression
<Gradient Descent>
34
Supervised Learning: Linear Regression
<Gradient Descent>
Source: alykhantejani.github.io
35
Supervised Learning: Linear Regression
<Gradient Descent>
36
Supervised Learning: Advanced Types
of Regression
Polynomial Regression: In which, we describe Lasso and Ridge regression are both techniques used in regression
the relationship between the independent analysis to handle the problem of overfitting and to improve the
variable x and the dependent variable y using generalization of the model. They do this by adding a penalty term
an nth-degree polynomial in x to the standard linear regression cost function.
37
Supervised Learning: Linear Regression
<DEMO>
DEMO: Session 1 – Linear Regression
38
Supervised Learning Algorithms: Logistic Regression
40
Supervised Learning Algorithms: Logistic Regression
Note: For logistic regression, there is no longer a closed-form solution, due to the
nonlinearity of the logistic sigmoid function.
41
Supervised Learning Algorithms: Logistic Regression
<How to train a logistic regression algorithm?>
Note: As usual, before training any machine learning algorithm we need to define
a Loss/Cost Function.
42
Supervised Learning Algorithms: Logistic Regression
<How to train a logistic regression algorithm?>
Note: As usual, before training any machine learning algorithm we need to define
a Loss/Cost Function.
Where,
43
Supervised Learning Algorithms: Logistic Regression
<Gradient Descent>
44
Supervised Learning: Logistic Regression
<DEMO>
45
Supervised Learning Algorithms: Tree-Based
Algorithms and Ensemble Methods
Bagging
Decision Tree Random Forest
(Bootstrap Aggregating)
46
Supervised Learning Algorithms: Tree-Based Algorithms and
Ensemble Methods
These methods can be used for both regression and classification problems.
• CART: Classification and Regression Trees (CART), commonly known as
decision trees, can be represented as binary/non-binary trees. They have the
advantage to be very interpretable.
• Bagging: Bootstrap + Aggregating, is the ensemble technique used by
random forest. Bagging chooses a random sample/random subset from the
entire data set. Hence each model is generated from the samples (Bootstrap
Samples) provided by the Original Data with replacement known as row
sampling.
• Random Forest: It is a tree-based technique that uses a high number of
decision trees built out of randomly selected sets of features. Contrary to the
simple decision tree, it is highly uninterpretable, but it's generally good
performance makes it a popular algorithm.
• Boosting: The idea of boosting methods is to combine several weak learners
to form a stronger one. The main ones are gradient boosting (e.g., XGBoost)
and Adaptive boosting (AdaBoost)
47
Supervised Learning Algorithms: Decision Trees
Decision trees (also called Classification And Regression Trees "CART") are a
type of machine learning algorithm that makes decisions by splitting data
into subsets based on certain features. They are used for both classification
(assigning a label to an item) and regression (predicting a numerical value).
1. Structure: Imagine a flowchart where each node represents a decision
based on a feature, and the branches represent the possible outcomes
or further decisions.
2. How They Work:
1. Root Node: This is the starting point where the entire dataset is.
It's the feature that is most effective at splitting the data.
2. Internal Nodes: These nodes split the data based on certain
conditions (e.g., if a value is greater than a certain threshold).
3. Leaf Nodes: These are the final nodes that do not split further.
They provide the final prediction or classification.
48
Supervised Learning Algorithms: Decision Trees
<Splitting Criteria>
49
Supervised Learning Algorithms: Decision Trees
for Classification
• Gini Impurity:
• Gini impurity is a measure of how mixed the classes are in a
given dataset or subset of data.
• It ranges from 0 to 0.5, where 0 indicates that a node contains
only samples of a single class, and 0.5 indicates that the
samples are evenly distributed among the classes.
• The decision tree algorithm aims to minimize the Gini impurity
when making splits.
• Entropy:
• Entropy measures the disorder or uncertainty in a set of
data.
• Like Gini impurity, it also ranges from 0 to 1, with 0
indicating perfect order and 1 indicating maximum disorder.
• Decision trees aim to minimize entropy when making splits.
50
Supervised Learning Algorithms: Decision Trees
for Classification
Training a decision tree consists of iteratively splitting the current data into two branches.
Say we had the following datapoints: How can we quantify the best split?
Fig. 1. The dataset Fig. 2. A perfect split (x=2) Fig. 3. An imperfect split (x=1.5)
Source: gini-impurity 51
Supervised Learning Algorithms: Decision Trees
for Classification <Gini Impurity>
Source: gini-impurity 52
Supervised Learning Algorithms: Decision Trees
for Classification <Gini Impurity>
Example 1: The whole dataset
Source: gini-impurity 53
Supervised Learning Algorithms: Decision Trees
for Classification <Gini Impurity>
Example 1: The whole dataset
Event Probability
Source: gini-impurity 54
Supervised Learning Algorithms: Decision Trees
for Classification <Gini Impurity>
FORMULA
Source: gini-impurity 55
Supervised Learning Algorithms: Decision Trees
for Classification <Gini Impurity>
Example 2: A Perfect Split
Left Branch has only blues, so its Gini Impurity is:
Note: A Gini Impurity of 0 is the lowest and best possible impurity. It can
only be achieved when everything is the same class (e.g. only blues or
only greens).
Fig. A Perfect Split
Source: gini-impurity
56
Supervised Learning Algorithms: Decision Trees
for Classification <Gini Impurity>
Example 3: An Imperfect Split
Source: gini-impurity 57
Supervised Learning Algorithms: Decision Trees
for Classification <Gini Impurity>
Picking The Best Split
We’ve already calculated the Gini Impurities for:
o Before the split (the entire dataset): 0.5
o Left Branch: 0
o Right Branch: 0.278
We’ll determine the quality of the split by weighting the impurity of each branch by
how many elements it has. Since Left Branch has 4 elements and Right Branch has 6,
we get:
Thus, the amount of impurity we’ve “removed” with this split is: 0.5 - 0.167 = 0.333
This value is called Gini Gain. This is what’s used to pick the best split in a decision
tree! Higher Gini Gain = Better Split. For example, it’s easy to verify that the Gini Gain of
the perfect split on our dataset is 0.5 > 0.3
Fig. An Imperfect Split
Source: gini-impurity 58
Supervised Learning Algorithms: Decision Trees
for Classification
• Information Gain:
• Information Gain is used to determine the best
feature to split the data at each node in a decision
tree.
• It quantifies how much information a feature gives
us about the classes.
• Gini Gain:
• Gini Gain is similar to Information Gain, but it uses
Gini impurity as the measure instead of entropy.
• It's calculated similarly to Information Gain but using
Gini impurity instead of entropy.
59
Supervised Learning Algorithms: Decision Trees
for Regression
60
Supervised Learning Algorithms: Decision Trees for Regression
Variance Reduction:
• In regression tasks, the goal is often to minimize the variance of the target variable within each node.
• The variance reduction is used as the splitting criterion, and it measures how much the variance of
the target variable is reduced after the split.
• The formula for variance reduction is specific to regression tasks and involves the calculation of
variances.
• The formula for variance reduction in the context of regression trees is as follows:
• Given a node with a dataset D containing n samples, where yi represents the target variable
values for each sample, the variance reduction is calculated as:
61
Supervised Learning Algorithms: Decision Trees
<DEMO>
62
Supervised Learning Algorithms: Ensemble
Bagging Methods
Source: towardsdatascience.com
63
Supervised Learning Algorithms: Ensemble
Bagging Methods
The most popular ensemble bagging method is Random Forest, but there are others as
well. Here are some common ensemble bagging methods:
Random Forest: Random Forest builds multiple decision trees during training. Each tree
is trained on a random subset of the training data, and the final prediction is obtained
by averaging or taking a vote over the predictions of all the trees.
Bagged Decision Trees: This is a more generic term for ensemble methods that use
bagging with decision trees as the base model. Random Forest is a specific
implementation of bagged decision trees.
Bagged Support Vector Machines (SVM): Bagging can be applied to SVMs, where each
model is trained on a different subset of the training data. The final prediction is often
determined by averaging the predictions of the individual SVMs.
Bagged Neural Networks: Similar to other models, neural networks can also benefit
from bagging. Multiple neural networks are trained on different subsets of the training
data, and their predictions are combined to form the final output.
Source: towardsdatascience.com
Bootstrap Aggregating (Bagging) in General: Bagging can be applied to various base
models, including decision trees, support vector machines, and neural networks, among
others. The idea is to create an ensemble of diverse models to reduce overfitting and
improve generalization.
64
Supervised Learning Algorithms: Ensemble
Bagging Methods <Random Forest>
Bootstrap: The term "bootstrap" in statistics refers to a resampling
technique where subsets of the dataset are randomly sampled with
replacement. In other words, each data point has an equal chance of
being selected for a subset, and it can be selected more than once.
This process mimics the idea of repeatedly drawing samples from
the original dataset.
Aggregating: "Aggregating" refers to the process of combining or
averaging the predictions of multiple models to obtain a final
prediction or decision.
Therefore, "Bootstrap Aggregating," or "Bagging" for short, involves
the following steps:
Bootstrap: Randomly sample subsets of the training data with
replacement, creating multiple subsets that may overlap.
Aggregating: Train a separate model on each subset of the data,
and then combine their predictions through averaging or voting to
make a final prediction.
Source: spotfire.com
65
Supervised Learning Algorithms: Ensemble
Boosting Methods
XGBoost
AdaBoost
(Extreme
(Adaptive LightGBM CatBoost
Gradient
Boosting)
Boosting)
66
Supervised Learning Algorithms: K-Nearest Neighbors
67
Supervised Learning Algorithms: K-Nearest Neighbors
1. Training Phase:
• Store the entire dataset in memory.
• No explicit training is done, as k-NN is a lazy learner.
2. Prediction Phase:
• When given a new, unseen data point, the algorithm finds the k
data points in the training set that are closest (most similar) to the
new point.
• The "closeness" is typically determined by a distance metric like
Euclidean distance, Manhattan distance, etc.
3. Classification:
• In the classification task, the algorithm assigns a class label to the
new point based on the majority class among its k nearest
neighbors. Fig. K-NN Algorithm Steps
4. Regression:
• In regression, the algorithm predicts a continuous value for the new
point based on the average (or some other measure) of the target
values of its k nearest neighbors.
68
Supervised Learning Algorithms: K-Nearest Neighbors
1. Key Parameters:
• k: This is the number of neighbors to consider. It's a hyperparameter that you need to choose. A small k
might lead to noise in the prediction, while a large k might smooth out the decision boundaries.
• Distance Metric: This defines how the "closeness" of data points is calculated. Common options include
Euclidean distance, Manhattan distance, Minkowski distance, etc.
2. Advantages:
• Simple to understand and implement.
• No explicit training phase, making it computationally efficient during training.
• Can be used for both classification and regression tasks.
3. Disadvantages:
• Can be computationally expensive during prediction, especially with large datasets.
• Sensitive to the choice of distance metric and value of k.
• Doesn't learn underlying patterns in the data, which can lead to suboptimal performance in some cases.
4. Considerations:
• Scaling of features is crucial as k-NN is sensitive to the scale of the variables.
• It's important to choose an appropriate value of k through techniques like cross-validation.
69
Supervised Learning Algorithms: K-Nearest Neighbors
<Example>
70
Supervised Learning Algorithms: K-Nearest Neighbors
<DEMO>
71
Supervised Learning Algorithms: Commonly Used
Regression Algorithms
Regression Algorithms:
1. Linear Regression: Predicts a continuous output variable based on the
input features by fitting a linear equation to the observed data.
2. Ridge Regression (L2 regularization): Similar to linear regression but
adds a penalty term to the coefficients to prevent overfitting.
3. Lasso Regression (L1 regularization): Similar to ridge regression, but
uses the absolute values of coefficients, which can lead to sparsity in the
model.
4. ElasticNet: A combination of L1 and L2 regularization that balances
between Ridge and Lasso regression.
5. Decision Tree Regression: Uses a decision tree to predict continuous
values.
6. Random Forest Regression: Ensemble method that uses multiple
decision trees to improve accuracy and control overfitting.
7. Gradient Boosting Regression: Builds an additive model in a forward
stage-wise manner, optimizing for the residual errors.
8. Support Vector Regression (SVR): Uses support vector machines to
perform regression.
72
Supervised Learning Algorithms: Commonly Used
Classification Algorithms
Classification Algorithms:
1. Logistic Regression: Used for binary classification problems, models the probability that a given instance belongs
to a particular category.
2. K-Nearest Neighbors (KNN): Classifies data points based on the majority class of their k-nearest neighbors.
3. Support Vector Machines (SVM): Finds the hyperplane that best separates classes in a high-dimensional space.
4. Naive Bayes: Applies Bayes' theorem with the "naive" assumption that features are independent, commonly used
for text classification.
5. Decision Tree Classification: Divides the data into subsets based on the value of features to make categorical
predictions.
6. Random Forest Classification: Ensemble method that uses multiple decision trees for classification.
7. Gradient Boosting Classification: Builds an ensemble of weak learners (usually decision trees) in a forward
stage-wise manner.
8. Neural Networks (Deep Learning): Multi-layered networks of interconnected nodes used for complex
classification tasks.
9. XGBoost, LightGBM, CatBoost (Gradient Boosting variations): Highly optimized implementations of gradient
boosting algorithms.
10.Adaboost: Combines multiple weak learners to create a strong learner.
73
Supervised Learning Algorithms: Classification
and Regression Algorithms
Source: www.oreilly.com 74
Supervised Learning Algorithms: Parametric vs.
Nonparametric
There are two main types of machine learning algorithms: parametric and
nonparametric.
77
Supervised Learning: Evaluation Metrics
<Classification Metrics>
78
Supervised Learning: Evaluation Metrics
<Classification Metrics>
79
Supervised Learning: Evaluation Metrics
<Classification Metrics>
80
Supervised Learning: Evaluation Metrics
<Classification Metrics>
ROC: The receiver operating curve, also noted as AUC: The area under the receiving operating
ROC, is the plot of TPR versus FPR by varying the curve, also noted AUC or AUROC, is the area
threshold. These metrics are summed up below: below the ROC as shown in the following figure:
Source: stanford.edu
81
Supervised Learning: Evaluation Metrics
<Classification Metrics>
82
Supervised Learning: Evaluation Metrics
<Classification Metrics>
Example Scenario:
Let's say we have a binary classification problem for a medical test that
identifies a disease:
Out of 100 actual cases of the disease:
The test correctly identifies 80 as positive (True Positives).
The test incorrectly identifies 20 as negative (False Negatives).
Out of 100 non-cases (healthy individuals):
The test correctly identifies 90 as negative (True Negatives).
The test incorrectly identifies 10 as positive (False Positives).
Find the confusion matrix and compute the accuracy, recall, precision, and
specificity!!
83
Supervised Learning: Evaluation Metrics
<Regression Metrics>
84
Unsupervised Learning Algorithms
85
Unsupervised Learning Algorithms
86
Unsupervised Learning Algorithms: Clustering
87
Unsupervised Learning Algorithms: Clustering
using K-Means Algorithm
88
Unsupervised Learning Algorithms: Clustering
using K-Means Algorithm
and
89
Unsupervised Learning Algorithms: Clustering
using K-Means Algorithm
90
Unsupervised Learning Algorithms: Clustering
using K-Means Algorithm <Visualization>
Source: Wikipedia 94
Machine Learning: Important Concepts
<Overfitting and Underfitting>
95
Machine Learning: Important Concepts
<Overfitting and Underfitting>
96
Machine Learning: Important Concepts
<Overfitting and Underfitting>
97
Machine Learning: Important Concepts
<Bias-Variance Tradeoff>
• What is Bias:
• Error between average model prediction
and ground truth
• The bias of the estimated function tells us
the capacity of the underlying model to predict
the values
• High Bias: Overly-simplified model, Underfitting, and
High error on both train and test sets
• What is Variance?
• Average variability in the model prediction for the
given dataset
• The variance of the estimated function tells you
how much the function can adjust to the change in
the dataset
• High Variance: Overly-complex model, Overfitting, Low
error on train data and high on test, Starts modelling the
noise in the input.
98
Machine Learning: Important Concepts
<K-Fold Cross Validation>
99
Machine Learning: Important Concepts
<Hyper-Parameters Optimization/Tuning>
100
Machine Learning: Important Concepts
<Hyper-Parameters Optimization/Tuning>
101
Machine Learning: Important Concepts
<Hyper-Parameters Optimization/Tuning>
102
Machine Learning: Important Concepts
<Hyper-Parameters Optimization/Tuning>
103
Machine Learning: Important Concepts
<Hyper-Parameters Optimization/Tuning>
104
Automated Machine learning (AutoML)
106
Automated Machine learning (AutoML)
< Some Frameworks >
107
Automated Machine learning (AutoML)
<DEMO>
DEMO: Session 1 – Automated Machine
Learning AutoML
108
Automated Machine learning (AutoML)
< AutoML vs Data scientists >
110
Bonus: Advice For Machine Learning Beginners
Advice for machine learning beginners | Andrej Karpathy and Lex Fridman
111