Supervised
Machine Learning
2 of 3 modules
Supervised Learning
Data includes both the input and the desired results.
Training and Test
Sets
Resampling
Imbalanced
Datasets
Resampling + Synthesis of artificial data
SMOTE – Synthetic Minority Oversampling
Ensemble (combined)
Models
Linear Regression
Getting our line straight!
Introduction to Regression Analysis
Regression analysis is used to:
Predict the value of a dependent variable based on the value of at least one
independent variable
Explain the impact of changes in an independent variable on the dependent
variable
• Dependent variable:
The variable we wish to predict or explain
• Independent variable:
The variable used to explain the dependent variable
Simple Only one independent variable, X
Linear Relationship between X and Y is
Regression described by a linear function.
Model Changes in Y are assumed to be
caused by changes in X
More than one independent variable,
Multi X
Linear Relationship between X and Y is
Regression described by a linear function.
Model Changes in Y are assumed to be
caused by changes in X
Types of Relationships
Linear relationships Curvilinear relationships
Y Y
X X
Y Y
X X
Types of Relationships
Strong relationships Weak relationships
Y Y
X X
Y Y
X X
Types of Relationships
No relationship
X
Simple Linear Regression Model
Population Random
Population Independent Error
Slope
Y intercept Variable term
Coefficient
Dependent
Variable
Yi = b + MXi + εi
Linear component Random Error
component
Simple Linear Regression Model - Errors
Y Yi = b + MXi + εi
Observed Value
of Y for Xi
εi Slope = M
Predicted Value Random Error
of Y for Xi
for this X i value
Intercept = b
Xi X
Interpretation of the Slope and the Intercept
• b is the estimated average value of Y when the value of X is zero
• M is the estimated change in the average value of Y as a result of a
one-unit change in X
How do we determine if our
Regression model is doing well or not?
Performance Metrics (Regression)
Mean Squared Error -
Mean Absolute Error - Measures the average of
Sum of the absolute the squares of the errors—
differences between that is, the average
predictions and actual squared difference between
values. the estimated values and
what is estimated.
Let’s dive straight to the Hands-on
using Jupyter notebooks
Feature
Engineering
Improving the model!
Feature engineering
• The first thing we need to do when creating a machine
learning model is to decide what to use as features.
• Features are key to a model, like a person’s name or
favorite color. pieces of information that we take from the
text and give to the algorithm so it can work its magic.
• E.g, if we do classification on health, some features could
be a person’s height, weight, gender, and so on.
• We would exclude things that maybe are known but aren’t useful
Benefits of Feature Engineering
• Reduces Overfitting : Less redundant data means less
opportunity to make decisions based on noise.
• Improves Accuracy : Less misleading data means modeling
accuracy improves.
• Reduces Training Time : Fewer data points reduce
algorithm complexity and algorithms train faster.
Techniques of Feature
Engineering
• Introducing interaction terms
Let’s dive straight to the Hands-on
using Jupyter notebooks
Logistic Regression
What is it and what is the algorithm?
What is the difference
between Linear Regression
& Logistic Regression?
Recap: What is linear
regression?
• Linear regression quantifies the relationship
between one or more predictor variables and
one outcome variable.
• For example, linear regression can be used to
quantify the relative impacts of age, gender, and
diet (the predictor variables) on height (the
outcome variable).
Recap: Example
Sales = 168 + 23
Advertising
Example – Log Reg – Scoring Goals!
• If we are kicking our soccer ball from a variety of distances.
• The results are going to be only Goal or no Goal.
• Our Standard Linear Regression will not work in this scenario!
Nominal
• Nominal scales are used for labeling variables, without
any quantitative value. “Nominal” scales could simply be called
“labels.”
Good to • E.g Male/Female, Red/Green/Yellow
know! Ordinal
• With ordinal scales, the order of the values is what’s important
and significant, but the differences between each one is not really
known.
• E.g Good, Very good, Excellent, Fantastic – 1#, 2#, 3#, 4#
What is logistic regression?
• Logistic regression is the appropriate regression
analysis to conduct when the dependent
variable is atleast binary.
• Like all regression analyses, the logistic
regression is a predictive analysis.
• Logistic regression is used to describe data and
to explain the relationship between one
dependent binary variable and one or more
nominal, ordinal, interval or ratio-level
independent variables.
The Sigmoid function
• We apply sigmoid function on the linear regression equation.
• By doing so, we will push our straight line to be a S shape or Sigmoid
Curve.
Model Evaluation is an integral part
Model of the model development process.
Evaluation It helps to find the best model that
represents our data and how well the
chosen model will work in the future.
Performance Metrics (Classification)
Confusion Matrix Accuracy Precision and
Recall
How do you evaluate classifiers?
Accuracy!
Confusion Matrix
It is a performance measurement for machine learning classification
problem where output can be two or more classes.
It is a table with 4 different combinations of predicted and actual
values.
So how can we use the metrics?
Say we have 2 confusion matrix from 2
models
Actual Class
Actual Class
- + -
+
+ + 10 10
8 1
Predicted Class
Predicted Class
2 89 - 0 80
-
Logistic Regression SVM
We can compare them!
Accuracy:
97% 90%
(TP+TN)/(TP+TN+FP+FN)
Precision:
89% 50%
TP/(TP+FP)
Recall:
80% 100%
TP/(TP+FN)
Precision and Recall
Precision attempts to answer the following question:
What proportion of positive identifications was correct?
Recall attempts to answer the following question:
What proportion of actual positives was identified correctly?
Decision Trees
Decision tree learning
is one of the most
widely used techniques
for classification.
Introduction
The classification
model is a tree, called
decision tree.
A decision tree can be converted to a set of rules
• Build tree split by split.
How we do • Find the best split you can at each step
our tree
• This best split is also known as Greedy
Search.
• We can put a number to our splitting
split? step with :
• Gini Index
Gini Index
• Where pi is the probability of an object being classified to a
particular class.
• While building the decision tree, we would prefer choosing the
attribute/feature with the least Gini index as the root node.
Each inner node is a decision based on a feature
Each leaf node is a class label
Predicting Titanic Survivors
Yes Is sex male? No
Is age > 9.5? Survived
0.73 36%
Died Is sibsp > 2.5?
0.17 61%
Died Survived
0.05 2% 0.89 2%
Build tree split by split,
Find the best split you can at each step
Yes Is sex male? No
Survived
0.73 36%
Build tree split by split,
Find the best split you can at each step
Yes Is sex male? No
Is age > 9.5? Survived
0.73 36%
Died
0.17 61%
Build tree split by split,
Find the best split you can at each step
Yes Is sex male? No
Is age > 9.5? Survived
0.73 36%
Died Is sibsp > 2.5?
0.17 61%
Died Survived
0.05 2% 0.89 2%
• Generates understandable rules.
• Perform classification without requiring
Strengths of much computation.
• able to handle both continuous and
decision tree categorical variables.
• Provides a clear indication of which fields
methods are most important for prediction or
classification.
• Natural multiclass classifier.
• It is less appropriate for estimation tasks where
the goal is to predict the value of a continuous
attribute.
• Prone to errors in classification problems with
many class and relatively small number of
training examples.
Weaknesses of • Computationally expensive to train.
• Growing a decision tree is computationally
decision tree expensive.
• At each node, each candidate splitting field
must be sorted before its best split can be
found.
• Small changes in input data can result in totally
different trees.
• Can make mistakes with unbalanced classes.
Support Vector
Machines
• SVMs are linear or non-linear classifiers that
find a hyperplane to separate two class of data,
positive and negative.
What are • SVM not only has a rigorous theoretical
SVMs? foundation, but also performs classification
more accurately than most other methods in
applications, especially for high dimensional
data
Support Vector Machine (SVM)
1
Patient
status 0.5
after 5 yr.
0
Number of positive nodes
Find the best boundary that separates two classes
Support Vector Machine (SVM)
?
1
Patient
status 0.5
after 5 yr.
0
Number of positive nodes
Bad: 3 misclassifications, accuracy 67%
Support Vector Machine (SVM)
?
1
Patient
status 0.5
after 5 yr.
0
Number of positive nodes
One misclassification, accuracy 89%
Support Vector Machine (SVM)
?
1
Patient
status 0.5
after 5 yr.
0
Number of positive nodes
Accuracy: 78%
Support Vector Machine (SVM)
?
1
Patient
status 0.5
after 5 yr.
0
Number of positive nodes
Accuracy: 100%
Support Vector Machine (SVM)
?
1
Patient
status 0.5
after 5 yr.
0
Number of positive nodes
Accuracy: 100%
Support Vector Machine (SVM)
?
1
Patient
status 0.5
after 5 yr.
0
Number of positive nodes
Accuracy: 100%
Support Vector Machine (SVM)
1
Patient
status 0.5
after 5 yr.
0
Number of positive nodes
The margin: No man’s land
2 Features: Number of + nodes, Age
2 Labels: Survived / Lost
Age
Number of positive nodes
Find the line that separates
the classes best
Age
Number of positive nodes
Age
Number of positive nodes
Age
Number of positive nodes
Age
Number of positive nodes
3 features: Find the best boundary plane
(More features: hyperplane)
• The hyperplane that separates positive and negative training
data is
〈w ⋅ x〉 + b = 0
• It is also called the decision boundary (surface).
What is a
hyperplane?
How to choose the
best hyperplane?
• SVM looks for the
separating hyperplane with
the largest margin.
• Machine learning theory
says this hyperplane
minimizes the error bound
• Accuracy
Pros • Works well on smaller cleaner datasets
• It can be more efficient because it uses a subset of
training points
• Isn’t suited to larger datasets as the training time
Cons with SVMs can be high
• Less effective on noisier datasets with overlapping
classes
What have you learned?
4/5/2023 78
Thank you !!
I welcome your questions.
4/5/2023 79