0% found this document useful (0 votes)

38 views29 pages

Chapter 7 - LAST

Uploaded by

Memo LOl

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

38 views29 pages

Chapter 7 - LAST

Uploaded by

Memo LOl

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 29

Important concepts

needed for implementing

ML model
Chapter 7
Machine Learning Steps
The task of imparting intelligence to machines seems daunting and
impossible. But it is actually really easy. It can be broken down into 7 major
steps
1. Collecting the data :
• As you know, machines initially learn from the data that you give them.
• It is of the utmost importance to collect reliable data so that your
machine learning model can find the correct patterns.
• Good data is relevant, contains very few missing and repeated values, and
has a good representation of the various subcategories/classes present.
2. Preparing the Data
• After you have your data, you have to prepare it.
You can do this by :
• Cleaning the data to remove unwanted data, missing values, rows, and
columns, duplicate values, etc.
• Visualize the data to understand how it is structured and understand the
relationship between various variables and classes present.
• Splitting the cleaned data into two sets - a training set and a testing set.
The training set is the set your model learns from. A testing set is used to
check the accuracy of your model after training..
3. Choosing a Model:
• A machine learning model determines the output you get after running a machine learning algorithm on the
collected data.
• It is important to choose a model which is relevant to the task at hand.

4. Training the Model:

• Training is the most important step in machine learning.

• In training, you pass the prepared data to your machine learning model to find
patterns and make predictions.
• It results in the model learned from the data so that it can accomplish the task
set.
5. Evaluating the Model:
• After training your model, you have to check to see how it’s
performing.
• This is done by testing the performance of the model on previously
unseen data.
• The unseen data used is the testing set that you split our data into
earlier.
6. Parameter Tuning:

• Once you have created and evaluated your model, see if its accuracy
can be improved in any way.
• This is done by tuning the parameters present in your model.
Parameters are the variables in the model that the programmer
generally decides.
• At a particular value of your parameter, the accuracy will be the
maximum. Parameter tuning refers to finding these values.
7. Making Predictions

In the end, you can use your model on unseen data to make
predictions accurately.
Overfitting vs. Underfitting
• Let’s say we want to predict if a student will land a job interview based
on her resume.
• Now, assume we train a model from a dataset of 10,000 resumes and
their outcomes.
• Next, we try the model out on the original dataset, and it predicts
outcomes with 99% accuracy… wow!
• But now comes the bad news.
• When we run the model on a new (“unseen”) dataset of resumes, we only get 50% accuracy… uh-
oh!
• Our model doesn’t generalize well from our training data to unseen data.
• This is known as overfitting, and it’s a common problem in machine learning and data science.
• We can understand overfitting better by looking at the opposite
problem, underfitting.
• Underfitting occurs when a model is too simple – informed by too few
features or regularized too much – which makes it inflexible in
learning from the dataset.
How to Prevent Overfitting in Machine Learning

• Detecting overfitting is useful, but it doesn’t solve the problem.

Fortunately, you have several options to try.
• Here are a few of the most popular solutions for overfitting:
Cross-validation

• Cross-validation is a powerful preventative measure against

overfitting.
• The idea is clever: Use your initial training data to generate multiple
mini train-test splits. Use these splits to tune your model.
• In standard k-fold cross-validation, we partition the data into k
subsets, called folds. Then, we iteratively train the algorithm on k-1
folds while using the remaining fold as the test set (called the
“holdout fold”).
Train with more data

• It won’t work every time, but training with more data can help
algorithms detect the signal better.

Remove features
Some algorithms have built-in feature selection.
For those that don’t, you can manually improve their generalizability by
removing irrelevant input features.
Evaluate a classification model

• After doing the usual Feature Engineering,

Selection, and of course, implementing a
model and getting some output in forms of a
probability or a class, the next step is to find
out how effective is the model based on
some metric using test datasets.
Different metrics:

• Confusion Matrix
• Accuracy
• Precision
• Recall or Sensitivity
• Specificity
• F1 Score
Confusion Matrix
Just opposite to what the name suggests,
confusion matrix is one of the most intuitive and
easiest metrics used for finding the correctness
and accuracy of the model.

It is used for Classification problem where the

output can be of two or more types of classes.

The Confusion matrix is not a performance

measure as such, a lot of the performance metrics
are based on Confusion Matrix and the numbers
inside it.
• Let’s say we are solving a classification problem
where we are predicting whether a person is
having cancer or not.
• Let’s give a label of to our target variable:
• 1: When a person is having cancer
• 0: When a person is NOT having cancer.
• Alright! Now that we have identified the problem, the
confusion matrix, is a table with two dimensions (“Actual”
and “Predicted”), and sets of “classes” in both
dimensions.
• Our Actual classifications are columns and Predicted
ones are Rows.
True Positives (TP) - True positives are the cases when
the actual class of the data point was 1(True) and the
predicted is also 1(True).
Ex: The case where a person is actually having cancer(1) and
the model classifying his case as cancer(1) comes under
True positive
True Negatives (TN) - True negatives are the cases when
the actual class of the data point was 0(False) and the
predicted is also 0(False)
Ex: The case where a person NOT having cancer and the
model classifying his case as Not cancer comes under True
Negatives.
• False Positives (FP) - False positives are the cases when the actual class of the data point
was 0(False) and the predicted is 1(True). False is because the model has predicted
incorrectly and positive because the class predicted was a positive one. (1)
• Ex: A person NOT having cancer and the model classifying his case as cancer comes under False Positives.

• False Negatives (FN) - False negatives are the cases when the actual class of the data point
was 1(True) and the predicted is 0(False). False is because the model has predicted
incorrectly and negative because the class predicted was a negative one. (0)
• Ex: A person having cancer and the model classifying his case as No-cancer comes under False Negatives.

• The ideal scenario that we all want is that the model should give 0
False Positives and 0 False Negatives. But that’s not the case in real
life as any model will NOT be 100% accurate most of the times.
When to minimize what?
• We know that there will be some error associated with every model that we use
for predicting the true class of the target variable. This will result in False Positives
and False Negatives

• There’s no hard rule that says what should be minimized in all the situations. It
purely depends on the business needs and the context of the problem you are
trying to solve. Based on that, we might want to minimize either False Positives or
False negatives.
Minimizing False Negatives
•We might end up making a classification when the person NOT having cancer is
classified as Cancerous. This might be okay as it is less dangerous than NOT
identifying/capturing a cancerous patient since we will anyway send the cancer
cases for further examination and reports. But missing a cancer patient will be a
huge mistake as no further examination will be done on them.
Minimizing False Positives
•For better understanding of False Positives, let’s use a different example
where the model classifies whether an email is spam or not.
•Let’s say that you are expecting an important email like hearing back
from a recruiter or awaiting an admit letter from a university. Let’s assign
a label to the target variable and say,1: “Email is a spam” and 0:”Email is
not a spam”.
•Suppose the Model classifies that important email that you are
desperately waiting for, as Spam(case of False positive). So in case of
Spam email classification, minimising False positives is more important
than False Negatives.
Accuracy
• Accuracy in classification problems is the number of correct
predictions made by the model over all kind's predictions made.
Precision
• Precision is a measure that tells us what proportion of patients that
we diagnosed as having cancer, actually had cancer.
Recall or Sensitivity
• Recall is a measure that tells us what proportion of patients
that actually had cancer was diagnosed by the algorithm as
having cancer.
• So basically if we want to focus more on minimising False
Negatives, we would want our Recall to be as close to 100%
as possible without precision being too bad and if we want
to focus on minimising False positives, then our focus
should be to make Precision as close to 100% as possible.
F-1 Score
• We don’t really want to carry both Precision and Recall in our pockets
every time we make a model for solving a classification problem. So
it’s best if we can get a single score that kind of represents both
Precision(P) and Recall(R).

F1 Score = Harmonic Mean(Precision, Recall)

F1 Score = 2 * Precision * Recall / (Precision + Recall)

Specificity
• Specificity is a measure that tells us what proportion of patients that
did NOT have cancer, were predicted by the model as non-cancerous.
Assume that our test set for the Corona rapid tests includes 200 individuals, broken down
by cell as follows:

Calculate the values of the following metrics based on the confusion matrix given.
[1.25*5=6.25]
a) Accuracy
b) Precision
c) Recall
d) F-1 Score
e) Specificity
Suggest, as a data scientist, the value in the confusion matrix that you would like to
reduce, in order to create a bett er classifier.

Lesson 4 - Performance Metrics
No ratings yet
Lesson 4 - Performance Metrics
46 pages
KNN Evaluation
No ratings yet
KNN Evaluation
51 pages
Unit4 PPT
No ratings yet
Unit4 PPT
126 pages
ML 2 PPT Unit 2
No ratings yet
ML 2 PPT Unit 2
214 pages
0 Machine Learning Overview and Metrics LT
No ratings yet
0 Machine Learning Overview and Metrics LT
84 pages
Introduction To Machine Learning
No ratings yet
Introduction To Machine Learning
116 pages
INT354 - Unit 1
No ratings yet
INT354 - Unit 1
72 pages
BSC ML CH1
No ratings yet
BSC ML CH1
63 pages
Data Science Unit 5
No ratings yet
Data Science Unit 5
11 pages
Lecture 9 - Evaluations
No ratings yet
Lecture 9 - Evaluations
68 pages
CS 620 / DASC 600 Introduction To Data Science & Analytics: Lecture 8-Performance Evaluation
No ratings yet
CS 620 / DASC 600 Introduction To Data Science & Analytics: Lecture 8-Performance Evaluation
62 pages
Introduction Class
No ratings yet
Introduction Class
134 pages
FML - KNN
No ratings yet
FML - KNN
64 pages
Unit Ii
No ratings yet
Unit Ii
118 pages
3ML.02.MainConcepts Evaluation
No ratings yet
3ML.02.MainConcepts Evaluation
35 pages
Ch01 ICS422 03
No ratings yet
Ch01 ICS422 03
46 pages
AIML-HC Mod 03
No ratings yet
AIML-HC Mod 03
46 pages
Unit 5 Classification PDF
No ratings yet
Unit 5 Classification PDF
131 pages
ML Unit 3
No ratings yet
ML Unit 3
127 pages
CH-5 ML
No ratings yet
CH-5 ML
36 pages
ML Chap 2
No ratings yet
ML Chap 2
60 pages
SML Updated UNIT 4
No ratings yet
SML Updated UNIT 4
44 pages
Machine Learning Models: by Mayuri Bhandari
No ratings yet
Machine Learning Models: by Mayuri Bhandari
48 pages
L2 - Problems in ML & Performance Evaluation
No ratings yet
L2 - Problems in ML & Performance Evaluation
30 pages
ERROR and Confusion Matrix
No ratings yet
ERROR and Confusion Matrix
29 pages
Machine Learning Note
No ratings yet
Machine Learning Note
40 pages
Module3 DS PPT
No ratings yet
Module3 DS PPT
68 pages
CH 6
No ratings yet
CH 6
24 pages
Lecture 20 - Evaluation Metrics
No ratings yet
Lecture 20 - Evaluation Metrics
27 pages
Machine Learning Basics
No ratings yet
Machine Learning Basics
32 pages
Unit 4 Learning
No ratings yet
Unit 4 Learning
100 pages
Machine - Learning - Unit - 1
No ratings yet
Machine - Learning - Unit - 1
70 pages
ML Viva Questions
No ratings yet
ML Viva Questions
25 pages
Confusion Matrix
No ratings yet
Confusion Matrix
43 pages
Evaluating Machine Learning Algorithms and Model Selection
No ratings yet
Evaluating Machine Learning Algorithms and Model Selection
10 pages
Classification
No ratings yet
Classification
53 pages
Evaluating Model Performance Unit 6
No ratings yet
Evaluating Model Performance Unit 6
33 pages
Crush Hypothesis Testing
From Everand
Crush Hypothesis Testing
Allison Dillard
No ratings yet
Assignment 5
No ratings yet
Assignment 5
22 pages
Lec 8
No ratings yet
Lec 8
35 pages
Certified Lean Six Sigma Green Belt (ICGB) Practice Questions And Exam Tests ICGB Exam Guidebook And Updated Questions
From Everand
Certified Lean Six Sigma Green Belt (ICGB) Practice Questions And Exam Tests ICGB Exam Guidebook And Updated Questions
Idea Link
No ratings yet
Basics of ML and Evaluation
No ratings yet
Basics of ML and Evaluation
42 pages
DLL Reflection Coding System
100% (1)
DLL Reflection Coding System
13 pages
Performance Metrics (Classification) : Enrique J. de La Hoz D
100% (1)
Performance Metrics (Classification) : Enrique J. de La Hoz D
30 pages
21-General Approach To Classification, Classification by Decision Tree Induction-17-02-2025
No ratings yet
21-General Approach To Classification, Classification by Decision Tree Induction-17-02-2025
15 pages
Data Science Interview Questions (#Day11) PDF
100% (1)
Data Science Interview Questions (#Day11) PDF
11 pages
L 13 Choose Your Own Algorithm D 07062024 111828am
No ratings yet
L 13 Choose Your Own Algorithm D 07062024 111828am
36 pages
Unit 3
No ratings yet
Unit 3
13 pages
Ai Unit 5
No ratings yet
Ai Unit 5
13 pages
Machine Learning Interview Questions
From Everand
Machine Learning Interview Questions
Tech Interviews
4.5/5 (2)
Chapter 3 Model Evaluation Final
No ratings yet
Chapter 3 Model Evaluation Final
30 pages
ML 01
No ratings yet
ML 01
24 pages
Evaluating A Machine Learning Model
No ratings yet
Evaluating A Machine Learning Model
14 pages
Accuracy Precision and Recall
No ratings yet
Accuracy Precision and Recall
21 pages
Measuring Complex Achievement
100% (2)
Measuring Complex Achievement
9 pages
AI & ML Notes
No ratings yet
AI & ML Notes
22 pages
Chater 3 Class 10
No ratings yet
Chater 3 Class 10
4 pages
"Designing & Teaching Learning Goals & Objectives" Summary
100% (1)
"Designing & Teaching Learning Goals & Objectives" Summary
10 pages
Evaluation Metrics
No ratings yet
Evaluation Metrics
11 pages
Machine Learning Cheatsheet
No ratings yet
Machine Learning Cheatsheet
12 pages
Unit III - I
No ratings yet
Unit III - I
15 pages
Unit 15 A Hot Day Unit 17 Don't Lie Unit 16 Louise and The Mouse Unit 18 at The Beach
No ratings yet
Unit 15 A Hot Day Unit 17 Don't Lie Unit 16 Louise and The Mouse Unit 18 at The Beach
5 pages
EDSP Application Form
No ratings yet
EDSP Application Form
4 pages
Volker Kirchberg, Martin Trondle - Experiencing Exhibitions - A Review of Studies On Visitor Experiences in Museums
No ratings yet
Volker Kirchberg, Martin Trondle - Experiencing Exhibitions - A Review of Studies On Visitor Experiences in Museums
18 pages
Evaluation of Predictive Models Final
No ratings yet
Evaluation of Predictive Models Final
6 pages
Machine Learning HC
No ratings yet
Machine Learning HC
4 pages
Management Training Courses PDF
No ratings yet
Management Training Courses PDF
8 pages
Study Skills and Classroom Success Test Questions
100% (3)
Study Skills and Classroom Success Test Questions
9 pages
Integration of AI in VTE Programmes in Universities
No ratings yet
Integration of AI in VTE Programmes in Universities
13 pages
Lesson Plan Volleyball Bumping
100% (1)
Lesson Plan Volleyball Bumping
3 pages
2018 India Bgas Cswip Course Exam Fee
No ratings yet
2018 India Bgas Cswip Course Exam Fee
1 page
Programme Curriculum For Master Programme in Entrepreneurship
No ratings yet
Programme Curriculum For Master Programme in Entrepreneurship
6 pages
Module 2 Answers
No ratings yet
Module 2 Answers
4 pages
Lesson Reflection Animal Adaptation
100% (1)
Lesson Reflection Animal Adaptation
2 pages
Group Discussion Topics - Section C
No ratings yet
Group Discussion Topics - Section C
20 pages
Teaching Vocabulary To b1 Preliminary For Schools Students Slides
No ratings yet
Teaching Vocabulary To b1 Preliminary For Schools Students Slides
41 pages
Machine Learning Based Crime Rate Analysis Using Python
No ratings yet
Machine Learning Based Crime Rate Analysis Using Python
7 pages
Portfoilo FS5 Table of Contents
No ratings yet
Portfoilo FS5 Table of Contents
8 pages
Article Critique
100% (1)
Article Critique
6 pages
5MK510 Digital and Social Media Module Handbook 20162017 Upload
No ratings yet
5MK510 Digital and Social Media Module Handbook 20162017 Upload
24 pages
DLP G10 Traditional Literary Devices
No ratings yet
DLP G10 Traditional Literary Devices
12 pages
Ais-46 Classes 6 To 11 School Innovation Marathon
No ratings yet
Ais-46 Classes 6 To 11 School Innovation Marathon
2 pages
Music Iii - Q1 - Melc Based DLP
No ratings yet
Music Iii - Q1 - Melc Based DLP
4 pages
Orange Template 2021-2022
No ratings yet
Orange Template 2021-2022
67 pages
Sec Exp-Na Els-2020 Syllabus
No ratings yet
Sec Exp-Na Els-2020 Syllabus
54 pages
Theme Based Instruction
No ratings yet
Theme Based Instruction
16 pages
Annotated Bibliography
No ratings yet
Annotated Bibliography
5 pages
Social Learning Theory and Cognitive Behavioral Models of Body - 2008 - Body Im
No ratings yet
Social Learning Theory and Cognitive Behavioral Models of Body - 2008 - Body Im
11 pages
Pattern Recognition
No ratings yet
Pattern Recognition
2 pages
Department of Education: Grade & Section: V-MANUEL L. QUEZON Adviser: Corazon B. Rivera
No ratings yet
Department of Education: Grade & Section: V-MANUEL L. QUEZON Adviser: Corazon B. Rivera
2 pages

Chapter 7 - LAST

Uploaded by

Chapter 7 - LAST

Uploaded by

Important concepts

needed for implementing

4. Training the Model:

• Training is the most important step in machine learning.

• Detecting overfitting is useful, but it doesn’t solve the problem.

• Cross-validation is a powerful preventative measure against

• After doing the usual Feature Engineering,

It is used for Classification problem where the

The Confusion matrix is not a performance

F1 Score = Harmonic Mean(Precision, Recall)

F1 Score = 2 * Precision * Recall / (Precision + Recall)

You might also like