0% found this document useful (0 votes)
38 views29 pages

Chapter 7 - LAST

Uploaded by

Memo LOl
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
38 views29 pages

Chapter 7 - LAST

Uploaded by

Memo LOl
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 29

Important concepts

needed for implementing


ML model
Chapter 7
Machine Learning Steps
The task of imparting intelligence to machines seems daunting and
impossible. But it is actually really easy. It can be broken down into 7 major
steps
1. Collecting the data :
• As you know, machines initially learn from the data that you give them.
• It is of the utmost importance to collect reliable data so that your
machine learning model can find the correct patterns.
• Good data is relevant, contains very few missing and repeated values, and
has a good representation of the various subcategories/classes present.
2. Preparing the Data
• After you have your data, you have to prepare it.
You can do this by :
• Cleaning the data to remove unwanted data, missing values, rows, and
columns, duplicate values, etc.
• Visualize the data to understand how it is structured and understand the
relationship between various variables and classes present.
• Splitting the cleaned data into two sets - a training set and a testing set.
The training set is the set your model learns from. A testing set is used to
check the accuracy of your model after training..
3. Choosing a Model:
• A machine learning model determines the output you get after running a machine learning algorithm on the
collected data.
• It is important to choose a model which is relevant to the task at hand.

4. Training the Model:

• Training is the most important step in machine learning.


• In training, you pass the prepared data to your machine learning model to find
patterns and make predictions.
• It results in the model learned from the data so that it can accomplish the task
set.
5. Evaluating the Model:
• After training your model, you have to check to see how it’s
performing.
• This is done by testing the performance of the model on previously
unseen data.
• The unseen data used is the testing set that you split our data into
earlier.
6. Parameter Tuning:

• Once you have created and evaluated your model, see if its accuracy
can be improved in any way.
• This is done by tuning the parameters present in your model.
Parameters are the variables in the model that the programmer
generally decides.
• At a particular value of your parameter, the accuracy will be the
maximum. Parameter tuning refers to finding these values.
7. Making Predictions

In the end, you can use your model on unseen data to make
predictions accurately.
Overfitting vs. Underfitting
• Let’s say we want to predict if a student will land a job interview based
on her resume.
• Now, assume we train a model from a dataset of 10,000 resumes and
their outcomes.
• Next, we try the model out on the original dataset, and it predicts
outcomes with 99% accuracy… wow!
• But now comes the bad news.
• When we run the model on a new (“unseen”) dataset of resumes, we only get 50% accuracy… uh-
oh!
• Our model doesn’t generalize well from our training data to unseen data.
• This is known as overfitting, and it’s a common problem in machine learning and data science.
• We can understand overfitting better by looking at the opposite
problem, underfitting.
• Underfitting occurs when a model is too simple – informed by too few
features or regularized too much – which makes it inflexible in
learning from the dataset.
How to Prevent Overfitting in Machine Learning

• Detecting overfitting is useful, but it doesn’t solve the problem.


Fortunately, you have several options to try.
• Here are a few of the most popular solutions for overfitting:
Cross-validation

• Cross-validation is a powerful preventative measure against


overfitting.
• The idea is clever: Use your initial training data to generate multiple
mini train-test splits. Use these splits to tune your model.
• In standard k-fold cross-validation, we partition the data into k
subsets, called folds. Then, we iteratively train the algorithm on k-1
folds while using the remaining fold as the test set (called the
“holdout fold”).
Train with more data

• It won’t work every time, but training with more data can help
algorithms detect the signal better.

Remove features
Some algorithms have built-in feature selection.
For those that don’t, you can manually improve their generalizability by
removing irrelevant input features.
Evaluate a classification model

• After doing the usual Feature Engineering,


Selection, and of course, implementing a
model and getting some output in forms of a
probability or a class, the next step is to find
out how effective is the model based on
some metric using test datasets.
Different metrics:

• Confusion Matrix
• Accuracy
• Precision
• Recall or Sensitivity
• Specificity
• F1 Score
Confusion Matrix
Just opposite to what the name suggests,
confusion matrix is one of the most intuitive and
easiest metrics used for finding the correctness
and accuracy of the model.

It is used for Classification problem where the


output can be of two or more types of classes.

The Confusion matrix is not a performance


measure as such, a lot of the performance metrics
are based on Confusion Matrix and the numbers
inside it.
• Let’s say we are solving a classification problem
where we are predicting whether a person is
having cancer or not.
• Let’s give a label of to our target variable:
• 1: When a person is having cancer
• 0: When a person is NOT having cancer.
• Alright! Now that we have identified the problem, the
confusion matrix, is a table with two dimensions (“Actual”
and “Predicted”), and sets of “classes” in both
dimensions.
• Our Actual classifications are columns and Predicted
ones are Rows.
True Positives (TP) - True positives are the cases when
the actual class of the data point was 1(True) and the
predicted is also 1(True).
Ex: The case where a person is actually having cancer(1) and
the model classifying his case as cancer(1) comes under
True positive
True Negatives (TN) - True negatives are the cases when
the actual class of the data point was 0(False) and the
predicted is also 0(False)
Ex: The case where a person NOT having cancer and the
model classifying his case as Not cancer comes under True
Negatives.
• False Positives (FP) - False positives are the cases when the actual class of the data point
was 0(False) and the predicted is 1(True). False is because the model has predicted
incorrectly and positive because the class predicted was a positive one. (1)
• Ex: A person NOT having cancer and the model classifying his case as cancer comes under False Positives.

• False Negatives (FN) - False negatives are the cases when the actual class of the data point
was 1(True) and the predicted is 0(False). False is because the model has predicted
incorrectly and negative because the class predicted was a negative one. (0)
• Ex: A person having cancer and the model classifying his case as No-cancer comes under False Negatives.

• The ideal scenario that we all want is that the model should give 0
False Positives and 0 False Negatives. But that’s not the case in real
life as any model will NOT be 100% accurate most of the times.
When to minimize what?
• We know that there will be some error associated with every model that we use
for predicting the true class of the target variable. This will result in False Positives
and False Negatives

• There’s no hard rule that says what should be minimized in all the situations. It
purely depends on the business needs and the context of the problem you are
trying to solve. Based on that, we might want to minimize either False Positives or
False negatives.
Minimizing False Negatives
•We might end up making a classification when the person NOT having cancer is
classified as Cancerous. This might be okay as it is less dangerous than NOT
identifying/capturing a cancerous patient since we will anyway send the cancer
cases for further examination and reports. But missing a cancer patient will be a
huge mistake as no further examination will be done on them.
Minimizing False Positives
•For better understanding of False Positives, let’s use a different example
where the model classifies whether an email is spam or not.
•Let’s say that you are expecting an important email like hearing back
from a recruiter or awaiting an admit letter from a university. Let’s assign
a label to the target variable and say,1: “Email is a spam” and 0:”Email is
not a spam”.
•Suppose the Model classifies that important email that you are
desperately waiting for, as Spam(case of False positive). So in case of
Spam email classification, minimising False positives is more important
than False Negatives.
Accuracy
• Accuracy in classification problems is the number of correct
predictions made by the model over all kind's predictions made.
Precision
• Precision is a measure that tells us what proportion of patients that
we diagnosed as having cancer, actually had cancer.
Recall or Sensitivity
• Recall is a measure that tells us what proportion of patients
that actually had cancer was diagnosed by the algorithm as
having cancer.
• So basically if we want to focus more on minimising False
Negatives, we would want our Recall to be as close to 100%
as possible without precision being too bad and if we want
to focus on minimising False positives, then our focus
should be to make Precision as close to 100% as possible.
F-1 Score
• We don’t really want to carry both Precision and Recall in our pockets
every time we make a model for solving a classification problem. So
it’s best if we can get a single score that kind of represents both
Precision(P) and Recall(R).

F1 Score = Harmonic Mean(Precision, Recall)

F1 Score = 2 * Precision * Recall / (Precision + Recall)


Specificity
• Specificity is a measure that tells us what proportion of patients that
did NOT have cancer, were predicted by the model as non-cancerous.
Assume that our test set for the Corona rapid tests includes 200 individuals, broken down
by cell as follows:

Calculate the values of the following metrics based on the confusion matrix given.
[1.25*5=6.25]
a) Accuracy
b) Precision
c) Recall
d) F-1 Score
e) Specificity
Suggest, as a data scientist, the value in the confusion matrix that you would like to
reduce, in order to create a bett er classifier.

You might also like