Chapter 7 - LAST
Chapter 7 - LAST
• Once you have created and evaluated your model, see if its accuracy
can be improved in any way.
• This is done by tuning the parameters present in your model.
Parameters are the variables in the model that the programmer
generally decides.
• At a particular value of your parameter, the accuracy will be the
maximum. Parameter tuning refers to finding these values.
7. Making Predictions
In the end, you can use your model on unseen data to make
predictions accurately.
Overfitting vs. Underfitting
• Let’s say we want to predict if a student will land a job interview based
on her resume.
• Now, assume we train a model from a dataset of 10,000 resumes and
their outcomes.
• Next, we try the model out on the original dataset, and it predicts
outcomes with 99% accuracy… wow!
• But now comes the bad news.
• When we run the model on a new (“unseen”) dataset of resumes, we only get 50% accuracy… uh-
oh!
• Our model doesn’t generalize well from our training data to unseen data.
• This is known as overfitting, and it’s a common problem in machine learning and data science.
• We can understand overfitting better by looking at the opposite
problem, underfitting.
• Underfitting occurs when a model is too simple – informed by too few
features or regularized too much – which makes it inflexible in
learning from the dataset.
How to Prevent Overfitting in Machine Learning
• It won’t work every time, but training with more data can help
algorithms detect the signal better.
Remove features
Some algorithms have built-in feature selection.
For those that don’t, you can manually improve their generalizability by
removing irrelevant input features.
Evaluate a classification model
• Confusion Matrix
• Accuracy
• Precision
• Recall or Sensitivity
• Specificity
• F1 Score
Confusion Matrix
Just opposite to what the name suggests,
confusion matrix is one of the most intuitive and
easiest metrics used for finding the correctness
and accuracy of the model.
• False Negatives (FN) - False negatives are the cases when the actual class of the data point
was 1(True) and the predicted is 0(False). False is because the model has predicted
incorrectly and negative because the class predicted was a negative one. (0)
• Ex: A person having cancer and the model classifying his case as No-cancer comes under False Negatives.
• The ideal scenario that we all want is that the model should give 0
False Positives and 0 False Negatives. But that’s not the case in real
life as any model will NOT be 100% accurate most of the times.
When to minimize what?
• We know that there will be some error associated with every model that we use
for predicting the true class of the target variable. This will result in False Positives
and False Negatives
• There’s no hard rule that says what should be minimized in all the situations. It
purely depends on the business needs and the context of the problem you are
trying to solve. Based on that, we might want to minimize either False Positives or
False negatives.
Minimizing False Negatives
•We might end up making a classification when the person NOT having cancer is
classified as Cancerous. This might be okay as it is less dangerous than NOT
identifying/capturing a cancerous patient since we will anyway send the cancer
cases for further examination and reports. But missing a cancer patient will be a
huge mistake as no further examination will be done on them.
Minimizing False Positives
•For better understanding of False Positives, let’s use a different example
where the model classifies whether an email is spam or not.
•Let’s say that you are expecting an important email like hearing back
from a recruiter or awaiting an admit letter from a university. Let’s assign
a label to the target variable and say,1: “Email is a spam” and 0:”Email is
not a spam”.
•Suppose the Model classifies that important email that you are
desperately waiting for, as Spam(case of False positive). So in case of
Spam email classification, minimising False positives is more important
than False Negatives.
Accuracy
• Accuracy in classification problems is the number of correct
predictions made by the model over all kind's predictions made.
Precision
• Precision is a measure that tells us what proportion of patients that
we diagnosed as having cancer, actually had cancer.
Recall or Sensitivity
• Recall is a measure that tells us what proportion of patients
that actually had cancer was diagnosed by the algorithm as
having cancer.
• So basically if we want to focus more on minimising False
Negatives, we would want our Recall to be as close to 100%
as possible without precision being too bad and if we want
to focus on minimising False positives, then our focus
should be to make Precision as close to 100% as possible.
F-1 Score
• We don’t really want to carry both Precision and Recall in our pockets
every time we make a model for solving a classification problem. So
it’s best if we can get a single score that kind of represents both
Precision(P) and Recall(R).
Calculate the values of the following metrics based on the confusion matrix given.
[1.25*5=6.25]
a) Accuracy
b) Precision
c) Recall
d) F-1 Score
e) Specificity
Suggest, as a data scientist, the value in the confusion matrix that you would like to
reduce, in order to create a bett er classifier.