0% found this document useful (0 votes)

26 views

Notes - Machine Learning

This document discusses overfitting and underfitting in machine learning models. Overfitting occurs when a model performs well on the training data but poorly on new data, due to high variance. Underfitting is when a model is too simple and cannot identify patterns in the data, due to high bias. The goal is to balance bias and variance to minimize error.

Uploaded by

beverishlim

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

26 views

Notes - Machine Learning

Uploaded by

beverishlim

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

You are on page 1/ 9

1. What is overfitting and underfitting.

Overfitting: The model performs well only

for the sample training data. If any new data
is given as input to the model, it fails to
provide any result. These conditions occur
due to low bias and high variance in the
model. Decision trees are more prone to
overfitting

Underfitting: Here, the model is so simple

that it is not able to identify the correct
relationship in the data, and hence it does
not perform well even on the test data. This
can happen due to high bias and low
variance. Linear regression is more prone to
Underfitting.

Bias
Imagine you're trying to throw darts at a bullseye. If most of your darts land to the left of
the bullseye, you're consistently off-target in a particular direction. That's like bias in
machine learning. It happens when your model makes assumptions about the data that lead
it to miss the actual trends. So, instead of hitting the bullseye (the real answers), your
model's predictions are consistently off in the same direction. This often happens because
the model is too simple to capture the complexity of the data.

Variance
Now, suppose your darts are all over the dartboard, sometimes to the left, sometimes to the
right, up, and down. There's a lot of spread in where your darts land. That's like variance in
machine learning. It occurs when your model pays too much attention to the training data,
including the noise or random fluctuations. As a result, the model performs well on the
training data but poorly on new, unseen data because it's too busy chasing the specifics of
the training data rather than focusing on the overall pattern. Imagine it as trying to
memorize the answers to a test without understanding the questions.

Finding the Balance

The goal in machine learning is to find a sweet spot between bias and variance, much like
adjusting your aim and throw strength to hit the bullseye more consistently. You want a
model that's complex enough to capture the true patterns in the data (low bias) without
getting distracted by the random noise in the dataset (low variance). This balance is crucial
for creating models that make accurate predictions on new, unseen data.

The trade-off between bias and variance is a central issue in machine learning. The goal is to
find a balance between bias and variance to minimize the overall error. A model with high
bias and low variance is said to underfit the data, while a model with low bias and high
variance is said to overfit the data. A model with low bias and low variance is said to be well-
fit to the data.

The bias-variance trade-off can be visualized using a learning curve, which is a plot of the
training and validation errors of a model as a function of the number of training examples or
the complexity of the model. A learning curve can show how the bias-variance trade-off
changes with different levels of training or complexity.
To reduce high bias, one can use more complex models, add more features, or reduce the
regularization parameter. To reduce high variance, one can use simpler models, reduce the
number of features, or increase the regularization parameter.

In summary, bias and variance are two important concepts in machine learning that are
used to evaluate the performance of a model. Bias refers to the simplifying assumptions
made by the model, while variance refers to the inconsistency of different predictions using
different training sets. The goal is to find a balance between bias and variance to minimize
the overall error and achieve accurate and consistent predictions.

2. What does it mean when the p-value are high or low?

A p-value is a tool used in statistics to help you understand if your results are likely to have
happened by chance or if there’s something more interesting going on.

What is a p-value?
 P-value: This is a number between 0 and 1 that tells you whether the results of your
experiment or study could happen just by chance. The lower the p-value, the less
likely the results are due to random chance.
 The p-value is often used to determine whether to reject or accept the null
hypothesis. A common threshold for statistical significance is a p-value less than or
equal to 0.05.

High P-value
 When it’s high (usually more than 0.05 or 5%): This suggests that your findings
could easily happen by chance. There isn’t enough evidence to suggest something
unusual is going on. It means there's a high probability that any effect you see (like a
treatment making a plant grow faster) could just be due to randomness.
 If the p-value is greater than 0.05, the null hypothesis is not rejected, and the
alternative hypothesis is not accepted.
 A high p-value suggests that the observed data is likely under the null hypothesis,
indicating that there is weak evidence against the null hypothesis.

Low P-value
 When it’s low (usually less than 0.05 or 5%): This indicates that your findings are
unlikely to have occurred by chance. There’s enough evidence to suggest a real
effect or difference. This doesn't mean your hypothesis is definitely true, but it's an
indication that it's worth further investigation.
 If the p-value is less than or equal to 0.05, the null hypothesis is rejected, and the
alternative hypothesis is accepted.
 A low p-value suggests that the observed data is unlikely under the null hypothesis,
indicating that there is strong evidence against the null hypothesis.

When to Use It
You use a p-value when you want to understand whether the differences or effects you
observe in your data (like test scores improving after studying more, or people feeling
better after taking a medicine) are meaningful or just random variations.

Example
Imagine you're teaching two groups of students for a test. Group A studies with a new
method you've developed, while Group B studies with the traditional method. After the test,
you compare the scores.
 High p-value: If the p-value is high when you compare the test scores, it means
there's a strong chance the difference in scores between the two groups could just
be due to random variation. The new study method might not be better than the
traditional one.
 Low p-value: If the p-value is low, it suggests that the difference in scores is unlikely
to be due to chance, indicating your new study method may genuinely improve
scores.

3. What is cross-validation

Cross-Validation is a Statistical technique used for improving a model’s performance. Here,

the model will be trained and tested with rotation using different samples of the training
dataset to ensure that the model performs well for unknown data. The training data will be
split into various groups and the model is run and validated against these groups in rotation.

The most commonly used techniques are:

 K- Fold method
 Leave p-out method
 Leave-one-out method
 Holdout method
This process helps ensure your model isn’t just memorizing the answers (overfitting) or too
confused by the data (underfitting). Instead, you’re aiming for a model that genuinely learns
and can apply that knowledge to new, unseen data, much like a well-prepared student
facing a variety of questions on a test.
4. Covariance vs Correlation

 Both covariance and correlation measure the relationship and the dependency
between two variables.
 Covariance indicates the direction of the linear relationship between variables.
 Correlation measures both the strength and direction of the linear relationship
between two variables.
 Correlation values are standardized.
 Covariance values are not standardized.

5. How do you approach solving any data analytics based project?

 First step is to thoroughly understand the business requirement/problem

 Next, explore the given data and analyze it carefully. If you find any data missing, get
the requirements clarified from the business.
 Data cleanup and preparation step is to be performed next which is then used for
modeling. Here, the missing values are found and the variables are transformed.
 Run your model against the data, build meaningful visualization and analyze the
results to get meaningful insights.
 Release the model implementation, track the results and performance over a
specified period to analyze the usefulness.
 Perform cross-validation of the model.

6. What does the ROC curve represent and how to create it?

Evaluation Metric/ Performance Metric

1. Accuracy
 What It Is: The percentage of your predictions that are correct.
 Analogy: Imagine you're answering true or false questions on a quiz. Accuracy is just
the number of questions you got right over the total number of questions.

2. Precision and Recall

 Precision (Positive Predictive Value):
 What It Is: Of all the positive (true) predictions you made, how many were
actually true?
 Simple Pointer: Precision is about being precise with your positive
predictions

 Recall (Sensitivity, True Positive Rate):

 What It Is: Of all the actual positive cases, how many did you correctly
predict?
 Simple Pointer: Recall is recalling or catching all the actual positives.

3. F1 Score
 What It Is: The harmonic mean of precision and recall. It balances the two—useful
when you need a single metric to compare models directly.
 Analogy: If precision and recall are the two wheels of a bicycle, the F1 score is how
well the bicycle rides. It needs both wheels to be balanced to work best.

4. ROC-AUC
 ROC (Receiver Operating Characteristic) Curve: Shows the trade-off between the
true positive rate (recall) and false positive rate at different thresholds.

 AUC (Area Under the ROC Curve):

 What It Is: A single number summarizing the ROC curve, representing the
likelihood that a randomly chosen positive instance is ranked higher than a
randomly chosen negative one.
 Simple Pointer: Think of AUC as the score of a game where you want to
maximize the area you cover.

5. Mean Absolute Error (MAE) and Mean Squared Error (MSE)

 Used For: Regression problems where you predict a continuous quantity.
 MAE:
 What It Is: The average of the absolute errors between the predicted and
actual values.
 Analogy: Like measuring how far off the dart is from the bullseye, regardless
of the direction.
 MSE:
 What It Is: The average of the squared differences between the predicted and
actual values.
 Simple Pointer: Punishes larger errors more than smaller ones, like adding a
penalty for being way off target.

Tricks to Remember
 Use Analogies: Relate metrics to everyday situations or familiar tasks.
 Focus on Key Aspects: Precision is about your positive predictions' correctness,
recall is about catching all actual positives, and so on.
 Practice with Examples: Apply these metrics to simple, real-world scenarios to see
how they work in action.
 Create Visuals: Sketch out what each metric is measuring. Visual aids can help
solidify concepts in your memory.

Actual Labels
1 0
Predicted 1 TP (True Positives): These are FP (False Positives): These are the
Labels the patients the doctor correctly healthy patients the doctor
diagnosed as sick. incorrectly diagnosed as sick.

0 FN (False Negatives): These are TN (True Negatives): These are the

the sick patients the doctor healthy patients the doctor correctly
incorrectly diagnosed as healthy. diagnosed as healthy.

 Precision: Precision is like the doctor's accuracy in making a correct diagnosis when
they say a patient is sick. It's calculated by the formula TP / (TP + FP), which in
practice would be the number of correctly diagnosed sick patients divided by all the
patients diagnosed as sick.

 Recall (Sensitivity): This is how good the doctor is at identifying all the sick patients.
It's TP / (TP + FN), so it's the number of correctly diagnosed sick patients divided by
the actual number of sick patients.

 F1 Score: This is a balance between precision and recall. If the doctor wants to be
sure they're both accurate and don’t miss any sick patients, they'll look at the F1 score.
It's the harmonic mean of precision and recall, calculated by 2 * (Precision *
Recall) / (Precision + Recall).
 Specificity: This is the doctor's skill at identifying healthy patients. It's TN / (TN +
FP), the number of correctly identified healthy patients divided by the actual number
of healthy patients.

 Accuracy: This is the doctor's overall correctness across all diagnoses. It's (TP + TN)
/ (TP + TN + FP + FN), which means all the correct diagnoses (both sick and
healthy) divided by all diagnoses made.

Machine Learning Interview Questions
From Everand
Machine Learning Interview Questions
Tech Interviews
4.5/5 (2)
Hypothesis Testing: An Intuitive Guide for Making Data Driven Decisions
From Everand
Hypothesis Testing: An Intuitive Guide for Making Data Driven Decisions
Jim Frost
No ratings yet
Crush Hypothesis Testing
From Everand
Crush Hypothesis Testing
Allison Dillard
No ratings yet
Multivariate Analysis – The Simplest Guide in the Universe: Bite-Size Stats, #6
From Everand
Multivariate Analysis – The Simplest Guide in the Universe: Bite-Size Stats, #6
Lee Baker
No ratings yet
Data Science Related Interview Question
100% (1)
Data Science Related Interview Question
77 pages
Q1. What Is Data Science? List The Differences Between Supervised and Unsupervised Learning
100% (1)
Q1. What Is Data Science? List The Differences Between Supervised and Unsupervised Learning
41 pages
Data Science Interview Questions
100% (2)
Data Science Interview Questions
55 pages
12 Bias-Variance_Underfit_overfit
No ratings yet
12 Bias-Variance_Underfit_overfit
4 pages
EDA Module 2
No ratings yet
EDA Module 2
28 pages
Machine Learning Math Essentials _12.02.2025
No ratings yet
Machine Learning Math Essentials _12.02.2025
88 pages
machine learning-unit 3
No ratings yet
machine learning-unit 3
18 pages
module 3 modified
No ratings yet
module 3 modified
48 pages
Data Science Interview Questions -1
No ratings yet
Data Science Interview Questions -1
55 pages
Jkkklphftbbhuii
No ratings yet
Jkkklphftbbhuii
17 pages
Bias-variance
No ratings yet
Bias-variance
8 pages
DataScience Interview Questions
100% (1)
DataScience Interview Questions
66 pages
Data Science Interview Questions: Answer Here
No ratings yet
Data Science Interview Questions: Answer Here
54 pages
Bias and Variance
No ratings yet
Bias and Variance
7 pages
Unit 4
No ratings yet
Unit 4
50 pages
Csa202 Unit 2
No ratings yet
Csa202 Unit 2
36 pages
Bias - Variance Trade Off
No ratings yet
Bias - Variance Trade Off
11 pages
Bias Variance dichotomy
No ratings yet
Bias Variance dichotomy
11 pages
Unit 2
No ratings yet
Unit 2
97 pages
150 Essential Data Science Questions and Answers
No ratings yet
150 Essential Data Science Questions and Answers
55 pages
Bais and Variance
No ratings yet
Bais and Variance
4 pages
Bias, Variance, and Tradeoff
No ratings yet
Bias, Variance, and Tradeoff
8 pages
Certified Lean Six Sigma Green Belt (ICGB) Practice Questions And Exam Tests ICGB Exam Guidebook And Updated Questions
From Everand
Certified Lean Six Sigma Green Belt (ICGB) Practice Questions And Exam Tests ICGB Exam Guidebook And Updated Questions
Idea Link
No ratings yet
DS Notes
No ratings yet
DS Notes
36 pages
ML3 - Evaluation
100% (1)
ML3 - Evaluation
65 pages
Bias and Variance in Machine Learning
100% (1)
Bias and Variance in Machine Learning
7 pages
40_Machine_Learning_Interview_Questions
No ratings yet
40_Machine_Learning_Interview_Questions
55 pages
Unit 3
No ratings yet
Unit 3
55 pages
Bias Variance Overfitting
No ratings yet
Bias Variance Overfitting
3 pages
uf,of, bias-variance tradeoff
No ratings yet
uf,of, bias-variance tradeoff
3 pages
Lec 3
No ratings yet
Lec 3
13 pages
Ensemble Method
No ratings yet
Ensemble Method
12 pages
Model Evaluation in ML
No ratings yet
Model Evaluation in ML
12 pages
Bias-Variance Tradeoff Presentation
No ratings yet
Bias-Variance Tradeoff Presentation
11 pages
Errors of Regression Models: Bite-Size Machine Learning, #1
From Everand
Errors of Regression Models: Bite-Size Machine Learning, #1
Lee Baker
No ratings yet
module3_DS_ppt
No ratings yet
module3_DS_ppt
68 pages
Lecture 9 - Evaluations
No ratings yet
Lecture 9 - Evaluations
68 pages
1 - Intro to Machine Learning
No ratings yet
1 - Intro to Machine Learning
34 pages
08_eval-intro_notes (1)
No ratings yet
08_eval-intro_notes (1)
10 pages
ML MU Unit 2
100% (3)
ML MU Unit 2
84 pages
Theory in Machine Learning
No ratings yet
Theory in Machine Learning
60 pages
MACHINE LEARNING NOTES ANNA UNIVERSITY
No ratings yet
MACHINE LEARNING NOTES ANNA UNIVERSITY
9 pages
5.2
No ratings yet
5.2
62 pages
Bias and Variance
No ratings yet
Bias and Variance
6 pages
Merge +1
No ratings yet
Merge +1
107 pages
Chapter 1-ML
No ratings yet
Chapter 1-ML
27 pages
ML 5
No ratings yet
ML 5
14 pages
A Short Guide to Marketing Model Alignment & Design: Advanced Topics in Goal Alignment - Model Formulation
From Everand
A Short Guide to Marketing Model Alignment & Design: Advanced Topics in Goal Alignment - Model Formulation
David Young
No ratings yet
Interview Questions On Machine Learning
100% (4)
Interview Questions On Machine Learning
22 pages
ML Fundamentals
No ratings yet
ML Fundamentals
15 pages
AIMl TA2
No ratings yet
AIMl TA2
4 pages
Chapter2 1 22
No ratings yet
Chapter2 1 22
9 pages
Machine Learning
No ratings yet
Machine Learning
10 pages
Data Science 面试必备指南 + 面试真题
No ratings yet
Data Science 面试必备指南 + 面试真题
54 pages
Certified Artificial Intelligence Practitioner 3
No ratings yet
Certified Artificial Intelligence Practitioner 3
36 pages
Bias-Variance Tradeoff
No ratings yet
Bias-Variance Tradeoff
6 pages
Employee 360-Degree Performance Evaluation Form
No ratings yet
Employee 360-Degree Performance Evaluation Form
12 pages
How To Do Clinical Audits
No ratings yet
How To Do Clinical Audits
40 pages
Auditing Theory - Chapter 1 and 2
No ratings yet
Auditing Theory - Chapter 1 and 2
4 pages
Literature Review On Management Effectiveness
100% (1)
Literature Review On Management Effectiveness
8 pages
Service-Broschuere_EN_web
No ratings yet
Service-Broschuere_EN_web
20 pages
PDF Large Scale Inference Empirical Bayes Methods for Estimation Testing and Prediction Institute of Mathematical Statistics Monographs 1st Edition Bradley Efron download
100% (9)
PDF Large Scale Inference Empirical Bayes Methods for Estimation Testing and Prediction Institute of Mathematical Statistics Monographs 1st Edition Bradley Efron download
71 pages
Pr1 4th Quarter Week 1 8 Las Lecture For 2nd Semester
No ratings yet
Pr1 4th Quarter Week 1 8 Las Lecture For 2nd Semester
13 pages
Dokumen PDF
No ratings yet
Dokumen PDF
20 pages
Competence and KPI
No ratings yet
Competence and KPI
16 pages
Design Research Guide (Arch. Sampan)
No ratings yet
Design Research Guide (Arch. Sampan)
40 pages
Lipardo 2020
No ratings yet
Lipardo 2020
10 pages
Randomised Clinical Trialof The Effectivenessof BIprism VSplacebo
No ratings yet
Randomised Clinical Trialof The Effectivenessof BIprism VSplacebo
7 pages
PRACTICAL RESEARCH 1 Module 2-4
No ratings yet
PRACTICAL RESEARCH 1 Module 2-4
9 pages
Desidoc Journal Article
No ratings yet
Desidoc Journal Article
6 pages
Ads511 - Case Study - Aqeelah&asyraf
No ratings yet
Ads511 - Case Study - Aqeelah&asyraf
14 pages
Guidelines For The Conduct of Offshore Drilling Hazard Site Surveys
No ratings yet
Guidelines For The Conduct of Offshore Drilling Hazard Site Surveys
46 pages
LRDF Calibration NCHRP RPT 368
No ratings yet
LRDF Calibration NCHRP RPT 368
222 pages
oromya region
No ratings yet
oromya region
3 pages
Checklist - Intrepretation of CIU Test - NoR
No ratings yet
Checklist - Intrepretation of CIU Test - NoR
7 pages
Keywords: School of Accountancy, Business and Hospitality, Community Engagement
No ratings yet
Keywords: School of Accountancy, Business and Hospitality, Community Engagement
26 pages
Experienced in Research Writing and Publications
No ratings yet
Experienced in Research Writing and Publications
3 pages
Hansen 等 - Simulating the Survey of Professional Forecasters-update
No ratings yet
Hansen 等 - Simulating the Survey of Professional Forecasters-update
55 pages
Mutual Funds The Future of Investment in Pakistan?: Sponsored By: Al Meezan Investment Management LTD
No ratings yet
Mutual Funds The Future of Investment in Pakistan?: Sponsored By: Al Meezan Investment Management LTD
26 pages
2019 Zhang - A Study On The Prevalence of Dental Anxiety Pain Perception and Their
No ratings yet
2019 Zhang - A Study On The Prevalence of Dental Anxiety Pain Perception and Their
8 pages
Assignment Outline-Sources of Data Collection
100% (3)
Assignment Outline-Sources of Data Collection
27 pages
Risk-Based Process Audits
No ratings yet
Risk-Based Process Audits
12 pages
RESEARCH Methodology: Associate Professor in Management Pondicherry University Karaikal Campus Karaikal - 609 605
No ratings yet
RESEARCH Methodology: Associate Professor in Management Pondicherry University Karaikal Campus Karaikal - 609 605
46 pages
ISHRM School System
No ratings yet
ISHRM School System
15 pages
Research Grade 11
No ratings yet
Research Grade 11
14 pages
Demand Estimation and Forecasting - Lecturenotes
100% (1)
Demand Estimation and Forecasting - Lecturenotes
33 pages

Notes - Machine Learning

Uploaded by

Notes - Machine Learning

Uploaded by

1. What is overfitting and underfitting.

Overfitting: The model performs well only

Underfitting: Here, the model is so simple

Finding the Balance

2. What does it mean when the p-value are high or low?

Cross-Validation is a Statistical technique used for improving a model’s performance. Here,

The most commonly used techniques are:

5. How do you approach solving any data analytics based project?

 First step is to thoroughly understand the business requirement/problem

Evaluation Metric/ Performance Metric

2. Precision and Recall

 Recall (Sensitivity, True Positive Rate):

 AUC (Area Under the ROC Curve):

5. Mean Absolute Error (MAE) and Mean Squared Error (MSE)

0 FN (False Negatives): These are TN (True Negatives): These are the

You might also like