0% found this document useful (0 votes)
24 views

Notes - Machine Learning

This document discusses overfitting and underfitting in machine learning models. Overfitting occurs when a model performs well on the training data but poorly on new data, due to high variance. Underfitting is when a model is too simple and cannot identify patterns in the data, due to high bias. The goal is to balance bias and variance to minimize error.

Uploaded by

beverishlim
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
24 views

Notes - Machine Learning

This document discusses overfitting and underfitting in machine learning models. Overfitting occurs when a model performs well on the training data but poorly on new data, due to high variance. Underfitting is when a model is too simple and cannot identify patterns in the data, due to high bias. The goal is to balance bias and variance to minimize error.

Uploaded by

beverishlim
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 9

1. What is overfitting and underfitting.

Overfitting: The model performs well only


for the sample training data. If any new data
is given as input to the model, it fails to
provide any result. These conditions occur
due to low bias and high variance in the
model. Decision trees are more prone to
overfitting

Underfitting: Here, the model is so simple


that it is not able to identify the correct
relationship in the data, and hence it does
not perform well even on the test data. This
can happen due to high bias and low
variance. Linear regression is more prone to
Underfitting.

Bias
Imagine you're trying to throw darts at a bullseye. If most of your darts land to the left of
the bullseye, you're consistently off-target in a particular direction. That's like bias in
machine learning. It happens when your model makes assumptions about the data that lead
it to miss the actual trends. So, instead of hitting the bullseye (the real answers), your
model's predictions are consistently off in the same direction. This often happens because
the model is too simple to capture the complexity of the data.

Variance
Now, suppose your darts are all over the dartboard, sometimes to the left, sometimes to the
right, up, and down. There's a lot of spread in where your darts land. That's like variance in
machine learning. It occurs when your model pays too much attention to the training data,
including the noise or random fluctuations. As a result, the model performs well on the
training data but poorly on new, unseen data because it's too busy chasing the specifics of
the training data rather than focusing on the overall pattern. Imagine it as trying to
memorize the answers to a test without understanding the questions.

Finding the Balance


The goal in machine learning is to find a sweet spot between bias and variance, much like
adjusting your aim and throw strength to hit the bullseye more consistently. You want a
model that's complex enough to capture the true patterns in the data (low bias) without
getting distracted by the random noise in the dataset (low variance). This balance is crucial
for creating models that make accurate predictions on new, unseen data.

The trade-off between bias and variance is a central issue in machine learning. The goal is to
find a balance between bias and variance to minimize the overall error. A model with high
bias and low variance is said to underfit the data, while a model with low bias and high
variance is said to overfit the data. A model with low bias and low variance is said to be well-
fit to the data.

The bias-variance trade-off can be visualized using a learning curve, which is a plot of the
training and validation errors of a model as a function of the number of training examples or
the complexity of the model. A learning curve can show how the bias-variance trade-off
changes with different levels of training or complexity.
To reduce high bias, one can use more complex models, add more features, or reduce the
regularization parameter. To reduce high variance, one can use simpler models, reduce the
number of features, or increase the regularization parameter.

In summary, bias and variance are two important concepts in machine learning that are
used to evaluate the performance of a model. Bias refers to the simplifying assumptions
made by the model, while variance refers to the inconsistency of different predictions using
different training sets. The goal is to find a balance between bias and variance to minimize
the overall error and achieve accurate and consistent predictions.

2. What does it mean when the p-value are high or low?

A p-value is a tool used in statistics to help you understand if your results are likely to have
happened by chance or if there’s something more interesting going on.

What is a p-value?
 P-value: This is a number between 0 and 1 that tells you whether the results of your
experiment or study could happen just by chance. The lower the p-value, the less
likely the results are due to random chance.
 The p-value is often used to determine whether to reject or accept the null
hypothesis. A common threshold for statistical significance is a p-value less than or
equal to 0.05.

High P-value
 When it’s high (usually more than 0.05 or 5%): This suggests that your findings
could easily happen by chance. There isn’t enough evidence to suggest something
unusual is going on. It means there's a high probability that any effect you see (like a
treatment making a plant grow faster) could just be due to randomness.
 If the p-value is greater than 0.05, the null hypothesis is not rejected, and the
alternative hypothesis is not accepted.
 A high p-value suggests that the observed data is likely under the null hypothesis,
indicating that there is weak evidence against the null hypothesis.

Low P-value
 When it’s low (usually less than 0.05 or 5%): This indicates that your findings are
unlikely to have occurred by chance. There’s enough evidence to suggest a real
effect or difference. This doesn't mean your hypothesis is definitely true, but it's an
indication that it's worth further investigation.
 If the p-value is less than or equal to 0.05, the null hypothesis is rejected, and the
alternative hypothesis is accepted.
 A low p-value suggests that the observed data is unlikely under the null hypothesis,
indicating that there is strong evidence against the null hypothesis.

When to Use It
You use a p-value when you want to understand whether the differences or effects you
observe in your data (like test scores improving after studying more, or people feeling
better after taking a medicine) are meaningful or just random variations.

Example
Imagine you're teaching two groups of students for a test. Group A studies with a new
method you've developed, while Group B studies with the traditional method. After the test,
you compare the scores.
 High p-value: If the p-value is high when you compare the test scores, it means
there's a strong chance the difference in scores between the two groups could just
be due to random variation. The new study method might not be better than the
traditional one.
 Low p-value: If the p-value is low, it suggests that the difference in scores is unlikely
to be due to chance, indicating your new study method may genuinely improve
scores.

3. What is cross-validation

Cross-Validation is a Statistical technique used for improving a model’s performance. Here,


the model will be trained and tested with rotation using different samples of the training
dataset to ensure that the model performs well for unknown data. The training data will be
split into various groups and the model is run and validated against these groups in rotation.

The most commonly used techniques are:


 K- Fold method
 Leave p-out method
 Leave-one-out method
 Holdout method
This process helps ensure your model isn’t just memorizing the answers (overfitting) or too
confused by the data (underfitting). Instead, you’re aiming for a model that genuinely learns
and can apply that knowledge to new, unseen data, much like a well-prepared student
facing a variety of questions on a test.
4. Covariance vs Correlation

 Both covariance and correlation measure the relationship and the dependency
between two variables.
 Covariance indicates the direction of the linear relationship between variables.
 Correlation measures both the strength and direction of the linear relationship
between two variables.
 Correlation values are standardized.
 Covariance values are not standardized.

5. How do you approach solving any data analytics based project?

 First step is to thoroughly understand the business requirement/problem


 Next, explore the given data and analyze it carefully. If you find any data missing, get
the requirements clarified from the business.
 Data cleanup and preparation step is to be performed next which is then used for
modeling. Here, the missing values are found and the variables are transformed.
 Run your model against the data, build meaningful visualization and analyze the
results to get meaningful insights.
 Release the model implementation, track the results and performance over a
specified period to analyze the usefulness.
 Perform cross-validation of the model.

6. What does the ROC curve represent and how to create it?

Evaluation Metric/ Performance Metric


1. Accuracy
 What It Is: The percentage of your predictions that are correct.
 Analogy: Imagine you're answering true or false questions on a quiz. Accuracy is just
the number of questions you got right over the total number of questions.

2. Precision and Recall


 Precision (Positive Predictive Value):
 What It Is: Of all the positive (true) predictions you made, how many were
actually true?
 Simple Pointer: Precision is about being precise with your positive
predictions

 Recall (Sensitivity, True Positive Rate):


 What It Is: Of all the actual positive cases, how many did you correctly
predict?
 Simple Pointer: Recall is recalling or catching all the actual positives.

3. F1 Score
 What It Is: The harmonic mean of precision and recall. It balances the two—useful
when you need a single metric to compare models directly.
 Analogy: If precision and recall are the two wheels of a bicycle, the F1 score is how
well the bicycle rides. It needs both wheels to be balanced to work best.

4. ROC-AUC
 ROC (Receiver Operating Characteristic) Curve: Shows the trade-off between the
true positive rate (recall) and false positive rate at different thresholds.

 AUC (Area Under the ROC Curve):


 What It Is: A single number summarizing the ROC curve, representing the
likelihood that a randomly chosen positive instance is ranked higher than a
randomly chosen negative one.
 Simple Pointer: Think of AUC as the score of a game where you want to
maximize the area you cover.

5. Mean Absolute Error (MAE) and Mean Squared Error (MSE)


 Used For: Regression problems where you predict a continuous quantity.
 MAE:
 What It Is: The average of the absolute errors between the predicted and
actual values.
 Analogy: Like measuring how far off the dart is from the bullseye, regardless
of the direction.
 MSE:
 What It Is: The average of the squared differences between the predicted and
actual values.
 Simple Pointer: Punishes larger errors more than smaller ones, like adding a
penalty for being way off target.

Tricks to Remember
 Use Analogies: Relate metrics to everyday situations or familiar tasks.
 Focus on Key Aspects: Precision is about your positive predictions' correctness,
recall is about catching all actual positives, and so on.
 Practice with Examples: Apply these metrics to simple, real-world scenarios to see
how they work in action.
 Create Visuals: Sketch out what each metric is measuring. Visual aids can help
solidify concepts in your memory.

Actual Labels
1 0
Predicted 1 TP (True Positives): These are FP (False Positives): These are the
Labels the patients the doctor correctly healthy patients the doctor
diagnosed as sick. incorrectly diagnosed as sick.

0 FN (False Negatives): These are TN (True Negatives): These are the


the sick patients the doctor healthy patients the doctor correctly
incorrectly diagnosed as healthy. diagnosed as healthy.

 Precision: Precision is like the doctor's accuracy in making a correct diagnosis when
they say a patient is sick. It's calculated by the formula TP / (TP + FP), which in
practice would be the number of correctly diagnosed sick patients divided by all the
patients diagnosed as sick.

 Recall (Sensitivity): This is how good the doctor is at identifying all the sick patients.
It's TP / (TP + FN), so it's the number of correctly diagnosed sick patients divided by
the actual number of sick patients.

 F1 Score: This is a balance between precision and recall. If the doctor wants to be
sure they're both accurate and don’t miss any sick patients, they'll look at the F1 score.
It's the harmonic mean of precision and recall, calculated by 2 * (Precision *
Recall) / (Precision + Recall).
 Specificity: This is the doctor's skill at identifying healthy patients. It's TN / (TN +
FP), the number of correctly identified healthy patients divided by the actual number
of healthy patients.

 Accuracy: This is the doctor's overall correctness across all diagnoses. It's (TP + TN)
/ (TP + TN + FP + FN), which means all the correct diagnoses (both sick and
healthy) divided by all diagnoses made.

You might also like