Notes - Machine Learning
Notes - Machine Learning
Bias
Imagine you're trying to throw darts at a bullseye. If most of your darts land to the left of
the bullseye, you're consistently off-target in a particular direction. That's like bias in
machine learning. It happens when your model makes assumptions about the data that lead
it to miss the actual trends. So, instead of hitting the bullseye (the real answers), your
model's predictions are consistently off in the same direction. This often happens because
the model is too simple to capture the complexity of the data.
Variance
Now, suppose your darts are all over the dartboard, sometimes to the left, sometimes to the
right, up, and down. There's a lot of spread in where your darts land. That's like variance in
machine learning. It occurs when your model pays too much attention to the training data,
including the noise or random fluctuations. As a result, the model performs well on the
training data but poorly on new, unseen data because it's too busy chasing the specifics of
the training data rather than focusing on the overall pattern. Imagine it as trying to
memorize the answers to a test without understanding the questions.
The trade-off between bias and variance is a central issue in machine learning. The goal is to
find a balance between bias and variance to minimize the overall error. A model with high
bias and low variance is said to underfit the data, while a model with low bias and high
variance is said to overfit the data. A model with low bias and low variance is said to be well-
fit to the data.
The bias-variance trade-off can be visualized using a learning curve, which is a plot of the
training and validation errors of a model as a function of the number of training examples or
the complexity of the model. A learning curve can show how the bias-variance trade-off
changes with different levels of training or complexity.
To reduce high bias, one can use more complex models, add more features, or reduce the
regularization parameter. To reduce high variance, one can use simpler models, reduce the
number of features, or increase the regularization parameter.
In summary, bias and variance are two important concepts in machine learning that are
used to evaluate the performance of a model. Bias refers to the simplifying assumptions
made by the model, while variance refers to the inconsistency of different predictions using
different training sets. The goal is to find a balance between bias and variance to minimize
the overall error and achieve accurate and consistent predictions.
A p-value is a tool used in statistics to help you understand if your results are likely to have
happened by chance or if there’s something more interesting going on.
What is a p-value?
P-value: This is a number between 0 and 1 that tells you whether the results of your
experiment or study could happen just by chance. The lower the p-value, the less
likely the results are due to random chance.
The p-value is often used to determine whether to reject or accept the null
hypothesis. A common threshold for statistical significance is a p-value less than or
equal to 0.05.
High P-value
When it’s high (usually more than 0.05 or 5%): This suggests that your findings
could easily happen by chance. There isn’t enough evidence to suggest something
unusual is going on. It means there's a high probability that any effect you see (like a
treatment making a plant grow faster) could just be due to randomness.
If the p-value is greater than 0.05, the null hypothesis is not rejected, and the
alternative hypothesis is not accepted.
A high p-value suggests that the observed data is likely under the null hypothesis,
indicating that there is weak evidence against the null hypothesis.
Low P-value
When it’s low (usually less than 0.05 or 5%): This indicates that your findings are
unlikely to have occurred by chance. There’s enough evidence to suggest a real
effect or difference. This doesn't mean your hypothesis is definitely true, but it's an
indication that it's worth further investigation.
If the p-value is less than or equal to 0.05, the null hypothesis is rejected, and the
alternative hypothesis is accepted.
A low p-value suggests that the observed data is unlikely under the null hypothesis,
indicating that there is strong evidence against the null hypothesis.
When to Use It
You use a p-value when you want to understand whether the differences or effects you
observe in your data (like test scores improving after studying more, or people feeling
better after taking a medicine) are meaningful or just random variations.
Example
Imagine you're teaching two groups of students for a test. Group A studies with a new
method you've developed, while Group B studies with the traditional method. After the test,
you compare the scores.
High p-value: If the p-value is high when you compare the test scores, it means
there's a strong chance the difference in scores between the two groups could just
be due to random variation. The new study method might not be better than the
traditional one.
Low p-value: If the p-value is low, it suggests that the difference in scores is unlikely
to be due to chance, indicating your new study method may genuinely improve
scores.
3. What is cross-validation
Both covariance and correlation measure the relationship and the dependency
between two variables.
Covariance indicates the direction of the linear relationship between variables.
Correlation measures both the strength and direction of the linear relationship
between two variables.
Correlation values are standardized.
Covariance values are not standardized.
6. What does the ROC curve represent and how to create it?
3. F1 Score
What It Is: The harmonic mean of precision and recall. It balances the two—useful
when you need a single metric to compare models directly.
Analogy: If precision and recall are the two wheels of a bicycle, the F1 score is how
well the bicycle rides. It needs both wheels to be balanced to work best.
4. ROC-AUC
ROC (Receiver Operating Characteristic) Curve: Shows the trade-off between the
true positive rate (recall) and false positive rate at different thresholds.
Actual Labels
1 0
Predicted 1 TP (True Positives): These are FP (False Positives): These are the
Labels the patients the doctor correctly healthy patients the doctor
diagnosed as sick. incorrectly diagnosed as sick.
Precision: Precision is like the doctor's accuracy in making a correct diagnosis when
they say a patient is sick. It's calculated by the formula TP / (TP + FP), which in
practice would be the number of correctly diagnosed sick patients divided by all the
patients diagnosed as sick.
Recall (Sensitivity): This is how good the doctor is at identifying all the sick patients.
It's TP / (TP + FN), so it's the number of correctly diagnosed sick patients divided by
the actual number of sick patients.
F1 Score: This is a balance between precision and recall. If the doctor wants to be
sure they're both accurate and don’t miss any sick patients, they'll look at the F1 score.
It's the harmonic mean of precision and recall, calculated by 2 * (Precision *
Recall) / (Precision + Recall).
Specificity: This is the doctor's skill at identifying healthy patients. It's TN / (TN +
FP), the number of correctly identified healthy patients divided by the actual number
of healthy patients.
Accuracy: This is the doctor's overall correctness across all diagnoses. It's (TP + TN)
/ (TP + TN + FP + FN), which means all the correct diagnoses (both sick and
healthy) divided by all diagnoses made.