Pre Class Read Study Guide
Pre Class Read Study Guide
1. Define bias in the context of machine learning models and provide a real-world analogy.
2. Explain how high bias affects a model's performance and name this effect.
3. Define variance in the context of machine learning models and provide a real-world analogy.
4. Explain how high variance affects a model's performance and name this effect.
1. Bias represents the error due to overly simplistic assumptions in the learning model, failing to
capture the underlying data patterns. An analogy is trying to estimate the complexity of a book
by only reading the first page.
2. High bias leads to underfitting, where the model cannot learn well from the training data and
fails to make accurate predictions on both training and unseen data. The model is too simple to
capture the data's complexity.
3. Variance measures the sensitivity of the model to small changes in the training data, meaning it
fits to noise. An analogy is memorizing every answer in a textbook without understanding the
concepts.
4. High variance leads to overfitting, where the model performs excellently on the training data but
poorly on unseen test data. It has memorized the training data, including the noise, and cannot
generalize to new data.
5. Underfitting occurs when a model has high bias. For instance, predicting house prices with a
simple average price instead of considering features like size, location, etc., is underfitting due to
high bias because key information is being ignored.
6. Overfitting occurs when a model has high variance. For example, a model that learns specific
details of training data, making it unusable for other datasets, is overfitting because it's overly
sensitive to the training data.
7. Regularization adds constraints to the model to reduce overfitting, which occurs when the model
is too complex and has high variance. Techniques such as Lasso and Ridge regression penalize
large coefficients, effectively simplifying the model.
8. Cross-validation tests the model on multiple subsets of data. If the model performs well on the
training subsets but poorly on the validation subsets, it indicates that the model is overfitting to
the training data.
9. To fix a high-bias outcome, which indicates underfitting, one can use a more complex model.
This can involve increasing the number of layers in a neural network or using a non-linear model
instead of a linear one.
10. Ensemble methods combine multiple models to average their predictions, which reduces errors.
By combining diverse models, ensemble methods can reduce both bias and variance, leading to
more robust and accurate predictions.
Essay Questions
1. Explain the bias-variance trade-off in detail, using examples to illustrate how adjusting a model's
complexity impacts its bias and variance.
2. Compare and contrast underfitting and overfitting. Discuss their causes, effects, and methods for
mitigating each.
3. Discuss the impact of data quality and quantity on the performance of machine-learning models,
and their relationship to bias and variance.
4. Evaluate and contrast different techniques for addressing overfitting, such as regularization and
cross-validation. Discuss their advantages and disadvantages.
5. Explain how ensemble methods, such as Random Forest, help to reduce both bias and variance.
Provide a real-world scenario where this approach would be particularly useful.
Bias: The error introduced by approximating a real-life problem, which is often complex, by a
simplified model.
Variance: The sensitivity of the model to small changes in the training data; a high variance
indicates the model fits to noise.
Underfitting: A situation where the model is too simple to capture the underlying structure of
the data, resulting in high bias and poor performance.
Overfitting: A situation where the model learns the training data too well, including noise,
leading to high variance and poor generalization to new data.
Data Augmentation: A technique to artificially increase the size of a training dataset by creating
modified versions of images or other data, reducing the risk of overfitting.
Ensemble Methods: Machine learning techniques that combine multiple base models to
produce one optimal predictive model. Common examples include Random Forest.