Machine Learning - Till Chapter5
Machine Learning - Till Chapter5
Key Points and Review of Lecture Notes: Supervised Machine Learning &
Optimization (Lecture 2)
1. Supervised Machine Learning (ML) Overview
• Supervised Learning: Training a model using labeled data (X, Y) pairs.
• Key Goals:
o Prediction: Estimating YYY given XXX.
o Inference: Understanding relationships between XXX and YYY.
• Common Supervised ML Methods:
o Linear Regression
o Logistic Regression
o Decision Trees (CART)
o Random Forests
o Boosting
o Regularization (Lasso, Ridge)
o Feature Engineering
• Applications:
o Predicting wine quality, loan defaults, click-through rates, sales volume, etc.
4. Optimization in Machine Learning
• Optimization is fundamental to machine learning models.
• Key Optimization Problems:
o Ordinary Least Squares (OLS): Minimize residual sum of squares (RSS).
o Regularized Regression: Minimize RSS with a penalty term (Lasso, Ridge).
o Gradient-Based Optimization: Used in deep learning models.
• Convex Optimization:
o If the loss function is convex, global minimization is guaranteed.
o Example: Least Squares Regression.
o Non-convex problems (e.g., deep learning) require heuristics (e.g., SGD).
8. Cross-Validation & Best Practices
• Cross-Validation:
o Split Data into Training & Test sets.
o K-Fold CV: Uses multiple subsets to ensure robustness.
o Helps to prevent overfitting.
• Key Model Selection Criteria:
o Bias-Variance Tradeoff:
▪ High Bias → Model is too simple (underfitting).
▪ High Variance → Model memorizes noise (overfitting).
o Hyperparameter Tuning:
▪ Regularization (L1, L2), Learning Rate, Tree Depth, etc.
• 9. Exam Prep Strategy
• Conceptual Understanding
• ✔ Know the difference between Supervised vs. Unsupervised Learning.
✔ Understand Linear Regression, its assumptions, and optimization techniques.
✔ Be familiar with Gradient Descent and why it's useful.
✔ Recognize parametric vs. non-parametric models and their tradeoffs.
✔ Understand the Least Squares Solution in matrix form.
✔ Learn the importance of convexity in optimization.
• Practice Problems
• ✔ Work through regression problems (OLS, Ridge, Lasso).
✔ Perform cross-validation and error analysis.
✔ Solve classification tasks (logistic regression, decision trees, Bayes classifier).
✔ Be comfortable with linear algebra concepts (matrices, gradients, convexity).
Key Points and Review of Lecture Notes: Supervised Machine Learning &
Optimization (Lectures 3 & 4)
• The lecture introduces deep learning for tabular data using feedforward
networks.
• Topics covered:
o Feature engineering in Ames Housing Data.
o Introduction to neural networks and their structure.
o Single hidden layer feedforward networks.
o Backpropagation algorithm for training neural networks.
• Goal: Learn the best weights and biases to minimize the loss function.
• Steps:
1. Initialize network weights randomly.
2. Compute forward pass to get predictions.
3. Compute loss function (e.g., MSE for regression, cross-entropy for
classification).
4. Use backpropagation to compute gradients.
5. Update weights using Stochastic Gradient Descent (SGD).
6. Repeat until convergence.
7. Backpropagation Algorithm
• Gradient Descent:
o Updates weights based on loss gradients.
• Stochastic Gradient Descent (SGD):
o Uses mini-batches instead of full dataset for efficiency.
• Regularization:
o L2 Regularization (Weight Decay): Prevents overfitting by penalizing large
weights.
o Dropout: Randomly removes neurons during training to enhance
generalization.
Activation functions introduce non-linearity into neural networks, allowing them to model
complex relationships.
3. Backpropagation Algorithm
Steps in Backpropagation
1. Forward Pass
o Compute predictions using current weights.
o Calculate loss based on predictions and actual values.
2. Compute Loss Function
o Measures how far predictions are from actual values.
o Examples: MSE (for regression), cross-entropy (for classification).
3. Backward Pass (Gradient Computation)
o Compute derivatives of the loss function w.r.t. each weight using the chain
rule.
o Compute gradients layer-by-layer, from output to input.
4.
Feature engineering is the process of transforming raw data into a format that improves model
performance.
• Deep learning can learn features automatically in domains like image and text
processing.
• However, for tabular data, feature engineering is still often necessary.
5. Loss Functions
Loss functions measure how well the neural network’s predictions match the actual values.
You are training a binary classification model and need to choose an activation function.
(a) Which activation function should be used in the final output layer?
(b) What are two potential problems of using a sigmoid activation in hidden layers?
Solution:
• (a) For binary classification, the sigmoid activation is used in the output layer
because it maps predictions to the range (0,1), making it interpretable as a
probability.
• (b) Two problems of using sigmoid in hidden layers:
1. Vanishing Gradient Problem – Gradients become very small for extreme
values, slowing down training.
2. Outputs are not zero-centered – Sigmoid outputs are always positive,
making gradient updates less efficient.
5. Conceptual Questions
Question 5: Deep Learning for Tabular Data
(a) Why do tree-based models (Random Forest, XGBoost) often outperform neural
networks on tabular data?
(b) When would deep learning be preferable for tabular data?
Solution:
Answer:
• (a) A single-layer perceptron can only learn linear decision boundaries. If the
data is not linearly separable, it will fail to learn meaningful patterns (e.g., XOR
problem).
• (b) Additional hidden layers allow the network to:
1. Capture non-linear relationships.
2. Learn hierarchical features (e.g., in images, edges → shapes → objects).
3. Model complex interactions in high-dimensional data.
2. Activation Functions
Question 2: Why Not Always Use ReLU?
(a) What are the advantages of ReLU over sigmoid and tanh?
(b) What are the potential problems with ReLU, and how can they be addressed?
Answer:
3. Loss Functions
Question 3: Why Use Cross-Entropy Loss for Classification?
(a) Why is Mean Squared Error (MSE) a poor choice for classification?
(b) How does cross-entropy loss work in binary and multiclass classification?
Answer:
4. Backpropagation
Question 4: Why Do We Need Backpropagation?
(a) Why can’t we compute weight updates directly like in linear regression?
(b) How does backpropagation efficiently compute gradients?
Answer:
Answer:
Answer:
7. Regularization Techniques
Question 7: Why Do We Use Dropout?
(a) What problem does dropout solve?
(b) How does dropout work during training and testing?
Answer:
9. Weight Initialization
Question 9: Why Not Initialize All Weights to Zero?
(a) What happens if all weights start at zero?
(b) What is a better initialization strategy?
Answer:
Answer: