0% found this document useful (0 votes)
3 views30 pages

Machine Learning - Till Chapter5

The document provides an overview of data analytics and machine learning, emphasizing the distinction between the two fields and their applications in optimization and online advertising. It details various predictive models, optimization techniques, and the structure and training of neural networks, particularly in relation to tabular data. Additionally, it outlines exam expectations, study tips, and key concepts necessary for understanding supervised learning and optimization in machine learning.

Uploaded by

Adithya Ayanam
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
3 views30 pages

Machine Learning - Till Chapter5

The document provides an overview of data analytics and machine learning, emphasizing the distinction between the two fields and their applications in optimization and online advertising. It details various predictive models, optimization techniques, and the structure and training of neural networks, particularly in relation to tabular data. Additionally, it outlines exam expectations, study tips, and key concepts necessary for understanding supervised learning and optimization in machine learning.

Uploaded by

Adithya Ayanam
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 30

1.

Introduction to Data Analytics and Machine Learning


• Data Analytics: Using data to build models for better decision-making.
o Descriptive: Summarizing patterns (e.g., visualization, clustering).
o Predictive: Forecasting outcomes (e.g., regression, classification).
o Generative: Modeling distributions (e.g., text/image generation).
o Prescriptive: Data-driven optimization and decision-making.
• Machine Learning vs. Data Analytics:
o ML focuses on patterns and predictions.
o Data analytics emphasizes decisions and value creation.

2. Optimization in Machine Learning


• Optimization plays a foundational role in analytics.
• Used in:
o Training models (gradient-based optimization).
o Decision-focused learning (end-to-end optimization).
o Reinforcement learning (dynamic decision-making).

3. Online Advertising and Ad Auctions


• Google AdWords:
o Revenue model based on Pay-Per-Click (PPC).
o Ads placed using a Generalized Second-Price Auction.
o Quality Score (QS) = Bid × Click-Through Rate (CTR) × 1000.
• CTR (Click-Through Rate):
o Measures probability of a user clicking an ad.
o Varies based on position, keyword, user profile, and device.
• Google's Optimization Problem:
o Maximize revenue while considering:
▪ Advertisers' budgets.
▪ CTR prediction models.
▪ Quality Score for ad ranking.

4. Predictive Models in Online Advertising


• Supervised Learning Approaches:
o Linear Regression: Baseline approach.
o CART Decision Trees: Simple interpretable models.
o Random Forests & Boosting: High predictive accuracy.
• CTR Prediction Models:
o Factors include ad position, user demographics, past CTRs.
o Used to optimize ad placement and maximize clicks.

5. Key Course Topics for Exam


• Gradient-Based Optimization: Used in deep learning models.
• Deep Learning Architectures:
o Feedforward, CNNs, RNNs, Transformers.
• Generative AI:
o Language models (GPT), diffusion models.
• Survival Analysis:
o Demand prediction using hazard functions.
• Reinforcement Learning:
o Decision-making under uncertainty.

6. Exam Format and Expectations


• Midterm Date: March 20, 2025 (In-class).
• Exam Focus:
o Conceptual Questions (based on lectures and homework).
o Data Analysis Questions (interpreting models, optimization).
• Grading Breakdown:
o Homework (30%)
o Midterm (30%)
o Final Project (30%)
o Discussion Lab (10%).

7. Additional Study Tips


• Review Lecture Notes: Focus on key models and optimization techniques.
• Practice Homework Problems: Many exam questions are similar.
• Understand Google Ad Optimization: Expect questions on auctions and CTR.
• Work with Python & PyTorch: Coding fluency may be tested in data analysis.

Key Points and Review of Lecture Notes: Supervised Machine Learning &
Optimization (Lecture 2)
1. Supervised Machine Learning (ML) Overview
• Supervised Learning: Training a model using labeled data (X, Y) pairs.
• Key Goals:
o Prediction: Estimating YYY given XXX.
o Inference: Understanding relationships between XXX and YYY.
• Common Supervised ML Methods:
o Linear Regression
o Logistic Regression
o Decision Trees (CART)
o Random Forests
o Boosting
o Regularization (Lasso, Ridge)
o Feature Engineering
• Applications:
o Predicting wine quality, loan defaults, click-through rates, sales volume, etc.
4. Optimization in Machine Learning
• Optimization is fundamental to machine learning models.
• Key Optimization Problems:
o Ordinary Least Squares (OLS): Minimize residual sum of squares (RSS).
o Regularized Regression: Minimize RSS with a penalty term (Lasso, Ridge).
o Gradient-Based Optimization: Used in deep learning models.
• Convex Optimization:
o If the loss function is convex, global minimization is guaranteed.
o Example: Least Squares Regression.
o Non-convex problems (e.g., deep learning) require heuristics (e.g., SGD).
8. Cross-Validation & Best Practices
• Cross-Validation:
o Split Data into Training & Test sets.
o K-Fold CV: Uses multiple subsets to ensure robustness.
o Helps to prevent overfitting.
• Key Model Selection Criteria:
o Bias-Variance Tradeoff:
▪ High Bias → Model is too simple (underfitting).
▪ High Variance → Model memorizes noise (overfitting).
o Hyperparameter Tuning:
▪ Regularization (L1, L2), Learning Rate, Tree Depth, etc.
• 9. Exam Prep Strategy
• Conceptual Understanding
• ✔ Know the difference between Supervised vs. Unsupervised Learning.
✔ Understand Linear Regression, its assumptions, and optimization techniques.
✔ Be familiar with Gradient Descent and why it's useful.
✔ Recognize parametric vs. non-parametric models and their tradeoffs.
✔ Understand the Least Squares Solution in matrix form.
✔ Learn the importance of convexity in optimization.
• Practice Problems
• ✔ Work through regression problems (OLS, Ridge, Lasso).
✔ Perform cross-validation and error analysis.
✔ Solve classification tasks (logistic regression, decision trees, Bayes classifier).
✔ Be comfortable with linear algebra concepts (matrices, gradients, convexity).

Key Points and Review of Lecture Notes: Supervised Machine Learning &
Optimization (Lectures 3 & 4)

1. Review of Optimization Concepts & Unconstrained


Problems
• Optimization is a fundamental aspect of machine learning, used to minimize loss
functions.
• Unconstrained Optimization: Involves minimizing a function without explicit
constraints.
• First-Order Condition: The optimal solution occurs when the gradient of the function
is zero.
• Convexity: If the function is convex, then any local minimum is also a global minimum.

2. Regularized Loss Function Minimization


• Many ML models are trained by minimizing a loss function that measures prediction
error.
• Regularization helps prevent overfitting by adding a penalty term.

Application to Ames Housing Dataset

• Linear regression used to predict housing prices.


• Training data: 2006-2008 (1936 samples).
• Test data: 2009-2010 (989 samples).
• Loss function: Mean Squared Error (MSE), which is equivalent to Residual Sum of
Squares (RSS).

3. Machine Learning as Loss Function Minimization


• The goal of ML algorithms is to minimize the loss function.
• A loss function measures how well the model’s predictions match actual values.
• Formulation:
o Regression: Least Squares Loss (RSS).
o Classification: Different loss functions used.

4. Gradient Descent for Optimization


• Gradient Descent: Iterative method for minimizing functions.
• Key Idea: Move in the direction of steepest descent (negative gradient).
Key Points for Exam Preparation - Feedforward Neural Networks (Lecture 5)
1. Overview of the Lecture

• The lecture introduces deep learning for tabular data using feedforward
networks.
• Topics covered:
o Feature engineering in Ames Housing Data.
o Introduction to neural networks and their structure.
o Single hidden layer feedforward networks.
o Backpropagation algorithm for training neural networks.

2. Ames Housing Data & Feature Engineering

• Ames Housing Dataset is used as a real-world dataset for regression tasks.


• Dependent Variable: Log of sale price.
• Independent Variables (~80 features):
o Zoning classification, dwelling type, year built, quality rating, living area, etc.
• Training/Test Split:
o Training Data: 2006-2008 sales (66% of data).
o Test Data: 2009-2010 sales (34% of data).
• Feature Engineering:
o Adding polynomial terms (e.g., 10-degree polynomials).
o Adding time-based trend features.
o Final dataset contains ~500 features.
3. Introduction to Neural Networks

• Definition: A neural network is a nonlinear statistical model that learns


relationships between inputs and outputs.
• Applications:
o Natural Language Processing (NLP)
o Image Recognition
o Speech Recognition
o Autonomous Driving
• Timeline of Deep Learning:
o 1940s: Early neural networks (McCulloch-Pitts model).
o 1980s: Neural networks gain popularity.
o 1990-2010: Decline in favor of tree-based models (Random Forest,
Boosting).
o 2010-present: Resurgence with deep learning.
• Why the resurgence?
o Availability of large datasets.
o Advances in GPU computation.
o Efficient software frameworks.

4. Feedforward Networks & Tabular Data

• Definition: A feedforward neural network (FNN) is a multilayer nonlinear function.


• Structure:
o Input layer: Takes feature values as input.
o Hidden layer(s): Applies weighted transformations and activation functions.
o Output layer: Produces final predictions.
• Tabular Data:
o Conventional ML models (Random Forest, Boosting) often outperform deep
learning.
o Deep learning does not necessarily outperform classical models on
tabular data.
o However, deep learning automatically performs feature engineering.

5. Neural Network Structure

• Single Hidden Layer Network:


o Input Layer: Takes in feature values.
o Hidden Layer: Transforms inputs using weighted sums + activation
functions.
o Output Layer: Final predictions.
• Key Components:
o Nodes: Perform computations.
o Edges: Represent connections with weights.
o Bias Terms: Offsets for flexibility.
o Activation Functions: Introduce nonlinearity.

6. Training a Feedforward Neural Network

• Goal: Learn the best weights and biases to minimize the loss function.
• Steps:
1. Initialize network weights randomly.
2. Compute forward pass to get predictions.
3. Compute loss function (e.g., MSE for regression, cross-entropy for
classification).
4. Use backpropagation to compute gradients.
5. Update weights using Stochastic Gradient Descent (SGD).
6. Repeat until convergence.

7. Backpropagation Algorithm

• Backpropagation = Gradient Descent + Chain Rule.


• Steps:
1. Forward Pass: Compute output layer activations.
2. Compute Loss: Compare predictions to ground truth.
3. Backward Pass: Compute partial derivatives of loss with respect to each
weight.
4. Update Weights: Apply gradient descent to minimize loss.
• Key Concepts in Backpropagation:
o Chain Rule: Computes gradients layer by layer.
o Gradient Descent: Updates weights based on gradient.
o Optimization: Uses learning rate to control step size.
9. Optimization in Neural Networks

• Gradient Descent:
o Updates weights based on loss gradients.
• Stochastic Gradient Descent (SGD):
o Uses mini-batches instead of full dataset for efficiency.
• Regularization:
o L2 Regularization (Weight Decay): Prevents overfitting by penalizing large
weights.
o Dropout: Randomly removes neurons during training to enhance
generalization.

10. Summary of Key Takeaways

• Neural networks are nonlinear models useful in various applications.


• Feedforward networks consist of input, hidden, and output layers.
• Activation functions introduce nonlinearity (e.g., ReLU, Sigmoid).
• Training involves: Forward pass, loss computation, backpropagation, weight
updates.
• Backpropagation efficiently computes gradients using the chain rule.
• Loss functions vary based on the problem type (MSE, cross-entropy, etc.).
• Optimization uses gradient descent (SGD) with regularization techniques.
How to Prepare for the Exam

1. Understand the structure of neural networks (input, hidden, output layers).


2. Know the different activation functions and their roles.
3. Be comfortable with backpropagation (forward pass, loss function, gradient
computation).
4. Review feature engineering concepts (why it's needed in neural networks).
5. Understand loss functions (MSE, logistic loss, cross-entropy).
6. Learn optimization techniques (SGD, weight decay, dropout).
7. Be prepared for conceptual questions on why deep learning does/does not work
well for tabular data.
8. Solve numerical problems related to forward and backward propagation.

(c) Output Layer

• Produces the final prediction for the model.


• Output depends on the type of problem:
o Regression: Single output, uses identity activation function.
o Binary classification: Single output with sigmoid activation.
o Multiclass classification: Multiple outputs, uses softmax activation.
2. Activation Functions and Their Roles

Activation functions introduce non-linearity into neural networks, allowing them to model
complex relationships.
3. Backpropagation Algorithm

Backpropagation is used to compute gradients efficiently when training a neural network.

Steps in Backpropagation

1. Forward Pass
o Compute predictions using current weights.
o Calculate loss based on predictions and actual values.
2. Compute Loss Function
o Measures how far predictions are from actual values.
o Examples: MSE (for regression), cross-entropy (for classification).
3. Backward Pass (Gradient Computation)
o Compute derivatives of the loss function w.r.t. each weight using the chain
rule.
o Compute gradients layer-by-layer, from output to input.
4.

4. Feature Engineering in Neural Networks

Feature engineering is the process of transforming raw data into a format that improves model
performance.

Why is Feature Engineering Needed?

• Neural networks do not always automatically capture complex patterns in


tabular data.
• Feature engineering helps in:
o Handling categorical variables (e.g., one-hot encoding).
o Scaling numerical features (e.g., standardization).
o Creating polynomial features for better non-linear relationships.
o Capturing temporal trends (e.g., time-based features in Ames Housing
Data).
Can Feature Engineering Be Automated?

• Deep learning can learn features automatically in domains like image and text
processing.
• However, for tabular data, feature engineering is still often necessary.

5. Loss Functions

Loss functions measure how well the neural network’s predictions match the actual values.

(a) Mean Squared Error (MSE)


7. Why Deep Learning Does/Does Not Work Well for Tabular Data
(a) Why Deep Learning May NOT Work Well

• Tree-based models (Random Forest, XGBoost) often outperform deep learning


for tabular data.
• Tabular data does not have hierarchical structure like images or text.
• Difficult to tune hyperparameters in neural networks.

(b) When Deep Learning Might Work

• If dataset is very large and contains complex interactions.


• If automated feature extraction is required.
Final Review Checklist

Understand neural network structure (input, hidden, output layers).


Know activation functions and their use cases.
Be comfortable with forward and backward propagation.
Understand feature engineering and when it’s necessary.
Learn loss functions for regression and classification.
Understand SGD, weight decay, dropout for optimization.
Be able to explain why deep learning works/does not work for tabular data.
Solve numerical problems involving forward and backpropagation.
2. Activation Functions
Question 2: Activation Function Choice

You are training a binary classification model and need to choose an activation function.
(a) Which activation function should be used in the final output layer?
(b) What are two potential problems of using a sigmoid activation in hidden layers?

Solution:

• (a) For binary classification, the sigmoid activation is used in the output layer
because it maps predictions to the range (0,1), making it interpretable as a
probability.
• (b) Two problems of using sigmoid in hidden layers:
1. Vanishing Gradient Problem – Gradients become very small for extreme
values, slowing down training.
2. Outputs are not zero-centered – Sigmoid outputs are always positive,
making gradient updates less efficient.
5. Conceptual Questions
Question 5: Deep Learning for Tabular Data
(a) Why do tree-based models (Random Forest, XGBoost) often outperform neural
networks on tabular data?
(b) When would deep learning be preferable for tabular data?

Solution:

• (a) Tree-based models perform better because:


1. They naturally handle missing values and categorical variables.
2. They do not require extensive feature scaling.
3. They work well on small-to-medium datasets without requiring large
amounts of data.
4. Feature interactions are automatically captured.
• (b) Deep learning is preferable when:
1. The dataset is very large (millions of records).
2. Feature interactions are too complex for tree-based models.
3. The data contains high-dimensional, continuous variables.

Conceptual Questions for Exam Preparation - Feedforward Neural Networks

1. Understanding Neural Network Structure


Question 1: Why Use Multiple Hidden Layers?
(a) Why might a single-layer perceptron be insufficient for complex problems?
(b) How do additional hidden layers improve a neural network’s performance?

Answer:

• (a) A single-layer perceptron can only learn linear decision boundaries. If the
data is not linearly separable, it will fail to learn meaningful patterns (e.g., XOR
problem).
• (b) Additional hidden layers allow the network to:
1. Capture non-linear relationships.
2. Learn hierarchical features (e.g., in images, edges → shapes → objects).
3. Model complex interactions in high-dimensional data.

2. Activation Functions
Question 2: Why Not Always Use ReLU?
(a) What are the advantages of ReLU over sigmoid and tanh?
(b) What are the potential problems with ReLU, and how can they be addressed?

Answer:

• (a) Advantages of ReLU:


1. Avoids vanishing gradient problem (gradient does not saturate for positive
inputs).
2. Computationally efficient (simple function: max(0, x)).
3. Sparse activations (some neurons output 0, leading to better
generalization).
• (b) Problems with ReLU and solutions:
o Dying ReLU Problem: Neurons can get stuck at zero output if gradients
become too small.
▪ Solution: Use Leaky ReLU (allows small gradients for negative
values) or ELU.
o Exploding activations: Can cause numerical instability.
▪ Solution: Use batch normalization.

3. Loss Functions
Question 3: Why Use Cross-Entropy Loss for Classification?
(a) Why is Mean Squared Error (MSE) a poor choice for classification?
(b) How does cross-entropy loss work in binary and multiclass classification?

Answer:

• (a) Problems with MSE for classification:


1. Slower convergence: It does not push predictions towards extreme values
(0 or 1).
2. Gradient vanishing issue: Sigmoid activation + MSE leads to small
gradients, making learning slow.

4. Backpropagation
Question 4: Why Do We Need Backpropagation?
(a) Why can’t we compute weight updates directly like in linear regression?
(b) How does backpropagation efficiently compute gradients?

Answer:

• (a) Direct weight updates don’t work because:


1. The loss function is non-linear due to activation functions.
2. There is no closed-form solution like in linear regression.
3. We need gradient-based optimization to adjust weights iteratively.
• (b) Backpropagation computes gradients efficiently using:
1. Forward pass: Compute activations layer by layer.
2. Backward pass: Compute derivatives layer-by-layer using the chain rule.
3. Weight updates: Use gradient descent to adjust weights.

5. Optimization in Neural Networks


Question 5: Why Is Stochastic Gradient Descent (SGD) Used Instead of
Standard Gradient Descent?
(a) What is the problem with computing gradients on the full dataset?
(b) What are the advantages and trade-offs of SGD?

Answer:

• (a) Problems with full-batch gradient descent:


1. Slow computation on large datasets.
2. Can get stuck in local minima without exploration.
3. Memory inefficient, especially for deep networks.
• (b) Advantages of SGD:
1. Faster updates (computes gradient on small mini-batches).
2. Stochastic nature helps escape local minima.
3. Works well in online learning settings.
• Trade-off: SGD has higher variance in updates, requiring techniques like momentum
or Adam optimizer.

6. Why Deep Learning May Not Work Well on Tabular Data


Question 6: Why Are Tree-Based Models Often Better for Tabular Data?
(a) What challenges do neural networks face when working with tabular data?
(b) When might deep learning still be useful?

Answer:

• (a) Challenges of Deep Learning for Tabular Data:


1. Feature interactions – Neural networks struggle with capturing interactions
between categorical and numerical features.
2. Need for large datasets – Tree-based models work well with smaller
datasets.
3. Harder to interpret – Decision trees offer explainability, whereas deep
networks are black-box models.
• (b) When Deep Learning Might Work Well:
1. Very large datasets with millions of observations.
2. Complex, high-dimensional data (e.g., genomic data).
3. Automated feature extraction needed (e.g., learned embeddings for
categorical data).

7. Regularization Techniques
Question 7: Why Do We Use Dropout?
(a) What problem does dropout solve?
(b) How does dropout work during training and testing?

Answer:

• (a) Problem: Overfitting – deep networks memorize training data instead of


generalizing.
• (b) How Dropout Works:
o Training: Randomly drop (remove) neurons with probability ppp.
o Testing: Use all neurons but scale their activations by ppp to balance the
effect.
8. Understanding Softmax and Probability Outputs
Question 8: Why Is Softmax Used in the Output Layer for Multiclass
Classification?
(a) What does the softmax function do?
(b) How does it differ from sigmoid?

9. Weight Initialization
Question 9: Why Not Initialize All Weights to Zero?
(a) What happens if all weights start at zero?
(b) What is a better initialization strategy?

Answer:

• (a) If all weights start at zero:


1. Every neuron in a layer will have the same gradient and learn identically.
2. Breaks symmetry, preventing useful learning.
• (b) Better initialization strategies:
1. Xavier Initialization (Glorot) – Keeps variance balanced across layers.
2. He Initialization – Preferred for ReLU-based networks.
10. Bias-Variance Tradeoff
Question 10: Why Might a Deep Network Have High Bias or High Variance?
(a) When does a neural network have high bias?
(b) When does it have high variance?
(c) How do we address each issue?

Answer:

• (a) High Bias (Underfitting):


o Network too simple, not enough capacity to learn patterns.
o Fix: Increase hidden layers, units, or use complex activation functions.
• (b) High Variance (Overfitting):
o Network memorizes training data, does not generalize.
o Fix: Use dropout, weight decay (L2 regularization), or early stopping.

You might also like