0% found this document useful (0 votes)
7 views12 pages

Al3451 Ia 2 Answer Key

The document provides an answer key for a machine learning assessment, covering topics such as ensemble learning, boosting, gradient descent, and regularization techniques. It includes definitions, key features, and comparisons of methods like bagging and boosting, as well as practical applications of Gaussian Mixture Models and neural network design. Additionally, it discusses batch normalization, hyperparameter tuning, and bootstrapping for estimating model accuracy.

Uploaded by

rajalakshmir
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
7 views12 pages

Al3451 Ia 2 Answer Key

The document provides an answer key for a machine learning assessment, covering topics such as ensemble learning, boosting, gradient descent, and regularization techniques. It includes definitions, key features, and comparisons of methods like bagging and boosting, as well as practical applications of Gaussian Mixture Models and neural network design. Additionally, it discusses batch normalization, hyperparameter tuning, and bootstrapping for estimating model accuracy.

Uploaded by

rajalakshmir
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 12

AL3451 – MACHINE LEARNING –IA2 –ANSWER KEY

PART – A
1 Ensemble learning is a machine learning technique where multiple models (such as
decision trees, SVMs, or neural networks) are trained and their predictions combined to
solve the same task. The goal is to achieve better accuracy and robustness than
individual models by leveraging their collective strengths.
2 Boosting aims to improve the performance of weak learners by training them
sequentially. Each new model is focused on correcting the errors made by the previous
models. This results in a strong ensemble where the combined output has higher
accuracy and reduced bias.
3 1. Improved Accuracy: Ensemble methods often outperform individual models
by reducing variance and bias.
2. Better Generalization: They are less likely to overfit the training data,
improving performance on unseen data.

4 Stacking combines the outputs of multiple base models using a meta-model, which
learns how to best integrate the predictions. This layered approach captures diverse
patterns in the data and compensates for individual model weaknesses, thus enhancing
overall predictive accuracy.
5 Gradient descent is an optimization algorithm used to minimize the loss function in
neural networks. It works by computing the gradient of the loss with respect to the
weights and updating the weights iteratively in the opposite direction of the gradient to
reach a minimum loss.
6 The vanishing gradient problem occurs when the gradients used during
backpropagation become very small, especially in deep networks. This leads to
minimal updates in the weights of the earlier layers, causing the network to learn
slowly or not at all in those layers.
7 L1 regularization adds the sum of the absolute values of the weights to the loss
function, encouraging sparsity by driving some weights to zero.
L2 regularization adds the sum of the squares of the weights, penalizing large weights
and promoting smoother models with better generalization.
8 Bootstrapping is a resampling technique where multiple datasets are generated by
sampling with replacement from the original dataset. This allows estimation of model
accuracy and variability, and is commonly used in ensemble methods like bagging.
9 Resampling techniques involve drawing repeated samples from a dataset to evaluate
model performance or stability. Examples include k-fold cross-validation and
bootstrapping. These methods are useful when limited data is available for training and
testing.
10 Statistical significance testing helps determine if the observed performance difference
between two classifiers is due to a true difference or just random chance. It ensures that
the comparison is meaningful and not influenced by sample variability.
PART – B
11A Ensemble learning Definition:
a) Ensemble learning is a machine learning approach where multiple individual models
(learners) are trained and combined to solve the same problem. It aims to improve
prediction accuracy, reduce overfitting, and provide better generalization.

Key Points:

 Each learner is usually a weak learner.


 Aggregation improves the final model's performance.
 Works on the principle of "wisdom of the crowd."

Benefits:

 Reduces variance (e.g., Bagging)


 Reduces bias (e.g., Boosting)
 Improves stability and robustness
b)
Ensemble Method Explanation Working Principle
Bagging (Bootstrap Multiple learners trained on random Aggregation via voting
Aggregating) subsets of data with replacement. or averaging.
Learners are trained sequentially, Focuses on
Boosting
each correcting previous errors. misclassified samples.
Combines predictions of multiple Final prediction made
Stacking
models using a meta-learner. by meta-model.
11B) Definition:
a) Boosting is an ensemble learning technique that builds models sequentially. Each
model focuses on the errors made by the previous model, gradually improving the
accuracy.

Key Features:

 Converts weak learners to a strong learner.


 Adjusts the weights of misclassified samples.
 Final prediction: weighted vote or sum of all models.

Steps:

1. Train initial model.


2. Identify misclassified data points.
3. Increase their weights.
4. Train next model on updated data.
5. Repeat and combine outputs.
b)
Algorithm Name: AdaBoost (Adaptive Boosting)

Working:

 Uses multiple weak classifiers (like decision stumps).


 Misclassified samples are given higher weights.
 Final prediction is a weighted vote of all classifiers.

Other Popular Algorithms:

Algorithm Description
Gradient Boosting Uses gradient descent to minimize error.
Extreme Gradient Boosting – optimized for speed and
XGBoost
performance.
LightGBM Faster and more efficient; handles large datasets.
CatBoost Optimized for categorical features.
12A Feature Bagging Boosting
a) Training Parallel Sequential
Data Sampling Random with replacement Same data, but weights updated
Objective Reduces variance Reduces bias
Sensitivity Less sensitive to outliers More sensitive to outliers
Example Random Forest AdaBoost, Gradient Boosting
Aggregation Voting or averaging Weighted vote
Overfitting Handles overfitting well Risk of overfitting if not tuned

Boosting is preferred when:


b)
1. High Bias Problems: When models underfit, boosting helps reduce bias.
2. Data Imbalance: Boosting focuses on hard-to-classify (often minority class)
points.
3. Higher Accuracy Required: Provides better performance on complex
datasets.
4. When Weak Models Are Reliable: If base learners are slightly better than
chance.
5. Applications: Fraud detection, medical diagnosis, competition data (Kaggle).

12B Definition:
a) GMM assumes data is generated from a mixture of several Gaussian distributions.
Each Gaussian is a cluster, characterized by:
 Mean (μ)
 Covariance (Σ)
 Mixing coefficient (π)

Features:

 Uses soft clustering (assigns probabilities)


 Solved using Expectation-Maximization (EM) algorithm

Applications:

 Image segmentation
 Anomaly detection
 Speech recognition
 Customer segmentation

b) Scenario:

Assume 3 Gaussian components G1,G2,G3G_1, G_2, G_3G1,G2,G3 with:

 Means: μ1,μ2,μ3\mu_1, \mu_2, \mu_3μ1,μ2,μ3


 Variances: σ12,σ22,σ32\sigma_1^2, \sigma_2^2, \sigma_3^2σ12,σ22,σ32
 Weights: π1,π2,π3\pi_1, \pi_2, \pi_3π1,π2,π3

How GMM Assigns Probabilities (Using EM Algorithm)

1. E-Step (Expectation):
For each data point xxx, compute responsibility rir_iri:

ri=πi⋅N(x∣μi,σi2)∑j=13πj⋅N(x∣μj,σj2)r_i = \frac{\pi_i \cdot \mathcal{N}(x|\mu_i, \


sigma_i^2)}{\sum_{j=1}^{3} \pi_j \cdot \mathcal{N}(x|\mu_j, \sigma_j^2)}ri=∑j=13
πj⋅N(x∣μj,σj2)πi⋅N(x∣μi,σi2)

 rir_iri: Probability that point x belongs to cluster i.


 N\mathcal{N}N: Gaussian probability density function.

2. M-Step (Maximization):
Update parameters μi,σi,πi\mu_i, \sigma_i, \pi_iμi,σi,πi using responsibilities.

3. Iterate until convergence.

<div align="center"> <img


src="https://upload.wikimedia.org/wikipedia/commons/0/05/GMM_responsibilities.p
ng" width="450"/><br/> <em>Figure 5: GMM Soft Clustering Responsibilities</em>
</div>

Output: Each data point gets a probability vector like [0.7, 0.2, 0.1], indicating its
association with each Gaussian component.

13A Apply Gradient Descent to a Simple Neural Network


Given:

 A neural network with two weights w1w_1w1 and w2w_2w2


 Loss function: Mean Squared Error (MSE)
 Learning rate: η=0.1\eta = 0.1η=0.1
 Input x=[x1,x2]=[1,2]x = [x_1, x_2] = [1, 2]x=[x1,x2]=[1,2]
 True output y=1y = 1y=1
 Initial weights: w1=0.5,w2=−0.5w_1 = 0.5, w_2 = -0.5w1=0.5,w2=−0.5

Step 1: Forward Pass (3M)


y^=w1x1+w2x2=(0.5)(1)+(−0.5)(2)=0.5−1=−0.5\hat{y} = w_1 x_1 + w_2 x_2 = (0.5)
(1) + (-0.5)(2) = 0.5 - 1 = -0.5y^=w1x1+w2x2=(0.5)(1)+(−0.5)(2)=0.5−1=−0.5
Loss=12(y−y^)2=12(1−(−0.5))2=12(1.5)2=1.125\text{Loss} = \frac{1}{2} (y - \
hat{y})^2 = \frac{1}{2} (1 - (-0.5))^2 = \frac{1}{2} (1.5)^2 = 1.125Loss=21(y−y^
)2=21(1−(−0.5))2=21(1.5)2=1.125

Step 2: Compute Gradients (4M)


∂Loss∂w1=(y^−y)⋅x1=(−0.5−1)⋅1=−1.5\frac{\partial \text{Loss}}{\partial w_1} = ( \
hat{y} - y ) \cdot x_1 = (-0.5 - 1) \cdot 1 = -1.5∂w1∂Loss=(y^−y)⋅x1
=(−0.5−1)⋅1=−1.5 ∂Loss∂w2=(y^−y)⋅x2=(−0.5−1)⋅2=−3.0\frac{\partial \text{Loss}}
{\partial w_2} = ( \hat{y} - y ) \cdot x_2 = (-0.5 - 1) \cdot 2 = -3.0∂w2∂Loss=(y^
−y)⋅x2=(−0.5−1)⋅2=−3.0

Step 3: Weight Update Using Gradient Descent (4M)


w1′=w1−η⋅∂Loss∂w1=0.5−0.1⋅(−1.5)=0.65w_1' = w_1 - \eta \cdot \frac{\partial \
text{Loss}}{\partial w_1} = 0.5 - 0.1 \cdot (-1.5) = 0.65w1′=w1−η⋅∂w1∂Loss
=0.5−0.1⋅(−1.5)=0.65 w2′=w2−0.1⋅(−3.0)=−0.5+0.3=−0.2w_2' = w_2 - 0.1 \cdot (-
3.0) = -0.5 + 0.3 = -0.2w2′=w2−0.1⋅(−3.0)=−0.5+0.3=−0.2
Updated Weights:
w1=0.65,w2=−0.2w_1 = 0.65,\quad w_2 = -0.2w1=0.65,w2=−0.2
13B design a Multilayer Perceptron (MLP) for Digit Classification
Goal: Classify handwritten digits (e.g., MNIST: 28x28 images)
MLP Architecture
Layer Description

Input Layer 784 neurons (28×28 pixels)

Hidden Layer 1 128 neurons, ReLU


Hidden Layer 2 64 neurons, ReLU

Output Layer 10 neurons (for digits 0–9), Softmax

Steps Involved in MLP Design

1. Preprocessing: Normalize pixel values (0–255 → 0–1)


2. Weight Initialization: Random initialization (Xavier/He)
3. Forward Propagation: Compute activations layer-wise
4. Loss Calculation: Use Cross-Entropy Loss
5. Backward Propagation: Compute gradients using chain rule
6. Optimization: Use Gradient Descent/Adam
7. Evaluation: Accuracy, confusion matrix

14A Apply Batch Normalization & Analyze Impact =


Batch Normalization Formula

For batch inputs x1,x2,...,xmx_1, x_2, ..., x_mx1,x2,...,xm:

μ=1m∑i=1mxi,σ2=1m∑i=1m(xi−μ)2\mu = \frac{1}{m} \sum_{i=1}^{m} x_i,\quad \


sigma^2 = \frac{1}{m} \sum_{i=1}^{m} (x_i - \mu)^2μ=m1i=1∑mxi,σ2=m1i=1∑m
(xi−μ)2 x^i=xi−μσ2+ϵ,yi=γx^i+β\hat{x}_i = \frac{x_i - \mu}{\sqrt{\sigma^2 + \
epsilon}},\quad y_i = \gamma \hat{x}_i + \betax^i=σ2+ϵxi−μ,yi=γx^i+β

Purpose:

 Normalize activations within layers.


 Maintain zero mean and unit variance.

Impact on Training (4M)


Aspect Effect

Speed Faster convergence

Stability Reduces vanishing/exploding gradients

Generalization Acts like regularization

Tuning Less sensitive to learning rate

14B Hyperparameter Tuning to Improve Underfitting


Underfitting Symptoms

 High bias
 Poor training and test accuracy
 Flat learning curves

Hyperparameter Tuning Strategies (6M)


Hyperparameter Effect

Learning Rate ↑ Faster learning

Hidden Units ↑ Increase capacity

Layers ↑ More abstraction

Batch Size ↓ Noisy gradients improve generalization

Epochs ↑ More training cycles

Activation Function Try ReLU, LeakyReLU

15A Solve Using Bootstrapping to Estimate Mean Accuracy –


Given:
 Accuracy scores from 5 models: [0.81, 0.83, 0.80, 0.84, 0.82]

Step 1: Perform Bootstrap Sampling

Create resampled sets (e.g., 1000 times), each with 5 values drawn with replacement.

Example Resample #1: [0.81, 0.81, 0.84, 0.82, 0.80]


Mean #1: 0.81+0.81+0.84+0.82+0.805=0.816\frac{0.81+0.81+0.84+0.82+0.80}{5} =
0.81650.81+0.81+0.84+0.82+0.80=0.816

Repeat 1000 times → collect 1000 means

Step 2: Estimate Statistics

 Mean of Means ≈ 0.82


 Standard Error (SE) = std of 1000 means
 95% Confidence Interval:

CI=xˉ±1.96⋅SECI = \bar{x} \pm 1.96 \cdot SECI=xˉ±1.96⋅SE


15B Impact of K in K-Fold Cross-Validation –
How K Affects Performance
K Value Impact

Small K (e.g., 3) Faster, but high bias

Medium K (e.g., 5 or 10) Balance of speed and accuracy

Large K (e.g., N-fold / LOOCV) Low bias, high variance, slow

Trade-offs
Higher K → more training, better performance

 Lower K → faster, less stable


 General rule: Use K = 5 or 10 for stable results

PART – C
16A Paired t-Test to Evaluate Classifier Performance
Problem Statement:
Two classifiers (A and B) are evaluated using 10-fold cross-validation. The accuracy
scores (%) from each fold are as follows:
Fold Classifier A Classifier B
1 85 82
2 88 84
3 84 83
4 90 86
5 87 85
6 89 84
7 91 88
8 86 83
9 88 86
10 87 84
We are to test whether the difference in accuracy is statistically significant using a
paired t-test at the 95% confidence level (α=0.05\alpha = 0.05α=0.05).

Step-by-Step Answer:

1. State Hypotheses (2M)


 Null Hypothesis H0H_0H0: There is no significant difference between the
two classifiers
μd=0\mu_d = 0μd=0
 Alternative Hypothesis H1H_1H1: There is a significant difference
μd≠0\mu_d \ne 0μd=0

2. Calculate Difference per Fold (2M)


di=Ai−Bid_i = A_i - B_idi=Ai−Bi
Fold A B did_idi
1 85 82 3
2 88 84 4
3 84 83 1
4 90 86 4
5 87 85 2
6 89 84 5
7 91 88 3
8 86 83 3
9 88 86 2
10 87 84 3

3. Compute Mean and Standard Deviation of Differences (4M)


dˉ=∑din=3+4+1+4+2+5+3+3+2+310=3010=3.0\bar{d} = \frac{\sum d_i}{n} = \
frac{3+4+1+4+2+5+3+3+2+3}{10} = \frac{30}{10} = 3.0dˉ=n∑di
=103+4+1+4+2+5+3+3+2+3=1030=3.0 sd=1n−1∑(di−dˉ)2s_d = \sqrt{\frac{1}{n-1} \
sum (d_i - \bar{d})^2}sd=n−11∑(di−dˉ)2 ∑(di−3)2=(0)2+(1)2+(−2)2+(1)2+
(−1)2+(2)2+(0)2+(0)2+(−1)2+(0)2=12\sum (d_i - 3)^2 = (0)^2 + (1)^2 + (-2)^2 +
(1)^2 + (-1)^2 + (2)^2 + (0)^2 + (0)^2 + (-1)^2 + (0)^2 = 12∑(di−3)2=(0)2+(1)2+
(−2)2+(1)2+(−1)2+(2)2+(0)2+(0)2+(−1)2+(0)2=12 sd=129=1.33≈1.15s_d = \sqrt{\
frac{12}{9}} = \sqrt{1.33} \approx 1.15sd=912=1.33≈1.15

4. Compute t-Statistic (3M)


t=dˉsd/n=3.01.15/10=3.01.15/3.16=3.00.364≈8.24t = \frac{\bar{d}}{s_d / \sqrt{n}} =
\frac{3.0}{1.15 / \sqrt{10}} = \frac{3.0}{1.15 / 3.16} = \frac{3.0}{0.364} \approx
8.24t=sd/ndˉ=1.15/103.0=1.15/3.163.0=0.3643.0≈8.24

5. Determine Critical t-value (2M)


 Degrees of Freedom (df) = n−1=9n - 1 = 9n−1=9
 From t-distribution table at α=0.05\alpha = 0.05α=0.05, two-tailed:
tcritical=2.262t_{\text{critical}} = 2.262tcritical=2.262

6. Compare and Conclude (2M)

 ⇒ Reject H0H_0H0
 Computed t=8.24>2.262t = 8.24 > 2.262t=8.24>2.262

✅ Conclusion:
There is a statistically significant difference between the two classifiers’
performances. Classifier A performs significantly better than Classifier B at 95%
confidence level.

16B Create a Bagging Ensemble using Decision Trees to Classify Customer Churn and
Discuss the Impact of Increasing the Number of Base Learners

A. Introduction to Customer Churn Classification (2M)

 Customer churn refers to the phenomenon where customers stop using a


service or product.
 Goal: Predict whether a customer will churn (1) or not churn (0) based on
features like:
o Tenure
o Monthly charges
o Internet service
o Customer support usage

B. What is Bagging? (Bootstrap Aggregating) (3M)

 Bagging is an ensemble technique that:


o Trains multiple independent models (base learners) on random
bootstrap samples.
o Aggregates their predictions by majority voting (for classification).
 Helps reduce variance and overfitting, especially when base learners are
unstable (e.g., decision trees).

C. Bagging with Decision Trees for Churn Prediction (5M)

Steps to Implement:

1. Data Preparation:
o Load churn dataset (e.g., from telecom company).
o Preprocess: One-hot encode categorical variables, normalize numerical
features.
2. Bootstrap Sampling:
o Generate k random datasets from training data with replacement.
3. Train Base Learners:
o Train k Decision Trees, each on one bootstrap sample.
4. Aggregate Predictions:
o For classification: Use majority voting to determine final class label.

Python Code Snippet (if asked for practical explanation)

python
CopyEdit
from sklearn.ensemble import BaggingClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

# Load churn data (assume X, y already prepared)


X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

bag_model = BaggingClassifier(
base_estimator=DecisionTreeClassifier(),
n_estimators=50,
bootstrap=True
)

bag_model.fit(X_train, y_train)
y_pred = bag_model.predict(X_test)

accuracy = accuracy_score(y_test, y_pred)


print("Bagging Accuracy:", accuracy)
D. Diagram of Bagging Process (2M)
mathematica
CopyEdit
Original Dataset ──┬─> Bootstrap Sample 1 ──> Tree 1
├─> Bootstrap Sample 2 ──> Tree 2
├─> ... ──> ...
└─> Bootstrap Sample k ──> Tree k

Majority Voting ──> Final Prediction

E. Impact of Increasing Number of Base Learners (3M)


Number of Trees Effect

Few (e.g., 5–10) Faster training, higher variance

Moderate (e.g., 30–50) Better accuracy, reduced overfitting

Many (e.g., 100–200) Slight gain, but diminishing returns


Explanation:

 Adding more trees improves stability and generalization, especially on noisy


data.
 However, after a certain number, improvements plateau.
 Very large ensembles may increase computation time and memory usage.

✅ Conclusion (1M)

Bagging with decision trees is effective for churn prediction. Increasing the number of
base learners improves performance initially, but has diminishing returns. A balance
between accuracy and computational cost must be maintained.

You might also like