0% found this document useful (0 votes)
2 views

Assignment 9 solution

The document contains solutions to various questions related to machine learning concepts such as decision trees, bagging, K-nearest neighbors, gradient boosting, random forests, AdaBoost, K-means, dendrograms, and hierarchical clustering. It explains the mechanisms of these algorithms, their advantages, and the mathematical principles behind them. Key insights include how bagging reduces variance in decision trees, the role of the parameter K in KNN, and the flexibility of hierarchical clustering in determining the number of clusters.

Uploaded by

Shruti Lashkari
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
2 views

Assignment 9 solution

The document contains solutions to various questions related to machine learning concepts such as decision trees, bagging, K-nearest neighbors, gradient boosting, random forests, AdaBoost, K-means, dendrograms, and hierarchical clustering. It explains the mechanisms of these algorithms, their advantages, and the mathematical principles behind them. Key insights include how bagging reduces variance in decision trees, the role of the parameter K in KNN, and the flexibility of hierarchical clustering in determining the number of clusters.

Uploaded by

Shruti Lashkari
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 4

Week 09 Assignment Solutions

IIT Madras Instructors

Question 1: Decision Trees and Bagging


Solution: (C) High variance
Bagging (Bootstrap Aggregating) addresses the high variance problem of decision trees. Individual
decision trees tend to overfit to their training data, but by combining predictions from multiple trees
trained on different bootstrap samples, bagging reduces variance and improves generalization.
Decision trees inherently suffer from high variance because of their tendency to grow deep structures
that perfectly fit the training data. Since trees make hard splitting decisions at each node, small changes
in the training data can result in completely different tree structures, leading to significantly different
predictions. This sensitivity to the specific training examples is the essence of their high variance nature.
Furthermore, decision trees have no built-in regularization mechanism to prevent overfitting, causing
them to capture noise in the training data rather than just the underlying pattern.
Bagging reduces variance by averaging predictions from multiple models trained on different subsets
of the data. Since each decision tree is trained on a different bootstrap sample, they make different
errors that tend to cancel out when averaged together. This ensemble approach effectively smooths out
the individual trees’ high variance, resulting in more stable and reliable predictions without significantly
increasing bias. The key insight is that while individual trees may overfit to their specific training
samples, their collective prediction is more robust and generalizable.

Question 2: K-Nearest Neighbors Parameter


Solution: (B) The number of nearest neighbors to consider for classification
In KNN, the parameter K determines how many neighboring data points are considered when making
a prediction. The algorithm assigns the majority class (for classification) or average value (for regression)
from these K neighbors.
For a test point x, the KNN prediction is given by:
1 X
ŷ(x) = yi (1)
K
i∈NK (x)

where NK (x) represents the K nearest neighbors of x according to some distance metric (typically
Euclidean distance: d(x, xi ) = kx − xi k2 ). For classification, this becomes a majority vote:
X
ŷ(x) = arg max I(yi = c) (2)
c
i∈NK (x)

The choice of K controls the bias-variance tradeoff: smaller K leads to lower bias but higher variance,
while larger K provides smoother decision boundaries but may increase bias.

Question 3: Gradient Boosting Fitting


Solution: (C) The gradient of the loss function with respect to predictions
In gradient boosting, each new model is trained to predict the gradient of the loss function with
respect to the current predictions. This allows the algorithm to iteratively reduce the error by following
the direction of steepest descent.
Mathematically, for a loss function L(y, F (x)), where F (x) is the current ensemble prediction, the
next model hm (x) in the ensemble is trained to approximate:
 
∂L(y, F (x))
hm (x) ≈ − (3)
∂F (x) F (x)=Fm−1 (x)

1
The ensemble is then updated as:

Fm (x) = Fm−1 (x) + α · hm (x) (4)

where α is a learning rate. For example, with squared error loss L(y, F (x)) = 21 (y − F (x))2 , the gradient
is simply the residual (y − F (x)), meaning each new model is trained to predict the errors of previous
models.

Question 4: Random Forest Feature Selection


Solution: (A) Square root of the total number of features

Random forests typically use p features at each split (where p is the total number of features). This
randomness increases diversity among trees in the ensemble, further helping to reduce variance.
At each node of a tree in a random forest, only a random subset of features mtry is considered for

splitting. For classification problems, the default is mtry = p, while for regression problems, it’s often
mtry = p/3. Mathematically, if we denote the set of all features as F with |F| = p, then at each node

we select a subset Fsub ⊂ F such that |Fsub | = mtry = p. The optimal split is then chosen only from
this subset:

Best Split = arg max Information Gain(j, s) (5)


j∈Fsub ,s

where j is a feature and s is a splitting threshold.

Question 5: AdaBoost Misclassification


Solution: (D) They are given higher weights in the next iteration
AdaBoost increases the weights of misclassified examples in each iteration, forcing subsequent weak
learners to focus more on the difficult examples that previous models couldn’t classify correctly.
The weight update rule for AdaBoost is:
(
(t+1) (t) e−αt , if instance i is correctly classified
wi = wi × αt (6)
e , if instance i is misclassified
 
where αt = 12 ln 1− t
t
and t is the weighted error rate of the t-th weak classifier. After updating,
weights are normalized to form a probability distribution. This exponential weighting scheme ensures
that misclassified examples receive exponentially higher weights, making them more influential in training
subsequent classifiers.

Question 6: AdaBoost Loss Function


Solution: (D) Exponential loss
AdaBoost minimizes the exponential loss function, which heavily penalizes misclassifications. This is
mathematically consistent with the algorithm’s weight update mechanism.
The exponential loss function for AdaBoost is defined as:

L(y, F (x)) = e−yF (x) (7)


PT
where y ∈ {−1, +1} is the true label and F (x) = t=1 αt ht (x) is the ensemble prediction. This loss
function can be shown to be equivalent to the weight update rule used in AdaBoost.
Taking the negative gradient of this loss with respect to F (x) gives:

∂L(y, F (x))
− = y · e−yF (x) (8)
∂F (x)

which is precisely what each weak learner tries to approximate in each iteration, proving that AdaBoost
is indeed a gradient boosting algorithm minimizing exponential loss.

2
Question 7: Decision Tree Split Metrics
Solution: (D) Mean squared error
For binary decision trees used in classification, Gini index, entropy, and classification error are com-
mon splitting criteria. Mean squared error is typically used for regression trees, not for binary classifi-
cation problems.
The Gini index measures impurity and is defined as:
c
X
Gini(t) = 1 − p(i|t)2 (9)
i=1

where p(i|t) is the proportion of samples belonging to class i at node t, and c is the number of classes.
Entropy is calculated as:
c
X
Entropy(t) = − p(i|t) log2 p(i|t) (10)
i=1

For regression trees, mean squared error is appropriate:


1 X
MSE(t) = (yi − ȳt )2 (11)
Nt i∈t

where Nt is the number of samples at node t, yi is the target value of sample i, and ȳt is the mean target
value at node t.

Question 8: K-means Initialization


Solution: (D) Randomly within the range of the data
The standard K-means algorithm initializes cluster centers randomly within the data range. This
random initialization selects points from the feature space to serve as the starting positions for the cluster
centroids.
(0)
In standard K-means, if the data range for feature j is [aj , bj ], then initial centroids µk have their
j-th component initialized as:
(0)
µkj ∼ Uniform(aj , bj ) (12)

This uniform random selection across the data range ensures that the initial centroids cover the
feature space where data points exist.

Question 9: Dendrograms
Solution: (C) A tree-like diagram showing the hierarchy of clusters
A dendrogram is a tree-like visualization that shows how clusters are merged (agglomerative) or split
(divisive) at each step of hierarchical clustering, revealing the nested structure of the clustering.

Question 10: Hierarchical Clustering Advantage


Solution: (A) It doesn’t require specifying the number of clusters beforehand
A key advantage of hierarchical clustering is that it produces a complete hierarchy of clusters. Users
can choose the appropriate number of clusters after examining the dendrogram, unlike K-means which
requires specifying K in advance.
Hierarchical clustering creates a nested tree of partitions, which can be cut at any level to yield a
specific number of clusters. This flexibility is particularly useful when the optimal number of clusters is
not known a priori. The hierarchy can be represented by a sequence of clusterings C = {C0 , C1 , . . . , Cn },
where C0 is the finest clustering (each point is its own cluster) and Cn is the coarsest clustering (all
points in one cluster).

3
The appropriate number of clusters can be determined by examining the dendrogram and identifying
where the largest change in dissimilarity occurs, which is often visualized as a large vertical gap in the
dendrogram. This can be formally expressed as finding i that maximizes:

∆(i) = d(Ci+1 ) − d(Ci ) (13)

where d(Ci ) is the dissimilarity level at which clustering Ci is formed.

You might also like