Quiz2 A
Quiz2 A
Quiz - 2 Set A
Instructions –
1. Consider the Perceptron algorithm applied to a binary classification task. Which of the following statements are correct?
(Select all that apply) (1 mark)
(A) The Perceptron algorithm is guaranteed to converge for any dataset, as long as the learning rate is appropriately
chosen.
(B) The weights are updated in the Perceptron algorithm based on the dot product of the input vector and the error
term, even when the sample is correctly classified.
(C) The bias in the Perceptron is updated when a sample is misclassified, but the magnitude of the update is not
dependent on the input values.
(D) If the Perceptron converges, it will find a decision boundary that minimizes the number of misclassifications on the
training data.
A. False
Reason: The Perceptron algorithm is only guaranteed to converge if the data is linearly separable. The algorithm does
not require a learning rate, and convergence depends solely on the separability of the data, not on a learning rate.
B. False
Reason: The weights are updated only when there is a misclassification. If the sample is classified correctly, no update
to the weights occurs. The update rule is:
W = w + errori · xi
C. True
Reason: The bias is updated when a sample is misclassified, using the rule:
B = b + errori
The magnitude of the bias update is independent of the input values, and it only depends on the error term.
D. True
Reason: If the Perceptron converges (for linearly separable data), it finds a decision boundary that minimizes the number
of misclassifications, reducing them to zero on the training data. However, it does not necessarily produce the most
optimal decision boundary, as there may be multiple valid solutions.
1 mark for correct option and correct reason
2. Which of the following statements regarding Random Forests and ensemble methods are not TRUE? (1 mark)
(A) Random Forests can handle both categorical and continuous variables, allowing for greater flexibility in modeling
various types of data.
(B) In a Random Forest, each tree is built in a sequential manner, where each tree depends on the results of the previous
tree.
(C) The primary advantage of using Random Forests over a single Decision Tree is the significant increase in bias while
keeping variance the same.
(D) Random Forests utilize an averaging scheme for classification tasks, where the final prediction is the average of the
predictions from all individual trees.
A. False: Random Forests can handle both categorical and continuous variables, making them versatile for different
types of datasets.
B. True: In a Random Forest, each tree is built independently and in parallel; there is no dependency between the trees,
which distinguishes them from boosting methods, where models are built sequentially.
C. True: The primary advantage of using Random Forests is a reduction in variance without significantly increasing
bias. They typically improve generalization compared to a single Decision Tree.
D. True: For classification tasks, Random Forests utilize a majority voting scheme, where the final prediction is based
on the mode of predictions from all individual trees, not an average.
1 mark for correct option
3. When growing a decision tree using the ID3 algorithm, which of the following is TRUE about the role of information
gain? (1 mark)
(a) Information gain measures how well a given attribute separates the training examples according to their target
classification.
(b) The attribute with the highest information gain is always selected at each step while growing the tree.
(c) Information gain is based on the reduction in entropy after the data is split on an attribute.
(d) Information gain ensures that the tree will never overfit the training data.
A. True: Information gain measures how well a given attribute separates the training examples according to their target
classification. It evaluates the effectiveness of an attribute in classifying the training data.
B. True: The attribute with the highest information gain is always selected at each step while growing the tree in
the ID3 algorithm. This ensures that the attribute that best reduces uncertainty is chosen.
C. True: Information gain is calculated based on the reduction in entropy after the data is split on an attribute.
It measures how much information a feature contributes towards classifying the data.
A: False (because the attribute may become relevant further down the tree when the records are restricted to some
value of another attribute) (e.g. XOR)
B: False for same reason
C: True because the attributes are categorical and can each be split only once
D: False because the tree may be unbalanced
1 mark for correct option and correct reason
6. Consider a Naı̈ve Bayes classifier with 3 boolean input variables, X1 , X2 , and X3 , and one boolean output, Y . (5 marks)
1. How many parameters must be estimated to train such a Naı̈ve Bayes classifier? (2.5 marks)
2. How many parameters would have to be estimated to learn the above classifier if we do not make the Naı̈ve Bayes
conditional independence assumption? (2.5 marks)
Solutions:
a. For a naive Bayes classifier, we need to estimate parameters:
P (Y = 1),
P (X1 = 1 | Y = 0),
P (X2 = 1 | Y = 0),
P (X3 = 1 | Y = 0),
P (X1 = 1 | Y = 1),
P (X2 = 1 | Y = 1),
P (X3 = 1 | Y = 1).
Other probabilities can be obtained with the constraint that the probabilities sum up to 1 (like P (X1 = 1 | Y = 0) =
1 − P (X1 = 0 | Y = 0)). So we need to estimate 7 parameters.
1 mark for correct parameters and 1.5 for correct parameter number
b. Without the conditional independence assumption, we still need to estimate P (Y = 1). (0.5 mark)
For Y = 1, we need to know all the enumerations of (X1 , X2 , and X3 ), i.e., 23 of possible (X1 , X2 , and X3 ). (1 mark)
Considering the constraint that the probabilities sum up to 1, we must estimate 23 − 1 = 7 parameters for Y = 1.
Therefore, the total number of parameters is 1 + 2(23 − 1) = 15.. (1 mark)
7. Using the dataset provided below, construct a decision tree to predict whether a person will play tennis or not. The
attributes available are:
The target variable is whether the person will Play Tennis (Yes or No). The dataset is as follows:
1. Calculate the initial entropy for the target variable Play Tennis. (2 mark)
2. Calculate the information gain for the attributes: Outlook, Temperature, Humidity, and Wind. Which attribute
would be chosen as the root of the decision tree based on the ID3 algorithm? (3 mark).
Day Outlook Temperature Humidity Wind Play Tennis
1 Sunny Hot High Weak No
2 Sunny Hot High Strong No
3 Overcast Hot High Weak Yes
4 Rain Mild High Weak Yes
5 Rain Cool Normal Weak Yes
6 Rain Cool Normal Strong No
7 Overcast Cool Normal Strong Yes
8 Sunny Mild High Strong No
9 Sunny Cool Normal Weak Yes
10 Rain Mild Normal Weak Yes
11 Sunny Mild Normal Strong Yes
12 Overcast Mild High Strong Yes
13 Overcast Hot Normal Weak Yes
14 Rain Mild High Strong No
1. Calculate the initial entropy for the target variable Play Tennis.
Initial Entropy of Play Tennis:
9 9 5 5
Entropy(S) = − log2 − log2
14 14 14 14
9 5
Entropy(S) = −( × −0.6374) − ( × −1.4854) ≈ 0.918
14 14
Initial Entropy = 0.940
2 mark for correct answer or log
2.
Information Gain for Outlook:
2 2 3 3
Entropy(Sunny) = − log2 − log2 ≈ 0.971
5 5 5 5
Entropy(Overcast) = 0
3 3 2 2
Entropy(Rain) = − log2 − log2 ≈ 0.971
5 5 5 5
5 4 5
Entropy(Outlook) = × 0.971 + ×0+ × 0.971 ≈ 0.693
14 14 14
4 6 4
Entropy(T emperature) = ×1+ × 0.918 + × 0.811 ≈ 0.911
14 14 14
7 7
Entropy(Humidity) = × 0.985 + × 0.592 ≈ 0.789
14 14