0% found this document useful (0 votes)

96 views21 pages

12f-601-Midterm Machine Learning

This midterm exam for the 10-601 Machine Learning course has 9 questions worth a total of 100 points. It is an open book exam but no electronic devices are allowed. The exam is challenging but will be graded on a curve. The first question involves short answer questions about probability, Bayes rule, Gaussian naive Bayes, and principal component analysis. The second question involves gamma mixture models including deriving equations for the E-step. The third question involves decision trees and hierarchical clustering.

Uploaded by

ale meke

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

96 views21 pages

12f-601-Midterm Machine Learning

Uploaded by

ale meke

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 21

10-601 Machine Learning, Midterm Exam

Instructors: Tom Mitchell, Ziv Bar-Joseph

Wednesday 12th December, 2012

There are 9 questions, for a total of 100 points.

This exam has 20 pages, make sure you have all pages before you begin.
This exam is open book, open notes, but no computers or other electronic devices.

This exam is challenging, but don’t worry because we will grade on a curve. Work efficiently.

Good luck!

Name:

Andrew ID:

Question Points Score

Short Answers 11
GMM - Gamma Mixture Model 10
Decision trees and Hierarchical clustering 8
D-separation 9
HMM 12
Markov Decision Process 12
SVM 12
Boosting 14
Model Selection 12
Total: 100

1
10-601 Matchine Learning Final Exam December 10, 2012

Question 1. Short Answers

(a) [3 points] For data D and hypothesis H, say whether or not the following equations must always
be true.
P
• h P (H = h|D = d) = 1 ... is this always true?

Solution:
yes

P
• h P (D = d|H = h) = 1 ... is this always true?

Solution:
no

P
• h P (D = d|H = h)P (H = h) = 1 ... is this always true?

Solution:
no

(b) [2 points] For the following equations, describe the relationship between them. Write one of four
answers:
(1) “=” (2) “≤” (3) “≥” (4) “(depends)”
Choose the most specific relation that always holds; “(depends)” is the least specific. Assume all
probabilities are non-zero.

P (H = h|D = d) P (H = h)

P (H = h|D = d) P (D = d|H = h)P (H = h)

Solution:
P (H|D) (DEPENDS) P (H)
P (H|D) ≥ P (D|H)P (H) .. this is the numerator in Bayes Rule, have to divide by the normal-
izer P (D), which is less than 1. Tricky... P (H|D) = P (D|H)P (H)/P (D) > P (D|H)P (H).

(c) [2 points] Suppose you are training Gaussian Naive Bayes (GNB) on the training set shown below.
The dataset satisfies Gaussian Naive Bayes assumptions. Assume that the variance is independent
of instances but dependent on classes, i.e. σik = σk where i indexes instances X (i) and k ∈ 1, 2
indexes classes. Draw the decision boundaries when you train GNB

a. using the same variance for both classes, σ1 = σ2

b. using separate variance for each class σ1 6= σ2

Page 1 of 20
10-601 Matchine Learning Final Exam December 10, 2012

12 12

10 10

8 8

6 6

4 4

2 2

0 0

−2 −2
−2 0 2 4 6 8 10 −2 0 2 4 6 8 10

Solution:
The decision boundary for part a will be linear, and part b will be quadratic.

Page 2 of 20
10-601 Matchine Learning Final Exam December 10, 2012

(d) [2 points] Assume that we have two possible conditional distributions (P (y = 1|x, w)) obtained
by training a logistic regression on the dataset shown in the figure below:

In the first case, the value of P (y = 1|x, w) is equal to 1/3 for all the data points. In the second
case, P (y = 1|x, w) is equal to zero for x = 1 and is equal to 1 for all other data points. One of
these conditional distributions is obtained by finding the maximum likelihood of the parameter w.
Which one is the MLE solution? Justify your answer in at most three sentences.

Solution:
The MLE solution is the first case where the value of P (y = 1|x, w) is equal to 1/3 for all the
data points.

(e) [2 points] Principal component analysis is a dimensionality reduction method that projects a dataset
into its most variable components. You are given the following 2D datasets, draw the first and sec-
ond principle components on each plot.

0.8

0.6

0.4

0.2
Dimension 2

−0.2

−0.4

−0.6

−0.8
−0.3 −0.2 −0.1 0 0.1 0.2 0.3 0.4
Dimension 1

Page 3 of 20
10-601 Matchine Learning Final Exam December 10, 2012

Solution:

Page 4 of 20
10-601 Matchine Learning Final Exam December 10, 2012

Question 2. GMM - Gamma Mixture Model

A Assume each data point Xi ∈ R+ (i = 1...n) is drawn from the following process:

Zi ∼ M ultinomial(π1 , π2 , ..., πK )
Xi ∼ Gamma(2, βZi )
The probability density function of Gamma(2, β) is P (X = x) = β 2 xe−βx .
(a) [3 points] Assume K = 3 and β1 = 1, β2 = 2, β3 = 4. What’s P (Z = 1|X = 1)?

Solution:
P (Z = 1|X = 1) ∝ P (X = 1|Z = 1)P (Z = 1) = π1 e−1
P (Z = 2|X = 1) ∝ P (X = 1|Z = 2)P (Z = 2) = π2 4e−2
P (Z = 3|X = 1) ∝ P (X = 1|Z = 3)P (Z = 3) = π3 16e−4

π1 e−1
P (Z = 1|X = 1) = (π1 e−1 +π2 4e−2 +π3 16e−4 )

(b) [3 points] Describe the E-step. Write an equation for each value being computed.

Solution:
For each X = x,

P (X = x|Z = k)P (Z = k) βk2 xe−βk x πk

P (Z = k|X = x) = P 0 0
= P 2 −βk0 x π 0
k0 P (X = x|Z = k )P (Z = k ) k0 βk0 xe k

(c) [2 points] Here’s the Bayes net representation of the Gamma mixture model for k = n = 2. Note
that we are treating π’s and β’s as variables – we have priors for them.

π1 Z1 X1 β1

π2 Z2 X2 β2

Would you say π’s are independent given the observations X? Why?

Solution:
No. π1 → Z1 → π2 is an active trail since X is given.

(d) For the following parts, choose true or false with an explanation in one sentence
i. [1 point] Gamma mixture model can capture overlapping clusters, like Gaussian mixture model.

Solution:
(All or none. 1 pt iff you get the answer and the explanation correct) true. in the e-step it
does soft assignment

Page 5 of 20
10-601 Matchine Learning Final Exam December 10, 2012

ii. [1 point] As you increase K, you will always get better likelihood of the data.

Solution:
(All or none. 1 pt iff you get the answer and the explanation correct) false. Won’t improve
after K > N

Page 6 of 20
10-601 Matchine Learning Final Exam December 10, 2012

Question 3. Decision trees and Hierarchical clustering

Assume we are trying to learn a decision tree. Our input data consists of N samples, each with k
attributes (N k). We define the depth of a tree as the maximum number of nodes between the root
and any of the leaf nodes (including the leaf, not the root).
(a) [2 points] If all attributes are binary, what is the maximal number of leaf (decision) nodes that we
can have in a decision tree for this data? What is the maximal possible depth of a decision tree for
this data?

Solution:
2(k−1) . Each feature can only be used once in each path from root to leaf. The maximum depth
is O(k).

(b) [2 points] If all attributes are continuous, what is the maximum number of leaf nodes that we can
have in a decision tree for this data? What is the maximal possible depth for a decision tree for this
data?

Solution:
Continuous values can be used multiple times, so the maximum number of leaf nodes can be
the same as the number of samples, N and the maximal depth can also be N.

(c) [2 points] When using single link what is the maximal possible depth of a hierarchical clustering
tree for the data in 1? What is the maximal possible depth of such a hierarchical clustering tree for
the data in 2?

Solution:
When using single link with binary data, we can obtain cases where we are always growing
the cluster by 1 node at a time leading to a tree of depth N. This is also clearly the case for
continuous values.

(d) [2 points] Would your answers to (3) change if we were using complete link instead of single
link? If so, would it change for both types of data? Briefly explain.

Solution:
While the answer for continuous values remain the same (its easy to design a dataset where
each new sample is farther from any of the previous samples) for binary data, if k is small
compared to N we will not be able to continue to add one node at a time to the initial cluster
and so the depth will change to be lower than N.

Page 7 of 20
10-601 Matchine Learning Final Exam December 10, 2012

Question 4. D-separation
Consider the following Bayesian network of 6 variables.

(a) [3 points] Set X = {B} and Y = {E}. Specify two distinct (not-overlapping) sets Z such that:
X⊥ ⊥Y |Z (in other words, X is independent of Y given Z).

Solution:
Z = {A} and Z = {D}

(b) [2 points] Can you find another distinct set for Z (i.e. a set that does not intersect with any of the
sets listed in 1)?

Solution:
The empty set Z = {}

(c) [2 points] How many distinct Z sets can you find if we replace B with A while Y stays the same
(in other words, now X = {A} and Y = {E})? What are they?

Solution:
Z = {}, Z = {B} and Z = {D}

(d) [2 points] If W ⊥
⊥X|Z and X⊥ ⊥Y |Z for some distinct variables W, X, Y, Z, can you say W ⊥
⊥Y |Z? If
so, show why. If not, find a counterexample from the graph above.

Solution:
No. A⊥⊥F |B and D⊥
⊥A|B but D and F are not independent given B.

Page 8 of 20
10-601 Matchine Learning Final Exam December 10, 2012

Question 5. HMM

The figure above presents two HMMs. States are represented by circles and transitions by edges. In
both, emissions are deterministic and listed inside the states.

Transition probabilities and starting probabilities are listed next to the relevant edges. For example,
in HMM 1 we have a probability of 0.5 to start with the state that emits A and a probability of 0.5 to
transition to the state that emits B if we are now in the state that emits A.

In the questions below, O100 =A means that the 100th symbol emitted by the HMM is A.
(a) [3 points] What is P (O100 = A, O101 = A, O102 = A) for HMM1?

Solution:
Note that P(O100=A, O101=A, O102=A) = P(O100=A, O101=A, O102=A,S100=A, S101=A, S102=A)
since if we are not always in state A we will not be able to emit A. Given the Markov property
this can be written as:

P(O100=A, O101=A, O102=A,S100=A, S101=A, S102=A) = P(O100=A—S100=A) P(S100=A)

P(O101=A—S101=A) P(S101=A—S100=A) P(O102=A—S102=A) P(S102=A—S101=A)

The emission probabilities in the above equation are all 1. The transitions are all 0.5. So the
only question is: What is P(S100=A)? Since the model is fully symmetric, the answer to this is
0.5 and so the total equation evaluates to: 0.53

(b) [3 points] What is P (O100 = A, O101 = A, O102 = A) for HMM2?

Solution:
0.5 ∗ 0.82

(c) [3 points] Let P1 be: P1 = P (O100 = A, O101 = B, O102 = A, O103 = B) for HMM1 and let P2 be:
P2 = P (O100 = A, O101 = B, O102 = A, O103 = B) for HMM2. Choose the correct answer from the
choices below and briefly explain.
1. P1 > P2
2. P2 > P1
3. P1 = P2
4. Impossible to tell the relationship between the two probabilities

Page 9 of 20
10-601 Matchine Learning Final Exam December 10, 2012

Solution:
(a). P1 evaluates to 0.54 while P2 is 0.5 ∗ 0.24 so clearly P1¿P2.

(d) [3 points] Assume you are told that a casino has been using one of the two HMMs to generate
streams of letters. You are also told that among the first 1000 letters emitted, 500 are As and 500 are
Bs. Which is of the following answers is the most likely (briefly explain):
1. The casino has been using HMM 1
2. The casino has been using HMM 2
3. Impossible to tell

Solution:
(c). While we saw in the previous question that it is much more less likely to switch between A
and B in HMM2, this is only true if we switch at every step. However, when aggregating over
1000 steps, since the two HMMs are both symmetric, both are likely to generate the *same*
number of As and Bs.

Page 10 of 20
10-601 Matchine Learning Final Exam December 10, 2012

Question 6. Markov Decision Process

Consider a robot that is moving in an environment. The goal of the robot is to move from an initial
point to a destination point as fast as possible. However, the robot has the limitation that if it moves
fast, its engine can overheat and stop the robot from moving. The robot can move with two different
speeds: slow and fast. If it moves fast, it gets a reward of 10; if it moves slowly, it gets a reward of 4.
We can model this problem as an MDP by having three states: cool, warm, and off. The transitions are
shown in below. Assume that the discount factor is 0.9 and also assume that when we reach the state
off, we remain there without getting any reward.

s a s0 P (s0 |a, s)
cool slow cool 1
cool fast cool 1/4
cool fast warm 3/4
warm slow cool 1/2
warm slow warm 1/2
warm fast warm 7/8
warm fast off 1/8

(a) [2 points] Consider the conservative policy when the robot always moves slowly. What is the
value of J ∗ (cool) under the conservative policy? Remember that J ∗ (s) is the expected discounted
sum of rewards when starting at state s

Solution:
J ∗ (cool) = 4 + 0.9J ∗ (cool)
J ∗ (cool) = 40

(b) [3 points] What is the optimal policy for each state?

Solution:
If in state cool then move fast. If in state warm then move slow.

(c) [2 points] Is it possible to change the discount factor to get a different optimal policy? If yes, give
such a change so that it results to a minimum changes in the optimal policy and if no justify your
answer in at most two sentences.

Solution:
Yes, by decreasing the discount factor. For example by choosing the discount factor equal to
zero the robot always chooses an action that gives the highest immediate reward.

(d) [2 points] Is it possible to change the immediate reward function so that J ∗ changes but the opti-
mal policy remains unchanged? If yes, give such a change and if no justify your answer in at most
two sentences.

Page 11 of 20
10-601 Matchine Learning Final Exam December 10, 2012

Solution:
Yes, for example by multiplying all the rewards by two.

(e) [3 points] One of the important problems in MDPs is to decide what should be the value of the
discount factor. For now assume that we don’t know the value of discount factor but an expert
person tells us that action sequence {fast, slow, slow} is preferred to the action sequence { slow, fast,
fast} if we start from either of states cool or warm. What does it tell us about the discount factor?
What ranges of discount factor is consistent with this preference?

Solution:
The discounted sum of future rewards using discount factor λ is calculated by: r + r(λ) +
r(λ2 ) + . . . .
So by solving the below equation, we would be able to find a range for discount factor λ:
10 + 4λ + 4λ2 > 4 + 10λ + 10λ2

Page 12 of 20
10-601 Matchine Learning Final Exam December 10, 2012

Question 7. SVM
(a) Kernels
i. [4 points] In class we learnt that SVM can be used to classify linearly inseparable data by
transforming it to a higher dimensional space with a kernel K(x, z) = φ(x)T φ(z), where φ(x)
is a feature mapping. Let K1 and K2 be Rn × Rn kernels, K3 be a Rd × Rd kernel and c ∈ R+
be a positive constant. φ1 : Rn → Rd , φ2 : Rn → Rd , and φ3 : Rd → Rd are feature mappings
of K1 , K2 and K3 respectively. Explain how to use φ1 and φ2 to obtain the following kernels.

a. K(x, z) = cK1 (x, z)

b. K(x, z) = K1 (x, z)K2 (x, z)

Solution: p
a. φ(x) = (c)φ1 (x)
b. φ(x) = φ1 (x)φ2 (x)

ii. [2 points]
One of the most commonly used kernels in SVM is the Gaussian RBF kernel: k(xi , xj ) =
kxi −xj k2
exp − 2σ . Suppose we have three points, z1 , z2 , and x. z1 is geometrically very close to
x, and z2 is geometrically far away from x. What is the value of k(z1 , x) and k(z2 , x)?. Choose
one of the following:
a. k(z1 , x) will be close to 1 and k(z2 , x) will be close to 0.
b. k(z1 , x) will be close to 0 and k(z2 , x) will be close to 1.
c. k(z1 , x) will be close to c1 , c1 1 and k(z2 , x) will be close to c2 , c2 0, where c1 , c2 ∈ R
d. k(z1 , x) will be close to c1 , c1 0 and k(z2 , x) will be close to c2 , c2 1, where c1 , c2 ∈ R

Solution:
Correct answer is a, RBF kernel generates a ”bump” around the center x. For points z1
close to the center of the bump, K(z1 , x) will be close to 1, for points away from the center
of the bump K(z2 , x) will be close to 0.

iii. [3 points] You are given the following 3 plots, which illustrates a dataset with two classes.
Draw the decision boundary when you train an SVM classifier with linear, polynomial (order
2) and RBF kernels respectively. Classes have equal number of instances.

Solution:

Page 13 of 20
10-601 Matchine Learning Final Exam December 10, 2012

3 3 3

2 2 2

1 1 1

X2
0 0 0

−1 −1 −1

−2 −2 −2

−3 −3 −3
−3 −2 −1 0 1 2 3 −3 −2 −1 0 1 2 3 −3 −2 −1 0 1 2 3
X1 X1 X1

Page 14 of 20
10-601 Matchine Learning Final Exam December 10, 2012

(b) [3 points] Hard Margin SVM

7
Class −1
Class +1

4
X2

0
0 1 2 3 4 5 6 7
X1

Support vector machines learn a decision boundary leading to the largest margin from both classes.
You are training SVM on a tiny dataset with 4 points shown in Figure 2. This dataset consists of two
examples with class label -1 (denoted with plus), and two examples with class label +1 (denoted
with triangles).
i. Find the weight vector w and bias b. What’s the equation corresponding to the decision bound-
ary?

Solution:
SVM tries to maximize the margin between two classes. Therefore, the optimal decision
boundary is diagonal and it crosses the point (3,4). It is perpendicular to the line between
support vectors (4,5) and (2,3), hence it is slope is m = -1. Thus the line equation is (x2 −
4) = −1(x1 − 3) = x1 + x2 = 7. From this equation, we can deduce that the weight
vector has to be of the form (w1 , w2 ), where w1 = w2 . It also has to satisfy the following
equations:
2w1 + 3w2 + b = 1 and
4w1 + 5w2 + b = −1

Page 15 of 20
10-601 Matchine Learning Final Exam December 10, 2012

Hence w1 = w2 = −1/2 and b = 7/2

ii. Circle the support vectors and draw the decision boundary.

Solution:
See the solution above

Page 16 of 20
10-601 Matchine Learning Final Exam December 10, 2012

Question 8. Boosting

Solution:

Figure 1: Sample training data for boosting algorithm.

In this problem, we study how boosting algorithm performs on a very simple classification problem
shown in Figure 1. We use decision stump for each weak hypothesis hi . Decision stump classifier
chooses a constant value c and classifies all points where x > c as one class and other points where
x ≤ c as the other class.
(a) [2 points] What is the initial weight that is assigned to each data point?

Solution:
1
3

(b) [2 points] Show the decision boundary for the first decision stump (indicate the positive and neg-
ative side of the decision boundary).

Solution:
One possible solution is shown in the figure.

(d) [3 points] Write down the weight that is assigned to each data point after the first iteration of
boosting algorithm.

Page 17 of 20
10-601 Matchine Learning Final Exam December 10, 2012

Solution:
t = 31
αt = 12 ln(2) = 0.3465
1/3∗exp(−0.3465)
For data points that are classified correctly D2 (i) = Z2 ≈ 0.25 and for the data
1/3∗exp(0.3465)
point that is classified incorrectly D2 (i) = Z2 ≈ 0.5 where Z2 is the normalization
factor.

(e) [3 points] Can boosting algorithm perfectly classify all the training examples? If no, briefly explain
why. If yes, what is the minimum number of iteration?

Solution:
No, since the data is not linearly separable.

(f) [1 point] True/False The training error of boosting classifier (combination of all the weak classifier)
monotonically decreases as the number of iterations in the boosting algorithm increases. Justify
your answer in at most two sentences.

Solution:
Pm
False, boosting is minimizing loss function: i=1 exp(−yi f (xi )) which doesn’t necessary mean
that the training error monotonically decrease. Please look at slides 14-18 http://www.cs.
cmu.edu/˜tom/10601_fall2012/slides/boosting.pdf.

Page 18 of 20
10-601 Matchine Learning Final Exam December 10, 2012

Question 9. Model Selection

(a) [2 points] Consider learning a classifier in a situation with 1000 features total. 50 of them are truly
informative about class. Another 50 features are direct copies of the first 50 features. The final 900
features are not informative.
Assume there is enough data to reliably assess how useful features are, and the feature selection
methods are using good thresholds.
• How many features will be selected by mutual information filtering?

Solution:
about 100

• How many features will be selected by a wrapper method?

Solution:
about 50

(b) Consider k-fold cross-validation. Let’s consider the tradeoffs of larger or smaller k (the number of
folds). For each, please select one of the multiple choice options.
i. [2 points] With a higher number of folds, the estimated error will be, on average,
• (a) Higher.
• (b) Lower.
• (c) Same.
• (d) Can’t tell.

Solution:
Lower (because more training data)

Page 19 of 20
10-601 Matchine Learning Final Exam December 10, 2012

(c) [8 points] Nearly all the algorithms we have learned about in this course have a tuning param-
eter for regularization that adjusts the bias/variance tradeoff, and can be used to protect against
overfitting. More regularization tends to cause less overfitting.
For each of the following algorithms, we point out one such tuning parameter. If you increase the
parameter, does it lead to MORE or LESS regularization? (In other words, MORE bias (and less
variance), or LESS bias (and more variance)?) For every blank, please write MORE or LESS.
Naive Bayes: MAP estimation of binary features’ p(X|Y ), Higher α means. . . regularization
using a Beta(α, α) prior.

Logistic regression, linear regression, or a neural network Higher λ means. . . regularization

with a λ j wj2 penalty in the objective
P

Bayesian learning for real-valued parameter θ, given a Higher width of the regularization
prior p(θ), which might a wide or narrow shape. (For ex- prior distribution
ample, a high vs. low variance gaussian prior.) means. . .

Neural Network: number of hidden units, n Higher n means. . . regularization

Feature selection with mutual information scoring: In- Higher t means. . . regularization
clude a feature in the model only if its MI(feat, class) is
higher than a threshold t.

Decision tree: n, an upper limit on number of nodes in the Higher n means. . . regularization
tree.

Boosting: number of iterations, n Higher n means. . . regularization

Dimension reduction as preprocessing: Instead of using Higher k means. . . regularization

all features, reduce the training data down to k dimen-
sions with PCA, and use the PCA projections as the only
features.

Solution:

• NB α: more
• λ L2 penalty: more
• Bayesian prior width: less

• Num. hidden units: less

• MI threshold: more
• Num. dtree nodes: less
• Num. boosting iter: less

• Num. PC’s: less

Page 20 of 20

SMAI Question Papers
No ratings yet
SMAI Question Papers
13 pages
Practice Midterm
No ratings yet
Practice Midterm
4 pages
Machine Learning in Business - Chapter 1
No ratings yet
Machine Learning in Business - Chapter 1
22 pages
Endsem ML Regular AK
No ratings yet
Endsem ML Regular AK
7 pages
Midterm Solutions
No ratings yet
Midterm Solutions
8 pages
Solutions: 10-601 Machine Learning, Midterm Exam: Spring 2008 Solutions
No ratings yet
Solutions: 10-601 Machine Learning, Midterm Exam: Spring 2008 Solutions
8 pages
Final 2006
No ratings yet
Final 2006
15 pages
601 sp09 Midterm Solutions
No ratings yet
601 sp09 Midterm Solutions
14 pages
Final 2006
No ratings yet
Final 2006
15 pages
Midterm Solutions Machine
100% (1)
Midterm Solutions Machine
17 pages
Midterm Solutions PDF
No ratings yet
Midterm Solutions PDF
17 pages
Final 2001f
No ratings yet
Final 2001f
18 pages
Midterm Sp16 Solutions
100% (1)
Midterm Sp16 Solutions
17 pages
CS 7641 CSE/ISYE 6740 Mid-Term Exam 2 (Fall 2016) Solutions: 1 Probability and Bayes' Rule (14 PTS)
No ratings yet
CS 7641 CSE/ISYE 6740 Mid-Term Exam 2 (Fall 2016) Solutions: 1 Probability and Bayes' Rule (14 PTS)
12 pages
2022 CS244 End Sem Soln
No ratings yet
2022 CS244 End Sem Soln
6 pages
Solution of Final Exam: 10-701/15-781 Machine Learning: Fall 2004 Dec. 12th 2004
No ratings yet
Solution of Final Exam: 10-701/15-781 Machine Learning: Fall 2004 Dec. 12th 2004
27 pages
Final F02soln
No ratings yet
Final F02soln
11 pages
Practice Final CS61c
No ratings yet
Practice Final CS61c
19 pages
Midterm - EE511 - Part B: K K K K
No ratings yet
Midterm - EE511 - Part B: K K K K
8 pages
10f 601 Midterm
No ratings yet
10f 601 Midterm
17 pages
Final Exam, 10701 Machine Learning, Spring 2009: Max. Score Score 1 2 3 4 5 6 7 8 9 10
No ratings yet
Final Exam, 10701 Machine Learning, Spring 2009: Max. Score Score 1 2 3 4 5 6 7 8 9 10
25 pages
Assignment 10 Solution
No ratings yet
Assignment 10 Solution
8 pages
12s 701 Final
No ratings yet
12s 701 Final
17 pages
MIT6 - 041SCF13 - Finl - s09 - Sol - Probabilistics Systems and Statistics
No ratings yet
MIT6 - 041SCF13 - Finl - s09 - Sol - Probabilistics Systems and Statistics
17 pages
Sample Quiz1 Questions
No ratings yet
Sample Quiz1 Questions
8 pages
Massachusetts Institute of Technology: (Final Exam - Spring 2009)
No ratings yet
Massachusetts Institute of Technology: (Final Exam - Spring 2009)
14 pages
Endsem ML Makeup AK - 1
No ratings yet
Endsem ML Makeup AK - 1
7 pages
Final f02
No ratings yet
Final f02
12 pages
cs675 SS2022 Midterm Solution PDF
No ratings yet
cs675 SS2022 Midterm Solution PDF
10 pages
Exam 21
No ratings yet
Exam 21
17 pages
CS725 2020 Quiz1
No ratings yet
CS725 2020 Quiz1
3 pages
Gate 2025
No ratings yet
Gate 2025
33 pages
Final s09 Sol
No ratings yet
Final s09 Sol
18 pages
CSE 569 Homework #1: Notes
No ratings yet
CSE 569 Homework #1: Notes
3 pages
ML 20230316 1
No ratings yet
ML 20230316 1
9 pages
AtiB Week 7 Ga
No ratings yet
AtiB Week 7 Ga
8 pages
Concordia University Machine Learning Assaignment With Solutions
No ratings yet
Concordia University Machine Learning Assaignment With Solutions
8 pages
10-601 Machine Learning Midterm Exam Fall 2011: Tom Mitchell, Aarti Singh Carnegie Mellon University
No ratings yet
10-601 Machine Learning Midterm Exam Fall 2011: Tom Mitchell, Aarti Singh Carnegie Mellon University
16 pages
CS725 2021 Midsem
No ratings yet
CS725 2021 Midsem
6 pages
Sping 2009 Final
No ratings yet
Sping 2009 Final
15 pages
ANSWERS TO 15-381 Final, Spring 2004: Friday May 7, 2004
No ratings yet
ANSWERS TO 15-381 Final, Spring 2004: Friday May 7, 2004
20 pages
2023 Summer Final
No ratings yet
2023 Summer Final
21 pages
Assignment 1
No ratings yet
Assignment 1
6 pages
Quiz3 2023
No ratings yet
Quiz3 2023
2 pages
hw3 Solution
No ratings yet
hw3 Solution
7 pages
MFDS - Test 1 Problems
No ratings yet
MFDS - Test 1 Problems
9 pages
MLT 2021-22
No ratings yet
MLT 2021-22
14 pages
Spring 2006 Test 1 Solution
No ratings yet
Spring 2006 Test 1 Solution
9 pages
MLESA v2024 Week10 Assignment Solution
No ratings yet
MLESA v2024 Week10 Assignment Solution
7 pages
2017-18-I MS Key
No ratings yet
2017-18-I MS Key
6 pages
Final f03
No ratings yet
Final f03
8 pages
Gate Statistics Practice Question
No ratings yet
Gate Statistics Practice Question
30 pages
EE2211 Past Paper
No ratings yet
EE2211 Past Paper
14 pages
Assignment Week 4
No ratings yet
Assignment Week 4
6 pages
Final Exam Epfl 2020 Machine Leaning
No ratings yet
Final Exam Epfl 2020 Machine Leaning
16 pages
Cs 419 Endsemsols
No ratings yet
Cs 419 Endsemsols
6 pages
ML Question CMU
No ratings yet
ML Question CMU
12 pages
EE2211 Past Paper Ans
No ratings yet
EE2211 Past Paper Ans
19 pages
Midterm 2010 Solutions
No ratings yet
Midterm 2010 Solutions
8 pages
Rec5 Solns
No ratings yet
Rec5 Solns
14 pages
10+2 Level Mathematics For All Exams GMAT, GRE, CAT, SAT, ACT, IIT JEE, WBJEE, ISI, CMI, RMO, INMO, KVPY Etc.
From Everand
10+2 Level Mathematics For All Exams GMAT, GRE, CAT, SAT, ACT, IIT JEE, WBJEE, ISI, CMI, RMO, INMO, KVPY Etc.
Shubhankar Paul
No ratings yet
Marketing Thesis
No ratings yet
Marketing Thesis
18 pages
Random Forest Algorithm Updated
No ratings yet
Random Forest Algorithm Updated
11 pages
On Catastrophic Inheritance of Large Foundation Models
No ratings yet
On Catastrophic Inheritance of Large Foundation Models
33 pages
Week01 Lecture BB
No ratings yet
Week01 Lecture BB
70 pages
Introduction To Machine Learning With Python A Guide For Data Scientists 1st Edition by Andreas MÃ Ller, Sarah Guido 1449369413 9781449369415 Download
100% (2)
Introduction To Machine Learning With Python A Guide For Data Scientists 1st Edition by Andreas MÃ Ller, Sarah Guido 1449369413 9781449369415 Download
53 pages
Fundamentals of Predictive Analytics A Business Analytics Course
No ratings yet
Fundamentals of Predictive Analytics A Business Analytics Course
36 pages
Stock Price Prediction Thesis
100% (3)
Stock Price Prediction Thesis
4 pages
Machine Learning Algorithms, Real-World Applications and Research Directions
No ratings yet
Machine Learning Algorithms, Real-World Applications and Research Directions
73 pages
Week 6 Lecture Notes
No ratings yet
Week 6 Lecture Notes
9 pages
Internship Report ML'
No ratings yet
Internship Report ML'
36 pages
Dealing With Missing Data in Python Pandas
100% (1)
Dealing With Missing Data in Python Pandas
14 pages
Unit1 ML
No ratings yet
Unit1 ML
36 pages
4 ML
No ratings yet
4 ML
41 pages
COMP3314 A2 G17 Report
No ratings yet
COMP3314 A2 G17 Report
4 pages
M.tech ML Unit-3
No ratings yet
M.tech ML Unit-3
17 pages
Unit 4 NNDL
No ratings yet
Unit 4 NNDL
37 pages
01-DL-Introduction To Deep Learning 01
No ratings yet
01-DL-Introduction To Deep Learning 01
18 pages
HOMEWORK 3 Report
No ratings yet
HOMEWORK 3 Report
3 pages
Unit-2 Machine Learning
No ratings yet
Unit-2 Machine Learning
110 pages
ML 01 Introduction
No ratings yet
ML 01 Introduction
34 pages
Forecasting Municipal Solid Waste Generation Using Artificial Intelligence Modelling Approaches
No ratings yet
Forecasting Municipal Solid Waste Generation Using Artificial Intelligence Modelling Approaches
10 pages
ML Assignment 2
No ratings yet
ML Assignment 2
3 pages
21el3203-Advanced Machine Learning-Lab Workbook Final
No ratings yet
21el3203-Advanced Machine Learning-Lab Workbook Final
150 pages
AI - Based Road Safety Audit System
No ratings yet
AI - Based Road Safety Audit System
13 pages
15 MCQs ML (DT Classification)
100% (1)
15 MCQs ML (DT Classification)
6 pages
Computer Vision: MR Hew Ka Kian Hew - Ka - Kian@rp - Edu.sg
No ratings yet
Computer Vision: MR Hew Ka Kian Hew - Ka - Kian@rp - Edu.sg
22 pages
Lecture - 5 - Validation
No ratings yet
Lecture - 5 - Validation
30 pages
Multimedia Technology and Enhanced Learning Second EAI International Conference ICMTEL 2020 Leicester UK April 10 11 2020 Proceedings Part I Yu-Dong Zhang - Quickly access the ebook and start reading today
No ratings yet
Multimedia Technology and Enhanced Learning Second EAI International Conference ICMTEL 2020 Leicester UK April 10 11 2020 Proceedings Part I Yu-Dong Zhang - Quickly access the ebook and start reading today
65 pages
Decision Trees Classification: Mustafa Jarrar
No ratings yet
Decision Trees Classification: Mustafa Jarrar
46 pages

12f-601-Midterm Machine Learning

Uploaded by

12f-601-Midterm Machine Learning

Uploaded by

10-601 Machine Learning, Midterm Exam

Instructors: Tom Mitchell, Ziv Bar-Joseph

Wednesday 12th December, 2012

There are 9 questions, for a total of 100 points.

Question Points Score

Question 1. Short Answers

P (H = h|D = d) P (D = d|H = h)P (H = h)

a. using the same variance for both classes, σ1 = σ2

Question 2. GMM - Gamma Mixture Model

P (X = x|Z = k)P (Z = k) βk2 xe−βk x πk

Question 3. Decision trees and Hierarchical clustering

P(O100=A, O101=A, O102=A,S100=A, S101=A, S102=A) = P(O100=A—S100=A) P(S100=A)

(b) [3 points] What is P (O100 = A, O101 = A, O102 = A) for HMM2?

Question 6. Markov Decision Process

(b) [3 points] What is the optimal policy for each state?

a. K(x, z) = cK1 (x, z)

b. K(x, z) = K1 (x, z)K2 (x, z)

(b) [3 points] Hard Margin SVM

Hence w1 = w2 = −1/2 and b = 7/2

Figure 1: Sample training data for boosting algorithm.

Question 9. Model Selection

• How many features will be selected by a wrapper method?

Logistic regression, linear regression, or a neural network Higher λ means. . . regularization

Neural Network: number of hidden units, n Higher n means. . . regularization

Boosting: number of iterations, n Higher n means. . . regularization

Dimension reduction as preprocessing: Instead of using Higher k means. . . regularization

• Num. hidden units: less

• Num. PC’s: less

You might also like