0% found this document useful (0 votes)
231 views7 pages

Hw3 Solutions

This document contains instructions for Homework 3 of the Machine Learning course 10-601 at Carnegie Mellon University for the Fall 2012 semester. It is due on Monday October 15, 2012 and should be submitted in hard copy to Sharon Cavlovich in GHC 8215. The homework contains 4 questions covering neural networks, Bayesian networks, and Expectation Maximization. Sample solutions or explanations are provided for parts of each question.

Uploaded by

mahamd saied
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
231 views7 pages

Hw3 Solutions

This document contains instructions for Homework 3 of the Machine Learning course 10-601 at Carnegie Mellon University for the Fall 2012 semester. It is due on Monday October 15, 2012 and should be submitted in hard copy to Sharon Cavlovich in GHC 8215. The homework contains 4 questions covering neural networks, Bayesian networks, and Expectation Maximization. Sample solutions or explanations are provided for parts of each question.

Uploaded by

mahamd saied
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 7

10-601 Machine Learning, Fall 2012

Homework 3
Instructors: Tom Mitchell, Ziv Bar-Joseph
TA in charge: Mehdi Samadi
email: [email protected]

Due: Monday October 15, 2012 by 4pm


Instructions There are 4 questions on this assignment no programming. Please hand in a hard copy
of your completed homework to Sharon Cavlovich (GHC 8215) by 4 PM on Monday, October 15th, 2012.
Dont forget to include your name and email address on your homework.

Neural Networks

1.1

Expressiveness of Neural Networks [10 points]

As discussed in class, neural networks are built out of units with real-valued inputs X1 . . . Xn , where the
unit output Y is given by
1
P
Y =
1 + exp((w0 + i wi Xi ))
Here we will explore the expressiveness of neural nets, by examining their ability to represent boolean
functions. Here the inputs Xi will be 0 or 1. Of course the output Y will be real-valued, ranging anywhere
between 0 and 1. We will interpret Y as a boolean value by interpreting it to be a boolean 1 if Y > 0.5, and
interpreting it to be 0 otherwise.
1. Give 3 weights for a single unit with two inputs X1 and X2 , that implements the logical OR function
Y = X1 X2 .

Figure 1:

1
1+ex .

F SOLUTION: Figure 1 shows the value of y = 1+e1x for different values of x. Note that y 0.5
if x 0, and y 0.5 if x 0. Given this, we need to choose wi so that w0 + w1 x1 + w2 x2 will be
greater than 0 when x1 x2 is equal to 1. One candidate solution is [w0 = 0.5, w1 = 1, w2 = 1].

2. Can you implement the logical AND function Y = X1 X2 in a single unit? If so, give weights that
achieve this. If not, explain the problem.
F SOLUTION: Similar to previous part, we can obtain [w0 = 1.5, w1 = 1, w2 = 1]
3. It is impossible to implement the EXCLUSIVE-OR function Y = X1 X2 in a single unit. However,
you can do it using a multiple unit neural network. Please do. Use the smallest number of units you
can. Draw your network, and show all weights of each unit.
F SOLUTION: It can be represented by a neural network with two nodes in the hidden layer. Input
weights for node 1 in the hidden layer would be [w0 = 0.5, w1 = 1, w2 = 1], input weights for node 2
in the hidden layer would be [w0 = 0.5, w1 = 1, w2 = 1], and input weights for the output node would
be [w0 = 0.8, w1 = 1, w2 = 1].
4. Create a neural network with only one hidden layer (of any number of units) that implements (A
B) (C D). Draw your network, and show all weights of each unit.

Figure 2: An example of neural network for problem 1.4

F SOLUTION: Note that XOR operation can be written in terms of AND and OR operations: p q =
(p q) (p q). Given this, we can rewrite the formula as (A C D) (B C D) (A B
C) (A B D). This formula can be represented by a neural network with one hidden layer and
four nodes in the hidden layer (one unit for each parenthesis). An example is shown in Figure 2.

1.2

MCLE, MAP, Gradient descent [15 points]

In class we showed the derivation of the gradient descent rule to train a single logistic (sigmoid) unit to
obtain a Maximum Conditional Likelihood Estimate for the unit weights w0 . . . wn . (See the slides from the
lecture on neural networks: http://www.cs.cmu.edu/~tom/10601_fall2012/slides/NNets-9_27_2012.
pdf, especially the slides on pages 4 and 5).

1. The slide at the top of page 5 claims that if we want to place a Gaussian prior on the weights, to
obtain a MAP estimate instead of a Maximum likelihood estimate, then we must choose weights that
minimize the expression E:
X
X
E= c
wi2 +
(y l f(xl ))2
i

th

where wi is the i weight for our logistic unit, y is the target output for the lth training example, xl
is the vector of inputs for the lth training example, f(xl ) is the unit output for input xl , and c is some
constant.
Show that thisQ
claim is correct, by showing that minimizing E is equivalent to maximizing the expression: ln P (W ) l P (Y l |X l ; W ). Here W is the weight vector hw0 . . . wn i. In particular, assume each
weight wi in the single unit follows a zero-mean Gaussian prior, of the form:

2 !
1 wi 0
1
exp
p(wi ) =
2

2 2
Qn
So that P (W ) = P (w0 , . . . , wn ) = i=0 P (wi ).
F SOLUTION:
!
M AP = arg max ln P (W )
W
W

P (Y |X ; W )

= arg max ln P (W ) + ln

P (Y l |X l ; W )

(1)

= arg max ln P (W ) + ln

1
2 2

exp

1
2

Y l f (X l )

2 !!

From the definition of P (W ) we can write:


2 !
1 wi 0

P (W ) =
exp
2

2 2
i=0
 Pn

2
i=0 wi2
n ln(2 )
ln P (W ) =
+
2
2 2
n
Y

(2)

By removing constant values from the above two formulas (since they dont change the maximization) and
removing the negative sign, we would have:

M AP = arg min c
W
W

n
X
i=0

wi2 +

M
X
l=1

(Y l f (X l ))2
(3)

= arg min E
W

2. Derive the gradient you would use to obtain the map estimate, for a single unit with two inputs X1
and X2 . In other words, give formulas for each of the three partial derivatives


E E E
,
,
w0 w1 w2
Hint: the slide at the bottom of page 5 of the handout does part of your task. If you get stuck, the slides
on linear regression might also be helpful.
3

F SOLUTION:
E=c

n
X

wi2 +

i=0

cw02

M
X
(Y l f (X l ))2
l=1

cw12

+ cw22 +

(Y l f (X l ))2

(4)

E
= 2cw0
2(Y l f (X l ))
(f (X l ))
w0
w0
l

In class we showed that

l
w0 (f (X ))

= f (X l ).(1 f (X l )). So we can write:

X
E
= 2cw0
2(Y l f (X l ))f (X l )(1 f (X l ))
w0

(5)

Similarly, we could derive formulas for w1 and w2 .

2
2.1

Bayesian Networks [20 points]


Representation and Inference [12 points]

Consider the following Bayes net:


ABCD
1. Write the joint probability P (A, B, C, D) for this network as the product of four conditional probabilities.

F SOLUTION:
P (A, B, C, D, ) = P (A)P (B|A)P (D)P (C|B, D)

(6)

2. How many independent parameters are needed to fully define this Bayesian Network?

F SOLUTION: We need 8 independent variables.


3. How many independent parameters would we need to define the joint distribution P (A, B, C, D) if we
made no assumptions about independence or conditional independence?

F SOLUTION: 24 1
4. [6 pts] Consider the even simpler 3-node Bayes Net
ABC
Give an expression for P (B = 1|C = 0) in terms of the parameters of this network. Use notation like
P (C = 1|B = 0) to represent individual Bayes net parameters.

F SOLUTION: Using Bayes Rule, we have:

P (C = 0|B = 1)P (B = 1)
P (C = 0)
(1 P (C = 1|B = 1))P (B = 1)
=
P (C = 0)

P (B = 1|C = 0) =

(7)

Equations for P (B = 1) and P (C = 0) can be derived by:


P (B = 1) = P (B = 1|A = 0)P (A = 0) + P (B = 1|A = 1)P (A = 1)
P (C = 0) = 1 P (C = 1)

(8)

= 1 (P (C = 1|B = 0)P (B = 0) + P (C = 1|B = 1)P (B = 1))


Note that 5 independent parameters for this Bayesian Network are: P (A = 1), P (B = 1|A = 0), P (B =
1|A = 1), P (C = 1|B = 0), P (C = 1|B = 1). The above equations can be written in terms of these parameters
using complement rule in probability.

2.2

Learning Bayes Nets [8 points]

Suppose you want to learn a Bayes net over two binary variables X1 and X2 . You have N training pairs of
N
X1 and X2 , given as {(x11 , x12 ), (x21 , x22 ), (x31 , x32 ), . . . , (xN
1 , x2 )}. Given two datasets A and B, we know that
j
j
the data in B is generated by x2 = F (x1 , ) +  for all training instances j where and  are two unknown
parameters. We dont have any information on how dataset A is generated. Let BN denote the Bayes Net
with no edges, and BN 0 denote the BN with an edge from X1 to X2 . For both of these Bayes net, we learn
its parameters using maximum likelihood estimation.
1. Which Bayes net is better to model the dataset A? Explain your answer.

F SOLUTION: BN 0 . We shouldnt make any independence assumption between variables since we dont
know how data has been generated.
2. Which Bayes net is better to model the dataset B? Explain your answer.

F SOLUTION:

BN 0 since we know that X1 and X2 are not independent.

Expectation Maximization (EM) [15 points]

Consider again the simple Bayes Network from question 2: A B C. You must train this network from
partly observed data, using EM and the following training examples:

example
example
example
example

1:
2:
3:
4:

A=1,
A=1,
A=0,
A=0,

B=1,
B=?,
B=0,
B=1,

C=0
C=0
C=1
C=1
5

Assume that we begin with each independent parameter of this network initialized to 0.6 (recall that you
enumerated these in question 2).
1. As we execute the EM algorithm, what gets calculated during the first E step?
F SOLUTION: For each training example k:
E[Bk ] = P (Bk = 1|Ak , Ck , ) =

P (Bk = 1, Ak , Ck |)
P (Bk = 1, Ak , Ck |) + P (Bk = 0, Ak , Ck |)

(9)

2. Give the value for this quantify, as calculated by the first E step.
F SOLUTION: For k = 2:
C=0|B=1 B=1|A=1 A=1
C=0|B=1 B=1|A=1 + C=0|B=0 B=0|A=1 A=1
0.4 0.6 0.6
= 0.6
=
0.4 0.6 0.6 + 0.4 0.4 0.6

E[B2 ] =

(10)

3. What gets calculated during the first M step?


F SOLUTION:
PN
A=1 =

(Ak = 1)
N
PN
k=1 (Ak = 0)E[Bk ]
PN
k=1 (Ak = 0)
PN
k=1 (Ak = 1)E[Bk ]
PN
k=1 (Ak = 1)
PN
k=1 (Ck = 1)E[Bk ]
PN
k=1 E[Bk ]
PN
k=1 (Ck = 1)(1 E[Bk ])
PN
k=1 (1 E[Bk ])

k=1

B=1|A=0 =
B=1|A=1 =
C=1|B=1 =
C=1|B=0 =

(11)

4. Give the value for this set of quantities, as calculated by the first M step.
F SOLUTION:
A=1 = 0.5
B=1|A=0 = 0.5
B=1|A=1 = 0.8
C=1|B=0 = 0.71
C=1|B=1 = 0.38

(12)

Midterm Review Questions [15 points]

Here are short questions (some from previous midterm exams) intended to help you review for our midterm
on October 18.

4.1

True or False Questions [9 points]

If true, give a 1-2 sentence explanation. If false, a counterexample.


1. As the number of training examples grows toward infinity, the MLE and MAP estimates for Naive Bayes
parameters converge to the same value in the limit.
F SOLUTION: False, since we have not made any assumption about the prior. A simple counterexample is
the prior which assigns probability 1 to a single choice of parameter
2. As the number of training examples grows toward infinity, the probability that logisitic regression will
overfit the training data goes to zero.
F SOLUTION: True.
3. In decision tree learning with noise-free data, starting with the wrong attribute at the root can make it
impossible to find a tree that fits the data exactly.

F SOLUTION: False.

4.2

Short Questions [6 points]

1. The Naive Bayes algorithm selects the class c for an example x that maximizes P (c|x). When is this
equivalent to selecting the c that maximizes P (x|c)?

(c)
F SOLUTION: P (c|x) = P (x|c)P
, so finding the c that maximizes P (c|x) is equivalent to finding the c
P (x)
that maximizes P (x|c), if the prior P (c) is uniform.

2. Imagine you have a learning problem with an instance space of points on the plane. Assume that the
target function takes the form of a line on the plane where all points on one side of the line are positive
and all those on the other are negative. If you are asked to choose between using a decision tree or a neural
network with no hidden layer, which would you choose? Why?
F SOLUTION: Neural network. A decision tree is not able to learn some of the linear functions on a plane
(e.g., y = x).

You might also like