AI&ML-Q With Answer
AI&ML-Q With Answer
P(D|h).
P(D) prior probability of D, probability that D will be observed
P(D|h) probability of observing D given a world in which h holds
In order specify a learning problem for the BRUTE-FORCE MAP LEARNING algorithm we must specify
what values are to be used for P(h) and for P(D|h) ?
Let’s choose P(h) and for P(D|h) to be consistent with the following assumptions:
The training data D is noise free (i.e., di = c(xi))
The target concept c is contained in the hypothesis space H
Do not have a priori reason to believe that any hypothesis is more probable than any other.
P(D|h) is the probability of observing the target values D = (d1 . . .dm) for the fixed set of instances
(x1 . . . xm), given a world in which hypothesis h holds
Since we assume noise-free training data, the probability of observing classification d i given h is
just 1 if di = h(xi) and 0 if di ≠ h(xi). Therefore,
Given these choices for P(h) and for P(D|h) we now have a fully-defined problem for the above brute-force
map learning algorithm. Recalling Bayes theorem, we have
Where, VSH,D is the subset of hypotheses from H that are consistent with D
To summarize, Bayes theorem implies that the posterior probability P(h|D) under our assumed P(h) and P(D|h) is
Learner L considers an instance space X and a hypothesis space H consisting of some class of real-
valued functions defined over X, i.e., (∀ h ∈ H)[ h : X → R] and training examples of the form
<xi,di>
The problem faced by L is to learn an unknown target function f : X → R
A set of m training examples is provided, where the target value of each example is corrupted by
random noise drawn according to a Normal probability distribution with zero mean (di = f(xi) + ei)
Each training example is a pair of the form (xi ,di ) where di = f (xi ) + ei .
– Here f(xi) is the noise-free value of the target function and ei is a random variable
representing the noise.
– It is assumed that the values of the ei are drawn independently and that they are distributed
according to a Normal distribution with zero mean.
The task of the learner is to output a maximum likelihood hypothesis or a MAP hypothesis
assuming all hypotheses are equally probable a priori.
Assuming training examples are mutually independent given h, we can write P(D|h) as the product of the
various (di|h)
Given the noise ei obeys a Normal distribution with zero mean and unknown variance σ2 , each di must also
obey a Normal distribution around the true targetvalue f(x i). Because we are writing the expression for
P(D|h), we assume h is the correct description of f.
Hence, µ = f(xi) = h(xi)
Maximize the less complicated logarithm, which is justified because of the monotonicity of function p
The first term in this expression is a constant independent of h, and can therefore be discarded,
yielding
Maximizing this negative quantity is equivalent to minimizing the corresponding positive quantity
Thus, above equation shows that the maximum likelihood hypothesis h ML is the one that minimizes the sum
of the squared errors between the observed training values d i and the hypothesis predictions h(xi).
The Bayesian approach to classifying the new instance is to assign the most probable target value, VMAP,
given the attribute values (al, a2.. .am) that describe the instance
The naive Bayes classifier is based on the assumption that the attribute values are conditionally
independent given the target value. Means, the assumption is that given the target value of the
instance, the probability of observing the conjunction (al, a2.. .am), is just the product of the
probabilities for the individual attributes:
Substituting this into Equation (1),
Where, VNB denotes the target value output by the naive Bayes classifier.
6. The following table gives dataset about stolen vehicles. Using Naïve bayes classifier classify the
new data (Red,SUV,Domestic).*****
=0.5*0.6*0.2*0.4=0.024.
Vj=NO
P(No) P(Color=Red|Stolen=No) P(Type =SUV |Stolen=No) P(Origin=Domestic |Stolen=No)
=0.5*0.4*0.6*0.6=0.072
Naïve Bayes classifies assigns maximum value to target class Stolen(i.e=0.072)
For new novel instance(Red,SUV,Domestic)=No [not stolen]
Conditional Probability of target value is P(No)/P(no)+P(Yes)=0.072/0.072+0.024=0.75.
7. Explain EM algorithm.*****
Estimating Means of k Gaussians
In this case, the sum of squared errors is minimized by the sample mean
EM algorithm
8. Explain Bayesian Belief Networks.*****
A Bayesian belief network describes the joint probability distribution for a set of variables.
Conditional Independence:
Where,
The naive Bayes classifier assumes that the instance attribute A1 is conditionally independent of instance
attribute A2 given the target value V. This allows the naive Bayes classifier to calculate P(A l, A2 | V) as
follows,
Representation:
A Bayesian belief network represents the joint probability distribution for a set of variables. Bayesian networks
(BN) are represented by directed acyclic graphs.
The Bayesian network in above figure represents the joint probability distribution over the boolean variables
Storm, Lightning, Thunder, ForestFire, Campfire, and BusTourGroup
A Bayesian network (BN) represents the joint probability distribution by specifying a set of
conditional independence assumptions
BN represented by a directed acyclic graph, together with sets of local conditional probabilities
Each variable in the joint space is represented by a node in the Bayesian network
The network arcs represent the assertion that the variable is conditionally independent of its non-
descendants in the network given its immediate predecessors in the network.
A conditional probability table (CPT) is given for each variable, describing the probability
distribution for that variable given the values of its immediate predecessors
The joint probability for any desired assignment of values (y1, . . . , yn) to the tuple of network
variables (Y1 . . . Ym) can be computed by the formula
Where, Parents(Yi) denotes the set of immediate predecessors of Yi in the network.
Example:
Consider the node Campfire. The network nodes and arcs represent the assertion that Campfire is
conditionally independent of its non-descendants Lightning and Thunder, given its immediate
parents Storm and BusTourGroup.
This means that once we know the value of the variables Storm and BusTourGroup, the variables
Lightning and Thunder provide no additional information about Campfire
The conditional probability table associated with the variable Campfire. The assertion is
The most basic instance-based method is the K- Nearest Neighbor Learning. This algorithm assumes all
instances correspond to points in the n-dimensional space Rn.
The nearest neighbors of an instance are defined in terms of the standard Euclidean distance.
Let an arbitrary instance x be described by the feature vector
((a1(x), a2(x), ………, an(x))
Where, ar(x) denotes the value of the r th attribute of instance x.
Then the distance between two instances xi and xj is defined to be d(xi , xj ) Where,
In nearest-neighbor learning the target function may be either discrete-valued or real- valued.
Let us first consider learning discrete-valued target functions of the form Where, V is
The k- Nearest Neighbor algorithm for approximation a discrete-valued target function isgiven below:
Below figure illustrates the operation of the k-Nearest Neighbor algorithm for the case where the instances are
points in a two-dimensional space and where the target function is Boolean valued.
The positive and negative training examples are shown by “+” and “-” respectively. A
query point xq is shown as well.
The 1-Nearest Neighbor algorithm classifies xq as a positive example in this figure, whereas the 5-
Nearest Neighbor algorithm classifies it as a negative example.
Below figure shows the shape of this decision surface induced by 1- Nearest Neighbor over the entire instance
space. The decision surface is a combination of convex polyhedra surrounding each of the training examples.
For every training example, the polyhedron indicates the set of query points whose classification will be
completely determined by that training example. Query points outside the polyhedron are closer to
some other training example. This kind of diagram is often called the Voronoi diagram of the set of
training example
The K- Nearest Neighbor algorithm for approximation a real-valued target function is g iven below
w
Distance-Weighted Nearest Neighbor Algorithm
The refinement to the k-NEAREST NEIGHBOR Algorithm is to weight the contribution of each of the
k neighbors according to their distance to the query point xq, giving greater weight to closer neighbors.
For example, in the k-Nearest Neighbor algorithm, which approximates discrete-valued target functions,
we might weight the vote of each neighbor according to the inverse square of its distance from xq
Distance-Weighted Nearest Neighbor Algorithm for approximation a discrete-valued target functions
The phrase "locally weighted regression" is called local because the function is approximated based
only on data near the query point, weighted because the contribution of each training example is
weighted by its distance from the query point, and regression because this is the term used widely in the
statistical learning community for the problem of approximating real-valued functions.
Given a new query instance xq, the general approach in locally weighted regression is to construct an
approximation 𝑓̂ that fits the training examples in the neighborhood surrounding xq. This approximation is
then used to calculate the value 𝑓̂(xq), which is output as the estimated target value for the query
instance.
Consider locally weighted regression in which the target function f is approximated near xq using a linear
function of the form
Where, ai(x) denotes the value of the ith attribute of the instance x
Derived methods are used to choose weights that minimize the squared error summed over the set D of
training examples using gradient descent
Need to modify this procedure to derive a local approximation rather than a global one. The simple way
is to redefine the error criterion E to emphasize fitting the local training examples. Three possible criteria
are given below.
2. Minimize the squared error over the entire set D of training examples, while weighting the error of
each training example by some decreasing function K of its distance from xq :
3. Combine 1 and 2:
If we choose criterion three and re-derive the gradient descent rule, we obtain the following training rule
The differences between this new rule and the rule given by Equation (3) are that the contribution of instance x to
the weight update is now multiplied by the distance penalty K(d(xq, x)), and that the error is summed over only the
k nearest training examples.
11. Discuss the learning tasks and Q learning in the context of reinforcement learning.
(OR) Explain Q learning. *
The Q Function
The value of Evaluation function Q(s, a) is the reward received immediately upon executing action a from state s,
plus the value (discounted by γ ) of following the optimal policy thereafter
-log2P(h): the description length of h under the optimal encoding for the hypothesis space H, L CH
(h) = −log2P(h), where CH is the optimal code for hypothesis space H.
-log2P(D | h): the description length of the training data D given hypothesis h, under the optimal
encoding from the hypothesis space H: LCH (D|h) = −log2P(D| h) , where C D|h is the optimal code
for describing data D assuming that both the sender and receiver know the hypothesis h.
Rewrite Equation (1) to show that hMAP is the hypothesis h that minimizes the sum given by the
description length of the hypothesis plus the description length of the data given the hypothesis.
Where, CH and CD|h are the optimal encodings for H and for D given h
The Minimum Description Length (MDL) principle recommends choosing the hypothesis that minimizes the sum
of these two description lengths of equ.
Where, codes C1 and C2 to represent the hypothesis and the data given the hypothesis
The above analysis shows that if we choose C1 to be the optimal encoding of hypotheses CH, and if we
choose C2 to be the optimal encoding CD|h, then hMDL = hMAP
1. Delayed reward: The task of the agent is to learn a target function 𝑓 that maps from the current state s to
the optimal action a = π (s). In reinforcement learning, training information is not available in (s,π (s)).
Instead, the trainer provides only a sequence of immediate reward values as the agent executes its
sequence of actions. The agent, therefore, faces the problem of temporal credit assignment:
determining which of the actions in its sequence are to be credited with producing the eventual rewards.
2. Exploration: In reinforcement learning, the agent influences the distribution of training examples by the
action sequence it chooses. This raises the question of which experimentation strategy produces most
effective learning. The learner faces a trade-off in choosing whether to favor exploration of unknown
states and actions, or exploitation of states and actions that it has already learned will yield high reward.
3. Partially observable states: The agent's sensors can perceive the entire state of the environment at
each time step, in many practical situations sensors provide only partial information. In such cases, the
agent needs to consider its previous observations together with its current sensor data when choosing
actions, and the best policy may be one that chooses actions specifically to improve the observability of
the environment.
4. Life-Long Learning: The robot has learn several related tasks often within the same environment with
same sensors. Ex: The mobile robot may need to learn how to dock on its battery charger, how to
navigate through narrow windows, and how to pickup output from laser printers.
14. Explain case based reasoning with example.
case-based reasoning:
Case-based reasoning (CBR) is a learning paradigm based on lazy learning methods and they classify
new query instances by analyzing similar instances while ignoring instances that are very different from
the query.
In CBR represent instances are not represented as real-valued points, but instead, they use a rich
symbolic representation.
CBR has been applied to problems such as conceptual design of mechanical devices based on a stored
library of previous designs, reasoning about new legal cases based on previous rulings, and solving
planning and scheduling problems by reusing and combining portions of previous solutions to similar
problems
The CADET system employs case-based reasoning to assist in the conceptual design of simple
mechanical devices such as water faucets.
It uses a library containing approximately 75 previous designs and design fragments to suggest
conceptual designs to meet the specifications of new design problems.
Each instance stored in memory (e.g., a water pipe) is represented by describing both its structure and its
qualitative function.
New design problems are then presented by specifying the desired function and requesting the
corresponding structure.
The problem setting is illustrated in below figure
The CADET system employs case-based reasoning to assist in the conceptual design of simple
mechanical devices such as water faucets.
It uses a library containing approximately 75 previous designs and design fragments to suggest
conceptual designs to meet the specifications of new design problems.
Each instance stored in memory (e.g., a water pipe) is represented by describing both its structure and its
qualitative function.
New design problems are then presented by specifying the desired function and requesting the
corresponding structure.
The function is represented in terms of the qualitative relationships among the water- flow levels and
temperatures at its inputs and outputs.
In the functional description, an arrow with a "+" label indicates that the variable at the arrowhead
increases with the variable at its tail. A "-" label indicates that the variable at the head decreases with the
variable at the tail.
Here Qc refers to the flow of cold water into the faucet, Qh to the input flow of hot water, and Qm to the
single mixed flow out of the faucet.
Tc, Th, and Tm refer to the temperatures of the cold water, hot water, and mixed water respectively.
The variable Ct denotes the control signal for temperature that is input to the faucet, and Cf denotes the
control signal for waterflow.
The controls Ct and Cf are to influence the water flows Qc and Qh, thereby indirectly influencing the
faucet output flow Qm and temperature Tm.
15. Write short notes on radial basis.*****
One approach to function approximation that is closely related to distance-weighted regression and also
to artificial neural networks is learning with radial basis functions
In this approach, the learned hypothesis is a function of the form
Where, each xu is an instance from X and where the kernel function Ku(d(xu, x)) is defined so that it
decreases as the distance d(xu, x) increases.
Here k is a user provided constant that specifies the number of kernel functions to be included.
𝑓̂ is a global approximation to f (x), the contribution from each of the Ku(d(xu, x)) terms is localized to a
region nearby the point xu.
Choose each function Ku(d(xu, x)) to be a Gaussian function centred at the point xu with some variance 𝑓 2
16. Use the naive Bayes classifier and the training data from this table to classify the
following novel instance:
< Outlook = sunny, Temperature = cool, Humidity = high, Wind = strong > *****
Day Outlook Temperature Humidity Wind PlayTennis
D1 Sunny Hot High Weak No
D2 Sunny Hot High Strong No
D3 Overcast Hot High Weak Yes
D4 Rain Mild High Weak Yes
D5 Rain Cool Normal Weak Yes
D6 Rain Cool Normal Strong No
D7 Overcast Cool Normal Strong Yes
D8 Sunny Mild High Weak No
D9 Sunny Cool Normal Weak Yes
D10 Rain Mild Normal Weak Yes
D11 Sunny Mild Normal Strong Yes
D12 Overcast Mild High Strong Yes
D13 Overcast Hot Normal Weak Yes
D14 Rain Mild High Strong No
=P(Outlook=sunny|vj)P(Temperature=cool|vj)P(Humidity=high|vj)P(Wind=strong|vj)
1) P(PlayTennis=yes)=9/14=0.64
P(playTennis=No)=5/14=0.36
2) P(outlook=sunny| PlayTennis=yes)=2/9=0.22
P(outlook=sunny| PlayTennis=No)=3/5=0.60
3) P(Temperature=cool| PlayTennis=yes)=3/9=0.33
P(Temperature=cool | PlayTennis=No)=1/5=0.2
4) P(Humidity=High| PlayTennis=yes)=3/9=0.33
P(Humidity=High | PlayTennis=No)=4/5=0.8
5) P(Wind=strong| PlayTennis=yes)=3/9=0.33
P(Wind=strong | PlayTennis=No)=3/5=0.60
=0.64*0.22*0.33*0.33*0.33=0.00506