0% found this document useful (0 votes)
50 views18 pages

AI&ML-Q With Answer

Bayesian learning methods have several key features: - Each training example can incrementally increase or decrease the probability of hypotheses. - Prior knowledge can be combined with observed data to determine the final probability of hypotheses. - Bayesian methods can accommodate probabilistic predictions and classify new instances by combining predictions of multiple hypotheses. The naive Bayes classifier applies the Bayes theorem to classify new instances based on attribute values, making the assumption that attributes are conditionally independent given the target value. It assigns the most probable target value by calculating the probabilities of each value given the attribute values. For the vehicle theft dataset, the naive Bayes classifier would classify the new instance (Red, SUV, Domestic) as having a target value
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
50 views18 pages

AI&ML-Q With Answer

Bayesian learning methods have several key features: - Each training example can incrementally increase or decrease the probability of hypotheses. - Prior knowledge can be combined with observed data to determine the final probability of hypotheses. - Bayesian methods can accommodate probabilistic predictions and classify new instances by combining predictions of multiple hypotheses. The naive Bayes classifier applies the Bayes theorem to classify new instances based on attribute values, making the assumption that attributes are conditionally independent given the target value. It assigns the most probable target value by calculating the probabilities of each value given the attribute values. For the vehicle theft dataset, the naive Bayes classifier would classify the new instance (Red, SUV, Domestic) as having a target value
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 18

AI&ML Question Bank with Answers-3rd IA

1. List and explain features of Bayesian learning methods.*****


 Each observed training example can incrementally decrease or increase the estimated
probability that a hypothesis is correct. This provides a more flexible approach to learning
than algorithms that completely eliminate a hypothesis if it is found to be inconsistent with
any single example.
 Prior knowledge can be combined with observed data to determine the final probability of a
hypothesis. In Bayesian learning, prior knowledge is provided by asserting (1) a prior
probability for each candidate hypothesis, and (2) a probability distribution over observed
data for each possible hypothesis.
 Bayesian methods can accommodate hypotheses that make probabilistic predictions.
 New instances can be classified by combining the predictions of multiple hypotheses,
weighted by their probabilities.
 Even in cases where Bayesian methods prove computationally intractable, they can provide
a standard of optimal decision making against which other practical methods can be
measured.
2. Explain bayes theorem and Maximum a posterior (MAP) Hypothesis.*****
Bayes theorem is the cornerstone of Bayesian learning methods because it provides a way to
calculate the posterior probability P(h|D), from the prior probability P(h), together with P(D) and

P(D|h).
 P(D) prior probability of D, probability that D will be observed
 P(D|h) probability of observing D given a world in which h holds

Maximum a Posteriori (MAP) Hypothesis:


 Some set of candidate hypotheses H and is interested in finding the most probable hypothesis h ∈ H
given the observed dataD. Any such maximally probable hypothesis is called a maximum a
posteriori (MAP) hypothesis.
 Bayes theorem to calculate the posterior probability of each candidate hypothesis is hMAP is a MAP

hypothesis provided, and P(D) is dropped because it is independent of P(D| h)


3. Describe Brute-Force MAP learning algorithm.*****
Brute-Force Bayes Concept Learning:

Consider the concept learning problem


 Assume the learner considers some finite hypothesis space H defined over the instance space X, in
which the task is to learn some target concept c : X → {0,1}.
 Learner is given some sequence of training examples ((x1, d1) . . . (xm, dm)) where xi is some
instance from X and where di is the target value of xi (i.e., di = c(xi)).
 The sequence of target values are written as D = (d1 . . . dm).

brute-force map learning algorithm:


1. For each hypothesis h in H, calculate the posterior probability

2. Output the hypothesis h MAP with the highest posterior probability

In order specify a learning problem for the BRUTE-FORCE MAP LEARNING algorithm we must specify
what values are to be used for P(h) and for P(D|h) ?

Let’s choose P(h) and for P(D|h) to be consistent with the following assumptions:
 The training data D is noise free (i.e., di = c(xi))
 The target concept c is contained in the hypothesis space H
 Do not have a priori reason to believe that any hypothesis is more probable than any other.


 P(D|h) is the probability of observing the target values D = (d1 . . .dm) for the fixed set of instances
(x1 . . . xm), given a world in which hypothesis h holds
 Since we assume noise-free training data, the probability of observing classification d i given h is
just 1 if di = h(xi) and 0 if di ≠ h(xi). Therefore,

Given these choices for P(h) and for P(D|h) we now have a fully-defined problem for the above brute-force
map learning algorithm. Recalling Bayes theorem, we have

Consider the case where h is inconsistent with the training data D


The posterior probability of a hypothesis inconsistent with D is zero

Consider the case where h is consistent with D

Where, VSH,D is the subset of hypotheses from H that are consistent with D

To summarize, Bayes theorem implies that the posterior probability P(h|D) under our assumed P(h) and P(D|h) is

4. Explain maximum likelihood and least-squared error hypothesis.

 Learner L considers an instance space X and a hypothesis space H consisting of some class of real-
valued functions defined over X, i.e., (∀ h ∈ H)[ h : X → R] and training examples of the form
<xi,di>
 The problem faced by L is to learn an unknown target function f : X → R
 A set of m training examples is provided, where the target value of each example is corrupted by
random noise drawn according to a Normal probability distribution with zero mean (di = f(xi) + ei)
 Each training example is a pair of the form (xi ,di ) where di = f (xi ) + ei .
– Here f(xi) is the noise-free value of the target function and ei is a random variable
representing the noise.
– It is assumed that the values of the ei are drawn independently and that they are distributed
according to a Normal distribution with zero mean.
 The task of the learner is to output a maximum likelihood hypothesis or a MAP hypothesis
assuming all hypotheses are equally probable a priori.

Using the definition of h ML we have

Assuming training examples are mutually independent given h, we can write P(D|h) as the product of the
various (di|h)
Given the noise ei obeys a Normal distribution with zero mean and unknown variance σ2 , each di must also
obey a Normal distribution around the true targetvalue f(x i). Because we are writing the expression for
P(D|h), we assume h is the correct description of f.
Hence, µ = f(xi) = h(xi)

Maximize the less complicated logarithm, which is justified because of the monotonicity of function p

The first term in this expression is a constant independent of h, and can therefore be discarded,
yielding

Maximizing this negative quantity is equivalent to minimizing the corresponding positive quantity

Finally, discard constants that are independent of h.

Thus, above equation shows that the maximum likelihood hypothesis h ML is the one that minimizes the sum
of the squared errors between the observed training values d i and the hypothesis predictions h(xi).

5. Discuss the Naïve bayes classifier.


The naive Bayes classifier applies to learning tasks where each instance x is described by a
conjunction of attribute values and where the target function f (x) can take on any value from some
finite set V.
A set of training examples of the target function is provided, and a new instance is presented,
described by the tuple of attribute values (al, a2.. .am).
The learner is asked to predict the target value, or classification, for this new instance.

The Bayesian approach to classifying the new instance is to assign the most probable target value, VMAP,
given the attribute values (al, a2.. .am) that describe the instance

Use Bayes theorem to rewrite this expression as

The naive Bayes classifier is based on the assumption that the attribute values are conditionally
independent given the target value. Means, the assumption is that given the target value of the
instance, the probability of observing the conjunction (al, a2.. .am), is just the product of the
probabilities for the individual attributes:
Substituting this into Equation (1),

Naive Bayes classifier:

Where, VNB denotes the target value output by the naive Bayes classifier.

6. The following table gives dataset about stolen vehicles. Using Naïve bayes classifier classify the
new data (Red,SUV,Domestic).*****

Carno Color Type Origin Stolen


1 Red Sports Domestic Yes
2 Red Sports Domestic No
3 Red Sports Domestic Yes
4 Yellow Sports Domestic No
5 Yellow Sports Imported Yes
6 Yellow SUV Imported No
7 Yellow SUV Imported Yes
8 Yellow SUV Domestic No
9 Red SUV Imported No
10 Red Sports Imported Yes
Solution:
1) Target class : Stolen(Yes,No)
P(Stolen =Yes)= total count of Yes/Total no of values in target=5/10=0.5
P(Stolen= No)= total count of No/Total no of values in target=5/10=0.5
2) Calculate probabilities of Color=Red with target class(Yes,No)
Red(Yes,No) = P(Color=Red|Stolen=Yes)=3/5=0.6
= P(Color=Red|Stolen=No))=2/5=0.4
3) Calculate probabilities of Type=SUV with target class(Yes,No)
SUV (Yes,No)= P(Type =SUV |Stolen=Yes)=1/5=0.2
= P(Type= SUV|Stolen=No)=3/5=0.6
4) Calculate probabilities of Origin=Domestic with target class(Yes,No)
Domestic (Yes,No) = P(Origin=Domestic |Stolen=Yes)=2/5=0.4
= P(Origin =Domestic |Stolen=No)=3/5=0.6

Calculate Naïve Bayes classifier according to Equation


VNB=argmax P(Vj)P(Color=Red|Vj)P(Type=SUV|Vj)P(Origin=Domestic|Vj)
Vj=Yes
=P(Yes) P(Color=Red|Stolen=Yes) P(Type =SUV |Stolen=Yes) P(Origin=Domestic
|Stolen=Yes)

=0.5*0.6*0.2*0.4=0.024.
Vj=NO
P(No) P(Color=Red|Stolen=No) P(Type =SUV |Stolen=No) P(Origin=Domestic |Stolen=No)
=0.5*0.4*0.6*0.6=0.072
Naïve Bayes classifies assigns maximum value to target class Stolen(i.e=0.072)
For new novel instance(Red,SUV,Domestic)=No [not stolen]
Conditional Probability of target value is P(No)/P(no)+P(Yes)=0.072/0.072+0.024=0.75.

7. Explain EM algorithm.*****
Estimating Means of k Gaussians

 Each instance is generated using a two-step process.


 First, one of the k Normal distributions is selected at random.
 Second, a single random instance xi is generated according to this selected distribution.
 This process is repeated to generate a set of data points as shown in the figure.
 We would like to find a maximum likelihood hypothesis for these means; that is, a hypothesis h
that maximizes p(D |h).

In this case, the sum of squared errors is minimized by the sample mean

EM algorithm
8. Explain Bayesian Belief Networks.*****

A Bayesian belief network describes the joint probability distribution for a set of variables.
Conditional Independence:

Let X, Y, and Z be three discrete-valued random variables. X is conditionally independent of Y given Z if


the probability distribution governing X is independent of the value of Y given a value for Z, that is, if

Where,

The above expression is written in abbreviated form as


P(X | Y, Z) = P(X | Z)
Conditional independence can be extended to sets of variables. The set of variables X 1 . . . Xl is
conditionally independent of the set of variables Y1 . . . Ym given the set of variables Z1 . . . Zn if

The naive Bayes classifier assumes that the instance attribute A1 is conditionally independent of instance

attribute A2 given the target value V. This allows the naive Bayes classifier to calculate P(A l, A2 | V) as

follows,
Representation:

A Bayesian belief network represents the joint probability distribution for a set of variables. Bayesian networks
(BN) are represented by directed acyclic graphs.

The Bayesian network in above figure represents the joint probability distribution over the boolean variables
Storm, Lightning, Thunder, ForestFire, Campfire, and BusTourGroup

A Bayesian network (BN) represents the joint probability distribution by specifying a set of
conditional independence assumptions

 BN represented by a directed acyclic graph, together with sets of local conditional probabilities
 Each variable in the joint space is represented by a node in the Bayesian network
 The network arcs represent the assertion that the variable is conditionally independent of its non-
descendants in the network given its immediate predecessors in the network.
 A conditional probability table (CPT) is given for each variable, describing the probability
distribution for that variable given the values of its immediate predecessors

 The joint probability for any desired assignment of values (y1, . . . , yn) to the tuple of network
variables (Y1 . . . Ym) can be computed by the formula


 Where, Parents(Yi) denotes the set of immediate predecessors of Yi in the network.


 Example:
 Consider the node Campfire. The network nodes and arcs represent the assertion that Campfire is
conditionally independent of its non-descendants Lightning and Thunder, given its immediate
parents Storm and BusTourGroup.


 This means that once we know the value of the variables Storm and BusTourGroup, the variables
Lightning and Thunder provide no additional information about Campfire

 The conditional probability table associated with the variable Campfire. The assertion is

P(Campfire = True | Storm = True, BusTourGroup = True) = 0.4

9. Describe the K-Nearest neighbor algorithm.*****

 The most basic instance-based method is the K- Nearest Neighbor Learning. This algorithm assumes all
instances correspond to points in the n-dimensional space Rn.
 The nearest neighbors of an instance are defined in terms of the standard Euclidean distance.
 Let an arbitrary instance x be described by the feature vector
((a1(x), a2(x), ………, an(x))
Where, ar(x) denotes the value of the r th attribute of instance x.

 Then the distance between two instances xi and xj is defined to be d(xi , xj ) Where,
 In nearest-neighbor learning the target function may be either discrete-valued or real- valued.

Let us first consider learning discrete-valued target functions of the form Where, V is

the finite set {v1, . . . vs }

The k- Nearest Neighbor algorithm for approximation a discrete-valued target function isgiven below:
 Below figure illustrates the operation of the k-Nearest Neighbor algorithm for the case where the instances are
points in a two-dimensional space and where the target function is Boolean valued.

 The positive and negative training examples are shown by “+” and “-” respectively. A
query point xq is shown as well.
 The 1-Nearest Neighbor algorithm classifies xq as a positive example in this figure, whereas the 5-
Nearest Neighbor algorithm classifies it as a negative example.

 Below figure shows the shape of this decision surface induced by 1- Nearest Neighbor over the entire instance
space. The decision surface is a combination of convex polyhedra surrounding each of the training examples.

 For every training example, the polyhedron indicates the set of query points whose classification will be
completely determined by that training example. Query points outside the polyhedron are closer to
some other training example. This kind of diagram is often called the Voronoi diagram of the set of
training example
The K- Nearest Neighbor algorithm for approximation a real-valued target function is g iven below

w
Distance-Weighted Nearest Neighbor Algorithm

 The refinement to the k-NEAREST NEIGHBOR Algorithm is to weight the contribution of each of the
k neighbors according to their distance to the query point xq, giving greater weight to closer neighbors.
 For example, in the k-Nearest Neighbor algorithm, which approximates discrete-valued target functions,
we might weight the vote of each neighbor according to the inverse square of its distance from xq
Distance-Weighted Nearest Neighbor Algorithm for approximation a discrete-valued target functions

10. Discuss locally weighted Regression.*****

Locally weighted regression:

 The phrase "locally weighted regression" is called local because the function is approximated based
only on data near the query point, weighted because the contribution of each training example is
weighted by its distance from the query point, and regression because this is the term used widely in the
statistical learning community for the problem of approximating real-valued functions.

 Given a new query instance xq, the general approach in locally weighted regression is to construct an
approximation 𝑓̂ that fits the training examples in the neighborhood surrounding xq. This approximation is
then used to calculate the value 𝑓̂(xq), which is output as the estimated target value for the query
instance.

Locally Weighted Linear Regression

 Consider locally weighted regression in which the target function f is approximated near xq using a linear
function of the form

Where, ai(x) denotes the value of the ith attribute of the instance x

 Derived methods are used to choose weights that minimize the squared error summed over the set D of
training examples using gradient descent

Which led us to the gradient descent training rule


Where, η is a constant learning rate

 Need to modify this procedure to derive a local approximation rather than a global one. The simple way
is to redefine the error criterion E to emphasize fitting the local training examples. Three possible criteria
are given below.

1. Minimize the squared error over just the k nearest neighbors:

2. Minimize the squared error over the entire set D of training examples, while weighting the error of
each training example by some decreasing function K of its distance from xq :

3. Combine 1 and 2:

If we choose criterion three and re-derive the gradient descent rule, we obtain the following training rule

The differences between this new rule and the rule given by Equation (3) are that the contribution of instance x to
the weight update is now multiplied by the distance penalty K(d(xq, x)), and that the error is summed over only the
k nearest training examples.

11. Discuss the learning tasks and Q learning in the context of reinforcement learning.
(OR) Explain Q learning. *
The Q Function

The value of Evaluation function Q(s, a) is the reward received immediately upon executing action a from state s,
plus the value (discounted by γ ) of following the optimal policy thereafter

Rewrite Equation (3) in terms of Q(s, a) as


Equation (5) makes clear, it need only consider each available action a in its current state s and choose the action
that maximizes Q(s, a).
An Algorithm for Learning Q
 Learning the Q function corresponds to learning the optimal policy.
 The key problem is finding a reliable way to estimate training values for Q, given only a sequence of
immediate rewards r spread out over time. This can be accomplished through iterative approximation
Rewriting Equation

12. Discuss Minimum Description Length Principle in brief.*****

Minimum description length principle:

 A Bayesian perspective on Occam’s razor


 Motivated by interpreting the definition of h MAP in the light of basic concepts from information
theory.

which can be equivalently expressed in terms of maximizing the log2

or alternatively, minimizing the negative of this quantity


This equation (1) can be interpreted as a statement that short hypotheses are preferred, assuming a particular
representation scheme for encoding hypotheses and data

 -log2P(h): the description length of h under the optimal encoding for the hypothesis space H, L CH
(h) = −log2P(h), where CH is the optimal code for hypothesis space H.
 -log2P(D | h): the description length of the training data D given hypothesis h, under the optimal
encoding from the hypothesis space H: LCH (D|h) = −log2P(D| h) , where C D|h is the optimal code
for describing data D assuming that both the sender and receiver know the hypothesis h.
 Rewrite Equation (1) to show that hMAP is the hypothesis h that minimizes the sum given by the
description length of the hypothesis plus the description length of the data given the hypothesis.

Where, CH and CD|h are the optimal encodings for H and for D given h

The Minimum Description Length (MDL) principle recommends choosing the hypothesis that minimizes the sum
of these two description lengths of equ.

Minimum Description Length principle:

Where, codes C1 and C2 to represent the hypothesis and the data given the hypothesis

The above analysis shows that if we choose C1 to be the optimal encoding of hypotheses CH, and if we
choose C2 to be the optimal encoding CD|h, then hMDL = hMAP

13. What is reinforcement learning? Explain Reinforcement Learning Problem characteristics.*****


 An agent interacting with its environment. The agent exists in an environment described by some set of
possible states S.
 Agent perform any of a set of possible actions A. Each time it performs an action a, in some state st the
agent receives a real-valued reward r, that indicates the immediate value of this state-action transition.
This produces a sequence of states si, actions ai, and immediate rewards r i as shown in the figure.
 The agent's task is to learn a control policy, π: S → A, that maximizes the expected sum of these rewards,
with future rewards discounted exponentially by their delay.
Reinforcement learning problem characteristics:

1. Delayed reward: The task of the agent is to learn a target function 𝑓 that maps from the current state s to
the optimal action a = π (s). In reinforcement learning, training information is not available in (s,π (s)).
Instead, the trainer provides only a sequence of immediate reward values as the agent executes its
sequence of actions. The agent, therefore, faces the problem of temporal credit assignment:
determining which of the actions in its sequence are to be credited with producing the eventual rewards.

2. Exploration: In reinforcement learning, the agent influences the distribution of training examples by the
action sequence it chooses. This raises the question of which experimentation strategy produces most
effective learning. The learner faces a trade-off in choosing whether to favor exploration of unknown
states and actions, or exploitation of states and actions that it has already learned will yield high reward.

3. Partially observable states: The agent's sensors can perceive the entire state of the environment at
each time step, in many practical situations sensors provide only partial information. In such cases, the
agent needs to consider its previous observations together with its current sensor data when choosing
actions, and the best policy may be one that chooses actions specifically to improve the observability of
the environment.

4. Life-Long Learning: The robot has learn several related tasks often within the same environment with
same sensors. Ex: The mobile robot may need to learn how to dock on its battery charger, how to
navigate through narrow windows, and how to pickup output from laser printers.
14. Explain case based reasoning with example.

case-based reasoning:

 Case-based reasoning (CBR) is a learning paradigm based on lazy learning methods and they classify
new query instances by analyzing similar instances while ignoring instances that are very different from
the query.
 In CBR represent instances are not represented as real-valued points, but instead, they use a rich
symbolic representation.
 CBR has been applied to problems such as conceptual design of mechanical devices based on a stored
library of previous designs, reasoning about new legal cases based on previous rulings, and solving
planning and scheduling problems by reusing and combining portions of previous solutions to similar
problems

A prototypical example of a case-based reasoning

 The CADET system employs case-based reasoning to assist in the conceptual design of simple
mechanical devices such as water faucets.
 It uses a library containing approximately 75 previous designs and design fragments to suggest
conceptual designs to meet the specifications of new design problems.
 Each instance stored in memory (e.g., a water pipe) is represented by describing both its structure and its
qualitative function.
 New design problems are then presented by specifying the desired function and requesting the
corresponding structure.
The problem setting is illustrated in below figure

A prototypical example of a case-based reasoning

 The CADET system employs case-based reasoning to assist in the conceptual design of simple
mechanical devices such as water faucets.
 It uses a library containing approximately 75 previous designs and design fragments to suggest
conceptual designs to meet the specifications of new design problems.
 Each instance stored in memory (e.g., a water pipe) is represented by describing both its structure and its
qualitative function.
 New design problems are then presented by specifying the desired function and requesting the
corresponding structure.

The problem setting is illustrated in below figure

 The function is represented in terms of the qualitative relationships among the water- flow levels and
temperatures at its inputs and outputs.
 In the functional description, an arrow with a "+" label indicates that the variable at the arrowhead
increases with the variable at its tail. A "-" label indicates that the variable at the head decreases with the
variable at the tail.
 Here Qc refers to the flow of cold water into the faucet, Qh to the input flow of hot water, and Qm to the
single mixed flow out of the faucet.
 Tc, Th, and Tm refer to the temperatures of the cold water, hot water, and mixed water respectively.
 The variable Ct denotes the control signal for temperature that is input to the faucet, and Cf denotes the
control signal for waterflow.
 The controls Ct and Cf are to influence the water flows Qc and Qh, thereby indirectly influencing the
faucet output flow Qm and temperature Tm.
15. Write short notes on radial basis.*****
 One approach to function approximation that is closely related to distance-weighted regression and also
to artificial neural networks is learning with radial basis functions
 In this approach, the learned hypothesis is a function of the form

 Where, each xu is an instance from X and where the kernel function Ku(d(xu, x)) is defined so that it
decreases as the distance d(xu, x) increases.
 Here k is a user provided constant that specifies the number of kernel functions to be included.
 𝑓̂ is a global approximation to f (x), the contribution from each of the Ku(d(xu, x)) terms is localized to a
region nearby the point xu.

Choose each function Ku(d(xu, x)) to be a Gaussian function centred at the point xu with some variance 𝑓 2

The functional form of equ(1) can approximate any function


 The functional form of equ(1) can approximate any function with arbitrarily small error, provided a
sufficiently large number k of such Gaussian kernels and provided the width
𝑓2 of each kernel can be separately specified
 The function given by equ(1) can be viewed as describing a two layer network where the first layer of
units computes the values of the various Ku(d(xu, x)) and where the second layer computes a linear
combination of these first-layer unit values

16. Use the naive Bayes classifier and the training data from this table to classify the
following novel instance:
< Outlook = sunny, Temperature = cool, Humidity = high, Wind = strong > *****
Day Outlook Temperature Humidity Wind PlayTennis
D1 Sunny Hot High Weak No
D2 Sunny Hot High Strong No
D3 Overcast Hot High Weak Yes
D4 Rain Mild High Weak Yes
D5 Rain Cool Normal Weak Yes
D6 Rain Cool Normal Strong No
D7 Overcast Cool Normal Strong Yes
D8 Sunny Mild High Weak No
D9 Sunny Cool Normal Weak Yes
D10 Rain Mild Normal Weak Yes
D11 Sunny Mild Normal Strong Yes
D12 Overcast Mild High Strong Yes
D13 Overcast Hot Normal Weak Yes
D14 Rain Mild High Strong No
=P(Outlook=sunny|vj)P(Temperature=cool|vj)P(Humidity=high|vj)P(Wind=strong|vj)

1) P(PlayTennis=yes)=9/14=0.64
P(playTennis=No)=5/14=0.36
2) P(outlook=sunny| PlayTennis=yes)=2/9=0.22
P(outlook=sunny| PlayTennis=No)=3/5=0.60
3) P(Temperature=cool| PlayTennis=yes)=3/9=0.33
P(Temperature=cool | PlayTennis=No)=1/5=0.2
4) P(Humidity=High| PlayTennis=yes)=3/9=0.33
P(Humidity=High | PlayTennis=No)=4/5=0.8
5) P(Wind=strong| PlayTennis=yes)=3/9=0.33
P(Wind=strong | PlayTennis=No)=3/5=0.60

Calculate Naïve Bayes classifier according to

Equation VNB=argmax P(Vj)


P(Outlook=sunny|vj)P(Temperature=cool|vj)P(Humidity=high|vj)P(Wind=strong|vj)
6) P(yes) P(sunny=yes) P(cool=yes) P(High=yes)P(Strong=yes)

=0.64*0.22*0.33*0.33*0.33=0.00506

7) P(No) P(sunny=No) P(cool=No) P(High=No)P(Strong=No)


=0.36*0.60*0.2*0.8*0.60=0.0206
Naïve Bayes classifies assigns maximum value to target class Playtennis=NO.
[Outlook = sunny, Temperature = cool, Humidity = high, Wind = strong ] =No
conditional Probability of target value is P(No)/P(no)+P(Yes)=
0.0206/0.0206+0.00506=0.795

You might also like