0% found this document useful (0 votes)
4 views11 pages

@vtudeveloper - in ML Mod 4

Bayesian Learning is a method that utilizes Bayes' theorem to reason about uncertain knowledge and infer unknown parameters in various applications, including game theory and medicine. The document outlines the fundamentals of probability-based learning, introduces Naive Bayes classification models, and explains the process of calculating prior, likelihood, and posterior probabilities. It also details the Naive Bayes algorithm, its applications, and provides a practical example of its implementation in classifying job offers based on student performance data.

Uploaded by

ishaaa2702
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF or read online on Scribd
0% found this document useful (0 votes)
4 views11 pages

@vtudeveloper - in ML Mod 4

Bayesian Learning is a method that utilizes Bayes' theorem to reason about uncertain knowledge and infer unknown parameters in various applications, including game theory and medicine. The document outlines the fundamentals of probability-based learning, introduces Naive Bayes classification models, and explains the process of calculating prior, likelihood, and posterior probabilities. It also details the Naive Bayes algorithm, its applications, and provides a practical example of its implementation in classifying job offers based on student performance data.

Uploaded by

ishaaa2702
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF or read online on Scribd
You are on page 1/ 11
CO) Tele} slats) Bayesian Learning “In science, progress is possible, In fact, if one believes in Bayes’ theorem, scientific progress is inevitable as predictions are made and as beliefs are tested and refined.” — Nate Silver Bayesian Learning is a learning method that describes and represents knowledge in an uncertain domain and provides a way to reason about this knowledge using probability measure. It uses Bayes theorem to infer the unknown parameters of a model. Bayesian inference is useful in many applications which involve reasoning and diagnosis such as game theory, medicine, ctc. Bayesian inference is much more powerful in handling missing data and for estimating any uncertainty in predictions, ig Objectives * Understand the basics of probability-based learning and probability theory * Lear the fundamentals of Bayes theorem * Introduce Bayes Classification models such as Brute Force Bayes learning algorithm, Bayes Optimal classifier, and Gibbs algorithm * Introduce Naive Bayes Classification models that work on the principle of Bayes theorem * Explore the Naive Bayes classification algorithm * Study about Naive Bayes Algorithm for continuous attributes using Gaussian distribution * Introduce other popular types of Naive Bayes classifiers such as Bernoulli Naive Bayes classifier, Multinomial Naive Bayes classifier, and Multi-class Naive Bayes classifier — 8.1 INTRODUCTION TO PROBABILITY-BASED LEARNING Probability-based learning is one of the most important practical learning methods which combines prior knowledge or prior probabilities with observed data, Probabilistic learning uses the concept of probability theory that describes how to model randomness, uncertainty, and noise to predict future events. It is a tool for modelling large datasets and uses Bayes rule to infer unknown quantities, predict and leam from data. In a probabilistic model, randomness plays a major role which gives probability distribution a solution, while in a deterministic model there is no randomness and Bayesian Learning + 235 hence it exhibits the same initial conditions every time the model is run and is likely to get a single possible outcome as the solution. Bayesian learning differs from probabilistic learning as it uses subjective probabilities (e,, probability that is based on an individual's belief or interpretation about the outcome of an event and it can change over time) to infer parameters ofa model. Two practical learning algorithms called Naive Bayes learning and Bayesian Belief Network (BBN) form the major part of Bayesian learning. These algorithms use prior probabilities and apply Bayes rule to infer useful information. Bayesian Belief Networks (BBN) is explained in detail in Chapter 9. Scan for information on ‘Probability Theory’ and for ‘Addhtional Examples’ 8.2. FUNDAMENTALS OF BAYES THEOREM Naive Bayes Model relies on Bayes theorem that works on the principle of three kinds of probabil- ities called prior probability, likelihood probability, and posterior probability. Prior Probability Itis the general probability of an uncertain event before an observation is seen or some evidence is collected. It is the initial probability that is believed before any new information is collected. Likelihood Probability Likelihood probability is the relative probability of the observation occurring for each class or the sampling density for the evidence given the hypothesis. It is stated as P (Evidence | Hypothesis), which denotes the likeliness of the occurrence of the evidence given the parameters. Posterior Probability Itis the updated or revised probability of an event taking into account the observations from the training data. P (Hypothesis | Evidence) is the posterior distribution representing the belief about the hypothesis, given the evidence from the training data. Therefore, Posterior probability = prior probability + new evidence 8.3 CLASSIFICATION USING BAYES MODEL Naive Bayes Classification models work on the principle of Bayes theorem. Bayes’ rule is a mathe- matical formula used to determine the posterior probability, given prior probabilities of events. Generally, Bayes theorem is used to select the most probable hypothesis from data, considering both prior knowledge and posterior distributions. It is based on the calculation of the posterior probability and is stated as: P (Hypothesis k | Evidence E) where, Hypothesis his the target class to be classified and Evidence F is the given test instance. 236» Machine Learning P (Hypothesis itl Evidence F) is calculated from the prior probability P (Hypothesis f), the likelihood probability P (Evidence £ |Hypothesis i) and the marginal probability P (Evidence F). Itcan be written as: P (Hypothesis h | Evidence E) = P(Evidence ElHypothesis h) P(Hypothesis h) Se ethesis fh) where, P (Hypothesis hi is the prior probability of the hypothesis h without observing the training data or considering any evidence. It denotes the prior belief or the initial probability that the hypothesis h is correct. P (Evidence E) is the prior probability of the evidence E from the training dataset without any knowledge of which hypothesis holds. It is also called the marginal proba- bility. P (Evidence E | Hypothesis h) is the prior probability of Evidence E given Hypothesis h. It is the likelihood probability of the Evidence E after observing the training data that the hypothesis h is correct. P (Hypothesis t | Evidence £) is the posterior probability of Hypothesis ht given Evidence E. It is the probability of the hypothesis h after observing the training data that the evidence Eis correct. In other words, by the equation of Bayes Eq, (8.1), one ean observe that: Posterior Probability a Prior Probability x Likelihood Probability Bayes theorem helps in calculating the posterior probability for a number of hypotheses, from which the hypothesis with the highest probability can be selected This selection of the most probable hypothesis from’a set of hypotheses is formally defined as Maximum A Posteriori (MAP) Hypothesis. (81) Maximum A Posteriori (MAP) Hypothesis, h,,,, Given a set of candidate hypotheses, the hypothesis which has the maximum value is considered as, the maximum probable hypothesis or most probable hypothesis. This most probable hypothesis is called the Maximum A Posteriori Hypothesis h,,,,- Bayes theorem Eq. (8.1) can be used to find the ly, Thy = ™aX,,,, P(Hypothesish| Evidence E) P(Evidence E | Hypothesis k)P( Hypothesis hi) “a P(Evidence E) =max,,, P(Evidence E Hypothesis h)P(Hypothesis It) (82) = max, Maximum Likelihood (ML) Hypothesis, hy, Given a set of candidate hypotheses, if every hypothesis is equally probable, only P (F. | fi) is used to find the most probable hypothesis. The hypothesis that gives the maximum likelihood for P (E | f) is called the Maximum Likelihood (ML) Hypothesis, jy. Ii =max,,, P(Evidence E Hypothesis h) (83) Correctness of Bayes Theorem Consider two events A and B ina sample space S. ATETTFITF BETTFTFTF P(A)=5/8 P@)=48 Bayesian Learning +» 237 P(A1B)=2/4 PBIA)=25 P(A1B)=P(B 1 A) P(A)/P(B)==2/4 P(BIA)=P(A1B)P(B)/P(A)==25 Let us consider a numerical example to illustrate the use of Bayes theorem now: pa a= sae sn Consider a boy who has a volleyball tournament on the next day, but today he feels sick. It is unusual that there is only a 40% chance he would fall sick since he is a healthy boy. Now, Find the probability of the boy participating in the tournament. The boy is very much interested in volley ball, so there is a 90% probability that he would participate in tournaments and 20% that he will fall sick given that he participates in the tournament. Solution: P (Boy participating in the tournament) = 90% P (He is sick | Boy participating in the tournament) = 20% P (He is Sick) = 40% ‘The probability of the boy participating in the tournament given that he is sick is: P (Boy participating in the tournament | He is sick) = P (Boy participating in the tournament) x P (He is sick | Boy participating in the tournament)/P (He is Sick) P (Boy participating in the tournament | Heis sick) = (0.9 x 0.2/0.4 =045 Hence, 45% is the probability that the boy will participate in the tournament given that he is sick. en Tey One related concept of Bayes theorem is the principle of Minimum Description Length (MDL). The minimum description length (MDL) principle is yet another powerful method like Occam's razor principle to perform inductive inference. It states that the best and most probable hypothesis is chosen for a set of observed data or the one with the minimum description. Recall from. Eq. (8.2) Maximum A Posteriori (MAP) Hypothesis, h,,,, which says that given a set of candidate hypotheses, the hypothesis which has the maximum value is considered as the maxintum probable hypothesis or most probable hypothesis. Naive Bayes algorithm uses the Bayes theorem and applies this MDL principle to find the best hypothesis for a given problem. Let us clearly understand how this algorithm works in the following Section 8.3.1. 8.3.1 NAIVE BAYES ALGORITHM Itis a supervised binary class or multi class classification algorithm that works on the principle of Bayes theorem. There is a family of Naive Bayes classifiers based on a common principle. These algorithms classify for datasets whose features are independent and each feature is assumed to be given equal weightage. It particularly works for a large dataset and is very fast. It is one of the most effective and simple classification algorithms. This algorithm considers all features to be independent of each other even though they are individually dependent on the classified object. Each of the features contributes a probability value independently during classification and hence this algorithm is called as Naive algorithm. Some important applications of these algorithms are text classification, recommendation system and face recognition. 238 © Machine Learning 1. Compute the prior probability for the target class. 2. Compute Frequency matrix and likelihood Probability for each of the feature. 3. Use Bayes theorem Eq. (8.1) to calculate the probability of all hypotheses. 4, Use Maximum A Posteriori (MAP) Hypothesis, yy,» Eq. (8.2) to classify the test object to the hypothesis with the highest probability. FF EAE] Assess a student's performance using Naive Bayes algorithm with the dataset provided in Table 8.1. Predict whether a student gets a job offer or not in his final year of the course. Table 8.1: Training Dataset (ay linac ona caesar Ee 1 9 [Yes ‘Very good Good ‘Yes 2. 28 [No Good “Moderate Yes 3. 29 [No Average ‘Poor No 4 <8 [No Average Good No 5 28 | Yes Good “Moderate Yes 6 29 [Yes Good “Moderate Yes 7 <3 [Yes Good Poor No 8 29 | No. ‘Very good Good Yes 9 28 (Yes Good Good Yes 10. 28 | Yes Average Good Yes Solution: The training dataset T consists of 10 data instances with attributes such as ‘CGPA’, ‘Interactiveness’, ‘Practical Knowledge’ and ‘Communication Skills’ as shown in Table 8.1, The target variable is Job Offer which is classified as Yes or No for a candidate student. Step 1: Compute the prior probability for the target feature ‘Job Offer’. The target feature ‘Job Offer’ has two classes, ‘Yes’ and ‘No’. It is a binary classification problem. Given a student instance, we need to classify whether ‘Job Offer = Yes! or ‘Job Offer = No’. From the training dataset, we observe that the frequency or the number of instances with ‘Job Offer = Yes’ is 7 and ‘Job Offer = No! is 3 The prior probability for the target feature is calculated by dividing the number of instances ‘belonging to a particular target class by the total number of instances. Hence, the prior probability for ‘Job Offer = Yes’ is 7/10 and ‘Job Offer = No’ is 3/10 as shown in Table 8.2. Bayesian Learning » 239 Table 8.2: Frequency Matrix and Prior Probability of Job Offer Tene ae Gea ‘Yes 7 P (Job Offer = Yes)=7/10 No 3 P (Job Offer =No) =3/10 Step 2: Compute Frequency matrix and Likelihood Probability for each of the feature. Step 2(a): Feature - CGPA Table 8.3 shows the frequency matrix for the feature CGPA. Table 8.3: Frequency Matrix of CGPA (aN Pacis ices 2 3 1 28 4 0 3s 0 2 Total 7 3 Table 8.4 shows how the likelihood probability is calculated for CGPA using conditional probability. Table 8.4: Likelihood Probability of CGPA (in P.Gob Offer= Ves) ee 29 _| P(CGPA29 | Job Offer= Yes) = 3/7, P (CGPA29 | Job Offer = No)=1/3 28 _| P (CGPA28 | Job Offer~ Yes) = 4/7 P (CGPA 28 | Job Offer - No) = 0/3 <8__ | P(CGPA <8 | Job Offer= Yes)=0/7 P (CGPA & | Job Offer = No) = 2/3, As explained earlier the Likelihood probability is stated as the sampling density for the evidence given the hypothesis. It is denoted as P (Evidence | Hypothesis), which says how likely is the occurrence of the evidence given the parameters. Itis caletilated as the number of instances of each attribute value and for a given class value divided by the number of instances with that class value. For example P (CGPA 29 | Job Offer = Yes) denotes the number of instances with ‘CGPA 29" and/‘Job Offer = Yes’ divided by the total number of instances with ‘Job Offer = Yes’. From the Table 8.3 Frequency Matrix of CGPA, number of instances with ‘CGPA 29" and ‘Job Offer = Yes’ is 3. The total number of instances with ‘Job Offer = Yes’ is 7. Hence, P (CGPA 29 | Job Offer = Yes ) = 3/7. Similarly, the Likelihood probability is calculated for all attribute values of feature CGPA. Step 2(b): Feature —Interactiveness Table 8.5 shows the frequency matrix for the feature Interactiveness. Table 8.5: Frequency Matrix of Interactiveness Dace e Acie) Coe Me okoac ‘YES: 5 1 NO. 2 2 Total 7 3 240 + Machine Learning Table 8.6 shows how the likelihood probability is calculated for Interactiveness using condi- tional probability. Table 8.6: Likelihood Probability of Interactiveness aes ice) Pob Offer=No) ‘YES P (Interactiveness = Yes I Job Offer= Yes) _ | P (Interactiveness = Yes | Job Offer =5/7 =No)=1/3 NO P (Interactiveness = No | Job Offer=Yes) _| P (Interactiveness=No | Job Offer =27 =No)=2/3 Step 2(c): Feature - Practical Knowledge Table 8.7 shows the frequency matrix for the feature Practical Knowledge. Table 8.7: Frequency Matrix of Practical Knowledge Geena leeks ence ‘Very Good 2 0 Average 1 2 Good 4 i Total 7 3 Table 88 shows how the likelihood probability is calculated for Practical Knowledge using conditional probability. Table 8.8: Likelihood Probability of Practical Knowledge Geneon A ee) P.Gob Offer=No) Very Good P (Practical Knowledge = Very | P (Practical Knowledge = Very Good I Job Offer = Yes)~2/7 Good I Job Offer= No) = 0/3 Average P (Practical Knowledge = Average | P (Practical Knowledge = Average | Job Offer = Yes) = 77 | Job Offer = No) = 2/3, Good P (Practical Knowledge = Good | Job Offer = Yes) = 4/7 P (Practical Knowledge = Good Job Offer =No) = 1/3, Step 2(d): Feature - Communication Skills Table 8.9 shows the frequency matrix for the feature Communication Skills. Table 8.9: Frequency Matrix of Communication Skills (nner tuk es ein Cod Ticats Good 4 1 Moderate 3 oO Poor 0 2 Total 7 3 Table 8.10 shows how the likelihood probability is calculated for Communication Skills using conditional probability. Table 8.1 Likelihood Probability of Communication Skills, (ACN P (Communication Skills = Good [Jab Offer = Yes) = 4/7 Bayesian Learning +» 241 PUob Offer=No) P (Communication Lob Offer = No) = 1/3 Moderate P (Communication Skills = P (Communication Skills = ‘Moderate | Job Offer= Yes) = 3/7 | Moderate | Job Offer =No) =0/3 Poor P (Communication Skills = Poor | P (Communication Skills = Poor | [Job Offer = Yes) = 0/7 Job Offer = No) = 2/3 Step 3: Use Bayes theorem Eq. (8.1) to calculate the probability of all hypotheses. Given the test data = (CGPA 29, Interactiveness = Yes, Practical knowledge = Average, Commu- nication Skills = Good), apply the Bayes theorem to classify whether the given student gets a Job offer or not. P (Job Offer = Yes | Test data) = (P(CGPA 29 | Job Offer = Yes) P (Interactiveness = Yes | Job Offer = Yes) P (Practical knowledge = Average | Job Offer = Yes) P (Communication Skills = Good | Job Offer = Yes) P (Job Offer = Yes)))/(P (Test Data)) ‘We can ignore P (Test Data) in the denominator since it is common for all cases tobe considered. Hence, P (Job Offer = Yes | Test data) = (P(CGPA 29 IJob Offer = Yes) P (Interactiveness = Yes | Job Offer = Yes) P (Practical knowledge = Average | Job\Offer = Yes) P (Communication Skills = Good | Job Offer = Yes) P (Job Offer = Yes) =3/7 x 5f7 x 1/7 x 4/7 x 7AO 0.0175 Similarly, for the other case ‘Job Offer = No’, ‘We compute the probability, P (Job Offer = Nol Test data) = (P(CGPA 29 IJob Offer =No) P (Interactiveness = Yes | Job Offer = No) P (Practical knowledge = Average | Job Offer = No) P (Communication Skills = Good | Job Offer =No) P (Job Offer = No))/(P(Test Data)). P (CGPA 29 Job Offer = No) P (Interactiveness = Yes | Job Offer = No) P (Practical knowledge = Average | Job Offer = No) P (Communication Skills = Good | Job Offer = No) P (Job Offer = No) = 1/3 x 1/3 x 2/3 x 1/3 x 3/10 = 0.0074 Step 4: Use Maximum A Posteriori (MAP) Hypothesis, h,,,, Eq. (8.2) to classify the test object to the hypothesis with the highest probability. Since P (Job Offer = Yes | Test data) has the highest probability value, the test data is classified as ‘Job Offer ee Zero Probability Error In Example 8.1, consider the test data to be (CGPA 28, Interactiveness = Yes, Practical knowledge = Average, Communication Skills = Good) ‘When computing the posterior probability, P (Job Offer = Yes | Test data) = (P(CGPA 28 lob Offer = Yes) P (Interactiveness = Yes | Job Offer = Yes) P (Practical knowledge = Average | Job Offer = Yes) P (Communication Skills = Good | Job Offer = Yes) P (Job Offer = Yes)))/((P(Test Data)) 242 © Machine Learning P (Job Offer= Yes | Test dat = Yes) P (Practical knowledge = Offer = Yes) P (Job Offer = Yes) = AY7 x 5]? «U7 4/7 x 7/10 = 0.0233 Similarly, for the other case ‘Job Offer = No’, ‘When we compute the probability: P (Job Offer=Nol Test data) = (P(CGPA 28 IJob Offer = No) P (Interactiveness = Yes | Job Offer = No) P (Practical knowledge = Average | Job Offer = No) P (Communication Skills = Good | Job Offer = No) P (Job Offer = No))/(P(Test Data)) =P (CGPA 28 | Job Offer =No) P (Interactiveness = Yes | Job Offer = No) P (Practical knowledge = Average | Job Offer = No) P (Communication Skills = Good | Job Offer =No) P (Job Offer= No) =0/3 x 1/3 x 2/3 x 1/3 x 3/10 =0 =(P(CGPA 28 |Job Offer = Yes) P(Interactiveness= Yes | Job Offer werage | Job Offer = Yes) P (Communication Skills = Good | Job Since the probability value is zero, the model fails to predict, and this is called as Zero- Probability error. This problem arises because there are no instances in the given Table 8.1 for the attribute value CGPA 28 and Job Offer = No and hence the probability value of this case is zero. This zero-probability error can be solved by applying a smoothing technique called Laplace correction which means given 1000 data instances in the training dataset, if there are zero instances for a particular value of a feature we can add Linstance for each attribute value pair of that feature which will not make much difference for 1000 data instances and the overall probability does not become zero. Now, let us scale the values given in Table 8.1 for 1000 data instances. The scaled values without Laplace correction are shown in Table 8.11. Table 8.11: Scaled Values to 1000 without Laplace Correction CaN » P (CGPA 29 | Job Offer = Yes) = 300/700 _| P (CGPA29 | Job Offer No) = 100/300 28 P (CGPA 28 | Job Offer = Yes) = 400/700 _| P (CGPA28 | Job Offer—No) = 0/300 3 P (CGPA <8 | Job Offer = Yes) = 0/700 P (CGPA

You might also like