Unit 6 Neural Network Part 2 2
Unit 6 Neural Network Part 2 2
Machine Learning
INTRODUCTION
• The technique was derived from the work of the 18th century
mathematician Thomas Bayes.
• He developed the foundational mathematical principles, known as
Bayesian methods, which describe the probability of events, and
more importantly, how probabilities should be revised when there is
additional information available.
• Bayesian learning algorithms, like the naive Bayes classifier, are highly
practical approaches to certain types of learning problems as they can
calculate explicit probabilities for hypotheses.
APPLICATIONS
• Text-based classification such as spam or junk mail filtering, author
identification, or topic categorization.
• Medical diagnosis such as given the presence of a set of observed
symptoms during a disease, identifying the probability of new
patients having the disease.
• Network security such as detecting illegal intrusion or anomaly in
computer networks.
BAYES’ THEOREM
• Concept Learning?
• Let us take an example of how a child starts to learn meaning of new
words, e.g. ‘ball’.
• Positive Examples
• Negative Examples
• Let us define a concept set C and a corresponding function f(k). We also
define f(k) = 1, when k is within the set C and f(k) = 0 otherwise. Our aim is
to learn the indicator function f that defines which elements are within the
set C.
• In Bayes’ theorem, we will learn how to use standard probability calculus to
determine the uncertainty about the function f, and we can validate the
classification by feeding positive examples.
BAYES’ THEOREM
• Bayes’ probability rule is given as:
where A and B are conditionally related events and p(A|B) denotes the
probability of event A occurring when event B has already occurred.
• Let us assume that we have a training data set D where we have
noted some observed data. Our task is to determine the best
hypothesis in space H by using the knowledge of D.
PRIOR (the probability before the evidence
is considered)
• The prior knowledge or belief about the probabilities of various
hypotheses in H is called Prior in context of Bayes’ theorem.
• For example, if we have to determine whether a particular type of
tumor is malignant for a patient, the prior knowledge of such tumors
becoming malignant can be used to validate our current hypothesis
and is a prior probability or simply called Prior.
• We will assume that P(h) is the initial probability of a hypothesis ‘h’
that the patient has a malignant tumour based only on the
malignancy test, without considering the prior knowledge of the
correctness of the test process or the so-called training data.
POSTERIOR (updated probability after the
evidence is considered)
• The probability that a particular hypothesis holds for a data set based on
the Prior is called the posterior probability or simply Posterior.
• In the above example, the probability of the hypothesis that the patient has
a malignant tumor considering the Prior of correctness of the malignancy
test is a posterior probability.
• In our notation, we will say that we are interested in finding out P(h|T),
which means whether the hypothesis holds true given the observed
training data T.
• So, the prior probability P(h), which represents the probability of the
hypothesis independent of the training data (Prior), now gets refined with
the introduction of influence of the training data as P(h|T).
• According to Bayes’ theorem
• Let us try to connect the concept learning problem with the problem
of identifying the h_map.
• On the basis of the probability distribution of P(h) and P(T|h), we can
derive the prior knowledge of the learning task.
• There are few important assumptions to be made as follows:
• The training data or target sequence T is noise free, which means that it is a
direct function of X only (i.e. t_i = c(x_i ))
• The concept c lies within the hypothesis space H
• Each hypothesis is equally probable and independent of each other
• On the basis of assumption 3, we can say that each hypothesis h
within the space H has equal prior probability, and also because of
assumption 2, we can say that these prior probabilities sum up to 1.
So, we can write
• For the cases when h is inconsistent with the training data T, using Eq.
1. we get
and when h is consistent with T
• So, with our set of assumptions about P(h) and P(T|h), we get the
posterior probability P(h|T) as