0% found this document useful (0 votes)
62 views54 pages

Natural Language Processing Natural Language Processing: Unit - 1 Elementary Probability Theory

This document discusses elementary probability theory including probability spaces, events, conditional probability, Bayes' theorem, random variables, and standard distributions. It provides examples and explanations of independent and dependent events, conditional probability, and applying Bayes' theorem to multiple examples involving coins, dice, cards, balls in bags, and more.

Uploaded by

225003012
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
62 views54 pages

Natural Language Processing Natural Language Processing: Unit - 1 Elementary Probability Theory

This document discusses elementary probability theory including probability spaces, events, conditional probability, Bayes' theorem, random variables, and standard distributions. It provides examples and explanations of independent and dependent events, conditional probability, and applying Bayes' theorem to multiple examples involving coins, dice, cards, balls in bags, and more.

Uploaded by

225003012
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 54

Natural Language Processing

Unit – 1
Elementary Probability Theory
Mathematical Foundations
• Elementary Probability Theory
• Probability spaces
• Conditional probability and independence
• Bayes' theorem
• Random variables
• Expectation and variance
• Notation
• Joint and conditional distributions
• Determining P
• Standard distributions
• Bayesian statistics
Elementary Probability Theory
Probability space
A probability space models random events and is made up of
three parts:
1.Sample space: the set of all possible outcomes. For example,
if you toss a coin twice, the sample space is {HH,HT,TH,TT}.
The sample space is sometimes denoted by the Greek letter
omega (Ω).
2.Event space: the set of all events (can be zero to any number).
3.Probability function, or the assignment of probability to the
event. For example, the odds of tossing a coin and getting a
heads is 50%. Probabilities are nonzero and add up to 1.*
The sample Space, S
The sample space, S, for a random phenomena is the set of all
possible outcomes
Examples
Tossing a coin – outcomes
S ={Head, Tail}
2. Rolling a die – outcomes
S={1, 2, 3, 4, 5, 6}
Event E
• Events in probability can be defined as a set of outcomes of a
random experiment
• The sample space indicates all possible outcomes of an
experiment
• Thus, events in probability can also be described as subsets of
the sample space
Independent events: Examples
A coin is flipped and a die is rolled. A card is drawn from a deck and
Find the probability of getting a replaced; then a 2nd card is drawn.
Head on the coin and a 4 on the Find the probability of getting a
die queen and then an ace
{
H 1, H 2, H 3,H 4, H 5, H 6, P(Q)=4/52
T 1, T 2, T 3, T 4, T 5, T 6 P(A)=4/52
} P(Q ∩ A)=P(A).P(B)=(4/52).(4/52)
P(A)=1/2 P(Q ∩ A)=1/169
P(B)=1/6
P(A ∩ B)=1/12=P(A).P(B)=1/12
Independent events: Examples
A box contains 3 red balls, 2 blue Selecting 2 blue balls
balls, and 5 white balls. A ball is
selected and its color noted. Then
it is replaced. A 2nd ball is
selected and its color noted. Find Selecting a blue ball and then a
the probability of white ball
1. Selecting 2 blue balls
2. Selecting a blue ball and then a
white ball Selecting a red ball and then a blue
3. Selecting a red ball and then a ball
blue ball
Dependent events
• Two events are dependent if the outcome of the first event affects the outcome
of the second event, so that the probability is changed.
Example:
Suppose we have 5 blue marbles and 5 red marbles in a bag. We pull out one
marble, which may be blue or red. Now there are 9 marbles left in the bag. What
is the probability that the second marble will be red? It depends.
• If the first marble was red, then the bag is left with 4 red marbles out of 9 so
the probability of drawing a red marble on the second draw is 4/9 .
• But if the first marble we pull out of the draw is blue, then there are still 5 red
marbles in the bag and the probability of pulling a red marble out of the bag is
5/9 .
• The second draw is a dependent event.
• It depends upon what happened in the first draw.
Conditional probability: Example
Two dies are thrown simultaneously and the sum of the numbers obtained is
found to be 7. What is the probability that the number 3 has appeared at least
once?
Solution: The sample space S would consist of all the numbers possible by the
combination of two dies. Therefore S consists of 6 × 6 i.e. 36 events.
• Event A indicates the combination of the numbers obtained is found to be 7.
A = {(1, 6)(2, 5)(3, 4)(4, 3)(5, 2)(6, 1)}
P(A) = 6/36
• Event B indicates the combination in which 3 has appeared at least once.
B = {(3, 1)(3, 2)(3, 3)(3, 4)(3, 5)(3, 6)(1, 3)(2, 3)(4, 3)(5, 3)(6, 3)}
P(B) = 11/36
n(A ∩ B) = 2
P(A ∩ B) = 2/36
• Applying the conditional probability formula we get,
P(B/A)=2/6=1/3 =P(A ∩ B)/P(A)
Bayes Theorem
Let E1, E2,…, En be a set of events associated with a sample space S, where
all the events E1, E2,…, En have nonzero probability of occurrence and
they form a partition of S. Let A be any event associated with S, then
according to Bayes theorem,

If A and B are two events, then the formula for


the Bayes theorem is given by:
Bayes Theorem: Example 1
One of two boxes contains 4 red balls and 2 green balls and the second box contains 4 green and two
red balls. By design, the probabilities of selecting box 1 or box 2 at random are 1/3 for box 1 and 2/3
for box 2. A box is selected at random and a ball is selected at random from it.
1. Given that the ball selected is red, what is the probability it was selected from the first box?
2. Given that the ball selected is red, what is the probability it was selected from the second box?
3. Compare the results in parts (1) and (2) and explain the answer.
Let us call the first box B1 and the second box B2. Let event E1 be "select box 1" and event E2 "select box
2". Let event R be "select a red ball".
Bayes Theorem: Example 1 Solution
1. 2.

3. The two probabilities calculated in parts 1) and 2) are equal.


Although there are more red balls in box 1 than in box 2 (twice as much), the
probabilities calculated above are equal because the probabilities of selecting box 2 is
higher (twice as much) than the probability of selecting box 1. Bayes' theorem takes all
the information into consideration
Bayes Theorem: Example 2
1% of a population have a certain disease and the remaining 99% are free from this disease. A test is
used to detect this disease. This test is positive in 95% of the people with the disease and is also
(falsely) positive in 2% of the people free from the disease.
If a person, selected at random from this population, has tested positive, what is the probability that
she/he has the disease?
Let D be the event "have the disease" and FD be the event "free from the disease"
Let the event TP be the event that the "test is positive".
Explanation:
Bayes Theorem: Example 4
A radar system is designed such that the probability of detecting the presence of an aircraft in its
range is 98%. However, if no aircraft is present in its range, it still report (falsely) that an aircraft is
present with a probability of 5%. At any time, the probability that an aircraft is present within the
range of the radar is 7%.
a) What is the probability that no aircraft is present in the range of the radar given that an aircraft is detected?
b) What is the probability that an aircraft is present in the range of the radar given that an aircraft is detected?
c) What is the probability that an aircraft is present in the range of the radar given that no aircraft is detected?
d) What is the probability that no aircraft is present in the range of the radar given that no aircraft is detected?
Bayes Theorem: Example 4 Solution
a) What is the probability that no aircraft b) What is the probability that an aircraft is
is present in the range of the radar given present in the range of the radar given that
that an aircraft is detected? an aircraft is detected?
Bayes Theorem: Example 4 Solution
c) What is the probability that an aircraft is d) What is the probability that no aircraft is
present in the range of the radar given that present in the range of the radar given that
no aircraft is detected? no aircraft is detected?
Tree Diagram
Bayes Theorem: Example 5
A factory has two machines I and II. Machine I and II produce 30%
and 70% of items respectively. Further 3% of items produced by
Machine I are defective and 4% of items produced by Machine II are
defective. An item is drawn at random. If the drawn item is defective,
find the probability that it was produced by Machine II.
Bayes Theorem: Example 6
Amy has two bags. Bag I has 7 red and 4 blue balls and bag II has 5 red and 9 blue
balls. Amy draws a ball at random and it turns out to be red. Determine the
probability that the ball was from the bag I using the Bayes theorem.

Let X and Y be the events that the ball is from


the bag I and bag II, respectively. Assume A to
be the event of drawing a red ball. We know that
the probability of choosing a bag for drawing a
ball is 1/2, that is,
P(X) = P(Y) = 1/2
Since there are 7 red balls out of a total of 11
balls in the bag I, therefore, P(drawing a red ball
from the bag I) = P(A|X) = 7/11
Similarly, P(drawing a red ball from bag II) =
P(A|Y) = 5/14
We need to determine the value of P(the ball
drawn is from the bag I given that it is a red
ball), that is, P(X|A).
Bayes Theorem: Example 7
Assume that the chances of a person having a skin disease are 40%. Assuming that
skin creams and drinking enough water reduces the risk of skin disease by 30% and
prescription of a certain drug reduces its chance by 20%. At a time, a patient can
choose any one of the two options with equal probabilities. It is given that after
picking one of the options, the patient selected at random has the skin disease. Find
the probability that the patient picked the option of skin creams and drinking
enough water using the Bayes theorem.
Assume E1: The patient uses skin
creams and drinks enough water;
E2: The patient uses the drug;
A: The selected patient has the skin
disease
P(E1) = P(E2) = 1/2
Using the probabilities known to us,
we have
P(A|E1) = 0.4 × (1-0.3) = 0.28
P(A|E2) = 0.4 × (1-0.2) = 0.32
Bayes Theorem: Example 9
Three persons A, B and C have applied for a job in a private company. The chance of
their selections is in the ratio 1 : 2 : 4. The probabilities that A, B and C can
introduce changes to improve the profits of the company are 0.8, 0.5 and 0.3,
respectively. If the change does not take place, find the probability that it is due to
the appointment of C.
Example
Random Variables
Random variable is a variable whose value depends on the outcome of a
random event (i.e.) R to the set of real numbers.
Stochastic process that generates numbers with a certain probability
distribution.
The word “stochastic” is ‘probabilistic’ or ‘randomly generated’
Sequence of results assumed to be generated by some underlying
probability distribution
Example: Suppose the events are Probability Mass Function
those that result from tossing two The probability mass function
dice. Then we could define a discrete (pmf) for a random variable X,
which gives the probability that
random variable X that is the sum of
the random variable has different
their faces: S = {2, . . . ,12 } numeric values:

If a random variable X is
distributed according to the pmf
p(x), then we will write X ~ p(x).
For a discrete random variable,
Expectation

Example:

 Notice that the expected value is not one of the possible outcomes: you can’t roll a 3.5.
 However, if you average the outcomes of a large number of rolls, the result approaches
3.5.

This is the expected average found by totaling up a large number of throws


of the die and dividing by the number of throws.
If we define X as a random variable denoting the sum of two dices
To Know

Variance

Standard Deviation
Joint and conditional distributions
• The joint probability mass function for two discrete random variables X,
Y is:

Let X and Y be two random variables. The probability distribution that


defines their simultaneous behavior is referred to as a joint probability
distribution.

Marginal Distribution
• Related to a joint pmf are marginal pmfs, which total up the probability
masses for the values of each variable separately :

• Conditional pmf in terms of the joint distribution:


• A chain rule in terms of random variables, for instance:
Solution
Determining P
Estimation:
• Assuming a probability function P and giving it the obvious definition
for simple examples with coins and dice.
• But what do we do when dealing with language? What do we say about
the probability of a sentence like The cow chewed its cud?
• In general, for language events, unlike dice, P is unknown. This means
we have to estimate P.
Relative Frequency
• The proportion of times a certain outcome occurs is called the relative
frequency of the outcome.
• If C(u) is the number of times an outcome u occurs in N trials then
is the relative frequency of u.
• The relative frequency is often denoted
Non Parametric (or) Distribution Free Approach
For example,
 if we assume our data is binomially distributed, but in fact the data looks
nothing like a binomial distribution, then our probability estimates
might be wildly wrong
For such cases, one can use methods that make no assumptions about the
underlying distribution of the data, or will work reasonably well for a
wide variety of different distributions.
This is referred to as a nonparametric or distribution-free approach.
If we simply empirically estimate P by counting a large number of
random events (giving us a discrete distribution, though we might
produce a continuous distribution from such data by interpolation,
assuming only that the estimated probability density function should
be a fairly smooth curve), then this is a nonparametric method.
 Non-parametric methods are used in automatic classification when the
underlying distribution of the data is unknown
Binomial Distribution
• Binomial distribution is a statistical probability distribution that summarizes the
likelihood that a value will take one of two independent values under a given set of
parameters or assumptions.
• The underlying assumptions of binomial distribution are that there is only one
outcome for each trial, that each trial has the same probability of success, and that
each trial is mutually exclusive or independent of one another.
• Binomial distribution is a common discrete distribution used in statistics, as opposed
to a continuous distribution, such as normal distribution.
A binomial probability distribution results from a procedure that meets all the
following requirements:
1. The procedure has a fixed number of trials.
2. The trials must be independent. (The outcome of any individual trial doesn’t affect the
probabilities in the other trials.)
3. Each trial must have all outcomes classified into two categories (commonly referred to
as success and failure).
4. The probability of a success remains the same in all trials.
Example

A tossed coin shows a ‘head’ or ‘tail’, a manufactured item can


be ‘defective’ or ‘non-defective’, the response
To a question might be ‘yes’ or ‘no’, an egg has ‘hatched’ or ‘not
hatched’, the decision is ‘yes’ or ‘no’ etc.
It is customary to call one of the outcomes a ‘success’ and the
other ‘not success’ or ‘failure’.
For example, in tossing a coin, if the occurrence of the head is
considered a success, then occurrence of tail is a failure.
Example
Each time we toss a coin or roll a die or perform any other experiment, we call it a trial.
If a coin is tossed, say, 4 times, the number of trials is 4, each having exactly two outcomes,
namely, success or failure.
 The outcome of any trial is independent of the outcome of any other trial.
 In each of such trials, the probability of success or failure remains constant. Such independent
trials which have only two outcomes usually referred as ‘success’ or ‘failure’ are called Bernoulli
trials.
For example, throwing a die 50 times is a case of 50 Bernoulli trials, in which each trial results
in success (say an even number) or failure (an odd number) and the probability of success (p) is
same for all 50 throws. Obviously, the successive throws of the die are independent
experiments. If the die is fair and have six numbers 1 to 6 written on six faces,
Example:
Six balls are drawn successively from an urn containing 7 red and 9 black balls.
Tell whether or not the trials of drawing balls are Bernoulli trials when after each
draw the ball drawn is
(i) replaced (ii) not replaced in the urn.
Solution
(i) The number of trials is finite. When the drawing is done with replacement, the
probability of success (say, red ball) is p =7/16 which is same for all six trials
(draws). Hence, the drawing of balls with replacements are Bernoulli trials.
(ii) When the drawing is done without replacement, the probability of success (i.e.,
red ball) in first trial is 7/16 , in 2nd trial is 6 /15 if the first ball drawn is red or
7/15 if the first ball drawn is black and so on. Clearly, the probability of success
is not same for all trials, hence the trials are not Bernoulli trials.
The Binomial Probability Formula

P(x) = n! • px • qn-x
(n – x )!x!

for x = 0, 1, 2, . . ., n
where
n = number of trials
x = number of successes among n trials
p = probability of success in any one trial
q = probability of failure in any one trial (q = 1 – p)
Example
A die is thrown 6 times. If ‘getting an odd number’ is a success, what is the
probability of
(i) 5 successes? (ii) At least 5 successes? (iii) At most 5 successes?
Multinominal Distribution
The generalization of a binomial trial to the case where each of the trials has
more than two basic outcomes is called a multinomial experiment, and is
modeled by the multinomial distribution.
Normal distribution
Data can be "distributed" (spread out) in different ways.
In many cases the data tends to be around a
central value with no bias left or right, and it
gets close to a "Normal Distribution"

The curve where is referred to as the standard normal distribution.


Gaussian
• While it is much better to refer to such a curve as a ‘normal distribution’ than as a
‘bell curve,’ if you really want to fit into the Statistical NLP or pattern
recognition communities, you should instead learn to refer to these functions as
Gaussians
• In much of statistics, the discrete binomial distribution is approximated by the
continuous normal distribution - one can see the basic similarity in the
shapes of the curves.
• Such an approximation is acceptable when both basic outcomes have a reasonable
probability of occurring or the amount of data is very large
• But, in natural language, events like occurrences of the phrase shade tree
mechanics are so rare, that even if you have a huge amount of text, there will be
a significant difference between the appropriate binomial curve and the
approximating normal curve, and so use of normal approximations can be
unwise
• Gaussians are often used in clustering
Bayesian Statistics
Maximum Likelihood Estimate
• Suppose one takes a coin and tosses it 10 times, and gets 8 heads. Then from a
frequentist point of view, the result is that this coin comes down heads 8
times out of 10. This is what is called the maximum likelihood estimate
• If one has looked the coin over, and there doesn’t seem anything wrong with it,
one would be very reluctant to accept this estimate.
• Rather, one would tend to think that the coin would come down equally head
and tails over the long run, and getting 8 heads out of 10 is just the kind of thing
that happens sometimes given a small sample
Prior Belief
• one’s beliefs even in the face of apparent evidence against it Bayesian statistics
measure degrees of belief, and are calculated by starting with prior beliefs and
updating them in the face of evidence, by use of Bayes’ theorem
Bayesian Updating
• In the general case where data come in sequentially and we can reasonably assume
independence between them, we start off with an a priori probability
distribution,
• when a new datum comes in, we can update our beliefs by calculating the
maximum of the a posteriori distribution, what is sometimes referred to as
the MAP probability.
• This then becomes the new prior, and the process repeats on each new datum. This
process is referred to as Bayesian updating
Bayesian Decision Theory
use it to evaluate which model or family of models better explains some data
• This is family of models, with a parameter representing the weighting of the coin.
• But an alternative theory is that at each step someone is tossing two fair coins,
and calling out “tails” if both of them come down tails, and heads otherwise.

You might also like