NLP Chapter 2
NLP Chapter 2
Language Processing
• Efficiency
– time to construct the model
– time to use the model
• Robustness: handling noise and missing values
• Scalability: efficiency in disk-resident databases
• Interpretability:
– understandable and insight provided by the model
• Compactness of the model: like number of rules.
Evaluation Method…
• Holdout set: The available data set D is divided into
two disjoint subsets,
– the training set Dtrain (for learning a model)
– the test set Dtest (for testing the model)
• Important: training set should not be used in testing
and the test set should not be used in learning.
– Unseen test set provides a unbiased estimate of accuracy.
• The test set is also called the holdout set. (the
examples in the original data set D are all labeled with
classes.)
• This method is mainly used when the data set D is
large.
Evaluation Method…
• n-fold cross-validation:
– The available data is partitioned into n equal-size
disjoint subsets.
– Use each subset as the test set and combine the rest n-1
subsets as the training set to learn a classifier.
– The procedure is run n times, which give n accuracies.
– The final estimated accuracy of learning is the average
of the n accuracies.
– 10-fold and 5-fold cross-validations are commonly
used.
– This method is used when the available data is not
large.
Evaluation Method…
• Leave-one-out cross-validation: This
method is used when the data set is very
small.
• It is a special case of cross-validation
• Each fold of the cross validation has only a
single test example and all the rest of the
data is used in training.
• If the original data has m examples, this is
m-fold cross-validation
Evaluation Method…
• Validation set: the available data is divided into three
subsets,
– a training set,
– a validation set and
– a test set.
• A validation set is used frequently for estimating
parameters in learning algorithms.
• In such cases, the values that give the best accuracy on
the validation set are used as the final parameter
values.
• Cross-validation can be used for parameter estimating
as well.
Evaluation Method…
• Precision and recall measures
– Used in information retrieval and text classification.
– We use a confusion matrix to introduce them.
Evaluation Method…
49
Introduction
You would like to determine how likely
the patient is infected with inhalational
anthrax given that the patient has a
cough, a fever, and difficulty breathing
50
Introduction
51
Introduction
52
Bayesian Networks
HasAnth
rax
53
Probability Primer: Random Variables
• A random variable is the basic element of
probability
• Refers to an event and there is some degree
of uncertainty as to the outcome of the event
• For example, the random variable A could
be the event of getting a heads on a coin flip
54
Boolean Random Variables
55
Probabilities
We will write P(A = true) to mean the probability that A = true.
What is probability? It is the relative frequency with which an outcome
would be obtained if the process were repeated a large number of times
under similar conditions*
*
P(A =
Ahem…there’s also the Bayesian
definition which says probability is false)
your degree of belief in an outcome
56
Conditional Probability
• P(A = true | B = true) = Out of all the outcomes in which B
is true, how many also have A equal to true
• Read this as: “Probability of A conditioned on B” or
“Probability of A given B”
H = “Have a headache”
F = “Coming down with Flu”
P(F = true)
P(H = true) = 1/10
P(F = true) = 1/40
P(H = true | F = true) = 1/2
P(H = true)
In general, P(X|Y)=P(X,Y)/P(Y)
58
The Joint Probability Distribution
• Joint probabilities can be between A B C P(A,B,C)
any number of variables false false false 0.1
false false true 0.2
eg. P(A = true, B = true, C = true)
false true false 0.05
• For each combination of variables,
false true true 0.05
we need to say how probable that
true false false 0.3
combination is
true false true 0.1
• The probabilities of these true true false 0.05
combinations need to sum to 1 true true true 0.15
Sums to 1
59
The Joint Probability Distribution
A B C P(A,B,C)
• Once you have the joint probability
false false false 0.1
distribution, you can calculate any
probability involving A, B, and C false false true 0.2
false true false 0.05
false true true 0.05
true false false 0.3
true false true 0.1
true true false 0.05
Examples of things you can compute:
true true true 0.15
• P(A=true) = sum of P(A,B,C) in rows with A=true
• P(A=true, B = true | C=true) =
P(A = true, B = true, C = true) / P(C = true)
60
The Problem with the Joint
Distribution
• Lots of entries in the A B C P(A,B,C)
table to fill up! false false false 0.1
• For k Boolean random false false true 0.2
false true false 0.05
variables, you need a
false true true 0.05
table of size 2k true false false 0.3
• How do we use fewer true false true 0.1
numbers? Need the true true false 0.05
true true true 0.15
concept of
independence
61
Independence
62
Independence
63
Conditional Independence
64
A Bayesian Network
A Bayesian network is made up of:
1. A Directed Acyclic Graph
A
C D
C D
66
A Set of Tables for Each Node
A P(A) A B P(B|A)
false 0.6
Each node Xi has a conditional
false false 0.01 probability distribution P(Xi |
true 0.4 false true 0.99 Parents(Xi)) that quantifies the
true false 0.7 effect of the parents on the node
true true 0.3
The parameters are the
B C P(C|B) probabilities in these conditional
false false 0.4 probability tables (CPTs)
false true 0.6 A
true false 0.9
true true 0.1
B
B D P(D|B)
B C P(C|B)
false false 0.4
false true 0.6
true false 0.9
true true 0.1 For a given combination of values of the parents
(B in this example), the entries for P(C=true | B)
and P(C=false | B) must add up to 1
eg. P(C=true | B=false) + P(C=false |B=false )=1
If you have a Boolean variable with k Boolean parents, this table has 2k
probabilities.
68
Bayesian Networks
Two important properties:
1. Encodes the conditional independence
relationships between the variables in the
graph structure
2. Is a compact representation of the joint
probability distribution over the variables
69
Conditional Independence
The Markov condition: given its parents (P1, P2),
a node (X) is conditionally independent of its
non-descendants (ND1, ND2)
P P
1 2
N N
X
D1 D2
C C
1 2
70
The Joint Probability Distribution
Where Parents(Xi) means the values of the Parents of the node Xi with
respect to the graph
71
Using a Bayesian Network Example
Using the network in the example, suppose you want to
calculate:
P(A = true, B = true, C = true, D = true)
= P(A = true) * P(B = true | A = true) *
P(C = true | B = true) P( D = true | B = true)
= (0.4)*(0.3)*(0.1)*(0.95)
A
C D
72
Using a Bayesian Network Example
Using the network in the example, suppose you want to
calculate: This is from the
P(A = true, B = true, C = true, D = true) graph structure
= P(A = true) * P(B = true | A = true) *
P(C = true | B = true) P( D = true | B = true)
= (0.4)*(0.3)*(0.1)*(0.95)
A
B
These numbers are from the
conditional probability tables
C D
73
Inference
74
Inference
HasAnth
rax
75
The Bad News
• Exact inference is feasible in small to
medium-sized networks
• Exact inference in large networks takes a
very long time
• We resort to approximate inference
techniques which are much faster and give
pretty good results
76
One last unresolved issue…
We still haven’t said where we get the
Bayesian network from. There are two
options:
• Get an expert to design it
• Learn it from data
77
Assignment-1
1. Write a python program that reads two
paragraphs of a document and then:
– segments into list of sentences and write them
on secondary storage device
– segments these sentences into words and write
them on secondary storage device
• Note:
– List of words should be converted into lower
cases and free from any numbers and
punctuation marks
Assignment -2
• 2.
– Take some paragraphs from any sources.
– Write a python program that displays
collocation words.