0% found this document useful (0 votes)
30 views79 pages

NLP Chapter 2

This document provides an overview of machine learning for natural language processing. It introduces supervised and unsupervised machine learning. For supervised learning, it discusses classification using loan application data as an example. For evaluation, it covers holdout set, cross-validation, and other methods. For unsupervised learning, it introduces clustering and discusses k-means clustering in detail.

Uploaded by

ai20152023
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
30 views79 pages

NLP Chapter 2

This document provides an overview of machine learning for natural language processing. It introduces supervised and unsupervised machine learning. For supervised learning, it discusses classification using loan application data as an example. For evaluation, it covers holdout set, cross-validation, and other methods. For unsupervised learning, it introduces clustering and discusses k-means clustering in detail.

Uploaded by

ai20152023
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 79

Machine learning for Natural

Language Processing

Department of Computing Science


Institute of Technology
Jimma University
Contents
• Introduction
• Supervised and Unsupervised Machine Learning
• Bayesian Networks
Introduction
• Like human learning from past experiences.
• A computer does not have “experiences”.
• A computer system learns from data, which
represent some “past experiences” of an
application domain.
• Machine learning is programming computers to
– optimize a performance criterion using example data or
past experience.
• Learning is the execution of a computer program
to optimize the parameters of a model using the
training data or past experience.
Introduction…
• The model may be
– predictive
• to make predictions in the future, or
– descriptive
• to gain knowledge from data, or both.
Introduction…
• Machine learning uses
– the theory of statistics in building mathematical models,
because the core task is
• making inference from a sample.
• The role of computer science is twofold:
– In training
• efficient algorithms
– to solve the optimization problem,
– to store and process the massive amount of data
– After training
• efficiency
– representation and algorithmic solution for inference .
Introduction…
• Based on data types – Machine learning
could be
– Supervised , or
– Unsupervised
Supervised Machine Learning
• learns from examples.
– Supervision: The data (observations,
measurements, etc.) are labeled with
pre-defined classes.
– It is like that a “teacher” gives the classes
(supervision).
– Test data are classified into these classes too.
• The task is commonly called:
– Supervised learning,
– Classification, or
– Inductive learning
Supervised…
• Data: A set of data records (also called
examples, instances or cases) described by
– k attributes: A1, A2, … Ak.
– a class: Each example is labelled with a
pre-defined class.
• Goal: To learn a classification model from
the data that can be used to predict the
classes of new (future, or test)
cases/instances.
Example
• A credit card company receives thousands of
applications for new cards.
• Each application contains information about an
applicant,
– age
– Marital status
– annual salary
– outstanding debts
– credit rating
– etc.
• Problem: to decide whether an application should
approved, or to classify applications into two
categories, approved and not approved.
Example…
Example…
• Learn a classification model from the data
• Use the model to classify future loan applications into
– Yes (approved) and
– No (not approved)
• What is the class for following case/instance?
Supervised learning process
• Two Steps:
– Learning (training): Learn a model using the
training data
– Testing: Test the model using unseen test data
to assess the model accuracy
What do we mean by learning?
• Given
– a data set D,
– a task T, and
– a performance measure M,
• A computer system is said to learn from D
to perform the task T if after learning the
system’s performance on T improves as
measured by M.
• In other words, the learned model helps the
system to perform T better as compared to
no learning.
Example
• Data: Loan application data
• Task: Predict whether a loan should be
approved or not.
• Performance measure: accuracy.
No learning: classify all future applications
(test data) to the majority class (i.e., Yes):
Accuracy = 9/15 = 60%.
• We can do better than 60% with learning.
Fundamental assumption of learning
Assumption: The distribution of training
examples is identical to the distribution of
test examples (including future unseen
examples).
• In practice, this assumption is often violated
to certain degree.
• Strong violations will clearly result in poor
classification accuracy.
• To achieve good accuracy on the test data,
training examples must be sufficiently
representative of the test data.
Evaluation Method
• Predictive accuracy

• Efficiency
– time to construct the model
– time to use the model
• Robustness: handling noise and missing values
• Scalability: efficiency in disk-resident databases
• Interpretability:
– understandable and insight provided by the model
• Compactness of the model: like number of rules.
Evaluation Method…
• Holdout set: The available data set D is divided into
two disjoint subsets,
– the training set Dtrain (for learning a model)
– the test set Dtest (for testing the model)
• Important: training set should not be used in testing
and the test set should not be used in learning.
– Unseen test set provides a unbiased estimate of accuracy.
• The test set is also called the holdout set. (the
examples in the original data set D are all labeled with
classes.)
• This method is mainly used when the data set D is
large.
Evaluation Method…
• n-fold cross-validation:
– The available data is partitioned into n equal-size
disjoint subsets.
– Use each subset as the test set and combine the rest n-1
subsets as the training set to learn a classifier.
– The procedure is run n times, which give n accuracies.
– The final estimated accuracy of learning is the average
of the n accuracies.
– 10-fold and 5-fold cross-validations are commonly
used.
– This method is used when the available data is not
large.
Evaluation Method…
• Leave-one-out cross-validation: This
method is used when the data set is very
small.
• It is a special case of cross-validation
• Each fold of the cross validation has only a
single test example and all the rest of the
data is used in training.
• If the original data has m examples, this is
m-fold cross-validation
Evaluation Method…
• Validation set: the available data is divided into three
subsets,
– a training set,
– a validation set and
– a test set.
• A validation set is used frequently for estimating
parameters in learning algorithms.
• In such cases, the values that give the best accuracy on
the validation set are used as the final parameter
values.
• Cross-validation can be used for parameter estimating
as well.
Evaluation Method…
• Precision and recall measures
– Used in information retrieval and text classification.
– We use a confusion matrix to introduce them.
Evaluation Method…

• Precision p is the number of correctly


classified positive examples divided by the
total number of examples that are classified
as positive.
• Recall r is the number of correctly classified
positive examples divided by the total
number of actual positive examples in the
test set.
Evaluation Method…
• It is hard to compare two classifiers using two
measures.
– F1 score combines precision and recall into one measure

• The harmonic mean of two numbers tends to be closer to the


smaller of the two.
– For F1-value to be large, both p and r much be large.
Unsupervised Machine Learning
• The data have no target attribute.
– We want to explore the data to find some
intrinsic structures in them.
• Example, clustering
Clustering
• Clustering is a technique for finding similarity
groups in data, called clusters. i.e.,
– it groups data instances that are similar to (near) each
other in one cluster and data instances that are very
different (far away) from each other into different
clusters.
• Clustering is often called an unsupervised
learning task as no class values denoting an a
priori grouping of the data instances are given,
which is the case in supervised learning.
Clustering…
• The data set has three natural groups of data
points, i.e., 3 natural clusters.
What is clustering for?
• Let us see some real-life examples
• Example 1: groups people of similar sizes together
to make “small”, “medium” and “large” T-Shirts.
– Tailor-made for each person: too expensive
– One-size-fits-all: does not fit all.
• Example 2: In marketing, segment customers
according to their similarities
– To do targeted marketing.
Aspects of clustering
• A clustering algorithm
– Partitional clustering
– Hierarchical clustering
• A distance (similarity, or dissimilarity) function
• Clustering quality
– Inter-clusters distance ⇒ maximized
– Intra-clusters distance ⇒ minimized
• The quality of a clustering result depends on the
algorithm, the distance function, and the
application.
K-means clustering
• K-means is a partitional clustering algorithm
• Let the set of data points (or instances) D be
{x1, x2, …, xn},
where xi = (xi1, xi2, …, xir) is a vector in a real-valued
space X ⊆ Rr, and r is the number of attributes
(dimensions) in the data.
• The k-means algorithm partitions the given data
into k clusters.
– Each cluster has a cluster center, called centroid.
– k is specified by the user
K-means algorithm
• Given k, the k-means algorithm works as follows:
1) Randomly choose k data points (seeds) to be
the initial centroids, cluster centers
2) Assign each data point to the closest centroid
3) Re-compute the centroids using the current
cluster memberships.
4) If a convergence criterion is not met, go to 2).
K-means algorithm…
Example
Example…
Distance Function
Strengths of k-means
• Strengths:
– Simple: easy to understand and to implement
– Efficient: Time complexity: O(tkn),
where n is the number of data points,
k is the number of clusters, and
t is the number of iterations.
– Since both k and t are small. k-means is considered a
linear algorithm.
• K-means is the most popular clustering algorithm.
Weakness of k-means
• The algorithm is only applicable if the mean is
defined.
– For categorical data, k-mode - the centroid is
represented by most frequent values.
• The user needs to specify k.
• The algorithm is sensitive to outliers
– Outliers are data points that are very far away from
other data points.
– Outliers could be errors in the data recording or some
special data points with very different values.
Hierarchical Clustering
• Produce a nested sequence of clusters, a tree, also
called Dendrogram.
Types of hierarchical clustering
• Agglomerative (bottom up) clustering: It builds the
dendrogram (tree) from the bottom level, and
– merges the most similar (or nearest) pair of clusters
– stops when all the data points are merged into a single
cluster (i.e., the root cluster).
• Divisive (top down) clustering: It starts with all data
points in one cluster, the root.
– Splits the root into a set of child clusters. Each child cluster
is recursively divided further
– stops when only singleton clusters of individual data points
remain, i.e., each cluster with only a single point
Agglomerative clustering
It is more popular then divisive methods.
• At the beginning, each data point forms a
cluster (also called a node).
• Merge nodes/clusters that have the least
distance.
• Go on merging
• Eventually all nodes belong to one cluster
Agglomerative clustering algorithm
An example: working of the algorithm
Measuring the distance of two clusters
• A few ways to measure distances of two clusters.
• Results in different variations of the algorithm.
– Single link
– Complete link
– Average link
Single link method
• The distance between two clusters is the distance
between two closest data points in the two
clusters, one data point from each cluster.
• It can find arbitrarily shaped clusters, but
– It may cause the undesirable “chain effect” by noisy points
Complete link method
• The distance between two clusters is the distance
of two furthest data points in the two clusters.
• It is sensitive to outliers because they are far away
Example
• Let’s examine how these linkage methods work, using the
following small, one-dimensional data set:

Single-linkage agglomerative clustering on the sample data set


Example

Complete-linkage agglomerative clustering on the sample data set


Reading Assignment
• Distance functions
– Euclidean distance
– Manhattan (city block) distance
– Minkowski distance
– Chebychev distance
– Cosine similarity
Bayesian Network
Introduction
Suppose you are trying to determine if a
patient has inhalational anthrax. You
observe the following symptoms:
• The patient has a cough
• The patient has a fever
• The patient has difficulty breathing

49
Introduction
You would like to determine how likely
the patient is infected with inhalational
anthrax given that the patient has a
cough, a fever, and difficulty breathing

We are not 100% certain that the patient


has anthrax because of these symptoms.
We are dealing with uncertainty!

50
Introduction

Now suppose you order an x-ray and


observe that the patient has a wide
mediastinum.
Your belief that that the patient is
infected with inhalational anthrax is now
much higher.

51
Introduction

• In the previous slides, what you observed


affected your belief that the patient is
infected with anthrax
• This is called reasoning with uncertainty
• Wouldn’t it be nice if we had some
methodology for reasoning with
uncertainty? Why in fact, we do…

52
Bayesian Networks
HasAnth
rax

HasC HasFe HasDifficultyB HasWideMediasti


ough ver reathing num

• In the opinion of many AI researchers, Bayesian


networks are the most significant contribution in
AI in the last 10 years
• They are used in many applications eg. spam
filtering, speech recognition, robotics, diagnostic
systems and even syndromic surveillance

53
Probability Primer: Random Variables
• A random variable is the basic element of
probability
• Refers to an event and there is some degree
of uncertainty as to the outcome of the event
• For example, the random variable A could
be the event of getting a heads on a coin flip

54
Boolean Random Variables

• We will start with the simplest type of random


variables – Boolean ones
• Take the values true or false
• Think of the event as occurring or not occurring
• Examples (Let A be a Boolean random variable):
A = Getting heads on a coin flip
A = It will rain today
A = The Cubs win the World Series in 2007

55
Probabilities
We will write P(A = true) to mean the probability that A = true.
What is probability? It is the relative frequency with which an outcome
would be obtained if the process were repeated a large number of times
under similar conditions*

The sum of the red


and blue areas is 1
P(A = true)

*
P(A =
Ahem…there’s also the Bayesian
definition which says probability is false)
your degree of belief in an outcome

56
Conditional Probability
• P(A = true | B = true) = Out of all the outcomes in which B
is true, how many also have A equal to true
• Read this as: “Probability of A conditioned on B” or
“Probability of A given B”
H = “Have a headache”
F = “Coming down with Flu”
P(F = true)
P(H = true) = 1/10
P(F = true) = 1/40
P(H = true | F = true) = 1/2

“Headaches are rare and flu is rarer,


P(H = true) but if you’re coming down with flu
there’s a 50-50 chance you’ll have a
headache.”
57
The Joint Probability Distribution

• We will write P(A = true, B = true) to mean


“the probability of A = true and B = true”
• Notice that:
P(H=true|F=true)
P(F = true)

P(H = true)
In general, P(X|Y)=P(X,Y)/P(Y)

58
The Joint Probability Distribution
• Joint probabilities can be between A B C P(A,B,C)
any number of variables false false false 0.1
false false true 0.2
eg. P(A = true, B = true, C = true)
false true false 0.05
• For each combination of variables,
false true true 0.05
we need to say how probable that
true false false 0.3
combination is
true false true 0.1
• The probabilities of these true true false 0.05
combinations need to sum to 1 true true true 0.15

Sums to 1

59
The Joint Probability Distribution
A B C P(A,B,C)
• Once you have the joint probability
false false false 0.1
distribution, you can calculate any
probability involving A, B, and C false false true 0.2
false true false 0.05
false true true 0.05
true false false 0.3
true false true 0.1
true true false 0.05
Examples of things you can compute:
true true true 0.15
• P(A=true) = sum of P(A,B,C) in rows with A=true
• P(A=true, B = true | C=true) =
P(A = true, B = true, C = true) / P(C = true)

60
The Problem with the Joint
Distribution
• Lots of entries in the A B C P(A,B,C)
table to fill up! false false false 0.1
• For k Boolean random false false true 0.2
false true false 0.05
variables, you need a
false true true 0.05
table of size 2k true false false 0.3
• How do we use fewer true false true 0.1
numbers? Need the true true false 0.05
true true true 0.15
concept of
independence

61
Independence

Variables A and B are independent if any of


the following hold:
• P(A,B) = P(A) P(B)
• P(A | B) = P(A)
• P(B | A) = P(B)
This says that knowing the outcome of A
does not tell me anything new about the
outcome of B.

62
Independence

How is independence useful?


• Suppose you have n coin flips and you want to
calculate the joint distribution P(C1, …, Cn)
• If the coin flips are not independent, you need 2n
values in the table
• If the coin flips are independent, then
Each P(Ci) table has 2 entries
and there are n of them for a
total of 2n values

63
Conditional Independence

Variables A and B are conditionally


independent given C if any of the following
hold:
• P(A, B | C) = P(A | C) P(B | C)
• P(A | B, C) = P(A | C)
• P(B | A, C) = P(B | C)
Knowing C tells me everything about B. I don’t gain anything
by knowing A (either because A doesn’t influence B or because
knowing C provides all the information knowing A would give)

64
A Bayesian Network
A Bayesian network is made up of:
1. A Directed Acyclic Graph
A

C D

2. A set of tables for each node in the graph


A P(A) A B P(B|A) B D P(D|B) B C P(C|B)
false 0.6 false false 0.01 false false 0.02 false false 0.4
true 0.4 false true 0.99 false true 0.98 false true 0.6
true false 0.7 true false 0.05 true false 0.9
true true 0.3 true true 0.95 true true 0.1
A Directed Acyclic Graph

Each node in the graph is a A node X is a parent of another


random variable node Y if there is an arrow from
node X to node Y eg. A is a
parent of B
A

C D

Informally, an arrow from


node X to node Y means X has
a direct influence on Y

66
A Set of Tables for Each Node
A P(A) A B P(B|A)
false 0.6
Each node Xi has a conditional
false false 0.01 probability distribution P(Xi |
true 0.4 false true 0.99 Parents(Xi)) that quantifies the
true false 0.7 effect of the parents on the node
true true 0.3
The parameters are the
B C P(C|B) probabilities in these conditional
false false 0.4 probability tables (CPTs)
false true 0.6 A
true false 0.9
true true 0.1
B
B D P(D|B)

C D false false 0.02


false true 0.98
true false 0.05
true true 0.95
A Set of Tables for Each Node
Conditional Probability
Distribution for C given B

B C P(C|B)
false false 0.4
false true 0.6
true false 0.9
true true 0.1 For a given combination of values of the parents
(B in this example), the entries for P(C=true | B)
and P(C=false | B) must add up to 1
eg. P(C=true | B=false) + P(C=false |B=false )=1

If you have a Boolean variable with k Boolean parents, this table has 2k
probabilities.

68
Bayesian Networks
Two important properties:
1. Encodes the conditional independence
relationships between the variables in the
graph structure
2. Is a compact representation of the joint
probability distribution over the variables

69
Conditional Independence
The Markov condition: given its parents (P1, P2),
a node (X) is conditionally independent of its
non-descendants (ND1, ND2)

P P
1 2

N N
X
D1 D2

C C
1 2

70
The Joint Probability Distribution

Due to the Markov condition, we can compute


the joint probability distribution over all the
variables X1, …, Xn in the Bayesian net using
the formula:

Where Parents(Xi) means the values of the Parents of the node Xi with
respect to the graph

71
Using a Bayesian Network Example
Using the network in the example, suppose you want to
calculate:
P(A = true, B = true, C = true, D = true)
= P(A = true) * P(B = true | A = true) *
P(C = true | B = true) P( D = true | B = true)
= (0.4)*(0.3)*(0.1)*(0.95)
A

C D

72
Using a Bayesian Network Example
Using the network in the example, suppose you want to
calculate: This is from the
P(A = true, B = true, C = true, D = true) graph structure
= P(A = true) * P(B = true | A = true) *
P(C = true | B = true) P( D = true | B = true)
= (0.4)*(0.3)*(0.1)*(0.95)
A

B
These numbers are from the
conditional probability tables
C D

73
Inference

• Using a Bayesian network to compute


probabilities is called inference
• In general, inference involves queries of the form:
P( X | E )
E = The evidence variable(s)

X = The query variable(s)

74
Inference
HasAnth
rax

HasC HasFe HasDifficultyB HasWideMediasti


ough ver reathing num

• An example of a query would be:


P( HasAnthrax = true | HasFever = true, HasCough = true)
• Note: Even though HasDifficultyBreathing and
HasWideMediastinum are in the Bayesian network, they are not
given values in the query (ie. they do not appear either as query
variables or evidence variables)
• They are treated as unobserved variables

75
The Bad News
• Exact inference is feasible in small to
medium-sized networks
• Exact inference in large networks takes a
very long time
• We resort to approximate inference
techniques which are much faster and give
pretty good results

76
One last unresolved issue…
We still haven’t said where we get the
Bayesian network from. There are two
options:
• Get an expert to design it
• Learn it from data

77
Assignment-1
1. Write a python program that reads two
paragraphs of a document and then:
– segments into list of sentences and write them
on secondary storage device
– segments these sentences into words and write
them on secondary storage device
• Note:
– List of words should be converted into lower
cases and free from any numbers and
punctuation marks
Assignment -2
• 2.
– Take some paragraphs from any sources.
– Write a python program that displays
collocation words.

Submission Date: April 8, 2014

You might also like