Unit 3
Unit 3
Parts of Speech tagging is a linguistic activity in Natural Language Processing (NLP) wherein each
word in a document is given a particular part of speech (adverb, adjective, verb, etc.) or
grammatical category. Through the addition of a layer of syntactic and semantic information to the
words, this procedure makes it easier to comprehend the sentence’s structure and meaning.
In NLP applications, POS tagging is useful for machine translation, named entity recognition, and
information extraction, among other things. It also works well for clearing out ambiguity in terms
with numerous meanings and revealing a sentence’s grammatical structure.
Default tagging is a basic step for the part-of-speech tagging. It is performed using the
DefaultTagger class. The DefaultTagger class takes ‘tag’ as a single argument. NN is the tag for a
singular noun. DefaultTagger is most useful when it gets to work with most common part-of-
speech tag. that’s why a noun tag is recommended.
The following are the processes in a typical natural language processing (NLP) example of part-
of-speech (POS) tagging:
• Tokenization: Divide the input text into discrete tokens, which are usually units of words or
subwords. The first stage in NLP tasks is tokenization.
• Loading Language Models: To utilize a library such as NLTK or SpaCy, be sure to load the
relevant language model. These models offer a foundation for comprehending a language’s
grammatical structure since they have been trained on a vast amount of linguistic data.
• Text Processing: If required, preprocess the text to handle special characters, convert it to
lowercase, or eliminate superfluous information. Correct PoS labeling is aided by clear text.
• Linguistic Analysis: To determine the text’s grammatical structure, use linguistic analysis.
This entails understanding each word’s purpose inside the sentence, including whether it is an
adjective, verb, noun, or other.
• Part-of-Speech Tagging: To determine the text’s grammatical structure, use linguistic
analysis. This entails understanding each word’s purpose inside the sentence, including
whether it is an adjective, verb, noun, or other.
• Results Analysis: Verify the accuracy and consistency of the PoS tagging findings with the
source text. Determine and correct any possible problems or mistagging.
Types of POS Tagging in NLP
• ]: Hidden Markov Models (HMMs) serve as a statistical framework for part-of-speech (POS)
tagging in natural language processing (NLP). In HMM-based POS tagging, the model
undergoes training on a sizable annotated text corpus to discern patterns in various parts of
speech. Leveraging this training, the model predicts the POS tag for a given word based on
the probabilities associated with different tags within its context.
Comprising states for potential POS tags and transitions between them, the HMM-based POS
tagger learns transition probabilities and word-emission probabilities during training. To tag
new text, the model, employing the Viterbi algorithm, calculates the most probable sequence
of POS tags based on the learned probabilities.
Widely applied in NLP, HMMs excel at modeling intricate sequential data, yet their
performance may hinge on the quality and quantity of annotated training data.
• Text Simplification: Breaking complex sentences down into their constituent parts makes
the material easier to understand and easier to simplify.
• Information Retrieval: Information retrieval systems are enhanced by point-of-sale (POS)
tagging, which allows for more precise indexing and search based on grammatical categories.
• Named Entity Recognition: POS tagging helps to identify entities such as names, locations,
and organizations inside text and is a precondition for named entity identification.
• Syntactic Parsing: It facilitates syntactic parsing, which helps with phrase structure analysis
and word link identification.
• Ambiguity: The inherent ambiguity of language makes POS tagging difficult since words
can signify different things depending on the context, which can result in misunderstandings.
• Idiomatic Expressions: Slang, colloquialisms, and idiomatic phrases can be problematic for
POS tagging systems since they don’t always follow formal grammar standards.
• Out-of-Vocabulary Words: Out-of-vocabulary words (words not included in the training
corpus) can be difficult to handle since the model might have trouble assigning the correct
POS tags.
• Domain Dependence: For best results, POS tagging models trained on a single domain
should have a lot of domain-specific training data because they might not generalize well to
other domains.
Conditional Random Fields
A Conditional Random Field (CRF) is a type of probabilistic graphical model often used in
Natural Language Processing (NLP) and computer vision tasks. It is a variant of a Markov
Random Field (MRF), which is a type of undirected graphical model.
• CRFs are used for structured prediction tasks, where the goal is to predict a structured output
based on a set of input features. For example, in NLP, a commonly structured prediction task
is Part-of-Speech (POS) tagging, where the goal is to assign a part-of-speech tag to each
word in a sentence. CRFs can also be used for Named Entity Recognition (NER), chunking,
and other tasks where the output is a structured sequence.
• CRFs are trained using maximum likelihood estimation, which involves optimizing the
parameters of the model to maximize the probability of the correct output sequence given the
input features. This optimization problem is typically solved using iterative algorithms like
gradient descent or L-BFGS.
• The formula for a Conditional Random Field (CRF) is similar to that of a Markov Random
Field (MRF) but with the addition of input features that condition the probability distribution
over output sequences.
Let X be the input features and Y be the output sequence. The joint probability distribution of a
CRF is given by:
where:
• Z(X) is the normalization factor that ensures the distribution sums to 1 over all possible
output sequences.
• λk are the learned model parameters.
• fk(yi – 1, yi, xi) are the feature functions that take as input the current output state yi, the
previous output state yi – 1, and the input features xi.
• These functions can be binary or real-valued, and capture dependencies between the input
features and the output sequence.
Given a sentence, we can use Viterbi algorithm to compute the most likely sequence of parts of
speech tags.
Viterbi Algorithm Overview
With a leading start token, you want to find the sequence of hidden states or parts of speech tags
that have the highest probability for this sequence.
image from week 2 of Natural Language Processing with Probabilistic Models course
The Viterbi algorithm computes all the possible paths for a given sentence in order to find the
most likely sequence of hidden states. It uses the matrix representation of the hidden Markov
models. The algorithm can be split into 3 steps:
• Initialization step
• Forward pass
• Backward pass
It uses the transition probabilities and emission probabilities from the hidden Markov models to
calculate two matrices. The matrix C (best_probs) holds the intermediate optimal probabilities
and matrix D (best_paths), the indices of the visited states.
• These two matrices have n rows, where n is the number of parts of speech tags or hidden
states in the model.
• And K columns, where k is the number of words in the given sequence.
image from week 2 of Natural Language Processing with Probabilistic Models course
Viterbi Initialization
In the initialization step, the first column in C and D matrix is populated.
First column in C:
The first column of C represents the probability of transition from start state to the first tag ti
and the word w1. Meaning we are trying to go from tag i to the word w1.
Formula:
where a_(1,i) is the transition probability from start state to i and b_(i, cindex(w1) is the
emission probability from tag i to word w1.
image from week 2 of Natural Language Processing with Probabilistic Models course
First column in D matrix:
• In the D matrix, you store the labels that represent the different states you’re traversing
when finding the most likely sequence of parts of speech tags for the given sequence of
words, W1 all the way to Wk.
• In the first column, you simply set all entries to zero, as there are no proceeding parts of
speech tags we have traversed.
image from week 2 of Natural Language Processing with Probabilistic Models course
C matrix formula:
where the first element is the probability of the preceding path you’ve traversed, the second
element is the transition probability from tag k to tag i, and the last element is the emission
probability from tag i to word j. We then choose the k which maximizes the entire formula.
D matrix formula:
which simply save the k, which maximized the entry in each Ci,j
image from week 2 of Natural Language Processing with Probabilistic Models course
Example:
Let’s say in the last column of matrix C below, the highest probability is t1
image from week 2 of Natural Language Processing with Probabilistic Models course
• Then we go to matrix D, we can find the following best path travels backward, until we
arrive at the start of the token. The path we recover from the backward pass is the
sequence of parts of speech tags with the highest probability.
image from week 2 of Natural Language Processing with Probabilistic Models course
Some notes:
• Be careful of the index in the matrix
• Use log probabilities instead of product multiplication, because when we multiply many
very small numbers like probabilities, this will lead to numerical issues. Use the the log
probabilities below yields better result.
Penn Treebank tagset
The English Penn Treebank tagset is used with English corpora annotated by the TreeTagger tool,
developed by Helmut Schmid in the TC project at the Institute for Computational Linguistics of
the University of Stuttgart. This version of the tagset contains modifications developed by Sketch
Engine (earlier version).
POS Tag Description Example
CC coordinating conjunction and
CD cardinal number 1, third
DT determiner the
EX existential there there is
FW foreign word les
IN preposition, subordinating conjunction in, of, like
IN/that that as subordinator that
JJ adjective green
JJR adjective, comparative greener
JJS adjective, superlative greenest
LS list marker 1)
MD modal could, will
NN noun, singular or mass table
NNS noun plural tables
NP proper noun, singular John
NPS proper noun, plural Vikings
PDT predeterminer both the boys
POS possessive ending friend’s
PP personal pronoun I, he, it
PPZ possessive pronoun my, his
RB adverb however, usually, naturally, here, good
RBR adverb, comparative better
RBS adverb, superlative best
RP particle give up
SENT Sentence-break punctuation .!?
SYM Symbol /[=*
TO infinitive ‘to’ To go
UH interjection uhhuhhuhh
VB verb be, base form be
VBD verb be, past tense was, were
VBG verb be, gerund/present participle being
VBN verb be, past participle been
VBP verb be, sing. present, non-3d am, are
VBZ verb be, 3rd person sing. present is
VH verb have, base form have
VHD verb have, past tense had
VHG verb have, gerund/present participle having
VHN verb have, past participle had
VHP verb have, sing. present, non-3d have
VHZ verb have, 3rd person sing. present has
VV verb, base form take
VVD verb, past tense took
VVG verb, gerund/present participle taking
VVN verb, past participle taken
VVP verb, sing. present, non-3d take
VVZ verb, 3rd person sing. present takes
WDT wh-determiner which
WP wh-pronoun who, what
WP$ possessive wh-pronoun whose
WRB wh-abverb where, when
# # #
$ $ $
“ Quotation marks ‘“
`` Opening quotation marks ‘“
( Opening brackets ({
) Closing brackets )}
, Comma ,
: Punctuation –;:—…
where:
• Z(X) is the normalization factor that ensures the distribution sums to 1 over all
possible output sequences.
• λk are the learned model parameters.
• fk(yi – 1, yi, xi) are the feature functions that take as input the current output state yi,
the previous output state yi – 1, and the input features xi.
• These functions can be binary or real-valued, and capture dependencies between
the input features and the output sequence.
We define f_i as a feature function and w_i as the weight vector. The
summation of i=1 to m is summing of all feature functions where m is the
number of unique states. The denominator Z(x) helped normalize the
probability as:
The MaxEnt model makes uses of the log-linear model approach with the
feature function but does not take into account the sequential data.
Maximum Entropy Markov Model (MEMM)
From the Maximum Entropy model, we can extend into the Maximum Entropy
Markov Model (MEMM). This approach allows us to use HMM that takes into
account the sequence of data and to combine it with the Maximum Entropy
model for features and normalization.
The Maximum Entropy Markov Model (MEMM) has dependencies between
each state and the full observation sequence explicitly. This is more expressive
than HMMs.
In the HMM model, we saw that it uses two probabilities matrice (state
transition and emission probability). We need to predict a tag given an
observation, but HMM predicts the probability of a tag producing a certain
observation. This is due to its generative approach. Instead of the transition and
observation matrices in HMM, MEMM has only one transition probability
matrix. This matrix encapsulates all combinations of previous states y_i−1 and
current observation x_i pairs in the training data to the current state y_i.
Our goal is to find the p(y_1,y_2,…,y_n|x_1,x_2,…x_n). This is:
Since HMM only depends on the previous state, we can limit the condition of
y_n given y_n-1. This is the Markov independence assumption.
MEMM can incorporate more features from its feature function as input while
HMM required the likelihood of each of the features to be computed since it is
a likelihood-based. The feature function of MEMM also has dependencies on
previous tag y_i-1. As an example:
Example function for letter ‘e’ in ‘test’ where the current tag is M and the previous tag is B.
The MEMM has a richer set of observation features that can describe
observations in terms of many overlapping features. For example in our word
segmentation, we could have features like capitalization, vowel or consonant,
or type of the character.