0% found this document useful (0 votes)
13 views16 pages

Unit 3

Uploaded by

Payal Khuspe
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
13 views16 pages

Unit 3

Uploaded by

Payal Khuspe
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 16

What is POS(Parts-Of-Speech) Tagging?

Parts of Speech tagging is a linguistic activity in Natural Language Processing (NLP) wherein each
word in a document is given a particular part of speech (adverb, adjective, verb, etc.) or
grammatical category. Through the addition of a layer of syntactic and semantic information to the
words, this procedure makes it easier to comprehend the sentence’s structure and meaning.
In NLP applications, POS tagging is useful for machine translation, named entity recognition, and
information extraction, among other things. It also works well for clearing out ambiguity in terms
with numerous meanings and revealing a sentence’s grammatical structure.

Default tagging is a basic step for the part-of-speech tagging. It is performed using the
DefaultTagger class. The DefaultTagger class takes ‘tag’ as a single argument. NN is the tag for a
singular noun. DefaultTagger is most useful when it gets to work with most common part-of-
speech tag. that’s why a noun tag is recommended.

Example of POS Tagging


Consider the sentence: “The quick brown fox jumps over the lazy dog.”
After performing POS Tagging:

• “The” is tagged as determiner (DT)


• “quick” is tagged as adjective (JJ)
• “brown” is tagged as adjective (JJ)
• “fox” is tagged as noun (NN)
• “jumps” is tagged as verb (VBZ)
• “over” is tagged as preposition (IN)
• “the” is tagged as determiner (DT)
• “lazy” is tagged as adjective (JJ)
• “dog” is tagged as noun (NN)
By offering insights into the grammatical structure, this tagging aids machines in comprehending
not just individual words but also the connections between them inside a phrase. For many NLP
applications, like text summarization, sentiment analysis, and machine translation, this kind of data
is essential.

Workflow of POS Tagging in NLP

The following are the processes in a typical natural language processing (NLP) example of part-
of-speech (POS) tagging:

• Tokenization: Divide the input text into discrete tokens, which are usually units of words or
subwords. The first stage in NLP tasks is tokenization.
• Loading Language Models: To utilize a library such as NLTK or SpaCy, be sure to load the
relevant language model. These models offer a foundation for comprehending a language’s
grammatical structure since they have been trained on a vast amount of linguistic data.
• Text Processing: If required, preprocess the text to handle special characters, convert it to
lowercase, or eliminate superfluous information. Correct PoS labeling is aided by clear text.
• Linguistic Analysis: To determine the text’s grammatical structure, use linguistic analysis.
This entails understanding each word’s purpose inside the sentence, including whether it is an
adjective, verb, noun, or other.
• Part-of-Speech Tagging: To determine the text’s grammatical structure, use linguistic
analysis. This entails understanding each word’s purpose inside the sentence, including
whether it is an adjective, verb, noun, or other.
• Results Analysis: Verify the accuracy and consistency of the PoS tagging findings with the
source text. Determine and correct any possible problems or mistagging.
Types of POS Tagging in NLP

Assigning grammatical categories to words in a text is known as Part-of-Speech (PoS) tagging,


and it is an essential aspect of Natural Language Processing (NLP). Different PoS tagging
approaches exist, each with a unique methodology. Here are a few typical kinds:
1. Rule-Based Tagging
Rule-based part-of-speech (POS) tagging involves assigning words their respective parts of
speech using predetermined rules, contrasting with machine learning-based POS tagging that
requires training on annotated text corpora. In a rule-based system, POS tags are assigned based
on specific word characteristics and contextual cues.
For instance, a rule-based POS tagger could designate the “noun” tag to words ending in “‑tion”
or “‑ment,” recognizing common noun-forming suffixes. This approach offers transparency and
interpretability, as it doesn’t rely on training data.
Let’s consider an example of how a rule-based part-of-speech (POS) tagger might operate:
Rule: Assign the POS tag “noun” to words ending in “-tion” or “-ment.”
Text: “The presentation highlighted the key achievements of the project’s development.”
Rule based Tags:

• “The” – Determiner (DET)


• “presentation” – Noun (N)
• “highlighted” – Verb (V)
• “the” – Determiner (DET)
• “key” – Adjective (ADJ)
• “achievements” – Noun (N)
• “of” – Preposition (PREP)
• “the” – Determiner (DET)
• “project’s” – Noun (N)
• “development” – Noun (N)
In this instance, the predetermined rule is followed by the rule-based POS tagger to label words.
“Noun” tags are applied to words like “presentation,” “achievements,” and “development”
because of the aforementioned restriction. Despite the simplicity of this example, rule-based
taggers may handle a broad variety of linguistic patterns by incorporating different rules, which
makes the tagging process transparent and comprehensible.
2. Transformation Based tagging
Transformation-based tagging (TBT) is a part-of-speech (POS) tagging method that uses a set of
rules to change the tags that are applied to words inside a text. In contrast, statistical POS tagging
uses trained algorithms to predict tags probabilistically, while rule-based POS tagging assigns
tags directly based on predefined rules.
To change word tags in TBT, a set of rules is created depending on contextual information. A
rule could, for example, change a verb’s tag to a noun if it comes after a determiner like “the.”
The text is systematically subjected to these criteria, and after each transformation, the tags are
updated.
When compared to rule-based tagging, TBT can provide higher accuracy, especially when
dealing with complex grammatical structures. To attain ideal performance, nevertheless, it might
require a large rule set and additional computer power.
Consider the transformation rule: Change the tag of a verb to a noun if it follows a determiner
like “the.”
Text: “The cat chased the mouse”.
Initial Tags:

• “The” – Determiner (DET)


• “cat” – Noun (N)
• “chased” – Verb (V)
• “the” – Determiner (DET)
• “mouse” – Noun (N)
Transformation rule applied:
Change the tag of “chased” from Verb (V) to Noun (N) because it follows the determiner “the.”
Updated tags:

• “The” – Determiner (DET)


• “cat” – Noun (N)
• “chased” – Noun (N)
• “the” – Determiner (DET)
• “mouse” – Noun (N)
In this instance, the tag “chased” was changed from a verb to a noun by the TBT system using a
transformation rule based on the contextual pattern. The tagging is updated iteratively and the
rules are applied sequentially. Although this example is simple, given a well-defined set of
transformation rules, TBT systems can handle more complex grammatical patterns.
3. Statistical POS Tagging
Utilizing probabilistic models, statistical part-of-speech (POS) tagging is a computer linguistics
technique that places grammatical categories on words inside a text. If rule-based tagging uses
massive annotated corpora to train its algorithms, statistical tagging uses machine learning.
In order to capture the statistical linkages present in language, these algorithms learn the
probability distribution of word-tag sequences. CRFs (conditional random fields) and Hidden
Markov Models (HMMs) are popular models for statistical point-of-sale classification. The
algorithm estimates the chance of observing a specific tag given the current word and its context
by learning from labeled samples during training.
The most likely tags for text that hasn’t been seen are then predicted using the trained model.
Statistical POS tagging works especially well for languages with complicated grammatical
structures because it is exceptionally good at handling linguistic ambiguity and catching subtle
language trends.

• ]: Hidden Markov Models (HMMs) serve as a statistical framework for part-of-speech (POS)
tagging in natural language processing (NLP). In HMM-based POS tagging, the model
undergoes training on a sizable annotated text corpus to discern patterns in various parts of
speech. Leveraging this training, the model predicts the POS tag for a given word based on
the probabilities associated with different tags within its context.
Comprising states for potential POS tags and transitions between them, the HMM-based POS
tagger learns transition probabilities and word-emission probabilities during training. To tag
new text, the model, employing the Viterbi algorithm, calculates the most probable sequence
of POS tags based on the learned probabilities.
Widely applied in NLP, HMMs excel at modeling intricate sequential data, yet their
performance may hinge on the quality and quantity of annotated training data.

Advantages of POS Tagging

There are several advantages of Parts-Of-Speech (POS) Tagging including:

• Text Simplification: Breaking complex sentences down into their constituent parts makes
the material easier to understand and easier to simplify.
• Information Retrieval: Information retrieval systems are enhanced by point-of-sale (POS)
tagging, which allows for more precise indexing and search based on grammatical categories.
• Named Entity Recognition: POS tagging helps to identify entities such as names, locations,
and organizations inside text and is a precondition for named entity identification.
• Syntactic Parsing: It facilitates syntactic parsing, which helps with phrase structure analysis
and word link identification.

Disadvantages of POS Tagging

Some common disadvantages in part-of-speech (POS) tagging include:

• Ambiguity: The inherent ambiguity of language makes POS tagging difficult since words
can signify different things depending on the context, which can result in misunderstandings.
• Idiomatic Expressions: Slang, colloquialisms, and idiomatic phrases can be problematic for
POS tagging systems since they don’t always follow formal grammar standards.
• Out-of-Vocabulary Words: Out-of-vocabulary words (words not included in the training
corpus) can be difficult to handle since the model might have trouble assigning the correct
POS tags.
• Domain Dependence: For best results, POS tagging models trained on a single domain
should have a lot of domain-specific training data because they might not generalize well to
other domains.
Conditional Random Fields
A Conditional Random Field (CRF) is a type of probabilistic graphical model often used in
Natural Language Processing (NLP) and computer vision tasks. It is a variant of a Markov
Random Field (MRF), which is a type of undirected graphical model.

• CRFs are used for structured prediction tasks, where the goal is to predict a structured output
based on a set of input features. For example, in NLP, a commonly structured prediction task
is Part-of-Speech (POS) tagging, where the goal is to assign a part-of-speech tag to each
word in a sentence. CRFs can also be used for Named Entity Recognition (NER), chunking,
and other tasks where the output is a structured sequence.
• CRFs are trained using maximum likelihood estimation, which involves optimizing the
parameters of the model to maximize the probability of the correct output sequence given the
input features. This optimization problem is typically solved using iterative algorithms like
gradient descent or L-BFGS.
• The formula for a Conditional Random Field (CRF) is similar to that of a Markov Random
Field (MRF) but with the addition of input features that condition the probability distribution
over output sequences.
Let X be the input features and Y be the output sequence. The joint probability distribution of a
CRF is given by:
where:

• Z(X) is the normalization factor that ensures the distribution sums to 1 over all possible
output sequences.
• λk are the learned model parameters.
• fk(yi – 1, yi, xi) are the feature functions that take as input the current output state yi, the
previous output state yi – 1, and the input features xi.
• These functions can be binary or real-valued, and capture dependencies between the input
features and the output sequence.

The Viterbi Algorithm

Given a sentence, we can use Viterbi algorithm to compute the most likely sequence of parts of
speech tags.
Viterbi Algorithm Overview
With a leading start token, you want to find the sequence of hidden states or parts of speech tags
that have the highest probability for this sequence.

image from week 2 of Natural Language Processing with Probabilistic Models course

The Viterbi algorithm computes all the possible paths for a given sentence in order to find the
most likely sequence of hidden states. It uses the matrix representation of the hidden Markov
models. The algorithm can be split into 3 steps:

• Initialization step
• Forward pass
• Backward pass

It uses the transition probabilities and emission probabilities from the hidden Markov models to
calculate two matrices. The matrix C (best_probs) holds the intermediate optimal probabilities
and matrix D (best_paths), the indices of the visited states.

• These two matrices have n rows, where n is the number of parts of speech tags or hidden
states in the model.
• And K columns, where k is the number of words in the given sequence.
image from week 2 of Natural Language Processing with Probabilistic Models course

Viterbi Initialization
In the initialization step, the first column in C and D matrix is populated.
First column in C:
The first column of C represents the probability of transition from start state to the first tag ti
and the word w1. Meaning we are trying to go from tag i to the word w1.

Formula:

where a_(1,i) is the transition probability from start state to i and b_(i, cindex(w1) is the
emission probability from tag i to word w1.

image from week 2 of Natural Language Processing with Probabilistic Models course
First column in D matrix:
• In the D matrix, you store the labels that represent the different states you’re traversing
when finding the most likely sequence of parts of speech tags for the given sequence of
words, W1 all the way to Wk.
• In the first column, you simply set all entries to zero, as there are no proceeding parts of
speech tags we have traversed.

image from week 2 of Natural Language Processing with Probabilistic Models course

Viterbi Forward Pass


After initialized the matrices C and D, all the remaining entries in the two matrices, C and D, are
populated column by column during the forward pass.

C matrix formula:

where the first element is the probability of the preceding path you’ve traversed, the second
element is the transition probability from tag k to tag i, and the last element is the emission
probability from tag i to word j. We then choose the k which maximizes the entire formula.

D matrix formula:

which simply save the k, which maximized the entry in each Ci,j
image from week 2 of Natural Language Processing with Probabilistic Models course

Viterbi Backward Pass


Use the C and D matrix from Forward Pass to create a path, so that we can assign a parts of
speech tag to every word.
The D matrix represents the sequence of hidden states that most likely generated our sequence,
word one all the way to word K. The backward pass helps retrieve the most likely sequence of
parts of speech tags for the given sequence of words.
Steps:
• Calculate the index of the entry Ci,k with the highest probability in the last column of C.
The probability at this index is the probability of the most likely sequence of hidden
states, generating the given sequence of words.
• Then we use this index s to traverse backwards through the matrix D, to reconstruct the
sequence of parts of speech tags.

Example:

Let’s say in the last column of matrix C below, the highest probability is t1
image from week 2 of Natural Language Processing with Probabilistic Models course

• Then we go to matrix D, we can find the following best path travels backward, until we
arrive at the start of the token. The path we recover from the backward pass is the
sequence of parts of speech tags with the highest probability.

image from week 2 of Natural Language Processing with Probabilistic Models course

Some notes:
• Be careful of the index in the matrix
• Use log probabilities instead of product multiplication, because when we multiply many
very small numbers like probabilities, this will lead to numerical issues. Use the the log
probabilities below yields better result.
Penn Treebank tagset
The English Penn Treebank tagset is used with English corpora annotated by the TreeTagger tool,
developed by Helmut Schmid in the TC project at the Institute for Computational Linguistics of
the University of Stuttgart. This version of the tagset contains modifications developed by Sketch
Engine (earlier version).
POS Tag Description Example
CC coordinating conjunction and
CD cardinal number 1, third
DT determiner the
EX existential there there is
FW foreign word les
IN preposition, subordinating conjunction in, of, like
IN/that that as subordinator that
JJ adjective green
JJR adjective, comparative greener
JJS adjective, superlative greenest
LS list marker 1)
MD modal could, will
NN noun, singular or mass table
NNS noun plural tables
NP proper noun, singular John
NPS proper noun, plural Vikings
PDT predeterminer both the boys
POS possessive ending friend’s
PP personal pronoun I, he, it
PPZ possessive pronoun my, his
RB adverb however, usually, naturally, here, good
RBR adverb, comparative better
RBS adverb, superlative best
RP particle give up
SENT Sentence-break punctuation .!?
SYM Symbol /[=*
TO infinitive ‘to’ To go
UH interjection uhhuhhuhh
VB verb be, base form be
VBD verb be, past tense was, were
VBG verb be, gerund/present participle being
VBN verb be, past participle been
VBP verb be, sing. present, non-3d am, are
VBZ verb be, 3rd person sing. present is
VH verb have, base form have
VHD verb have, past tense had
VHG verb have, gerund/present participle having
VHN verb have, past participle had
VHP verb have, sing. present, non-3d have
VHZ verb have, 3rd person sing. present has
VV verb, base form take
VVD verb, past tense took
VVG verb, gerund/present participle taking
VVN verb, past participle taken
VVP verb, sing. present, non-3d take
VVZ verb, 3rd person sing. present takes
WDT wh-determiner which
WP wh-pronoun who, what
WP$ possessive wh-pronoun whose
WRB wh-abverb where, when
# # #
$ $ $
“ Quotation marks ‘“
`` Opening quotation marks ‘“
( Opening brackets ({
) Closing brackets )}
, Comma ,
: Punctuation –;:—…

Conditional Random Fields


A Conditional Random Field (CRF) is a type of probabilistic graphical model often
used in Natural Language Processing (NLP) and computer vision tasks. It is a variant
of a Markov Random Field (MRF), which is a type of undirected graphical model.
• CRFs are used for structured prediction tasks, where the goal is to predict a
structured output based on a set of input features. For example, in NLP, a
commonly structured prediction task is Part-of-Speech (POS) tagging, where the
goal is to assign a part-of-speech tag to each word in a sentence. CRFs can also be
used for Named Entity Recognition (NER), chunking, and other tasks where the
output is a structured sequence.
• CRFs are trained using maximum likelihood estimation, which involves
optimizing the parameters of the model to maximize the probability of the correct
output sequence given the input features. This optimization problem is typically
solved using iterative algorithms like gradient descent or L-BFGS.
• The formula for a Conditional Random Field (CRF) is similar to that of a Markov
Random Field (MRF) but with the addition of input features that condition the
probability distribution over output sequences.
Let X be the input features and Y be the output sequence. The joint probability
distribution of a CRF is given by:

where:
• Z(X) is the normalization factor that ensures the distribution sums to 1 over all
possible output sequences.
• λk are the learned model parameters.
• fk(yi – 1, yi, xi) are the feature functions that take as input the current output state yi,
the previous output state yi – 1, and the input features xi.
• These functions can be binary or real-valued, and capture dependencies between
the input features and the output sequence.

Maximum Entropy Model


Similar to logistic regression, the maximum entropy (MaxEnt) model is also
a type of log-linear model. The MaxEnt model is more general than logistic
regression. It handles multinomial distribution where logistic regression is for
binary classification.
The maximum entropy principle is defined as modeling a given set of data
by finding the highest entropy to satisfy the constraints of our prior knowledge.
The feature function of MaxEnt model would be multi-classes. For example,
given (x,y), the feature function returns 0,1, or 2.
The maximum entropy model is a conditional probability model p(y|x) that
allows us to predict class labels given a set of features for a given data point.
It does the inference by taking trained weights and performs linear
combinations to find the tag with the highest probability by finding the highest
score for each tag.
To find the probability for each tag/class, MaxEnt defined as:

We define f_i as a feature function and w_i as the weight vector. The
summation of i=1 to m is summing of all feature functions where m is the
number of unique states. The denominator Z(x) helped normalize the
probability as:

The MaxEnt model makes uses of the log-linear model approach with the
feature function but does not take into account the sequential data.
Maximum Entropy Markov Model (MEMM)
From the Maximum Entropy model, we can extend into the Maximum Entropy
Markov Model (MEMM). This approach allows us to use HMM that takes into
account the sequence of data and to combine it with the Maximum Entropy
model for features and normalization.
The Maximum Entropy Markov Model (MEMM) has dependencies between
each state and the full observation sequence explicitly. This is more expressive
than HMMs.
In the HMM model, we saw that it uses two probabilities matrice (state
transition and emission probability). We need to predict a tag given an
observation, but HMM predicts the probability of a tag producing a certain
observation. This is due to its generative approach. Instead of the transition and
observation matrices in HMM, MEMM has only one transition probability
matrix. This matrix encapsulates all combinations of previous states y_i−1 and
current observation x_i pairs in the training data to the current state y_i.
Our goal is to find the p(y_1,y_2,…,y_n|x_1,x_2,…x_n). This is:

Since HMM only depends on the previous state, we can limit the condition of
y_n given y_n-1. This is the Markov independence assumption.

So the Maximum Entropy Markov Models (MEMM) defines using Log-


linear model as:

where x is a full sequence of inputs of x_1 to x_n. Let y be corresponding labels


or sequence of tags (0 and1 in our case). The variable i is the position to be
tagged and n is the length of the sentence. The denominator Z(y_i-1,x) is the
normalizer that defines as

MEMM can incorporate more features from its feature function as input while
HMM required the likelihood of each of the features to be computed since it is
a likelihood-based. The feature function of MEMM also has dependencies on
previous tag y_i-1. As an example:

Example function for letter ‘e’ in ‘test’ where the current tag is M and the previous tag is B.

The MEMM has a richer set of observation features that can describe
observations in terms of many overlapping features. For example in our word
segmentation, we could have features like capitalization, vowel or consonant,
or type of the character.

You might also like