Assignment 3
Assignment 3
AIM :
Survey various techniques for POS tagging and implement any one of
them.
THEORY :
Part-of-speech (POS) tagging is a popular Natural Language Processing
process which refers to categorizing words in a text (corpus) in
correspondence with a particular part of speech, depending on the
definition of the word and its context.
In Figure, we can see each word has its own lexical term written
underneath, however, having to constantly write out these full terms when
we perform text analysis can very quickly become cumbersome
especially as the size of the corpus grows. Thence, we use a short
representation referred to as “tags” to represent the categories.
As earlier mentioned, the process of assigning a specific tag to a word in
our corpus is referred to as part-of-speech tagging (POS tagging for short)
since the POS tags are used to describe the lexical terms that we have
within our text.
Most of the POS tagging falls under Rule Base POS tagging, Stochastic
POS tagging and Transformation based tagging.
Markov Model :
Taking the example text we used in Figure , “Why not tell someone?”,
imaging the sentence is truncated to “Why not tell … ” and we want to
determine whether the following word in the sentence is a noun, verb,
adverb, or some other part-of-speech.
Now, if you are familiar with English, you’d instantly identify the verb
and assume that it is more likely the word is followed by a noun rather
than another verb. Therefore, the idea as shown in this example is that
the POS tag that is assigned to the next word is dependent on the POS
tag of the previous word.
How did we populate the transition matrix? . Let’s use 3 sentences for
our corpus. The first is “<s> in a station of the metro”, “<s> the
apparition of these faces in the crowd”, “<s> petals on a wet, black
bough.” (Note these are the same sentences used in the course). Next,
we will break down how to populate the matrix into steps:
1. Count occurrences of tag pairs in the training dataset
At the end of step one, our table would look something like this…
2. Calculate the probability of using the counts
Appling the formula in above formula to the table in previous table, our
new table would look as follows…
You may notice that there are many 0’s in our transition matrix which
would result in our model being incapable of generalizing to other text
that may contain verbs. To overcome this problem, we add smoothing.
Adding smoothing requires we slightly we adjust the formula
from above formula by adding a small value, epsilon, to each of the
counts in the numerator, and add N * epsilon to the denominator, such
that the row sum still adds up to 1.
Hidden Markov Model
Hidden Markov Model (HMM) is a statistical Markov model in which
the system being modeled is assumed to be a Markov process with
unobservable (“hidden”) states .In our case, the unobservable states are
the POS tags of a word.
If we rewind back to our Markov Model, we see that the model has
states for part of speech such as VB for verb and NN for a noun. We
may now think of these as hidden states since they are not directly
observable from the corpus. Though a human may be capable of
deciphering what POS applies to a specific word, a machine only sees
the text, hence making it observable, and is unaware of whether that
word POS tag is noun, verb, or something else which in-turn means they
are unobservable.
Both the Markov Model and Hidden Markov model have transition
probabilities that describe the transition from one hidden state to the
next, however, the Hidden Markov Model also has something known as
emission probabilities.
The emission probabilities describe the transitions from the hidden states
in the model — remember the hidden states are the POS tags — to the
observable states — remember the observable states are the words.
In Figure we see that for the hidden VB state we have observable states.
The emission probability from the hidden states VB to the observable
eat is 0.5 hence there is a 50% chance that the model would output this
word when the current hidden state is VB.
We can also represent the emission probabilities as a table…
Similar to the transition probability matrix, the row values must sum to
1. Also, the reason all of our POS tags emission probabilities are more
than 0 since words can have a different POS tag depending on the
context.
To populate the emission matrix, we’d follow a procedure very similar
to the way we’d populate the transition matrix. We’d first count how
often a word is tagged with a specific tag.
CODE :
import nltk
nltk.download('stopwords')
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')
nltk.download('treebank')
nltk.download('maxent_ne_chunker')
nltk.download('words')
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize, sent_tokenize
stop_words = set(stopwords.words('english'))
tokenized = sent_tokenize(txt)
for i in tokenized:
wordsList = nltk.word_tokenize(i)
wordsList = [w for w in wordsList if not w in stop_words]
tagged = nltk.pos_tag(wordsList)
print(tagged)