0% found this document useful (0 votes)
5 views

NLP unit-1-introduction-and-word-level-analysis

The document provides an overview of Natural Language Processing (NLP), detailing its origins, challenges, and key components such as language modeling, regular expressions, finite state automata, and English morphology. It highlights the historical development of NLP from the Turing Machine to modern applications and discusses the challenges faced in text summarization and AI chatbots. Additionally, it explains various models and techniques used in NLP, including grammar-based and statistical language modeling, as well as the role of finite-state transducers in analyzing language data.

Uploaded by

yash14503
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
5 views

NLP unit-1-introduction-and-word-level-analysis

The document provides an overview of Natural Language Processing (NLP), detailing its origins, challenges, and key components such as language modeling, regular expressions, finite state automata, and English morphology. It highlights the historical development of NLP from the Turing Machine to modern applications and discusses the challenges faced in text summarization and AI chatbots. Additionally, it explains various models and techniques used in NLP, including grammar-based and statistical language modeling, as well as the role of finite-state transducers in analyzing language data.

Uploaded by

yash14503
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 25

lOMoARcPSD|21913187

SNLP

UNIT-1

Introduction and Word Level Analysis

1.1 Origins and Challenges of NLP

1.1.1 Origins of NLP

Natural Language Processing known as NLP is a form of Artificial Intelligence that


allows machines and computers to understand and interpret the human language. This
form is very essential to allow users to communicate with the computer just as they
would any other human being. NLP is so vast that people have been using it for well
over several decades and still see the potential to use it to perform many tasks that
might not be possible even today.

NLP is essentially the way humans communicate with a machine to allow them to
perform the tasks required. This means that the earliest traces of NLP should be found
right at the time where computers were starting to be developed. However, language
processing is not only limited to a computer but can also be used for other machines as
well. Around 1950 the Turing Machine became famous (Developed by Alan
Turing to interpret the encrypted signals being sent by the Germans). Turing built a
machine that could use logic and various other key aspects to understand the logic
behind each message and crack the code. The first use of NLP was seen right there
while communicating with the Turing Machine.

Right after this invention researchers started to use machine intelligence to perform
several tasks that could make life a lot easier for people. To ensure that the machine
performed the tasks that were given to the people needed to feed in appropriate data.
The 1960’s seen a lot of work that was led by Chomsky and other renowned
researchers who worked on this language to develop a universal syntax that could be
used by people around the world to communicate with machines.

Development in this field was not rapid but there was a growing interest in developing
this field further to automate a lot of manual activities. People had a learning curve that
was very slow which only allowed people to have a keen interest in this field around
the year 1990. The 90’s seen a surge of people developing models using artificial
intelligence. TO do the same they relied heavily on language processing which seen a
boost around the time.
Downloaded by Yash ([email protected])
lOMoARcPSD|21913187

All of this gave an incentive for people to use NLP along with machine learning and
artificial intelligence to develop probabilistic models that relied heavily on data. To let
the machine read and interpret data NLP became key right up to the present time. A lot
of data relied on speech recognition which allowed people to transfer audio files and
store them in the form of data on the internet. Due to the Internet Age language
processing is used in almost any area around you. Social media platforms rely heavily
on language processing which forms a major part of recent generations. In short,
without NLP we would not have achieved all the technological advancements that we
have in recent times.

1.1.2 Challenges of NLP

NLP is one of the major ways in which a person can communicate with machines to
solve problems does come with a plethora of advantages. However, as advanced as the
technology may be there are always a few challenges that lie ahead of the following
which needs to be conquered to solve them. A little about each of the challenges are
given below.

1. Text Summarization

This is simply a challenge that lies ahead for readers who are working on extracting
data and going through a lot of data in a short time. They would prefer to have a
summary of the same. AI uses NLP to interpret and structure a summary by using
all of the important points. However, there are a lot of points that do get missed out. This
is one of the challenges that could solve a lot of problems that use data inspection.

Chatbot’s and Other AI Answer Machines

These are so frequently used by people while surfing a website to navigate the site
appropriately to find the thing that they would be needing most. Replies do get easy
based on questions that are asked frequently. However, when we use this on a much
higher level with more complex reasonings, not every chatbot implemented with AI
would be able to provide reliable answers. AI intends to allow everyone to use this
feature to solve their daily problems.

Key Takeaways

NLP has definitely made advancements in the world but still faces many
challenges that could eventually limit the use of it.

Downloaded by Yash ([email protected])


lOMoARcPSD|21913187

1.2 Language Modelling: Grammar-based LM, Statistical LM

Language models (LM’s) are used to estimate the likelihood of different words or
phrases relative to other phrases or words to improve other NLP applications. Most of
the models that are based on language modelling use probabilistic
inferences to calculate the probability of the next word that the user might be
intending to note down. It can also be used to find the probability of a certain word or
phrase in data that is already existing.

P(w) = P(w1, w2, w3, w4, ………wn) = ∏ P(wi|w1,w2.w3……..wi-1)

where ‘n’ is the total number of occurrences in the given data and i is the instance that
we need to find out the occurrence of.

1.2.1 Grammar Based Language Modelling

As the name rightly suggests this model is based on the grammatical inferences that
have been given out to the machine to make an appropriate prediction based on the
probabilistic inferences that the sentence has occurred in a similar event as before.

We can be sure that the language that we need the user to interpret a certain event is
based on the past happenings in terms of that data. Let us understand this model with
the help of an example further.

E.g. Sentence: “This glass is transparent.”

P (“This glass is transparent.”) = P(wi)

To construct the above sentence or predict it we need to see all the possible
combinations based on the rule that is given below.

1. P(This ) = P(w1)
2. P(glass | This ) = P(w2)
3. P( is | This glass ) =P(w3)
4. P(transparent | This glass is ) = P(w4)

If the model needs to calculate the probability of the above sentence it would have to
follow the same formula as given above.

Downloaded by Yash ([email protected])


lOMoARcPSD|21913187

Therefore, P(w) = ∏ P(wi|w1,w2.w3……..wi-1)

This would give us the probability of the grammatically correct statement which would
then be predicted and shown to the user. Based on this one could build a sentence or
phrase completion model using NLP.

1.2.2 Statistical Based Language Modelling

As the name suggests this model is based on statistical inferences made from data that
has been entered into the machine that is supposed to make the final inferences. This
language model relies heavily on data and the input in a correct and appropriate form.

The probability distribution is given over required word sequences which are then
used over the textual data to gain inferences based on the values given to them.

E.g. 1) -p(“Today is Monday”) = 0.001

2) -p(“Today Monday is”) = 0.00000001

Approximate values are given to the model to work on making inferences.

This model is also known to have a probabilistic mechanism that is used to generate
text and is also known as a ‘generative model’. All of the values are sampled from the
data that has been entered to gain appropriate context about the happenings that
surround the entire probability distribution. Statistical LM also happens to be context-
dependent. This simply means that the way a certain data is entered and interpreted
also affects the outcome of the probabilistic outcome.

This model is extremely useful to quantify uncertainties that might be lurking around in
the data that has been entered in the form of speech or simply computer language.
This can be seen clearly in the example below.

E.g. 1) John feels _________. (happy, habit)

This is an example where the language model needs to predict the next word that the
person is going to say or type to save time. 99 percent of the people would agree that
“happy” would be the correct option. However, the language model does not have the
same process of thinking as a normal English-speaking human has. LM’s simply
use the probability density of the word in the option to predict the next word.

Downloaded by Yash ([email protected])


lOMoARcPSD|21913187

Hence the final answer would be “John feels happy.”

Key Takeaways

Grammar based modelling uses the smallest syllable in order to construct the
sentence.

1.3 Regular Expressions

Regular Expression is a language that has been developed to help the user specify a
search string to be found in the data. The pattern needs to be specified to get the
desired results.

E.g. If a person wishes to find 10 numbers from the data but does not know the exact
numbers, he/she could simply specify the pattern that those numbers are
following to find them.

We can divide the following regular expression into 2 parts which is the corpus and the
pattern.

Corpus: A Corpus is essentially a paragraph that is being specified under which an


entire search will be executed. This has the entire bulk of data that can be used.

Pattern: This is the required pattern that can be entered in the search to obtain the
required search results. Basic expressions that are used for the following are given
below.

There are a few basic properties of a regular expression that need to be


understood to use ‘RegEx’ (Regular Expression) efficiently.
1. Brackets (“[ ]”)

Brackets are used to specify a certain disjunction of characters. For instance, the
bracket being used on a regular ‘s’ or ‘S’ can be used to return a capital or lower case
‘S’.

E.g. /[sS]mall = Small or small (any of the above combinations)

Downloaded by Yash ([email protected])


lOMoARcPSD|21913187

/[a b c] = ‘a’, ‘b’ or ‘c’ (any of the letters in the bracket)

/[0,1,2,3,4,5] = any digit from the bracket

Dash (“-”)

Dash is a regular expression that can specify the range for a particular search. If we
put a dash between two letters it would match all the letters between them.

E.g. /[A-Z] = All upper-case letters between A and Z would match the search.

/[a-z] = All lower-case letters between a and z would match the search.

Caret (“^”)

As normal syntax, this sign is also used to indicate a negation in the search that is
conducted using this regular expression. It could also simply be used as a ‘caret’
symbol.

E.g. /[^A-Z]mall = Not any upper-case would be selected

/[^Dd] = This would indicate the search is for neither ‘d’ nor ‘D’

/[e^] = Used as a simple caret simple as there is nothing to negate after ‘e’

Question Mark (“?”)

The question mark symbol is essentially used to display the optionality for the search
that has been conducted in the previous search. Putting a “?” before a certain letter
would indicate that the user would like to return the search with an option consisting of
that letter and one without that letter.

E.g. /[malls?] = This would return either “mall” or “malls”

/[colou?r] = This would return either “color” or “colour”

Period(“*”)

The asterisk symbol “*” is used to match a certain character that is between an
expression in the search. If the user has an expression that would make sense after
entering another letter this period mark is very useful.

E.g. /[beg*n] = Will give out any word that makes sense like ‘begin’

Downloaded by Yash ([email protected])


lOMoARcPSD|21913187

Anchors

Anchors can be used to match a certain assertion in the text that has been searched.

E.g. Consider “^” and “$” as specified anchors. Then


^The = Matches a string that starts with ‘The’
bye$ = Matches a string that ends with ‘bye’

Character Classes

“\d”: Matches a single character which is a digit.


“\w”: Matches a word character. This could be alphanumeric as well as
with an underscore.
“\s”: Matches a whitespace character. This includes line breaks as well as
tabs.
“.”: Matches any character.

Key Takeaways
Regular Expression is a language that has been developed to mainly match
different patterns in the data.

1.4 Finite State Automata

Finite State Automata is an abstract computing device that is designed to be a


mathematical model of a system. This system has discrete inputs, outputs, transition
sets, and different states that occur on input symbols. This can be represented in 3
forms namely:
1. Graphical forms (Transition Diagram)
2. Tabular forms (Transition Table)
3. Mathematical forms (Mapping functions or Transition functions)

A finite Automata is made up of 5 tuples which are given in a set as:

M = { Q, ∑, ∂, q0, F }

Q: It is a finite set known as states


∑: It is a finite set known as alphabets
∂: ( Q x ∑ ) This is a transition function
qo: This is a state that is called the initial state and is part of Q
F: This set is also known as the final state and is also a part of Q

Downloaded by Yash ([email protected])


lOMoARcPSD|21913187

Fig. Description 1: These are 3 steps of the formation in the process of FSA

Key Takeaways

FSA can be represented in 3 forms namely:


Graphical forms (Transition Diagram)
Tabular forms (Transition Table)
Mathematical forms (Mapping functions or Transition functions)

1.5 English Morphology

English Morphology is essentially the process of structuring the entire language to form new
words from other base words that can be used further in the field of NLP. This type of morphology
is also known as ‘derivational morphology’. We can achieve the following in so many different
ways as follows:

We can use the stem (root word) to pair with different suffixes and affixes to make a new word that
may have relevance based on our language processing. A derived word from on that is pre-
existing is known as morphology. The English language has a lot of words that were made through
morphology. English derivation, however, is complex. The reasons are:

1. Different meanings: A lot of words do have the same root word but differ when it comes to
the true meaning of the word. For example. Let us take the word “conform”. A person can
add 2 suffixes to form the words “conformity” and “conformation”. However, both of the

Downloaded by Yash ([email protected])


lOMoARcPSD|21913187

above words have completely different meanings despite stemming from the same root
word.
2. Non-uniform Effects: The words that are formed may not always have lasting effects on
other words that are formed in the same manner. Take for example the word summarizes, if
one adds the verb form – “ation” it forms another word with very similar meaning. The
English language is vast but does not form the same throughout the entire plethora of the
language.

E.g.

Category Root Affix Derived word New category


Noun character -ize characterize Verb
Verb give -er giver Noun

Adjective real -ize realize Verb


Noun colour -ful colourful Adjective

Key Takeaways

English Morphology is used by the system in order to structure the entire language and
make some sense of it.

1.6 Transducers for Lexicon and Rules

A finite-state transducer is essentially a representation of a set of string pairs that are


used to analyse outputs to get a better understanding of the same. They could be any type of
string pairs such as input and output pairs of the data.
E.g. An FST can represent {(run, Run + Verb + PL), (run, run + Noun + SG), (ran, run + Verb
+ Past) …..}
A transducer function can be used to map input from zero and all the other outputs as well. A
regular relation is not termed as a function in the above expression.
E.g. An FST can represent – transduce(run) {(run, Run + Verb + PL, run + Noun + SG)}
Finite-state transducers essentially have 2 tapes which are known as input and output. The pairs
are made up of input and output variables. There is also a concept of transducer inversion that is
important in switching pairs i.e. the input and the output labels. The inversion of a transducer
T(T^-1) simply works to switch the input and output labels. T would help map the input I to the
output O whereas the T^-1 maps the exact opposite of the above.
T = {(a,a1),(a,a2),(b,b1),(c,c1)}

T-1 = {(a1,a), (a2,a), (b1,b), (c1,c)}


We could have 2 lexical tapes placed as an example
Lexical Tape: Output alphabet ( ∂ )

Downloaded by Yash ([email protected])


lOMoARcPSD|21913187

(dog + noun + plural)

Surface Tape: Input alphabet ( ∑ )

(dogs)
A finite-state transducer T = P(in) X P(out) defines a certain relation that is between 2 languages
which are in the terms of input and output. FST has a few applications which are given below as
follows.
1. Recognizer

The transducer plays an important role in taking a pair of strings and deciding whether to accept
them or reject them based on a certain amount of criteria.

Generator

The transducer can also give a pair of strings (viz. L(input) and L(output)) as output for a certain
language that is required.

Translator

This process involves reading a string that has been placed as input by the user. The FST then
uses “Morphological parsing” to translate the input in the form of morphemes.

Relater

FST’s are key in being used as a medium that can help relationships between different types of
sets being used around.
Lexicon
Every word in a language is made up of existing words that have meaning. This would essentially
be from a base word while adding an affix to the same. The result would be a word that is formed
through morphology. Most of these are governed by rules which are also known as “Orthographic
rules”. They form the main basis on which Morphological Parsing takes place. This is done
through the process of morphological parsing which is used to form or break a complex
structured form of representation into ‘morphemes. These morphemes are further used to make
words that have meaning and sense.
A “Lexicon” is essentially a list of words that can be used as the stem and affixes that would suit
them to form a new word. A lexicon should constitute the list of words that have the base to form a
structure to be converted to different types of words.

Key Takeaways

A “Lexicon” is essentially a list of words that can be used as the stem and affixes that would
suit them to form a new word.

1.7 Tokenization

Downloaded by Yash ([email protected])


lOMoARcPSD|21913187

Before a user happens to proceed with processing the language, the input text in any form needs
to undergo a process of normalization. This ensures that the text is easily distinguished and easy
to work on before any process goes through. The first process that is part of the entire
normalization process is also known as Tokenization or ‘segmentation’ of words.
Word tokenization can be achieved very easily with the help of a UNIX command line for a text
that is solely in the form of the English language. The process of breaking a given string into
smaller parts (tokens) is known as tokenization. Some of the basic commands that are used to
proceed with the following are as follows.
1. “tr”: This command is used to make systematic changes to the characters that are already
part of the input.
2. “sort”: This command as the name suggests can help to sort the lines which are part of the
input in alphabetical order for better understanding and inference.
3. “uniq”: This command is simply used to analyse the identical lines input and allow them to
collapse based on changes and modifications made to the system.
A simple method that can be used to initialize the process of tokenization is to split it o
whitespace. In the Python programming language, one should simply split this through the NLTK
library with the command raw.split(). Using regular expression makes it impossible to match
new characters in the space to the string. We need to match tabs and lines while using this
process.
Problems with tokenization:
Tokenization may be a great way to analyse and break down complex language structures but
may pan out to be a difficult task while executing. A single solution may not always serve the
purpose as efficiently as one would expect.
Deciding what should be grouped as a token and what should not, is key to enabling users
to work easily with the help of tokenization.
Having raw manually tokenized data to compare with data that has been given back as
output from the tokenizer helps to inspect the flaws and finally choose an appropriate
token in the process.
Contractions serve as a major problem e.g. “couldn’t”. While analysing a sentence
meaning this would not pose too much of a problem and hence should be broken down
into 2 separate tokens potentially.

Key Takeaways

Tokenization is generally done using finite automata or by using expressions.

1.8 Detecting and Correcting Spelling Errors

While typing text into the computer it is a must that people are able to type the words in correctly.
Most of the times there is appropriate entry of words in the form of a corpus. However, if at a
certain point of time people do not understand the word being dictated out to them. This would
cause them to have an incorrect spelling attached to them. The computer would now have to work
on this right there in order to allow the corpus to be examined for analysis further. This is one of
the main reasons why the detection of spelling errors is essential to any corpus.

In order to get an idea about what errors could occur we need to make a list of them. These are all
the errors that could occur in a corpus.
Ignoring grammar rules

Downloaded by Yash ([email protected])


lOMoARcPSD|21913187

Semantically similar components


Phonetically similar components

There are also different errors that are important to know while typing. The machine has to get a
good idea of the following before letting any of the data into the sorting algorithm.
Deletion – This means that a letter is missing from the string

Insertion – This means that a certain letter or letters need to be added

Substitution – This means that a certain letter or letter in the string needs to be replaced by
another letter or letters in the string

Transposition – This means that all the letters are present but in a different order. The
letters in the string could be swapped between each other to get a meaningful word.

In order to detect the issue with the spelling there are 2 models that can be employed based on
the corpus that is in question. They are:
Direct Detection
After working with a lot of corrections the system is trained to make a list of all the words that
have been corrected before. By running the corpus against the list could get rid of many words
that are commonly misspelled. If the words check out they will be considered eligible.

Language Model Detection


This model is a little different from the previous one because there are many words that could be
considered wrong even when they are correct. This is why the user builds a model which is
essentially made just for the purpose of correction. The model is well trained in order to get a
good look at the different words that could be suggested as alternatives to the misspelled word.

Key Takeaways

Some of the basic operators that are used in regular expressions are Brackets, Dash, Caret,
Question mark, Concatenation symbols, Anchors, etc.

1.9 Minimum Edit Distance

Just as the name rightly suggests this phenomenon is about the comparison between two
different types of strings. NLP constitutes a major part of measuring the similarities between two
or more phrases (strings). An example of this can be seen in everyday life. While typing a
message on your mobile you notice a certain autocorrect feature. In short, if the user types the
word ‘kangaroo’ instead of ‘kangaroo’ the system would intend to know what exactly did the user
mean. This would mean that the user has made a mistake and meant a word that was similar to this

Downloaded by Yash ([email protected])


lOMoARcPSD|21913187

one. Based on simple NLP the system deciphers ‘kangaroo’ to be the appropriate spelling and
consciously makes the change to save the user time and effort.
At times the spelling is not the only thing that is different about the string. Many a time people
have two strings that differ by only a single word. E.g.
1. “ Andrew is a football player.”
2. “Andrew is a good football player.”
On closer observation, we notice that both the sentences are very similar to each other and differ
by only a single letter. This tells us that both these strings might be referent. Now that we have an
idea about the similarities that can occur between two strings we can move on to the definition of
“Edit Distance” or simply “Minimum Edit Distance.”
Edit Distance is the distance between two strings in terms of their similarities and differences. It
also refers to the minimum number of operations that the user would have to execute to make two
or more strings alike (transform S1 into S2). The operations could include deletion and addition of
words as well as the substitution in certain cases to achieve the end string product.
We can look at the following example to understand the case between the 2 words ‘execution’
and ‘intention’.

The alignment between the 2 words is something that has been depicted in the above figure. We
can notice that there are a lot of letters that bear resemblance to each other in a very similar
position as the word we intend to transform it to. We notice that the transformation can be made
with the help of 3 distinct operations viz. deletion, insertion, and substitution. This is a visual
depiction of how knowledge of the minimum distance can help in-text transformation and editing.

We now have a clear idea about exactly is the minimum distance in NLP. However, to calculate the
same, we need to get familiar with an algorithm known as the “Minimum Edit Distance
Algorithm”. We need to find the shortest path that can get us to the intended word with simple
processes as mentioned above.

Key Takeaways

The minimum Edit distance refers to the least number of actions that it takes to convert
string 1 into string 2. We can achieve this through the minimum edit distance algorithm.

Downloaded by Yash ([email protected])


lOMoARcPSD|21913187

1.10 N-Gram Models

Word Level Analysis


N-Gram as a term simply means the simplest bit the forms a sentence. It is essentially the number
of bits that form a given string while we are reading data that has to be processed. For example:
New Delhi – (2-gram word)
Phir Hera Pheri – (3-gram word)
He stood up quickly – (4-gram word)
It is very obvious that we have heard the first two examples much more frequently than we may
have heard the last example which is “He stood up quickly.” This is an example of an instance
where the N-gram is not observed as much. We then notice that assigning a certain probability to
the occurrence of the N-bits in the model to predict the next word in the string, we could benefit
from it for 3 of the main following reasons:
1. Many of the 2-bit N-grams could be clubbed together to form a single word and in turn for
a single bit that could be easier to read.

Predictions could be made very easily by doing the same. Users would benefit because the next
word in completing the statement would be one that has appeared more time because of the
appropriate probability density that has been assigned to the same. E.g. If we have a sentence,
“Please submit your ______.” It is most likely that the next word would more likely be ‘assignment’
rather than ‘house’.

Spellings are something that needs to be changed from time to time depending on the typing
speed of the person. Having a feature like autocorrect could be based more on the occurrence of
the word after a certain one. E.g. If we have a sentence, “drink mlik” due to some error while
typing, the machine will correct it to ‘milk’ and in turn change the spelling. This is done because
the word ‘milk’ has a higher chance of coming after the word ‘drink’.

Now that we understand the use of an N-gram model let us move forward and get an idea about
the formal understanding of an N-gram model. An N-gram model is essentially used to predict
the occurrences of the previous (N-1) words. To understand this better we have a few terms
introduced which help us understand the type of model we are going to be working with. This
helps us figure out our next word by handpicking the required data from each of them.
Bi-Gram Model – This means that there are 2 words (N=2) and we need to go back to (N-1)
= (2-1) = 1 word to predict the next word accurately.

Tri-Gram Model – This means that there are 3 words (N=3) and we need to go back to (N-1)
= (3-1) = 2 words to predict the next word accurately.

To see how the model works further we need to work with assigning a certain probabilistic
density to each of the following words in a model. Depending on the sentence the user can
choose the model which could be bigram, trigram, or even further for more complex strings. To
illustrate we need a ‘corpus’ or a large set of small sentences. Let us look at a few small sentences
to illustrate this property.

Downloaded by Yash ([email protected])


lOMoARcPSD|21913187

1. He went to New Delhi.


2. New Delhi has nice weather.
3. It is raining in the new city.
Let us assume one of the models that have been mentioned above. What if we take a
bigram model? Probability is the [number of times the previous word (N-1) appears before the
word that we require to predict] / [Number of time the previous word appears before the
required word in the entire set of lines]
Let us find out the probability that the word ‘Delhi’ comes after the word ‘New’. So for this, we
need to take the number of times ‘New Delhi’ appears divided by the number of times ‘New’
appears in the 3 lines on top.
= (number of times ‘New Delhi’ appears) / (the number of times ‘New’ appears)
=2/3
= 0.67
Bigram models work very well with all the different types of corpus’ that are available from the
model. Bigram may work better than trigram models even for more complex strings. However,
when we have a lot of data for testing it is much more feasible to work with 4-gram and 5-gram
models. One needs to analyse the right and left sides of the sentence as well before making any
adjustments and assumptions about the same. Now we will look at the different unsmoothed N-
gram models and their evaluation.

Key Takeaways

The N-gram model uses probability to make assumptions based on the corpus that is used
to train the function.

1.11 Evaluating N-Grams

When we have a model like the N-gram model we must know how effective the model is in
practice as well as hypothetical cases. If the application of the program improves after time and
spatial expansion it is certain that the performance is improving. If we want to analyse and go
through the model, we need to conduct something which is known as ‘extrinsic evaluation’.
Extrinsic evaluation is a method that works from one end of the model that is the start to the other
end which is the output. We can simply run the model twice with a slight tweak in the model to get
a whole description of the analysis that is going to be conducted by the model. Once we get both
of the tests done we can simply compare them. However, the worst part about having to do a test
like this is the cost. Running an NLP system twice just for testing could cost a lot.
Due to the cost and other factors that would cause the previous evaluation to be accurate but not
every effective in the long run a new method was decided. This one simply decided a certain
metric that would be constant for all of the models. Once the model works it would simply
compare similar statistics even without having a certain application of the model. Such an
application and testing of the model is known as ‘Intrinsic evaluation’.
Now while using the above ‘Intrinsic evaluation’ we need to train the corpus that we have and
hence we need something which is known as a ‘test set’. Just as a lot of other tests that are present
in the world of statistics the N-gram model also relies heavily on testing the model with the help
of a corpus. So, this allows us to have a ‘training set’ and a ‘test set’ with data that is unseen by the

Downloaded by Yash ([email protected])


lOMoARcPSD|21913187

system. This allows the user to train the data with other features not prevalent in the current
dataset but might appear in future datasets.
Once we train our model with the help of the sets that are given above we notice that our
algorithm gets tuned up explicitly to the features that are present in these sets. This causes it to
become redundant with respect to a new feature that could appear in datasets in the future. This is
why people have come up with something which is known as the ‘development test set’ or a
‘devset’ to solve this problem. A general distribution by users who work with the data is 80%
which foes for the training data, 10% which goes for the test data, and the remaining 10% which
goes for the development set.

Key Takeaways

We can train our N-gram model with the help of an algorithm for the same.

1.12 Smoothing

A lot of times while testing our model we might come in contact with words that might be
unknown to us or those which simply have meaning but have occurred in the dataset only once.
Due to their single presence, the algorithm might assign them with simply a zero probability. If
this happens it might cause our result set to be null and void due to the negligible importance
given to those words. The process that allows us to assign some mass to the events is known
as smoothing or discounting. There are a few ways to do this smoothing which are given below.

Laplace Smoothing
Before we complete the normalization of the probability densities, we can add the number 1 to
every bigram count. We would notice that all the counts would be upped by a single value. E.g. 2
would be 1, 3 would be 4, etc. This is the simplest way of smoothing which may not be of great use
in newer N-gram models but does a great job in getting one familiar with concepts that might be
key in learning different types of smoothing.
Let us get an idea of this smoothing with the help of a unigram model
Where, w = word count
c = normalized count
N = number of words/tokens (after tokenization)

Therefore, P(w) = c / N

Now as explained above, the process of Laplace smoothing allows each word count to be
increased by the value 1. After making this increment, we need to also notice that an adjustment
has to be made to the denominator to make up for the W words in the observation list.
w = word count
c = normalized count

Downloaded by Yash ([email protected])


lOMoARcPSD|21913187

N = number of words/tokens (after tokenization)


W = words in observation (left out)

Therefore, P(Laplace)(w) = c +1 / N +W

Key Takeaways

P(Laplace)(w) = c +1 / N +W this is the formula for Laplace Smoothing.

1.13 Interpolation and Back off – Word Classes, Part-of-Speech Tagging, Rule-
based, Stochastic, and Transformation-based Tagging

We learned a lot about the smoothing that takes place to get rid of the problem that involves not
assigning any probability to new words. However, different problems need to be taken care of
which are shown below. Some of the problems that are solved by involving a trigram can be
solved instead by using bigram. We can also solve a few bigram models by using unigram
models.
We can use lesser context as a good example that allows the user to generalize the model. There
are 2 different types of ways that we can use the following

1.13.1 Interpolation (Jelinek-Mercer Smoothing)


Interpolation which is also a type of smoothing is used in place of back off at times. The process of
interpolation involves mixing the estimates of the probability of all the n-gram estimators
available. This is done by consciously weighing all the word counts in the 3-gram, 2-gram, and
unigram models.
When we look at simple interpolation there is a combination of different n-gram orders as
mentioned above in the model. The mixing of all the models gives us a fair idea of all the
probabilities with their weights assigned to them as P(wn|wn−2.wn−1). If λ is given as the
weight coefficient then we get the following expression.

Pˆ(wn|wn−2wn−1) = λ1P(wn|wn−2wn−1) + λ2P(wn|wn−1) + λ3P(wn)

There is also a little more complex version of the same in which the λ is given as per the context
in which it is issued. We have a little assumption made here that the word counts for the trigram
model would be more accurate as compared to the bigram and unigram models. Hence, we get a
new modified equation as shown below.

Downloaded by Yash ([email protected])


lOMoARcPSD|21913187

1.13.2 Back off (Katz Smoothing)


1. Back off is the process where the user uses a trigram to provide a source of evidence to the
NLP algorithm. If this does not train appropriately the user can switch to a bigram model as
well as a unigram model thereafter. In short, the user “backs off” from using a higher gram
model if there is 0 or no evidence of the same during discounting. This allows lower n-gram
models to come into the big picture.
2. In the algorithm, if we notice that the word count of the desired word in the corpus has zero
counts, we allow the model to be reduced to one that has (N-1) grams. This process is
simply continued until the model has enough evidence to be used as an appropriate
model. Higher ordered grams save some probability density to get an appropriate
discount for the same.
3. We notice that the probability that is going to be added to the total discounted word counts
in the string must have an assigned number that is greater than 1!. Katz back off is a type
of back off that relies on the discounting that happens throughout a probability P*(P-
star). We would also need a function α(alpha) to allow a fair distribution of the probability
density throughout. This algorithm is mainly used with the Good Turing Algorithm that is
used for smoothing. Collectively this is known as the Good-Turing back off method for
providing an insight into the zero-gram word counts present in the corpus.
4. The P(BO) which is the probability of the back off is given as:

Comparison between back off and interpolation


1. Both the models (interpolation and back off) deal with involving higher and lower gram
models to compute the final result during smoothing.

In the case of having a certain non-zero word count the process of interpolation uses the data
which is given by lower n-gram models.

In the case of having a certain non-zero word count the process of back off does not use the data
which is given by lower n-gram models.

One can create an interpolated version of a back off algorithm and the same goes with creating
a back off version of the interpolated algorithm.
1.13.3 Word Classes
When we look at the topic of word classes, we find a very simple understanding of the same in
the English language. Having a good knowledge of the different word classes could help us
understand the morphology and other lexical analysis’ happening in language processing. We
can start with the simple examples of word classes which are given below and then move
forward.

Downloaded by Yash ([email protected])


lOMoARcPSD|21913187

1. Basic Word Classes

Noun – N
Verb – V
Adjective – A
Adverb – a

Closed Classes

Determiners – the, an, a


Pronouns – he, she, it, they, etc.
Prepositions - over, from, with, under, around, etc.

1.13.4 Part of Speech (POS) Tagging


When we talk about the different word classes above, we get a clear idea about the English
language that we are dealing with here. Below is a table that is made up of the tags given to the
different parts of speech along with examples of the same to get you familiar with the following.
In POS, each word class is tagged and then used as a token during the analysis process.

WORD CLASS TAG EXAMPLE

Singular Noun NN House, school

Plural Noun NNS Boys, girls

Singular Proper Noun NNP APPLE, OPPO

Plural Proper Noun NNPS Anands, Geetas

Base Verb VB Work, play

Verb past tense VBD Worked, played

Verb Gerund VBG Working, Playing

Verb Past Participle VBN Driven

Adjective JJ Beautiful, long

Comparative Adjective JJR Bigger, Longer

Superlative Adjective JJS Deadliest

Adverb RB Easily, beautifully

Preposition IN Under, around, near

Downloaded by Yash ([email protected])


lOMoARcPSD|21913187

Conjunction CC And, or, neither, nor

Determiner DT The, a, an

Foreign Word FW Mea Culpa

Existential EX (is) there

Personal pronoun PRP You, He

Possessive pronoun PRP$ Yours, Theirs

List / Item Marker LS 3,4, three, four

Modal MD Must, might

Cardinal Number CD Nine, ten

Interjection UH Ouch!, Alas!

Dollar Sign $ $

Pound Sign # #

Left Quote “ ‘, “

Right Quote ” ’, ”

Left Parentheses ( (, [, {, >

Right Parentheses ) ), ], }, <

Comma , ,

1.13.5 Rule-Based Tagging


This type of tagging happens to be one of the most formerly known methods of tagging with the
help of POS. These simply use tags for word classes that are present in the lexicon or the corpus.
Sometimes there could be more than one possible tag. To deal with this problem there are a set of
rules that have been made to deal with this.
Rule-based tagging is a process that goes through 2 stages entirely. The first stage and the
second stage are given below.
1. Stage One - In this stage, the language processing uses the dictionary that has been
included to give each word a specific POS which can be used in the tagging process later.

Stage Two – There is a list of rules which are used to finalize every word to select and assign
them with a particular POS.

Rules for Rule-Based tagging


The taggers that are used are all driven by knowledge.
The rules involved in finalizing the POS are all made manually.

Downloaded by Yash ([email protected])


lOMoARcPSD|21913187

Rules are used to code the information.


There are a limited number of rules.
Rule-Based tags are the only rules used for smoothing and NLP.

1.13.6 Stochastic Tagging


There is another model that is made to complete the process of tagging. This is known as the
Stochastic POS tagging. We only need to identify which of the model could be classified as a
stochastic model and used for the same. The name comes from the feature of the model using a
lot of statistics to get the desired results.
Word Frequency Approach
In the word frequency approach, the taggers that are assigned classify the words mainly based on
the probability of occurrence. If the tag has been assigned the most it will show up after the
training set and be used in the main assigning process. A lot of tags have different approaches
and this is one of the common ones that are used for the same
The N-Gram approach is one that is very similar to the same because of the different probability
densities given to the model. As explained above it depends on the previous words assigned to
the entire sequence.

Properties
The following type of tagging is based on all the statistical inferences that have been
made to tag words that are part of the corpus.
A training set needs to be assigned to the model to get an idea of the instances that have
occurred in the past.
A probability would also be assigned to the words that may not appear in the training
corpus.
The training and testing corpus are different from each other.
It simply uses the tags which appear most frequently during the training process using a
training corpus set.

1.13.7 Transformation Based Tagging


This type of tagging is also known by the name of ‘Brill Tagging’. This form of tagging does not
use any sort of probability or statistical inferences to make tags. Instead, there are a set
of rules used by the algorithm to allow POS tags on the corpus. It essentially transforms the text
into a different state through the intervention of pre-defined rules.
The model is heavily based on the previous two models to understand the rules laid down by the
algorithm for the tagging process. There is a fixed protocol that is followed by this tagging
process to get an idea of the tagging process.
1. The solution to begin – The program works with solutions that are very familiar to it based
on past problems that occurred which take place in cycles.
2. The most appropriate solution chosen - The most useful solution would be automatically
adopted.
3. Application – The solution is then implemented into the solution of the problem.

Key Takeaways

POS is a feature that helps tag the different tokens that have been formed after the process
of tokenization.
Rules for Rule-Based tagging
The taggers that are used are all driven by knowledge.
The rules involved in finalizing the POS are all made manually.

Downloaded by Yash ([email protected])


lOMoARcPSD|21913187

Rules are used to code the information.


There are a limited number of rules.
Rule-Based tags are the only rules used for smoothing and NLP.

1.14 Issues in PoS Tagging – Hidden Markov and Maximum Entropy Models

1.14.1 Hidden Markov Model


A Hidden Markov Model (HMM) is a model that is a modified stochastic tagging model with the
main process being underlined throughout the main process. The result is simply the same if not
more accurate. However, the processes that get done cannot be seen. The results of each
sequence throughout the process are the only part that can be seen and comprehended.
To understand this a little better, we need to understand the term Markov chain. This is simply a
model that gives us inferences about the sequences from the different probability densities
occurring throughout the corpus. To make a prediction the Markov chain relies heavily on the
current position of the tag or word count in the string. The Hidden Markov Model allows the user
to dwell around both visible events (input strings) as well as hidden events (POS tagging).
Let us take an example to understand better the HMM process for a certain example. Say we have
to build an HMM of the problem where coins are tossed in a certain sequence. To get a better
understanding we can break this down into the 2 processes and factors
1. Visible Events

These are the events that we know are happening and can physically or virtually picture them. In
this case, it is the tossing of coins. We are sure that coins are being tossed with an option of either
showing ‘heads’ or ‘tails’ as the outcome.

Hidden Events

These are the factors that are completely hidden from the user. In this case, they could be the
number of coins that are used in the problem or the order in which the coins are tossed up in case
they are different or weighted. One could conclude so many different models to understand this
process better. One of the forms of the model is given below.

Downloaded by Yash ([email protected])


lOMoARcPSD|21913187

Fig. No. 2

Fig. Description 2: These are the hidden events during the transformation as shown in the
diagram.
Tags in each of the following are sequences that generate tags for the outputs being given in the
hidden states. These states have tags that could give an output that is observable to the user.
1.14.2 Maximum Entropy Markov Model (MEMM)
This model is used as an alternate form for the above Markov Model. In this model, the sequences
which are taking place are replaced by a single function that we can define. This model can
simply tell us about the happenings in the predicted state ‘s’ while knowing the past state ‘s’ as
well as the current observation o. P (s | s’, o)
The main difference that we can get ourselves familiar with is the part where HMM’s are only
dependent on the current state whereas the MEMM’s have their inferences which can be based
on the current as well as the past states. This could involve a better form of accuracy while going
through the entire tagging process. We can even see the difference in the diagram which is given
below.

Downloaded by Yash ([email protected])


lOMoARcPSD|21913187

Fig. Description: Comparison between the Hidden Markov Model and the Maximum Entropy
Markov Model
Disadvantages of MEMM’s
1. MEMM models have a problem due to the label bias problem. Once the state has been
selected, the observation which is going to succeed the current one will select all of the
transitions that might be leaving the state.

There are a few examples in words such as “rib” and “rob” which when placed in the algorithm in
their raw form have a little different path as compared to the other words. When we have “r_b”
there is a path that goes through 0-1 and 0-4 to get the probability mass function. Both the paths
would be equally likely not allowing the model to select one fixed word that could be predicted
by the model correctly.

HMM, and MEMM models are used to tag and have their pros and cons which can be
decided based on the model that needs to be adapted.

Perplexity
One of the methods that are put to use to make sure that the evaluation of models is done
correctly is the process of ‘perplexity’. We cannot use probabilities that are raw for a metric while
evaluating languages. A variant in this is known as perplexity (PP). Perplexity of a certain NLP
model on a set that is used for testing is normalized by the number of words and the probabilistic
density assigned to it.
Let us say that for a set W = w1.w2.w3.w4……..wn
The following process is done by working on the probability expansion with the help of the chain
rule. This is followed by the assumption that the model is a 2-gram ort a bigram model as given
above. The result is perplexity which gives us a clear understanding while evaluating a model.
These are the different formulas as given below:

Downloaded by Yash ([email protected])


lOMoARcPSD|21913187

The final result that we can conclude is that because the probability is an inverse type of
probability, the higher the conditional probability of W the lower if the perplexity (PP).
We could even try to understand the concept of perplexity with the help of a factor called the
‘weighted average branching factor’. This simply is a factor that tells the user about the number
of words that may follow the given word. Let us take a simple example of the first 20 digits in the
normal number system. Hence the set S = {1,2,3,4, ……..20}. The probability that any of the
digits occurs is simply P = 1/20. However, if we consider the following as a language we notice
something a little different about the measure of the perplexity. As mentioned before the
perplexity is simply the inverse which is precisely why the answer would be inverted as shown
below.
We know, Perplexity = PP(S) = P(s1.s2.s3…….sN) ^ (-1/N)
= { [ (1/20)^N ] ^ (-1/N) }
= ( 1/20 ) ^ -1
= 20

Key Takeaways

A Hidden Markov Model (HMM) is a model that is a modified stochastic tagging model with
the main process being underlined throughout the main process.
They are based on 2 main factors viz. visible events and hidden events.

References
1. Fundamentals of Speech Recognition by Lawrence Rabiner
2. Speech and Language Processing by Daniel Jurafsky and James H. martin

Downloaded by Yash ([email protected])

You might also like