0% found this document useful (0 votes)
139 views43 pages

NLP Study Materials Updated

Natural Language Processing (NLP) is a subfield of linguistics and artificial intelligence that enables machines to understand and interpret human language. NLP is used in applications like voice assistants, sentiment analysis, summarization, and more. The document discusses some key components of NLP including natural language understanding to interpret meaning and context, and natural language generation to produce text. It also outlines some common challenges in NLP like ambiguity and discusses techniques for cleaning text data through steps such as removing HTML tags, normalization, and filtering stopwords.

Uploaded by

CHARU SINGH
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
139 views43 pages

NLP Study Materials Updated

Natural Language Processing (NLP) is a subfield of linguistics and artificial intelligence that enables machines to understand and interpret human language. NLP is used in applications like voice assistants, sentiment analysis, summarization, and more. The document discusses some key components of NLP including natural language understanding to interpret meaning and context, and natural language generation to produce text. It also outlines some common challenges in NLP like ambiguity and discusses techniques for cleaning text data through steps such as removing HTML tags, normalization, and filtering stopwords.

Uploaded by

CHARU SINGH
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 43

IntroductiontoNaturalLanguageProcessing

NaturalLanguageProcessing(NLP)isasubfieldof linguistics(scientificstudyoflanguage)andArtificia
lIntelligence.Itenablesmachinestounderstandandinterpret humanlanguage.

NLPinourdailylife:

Ifyou have noticed Google Search is integrated with the ―Search by Voice‖ button
whichenablesusers to search for anything byjust speaking.

at the back end, Google receives your voice recording, processes your words (Natural
Language)and converts it into text, and then through text matching algorithms relevant results
are displayedtoyou.

Companies are using sentiment analysis, an application of natural language processing (NLP) to
identify the opinion and sentiment of their customers online. It will help companies to
understand what their customers think about the products and services.

Faster typing with NLP

When you type out a message or search query on Google, NLP allows you to type faster.

1. Predictive text – Provides suggestions for the next word in a sentence

2. Autocomplete – Provides suggestions for the remaining part of the word

Text messengers, search engines, websites, forms, etc., utilize NLP technology simultaneously, to speed up
the access to relevant information.

Precise writing with NLP

While writing an email, word documents, composing blog posts, or using Google Docs, NLP allows users to
write more precisely.

3. Grammar checkers – helps users use punctuations, voice, articles, propositions and other grammatical
elements by providing suggestions in your language of choice.

4. Spell checkers – help users remove spelling errors, stylistically incorrect language, typos, etc., based on the
language chosen.
For example, Grammarly utilizes both spell checkers and grammar checkers to help you make corrections for
a more accurate output.

This is just an example, similarly our daily life is just filled with numerous applications
ofNaturalLanguageProcessing.

NLPbasedproblemsarenot reallyeasy.Why?

NLP based problems usually have unstructured data and when the data is in unstructured
form,thendata processing becomes difficult.

Unstructureddata:Datawhichisnotinproperstructureandwhichcannotbestoredintheformof
rowand columndirectly.

Video, Image,Text,Audio areexamplesofunstructureddata.

ApplicationsofNaturalLanguageProcessing:

 AutoCompletefeatureinEmailsandinSearchEngines
 VoiceRecognition(Machinesare abletoconvert VoiceintoText)
 Texttospeech
 Chatbots
 VoiceAssistants(For e.g.Alexa, Siri)
 SentimentsAnalysis(Fore.g.Positive,Negative, Neutral)
 TextSummarization(Fore.g.Summarisingthenewsinto50words)
 EmailSpamDetection(Fore.g.Spam/NotSpam)
 Advertisements
 ExtractinginformationfromResume(For e.g.NamedEntityRecognition)
 And alot more. . .

This list just goes on, as the technologies are advancing, we are becoming more and
moreconnected with the Internet and we are producing lots of data daily over the internet. Be it
yourSocial Media, your Product buying history, your search history, companies are just using
theseinformationto personalizeyour experienceon their platforms.
HencethereisagrowingdemandinthefieldofNLPwhereengineersandscientistsareengagedwithunstru
ctured data soas tomakesomerelevant business decisions.

ComponentsofNLP

TherearetwocomponentsofNLP,oneworksforunderstandingtherightmeaning/semanticofthetext
and other helps in generatingthe textas aresponse tothe user.

a) NaturalLanguageUnderstanding(NLU)

b) NaturalLanguageGeneration(NLG)

1) Natural Language Understanding: It is where the syntax and semantics are learnt by
themachine.Itisthestepwheremachinesunderstandtheactualmeaningandcontextofthesentence.Buta
sthelanguagecomeswithAmbiguity,sotherearefewproblemsmentionedbelowwhich occurs
whileunderstanding thetext:

a) LexicalAmbiguity:When awordhas morethan onemeaning.

Example:IsawBats(theMammalBatorwoodenCricket Bat)

b) SyntacticAmbiguity:Presenceoftwo ormore possiblemeaningswithin asinglesentence.

Example:Thechickenisreadytoeat.(Chickendishisnowreadyforyoutoeat,Or,chickenhimself is
readyto eat something)

c) ReferentialAmbiguity: Whenitisnotclearabouttheobjectyou arereferringto.

Example:JohncalledJay.Later,Helaughed.(HerewhoishereferringtoJohnorJay)

Therearedefinedtechniqueswhichareusedtoremovetheseambiguitiesfromthetextsothatthe right
meaning is understood bythe machine.

2) Natural Language Generation: It is where machines generate text from their


knowledgebase.

Example:AutomaticEssayWriting, NewsWritingetc.

NaturalLanguageGenerationworksinthreephases:

a) Textplanning:Inthisphase, useful contentisfetchedfromthemachine'sknowledgebase.

b) Sentence planning: In this phase, selection of words, forming meaningful phrases and
settingtoneofthe sentencetakes place.
c) TextRealization:Thisisthefinalphasewhereexecutionofasentenceplanisdoneintothefinalsentenc
efordelivery.

Text Cleaning Basics

So far we have learnt what is NLP, what are its components, and what are the challenges faced during Text
processing. Now it is the time to do a bit of coding and let us start cleaning the text corpus.

When the text corpus is given to us, it may have following issues:

 HTML tags
 Upper / Lower Case inconsistency
 Punctuations
 Stop words
 Words not in their root form
 And so on. . .

Before using the data for predictions, we need to clean it. Let us start working on these issues one by
one:

1) HTML Tags removal:

While scraping the text data from a website, you may get HTML tags included, so it is recommended that we
remove them.

Example:

“The market extended gains for the seventh consecutive session, climbing 1 percent to end at <b> record
</b> closing high on May 31. Reliance Industries <h2> continued to be a leader </h2> in the rally, followed
by <br> private banks & financials and FMCG stocks.”

To clean the above text, let us remove the words which are present in between the angle brackets „<‟ , „>‟.
We need to write regex (regular expressions) for it.

import re
text_data = '''The market extended gains for the seventh consecutive session, climbing 1
percent to end at <b> record </b> closing high on May 31. Reliance Industries <h2>
continued to be a leader </h2> in the rally, followed by <br> private banks & financials
and FMCG stocks.'''
html_pattern = re.compile('<.*?>')
text_data = re.sub(html_pattern, '', text_data)
text_data

Output:

“The market extended gains for the seventh consecutive session, climbing 1 percent to end at record closing
high on May 31. Reliance Industries continued to be a leader in the rally, followed by private banks &
financials and FMCG stocks.”

Now you can notice that html tags have been replaced with empty strings.
2) Upper and lower case inconsistency:

Let us remove this inconsistency and convert everything into lower case.

text_data = text_data.lower()
text_data

Output:

“the market extended gains for the seventh consecutive session, climbing 1 percent to end at record closing
high on may 31. reliance industries continued to be a leader in the rally, followed by private banks &
financials and fmcg stocks.”

3) Remove Punctuations:

Punctuations in the text do not make much sense hence we can remove them.

Example: % ^ &* , ) } etc

c)text_data = re.sub(r'[^\w\s]', '', text_data)


d)text_data

Output:

“the market extended gains for the seventh consecutive session climbing 1 percent to end at record
closing high on may 31 reliance industries continued to be a leader in the rally followed by private banks
financials and fmcg stocks”

4) Remove words having length less than or equal to 2:

Words that provide meaningful information often have word length more than 2.

e)text_data = ' '.join(word for word in text_data.split() if len(word)>2)


f)text_data

Output:

'the market extended gains for the seventh consecutive session climbing percent end record closing high
may reliance industries continued leader the rally followed private banks financials and fmcg stocks'
PhasesinNaturalLanguageProcessing

There are phases in NLP which need to be performed in order to extract meaningful
informationfrom the text corpus. Once these phases are completed, you are ready with your
refined text andthen youcan applysomemachine learningmodel to predict something.

1) Lexical Analysis: In this phase, the text is broken down into paragraphs, sentences and
words.Analysisisdoneforidentificationanddescriptionofthestructureofwords.Itincludestechniquesa
s follows:

 Stopwordremoval(removing„and‟,„of‟,„the‟etc.fromtext)
 Tokenization(breakingthetextintosentencesor words)
o Wordtokenizer
o Sentencetokenizer
o Tweettokenizer
 Stemming(removing„ing‟,„es‟,„s‟fromthetailofthewords)
 Lemmatization(convertingthewordstotheirbaseforms)

2) Syntactic Analysis:

Parsing, syntax analysis, or syntactic analysis is the process of analyzing a string of symbols, either in
natural language, computer languages, or data structures, conforming to the rules of formal grammar.

SyntacticAnalysisisused tocheck grammar,arrangementsof words,andthe relationship between


the words.

Grammatical rules are applied to categories and groups of words, not individual words. The syntactic
analysis basically assigns a semantic structure to text.

Syntactic analysis is a very important part of NLP that helps in understanding the grammatical meaning of
any sentence.

Example:Thisworddoesnotmakesense:―TruckiseatingOranges―
Hencethereisaneedtoanalyzetheintentofthewordsinasentence.Someofthetechniquesusedin this
phaseare:

 DependencyParsing
 PartsofSpeech(POS)tagging

3) Semantic Analysis: Once the tagging and word dependencies are analyzed, semantic
analysisextracts only meaningful information from the text and rejects/ignores the sentences that
do notmakesense.

Semantic Analysis attempts to understand the meaning of Natural Language.

Semantic Analysis of Natural Language captures the meaning of the given text while considering context,
logical structuring of sentences, and grammar roles.

Example:―Truckis eatingOranges― willbeignoredfromtheinformationsummary.

4) Discourse Integration: Its scope is not only limited to a word or sentence, rather
discourseintegrationhelps in studying thewholetext.

Example:"Johngot readyat9 AM.Laterhetookthetrain toCalifornia


Here, the machine is able to understand that the word ―he‖ in the second sentence is referring
to―John‖

5) Pragmatic Analysis: It is a complex phase where machines should have knowledge not
onlyabout the provided text but also about the real world. There can be multiple scenarios where
theintentofasentencecanbemisunderstoodifthemachinedoesn’thavereal worldknowledge.

Example:"Thank you for coming so late, we have wrapped up the meeting" (Contains mockery)

―Canyoushareyourscreen?"(herethecontextisaboutcomputer‟sscreenshareduringa
remotemeeting)

Tokenization

Processofsplittingthetext, phrases,sentencesintosmallerunitsiscalledTokenization.

Example:

 Splittingofatext intosentences(Sentenceis consideredastoken)


 Splittingofasentenceintowords (Wordisconsideredastoken)

Wecanimport differenttypesof tokenizersfromthenltklibraryaccordingly.

1) SentenceTokenizer:

Textdatawillbesplittedintosentences

Code:

fromnltk.tokenize importsent_tokenize
text_data = 'The market extended gains for the seventhconsecutive session, climbing 1 percentto
end at record closing high on May 31. Reliance Industries continued to be a leader in the
rally,followedbyprivate banks &financials and FMCG stocks.'
sent_tokenize(text_data)

Output:

['Themarketextendedgainsfortheseventhconsecutivesession,climbing1percenttoendatrecord
closing high on May31.',

'RelianceIndustriescontinuedtobealeaderintherally,followedbyprivatebanks&financialsandFMCG
stocks.']

2) WordTokenizer:

Textdatawillbesplittedinto words

Code:

fromnltk.tokenize importword_tokenize
text_data = 'The marketextended gainsfor the seventhconsecutive session, climbing 1 percentto
end at record closing high on May 31. Reliance Industries continued to be a leader in the
rally,followed byprivatebanks &financials and FMCGstocks.'
word_tokenize(text_data)

Output:

['The','market','extended','gains','for','the','seventh','consecutive','session',',','climbing','1',
'percent','to','end','at','record','closing','high','on','May','31','.','Reliance','Industries',
'continued','to','be','a','leader','in','the','rally',',','followed','by','private','banks','&','financials','and','F
MCG','stocks','.']

3) WhitespaceTokenizer: Based on white space, words are splitted. In Previous example,


―,‖,―.‖ are not part of the word, as they have their own usage and meaning, hence they are
splittedseparatelyandconsideredasseparatetokensbytheirown.ButinWhitespacetokenizer,character
swhich areoccurring togetherwill rema in together.

Code:

fromnltk.tokenize importWhitespaceTokenizer
text_data='Themarketextendedgainsfortheseventhconsecutivesession,climbing1percenttoendatrec
ord closinghighonMay31.RelianceIndustriescontinuedtobe a leaderintherally,followed by private
banks & financials and FMCG stocks.'print(WhitespaceTokenizer().tokenize(text_data))

Output:

['The','market','extended','gains','for','the','seventh','consecutive','session,','climbing','1',
'percent','to','end','at','record','closing','high','on','May','31.','Reliance','Industries',
'continued','to','be','a','leader','in','the','rally,','followed','by','private','banks','&','financials','and','F
MCG','stocks.']

StopwordRemoval

Therearewordsinoursentenceswhichdonotprovideanyrelevantinformationandhencetheycanberemo
ved from thetext.

Example:and,of,at,it,theetc.

TherearemultipleNLP librarieswhichoperateontextandprovidefunctionalitytoremovestopwords.So
me ofthe famouslibraries that provide supportforStop word removal:

 NLTK
 Spacy
 Gensim

installitvia belowcommand:pip installnltk

Code:

import nltk
stopwords=nltk.corpus.stopwords.words('english')

Now you cancheckthestop wordlistusingthestatement:


Thiswill provide alistof allthe stop words

len(stopwords)=>179

Now,letusimportthewordtokenizerlibrarywhichwillsplitourtextcorpusintowords.Lateron these
words will be checked whether they are part of the stop word list or not, if they are partof it,
we need to ignore that word.

from nltk.tokenizeimport
word_tokenizetokenized_text=word_token
ize(text_data)tokenized_text=word_tokeni
ze(text_data)print(tokenized_text)

Output:
['the','market','extended', 'gains','for','the', 'seventh','consecutive','session','climbing','percent',
'end','record','closing','high','may','reliance','industries','continued','leader','the','rally','followed','pr
ivate','banks','financials','and','fmcg','stocks']

Atthis point,letus checkforeachwordandremovethestop words

removed_stop_words_list = [word for word in tokenized_textif word not in


stopwords]print(removed_stop_words_list)

Output:
['market','extended','gains','seventh','consecutive','session','climbing','percent','record',
'closing','high','may','reliance','industries','continued','leader','rally','followed','private','banks','fina
ncials','fmcg','stocks']

Itisclearlyvisiblethat'the','for','and'havebeenremovedfromthetext

Important Point: Let‟ssay there is some word which does not make sense in your domain
andyou want to remove it too. There is a way by which you can enhance your stop words list
byaddingthiswordintoStopwordslistandlateryoucanapplythesamestepforremoval.

Example:„fmcg‟isamorecommonwordinyourdomainsoyouwanttoremoveit.

stopwords.append('fmcg')
len(stopwords)=>180
removed_stop_words_list = [word for word in tokenized_textif word not in
stopwords]print(removed_stop_words_list)

Output:

['market','extended','gains','seventh','consecutive','session','climbing','percent','end','record',
'closing','high','may','reliance','industries','continued','leader','rally','followed','private','banks','fina
ncials','stocks']

At this point, we have cleaned data, however, there are some words which are not in their
rootform. And this problem can affect the model‟s accuracy. Hence it is recommended to
convert thewordsto theirbaseforms.Wearegoing to learnthesetechniques going forward.
Staytuned!!

Stemming&Lemmatization

When we work on some text document, removal of punctuation and stop words are just
notenough,thereis still something morewhich needsour attention.

The words that we use in sentences can take any form. Words can be used in present tense,
orpast or maybein futuretense, accordinglythe word will change.

Fore.g.-Theword„Go‟is„Go‟/„Goes‟inpresenttenseand„Went‟inpasttense

-Theword„See‟is„See‟/„Sees‟inpresenttense,whereasitis„Saw‟inpasttense

These inconsistencies in data can affect the model training and predictions, hence, we need
tomakesurethat thewordsexist in their root forms.

Tohandlethis,therearetwo methods:
1) Stemming:

Stemmingistheprocessofconverting/reducingtheinflectedwordstotheirrootform.Inthismethod,the
suffixes areremoved from theinflected word sothat it becomes theroot.

Foreg.Fromtheword―Going‖,―ing‖suffixwillgetremovedandthe inflectedword―Going‖willbecome
―Go‖whichis therootform.

Few more examples: Developing ->

DevelopDeveloped->Develop

Development-

>DevelopDevelops->

Develop

Alltheseinflectedwordstaketheirrootformwhentheirsuffixesareremoved.Internallythestemmingpr
ocess uses some rules for trimming thesuffixpart.

We can implement stemming in Python using famous library called as


―nltk‖Ifyou don't havenltk installed inyour machine,youcan simplytype:

pip install nltk

Thiswill install nltktoyour systemandyou shouldbeable toimport it.

PythonCode:

fromnltk.stemimportPorterStemmerp
orter =
PorterStemmer()print(porter.stem("d
eveloping"))print(porter.stem("devel
ops"))print(porter.stem("development
"))print(porter.stem("developed"))

Output:

develop
develop
develop
develop
However,there aresomewords whichdonotgetproperlyhandledbythe ―Stemming‖process.

Fore.g.―went‖,―flew‖,―saw‖thesewordscan‟tbeconvertedproperlytotheirbaseformsifStemmingis
applied.

Code:

print(porter.stem("went"))
print(porter.stem("flew"))
print(porter.stem("saw"))

Output:
wentfle
wsaw

Surprisingly,there isnochange inthe outputbecause the Stemmingprocessisnotsmartenough. It


just knows how to trim the suffix part, but it does not know how to change the form ofthe word.
To solve this issue, there should be some algorithm which understands the linguisticmeaningof
thesentenceand convertseachword toits baseformaccordingly.

FortunatelywehaveLemmatization forthiswork.

Goodpartabout thisStemmeristhat notonlyEnglish, butit isuseful for otherLanguages also.

Pros: Computationally Fast: As it simply trims the suffix without worrying about the context
ofword

Cons: It is not useful enough if you are concerned about the valid words. Stemmer can give
yousome words which do not haveanymeaning.

"Goes‖-> ―goe‖

2) Lemmatization:

Itiswherethewordsareconvertedtotheirrootformsbyunderstandingthecontextofthewordinthesenten
ce.

InStemming,the rootwordwhich weget afterconversionis

calledastem.Whereas,it is calledalemmainLemmatization.
Pros:TherootwordwhichwegetafterconversionholdssomemeaningandthewordbelongstotheDictio
nary.

Cons:Itiscomputationallyexpensive.

NLTKprovidesaclasscalled WordNetforthispurpose.

Code:

from nltk.stemimport
WordNetLemmatizerwordnet_lemmatizer=W
ordNetLemmatizer()
Asmentionedinthistutorialyoumighthaveinstalledthenltklibraryinyoursystem,buttoworkwith
WordNetLemmatizer, you needto download thispackageexplicitly.

Code:

import
nltknltk.downlo
ad()
This will launch one window like below and you need to scroll down and select “wordnet” from the
list and click on Download.
Once downloaded successfully, you should be able to use WordNetLemmatizer

Code:

from nltk.stemimport
WordNetLemmatizerwordnet_lemmatizer =
WordNetLemmatizer()print(wordnet_lemmatizer.lemmatize(
"going"))
print(wordnet_lemmatizer.lemmatize("goes")) # Lemmatizer is able to convert it to
"go"print(wordnet_lemmatizer.lemmatize("went"))
print(porter.stem("goes"))# Stemmingis unableto normalizethe word"goes"properly
Output:

going
gowe
ntgoe

ButyoumightbewonderingthatLemmatizerisunabletonormalizethewords―going‖and―went‖into
their root forms.
It is because we have not passed the context to
it.Partofspeech―pos‖istheparameterwhichweneedtospecify.By defaultitisNOUN.If awordis averb
whichwewant to normalize,then weneedto specifywiththevalueas―v‖

Code:

fromnltk.stemimport
WordNetLemmatizerwordnet_lemmatizer =
WordNetLemmatizer()print(wordnet_lemmatizer.lemmatize(
"going",
pos="v"))print(wordnet_lemmatizer.lemmatize("goes",
pos="v"))print(wordnet_lemmatizer.lemmatize("went",
pos="v"))print(wordnet_lemmatizer.lemmatize("go", pos =
"v"))print(wordnet_lemmatizer.lemmatize("studies", pos =
"v"))print(wordnet_lemmatizer.lemmatize("studying", pos =
"v"))print(wordnet_lemmatizer.lemmatize("studied",pos="v"
))
print(wordnet_lemmatizer.lemmatize("dogs"))#bydefault,itisnounpri
Output:
nt(wordnet_lemmatizer.lemmatize("dogs",pos="n"))

gogo
gogo
study
study
study
dogd
og

BasicExtractionsFromthetextcorpus,wecanextractusefulinformationandcreatevariablesout of it.

Example: ―Thequickbrownfoxjumps overthelazydog‖

Fromtheabovesentence,wecanextractfew meaningfulinformationlike:

 Howmanywordsarepresent?
 Howmanycharactersare present inthe sentence?
 Whatistheaveragelengthofeach word?
 Howmanylowercasewords arepresent?
 HowmanyUppercasewords arepresent?
 Whatisthe lengthofthelongest/smallestword?
 Howmanystop words arepresent?

ForStopwords,wewillimportnltk, andforother set ofvariables,thereisnoneed.

Code:

importpandasaspdi
mport nltk
stopwords=nltk.corpus.stopwords.words('english')#importingstopwordslist

string_lst = [ 'THE quick BROWN fox jumps over the lazy


dog','Iam TOOLAZYto do ANYTHING',
'Padhai Time is there to help you

out']df=pd.DataFrame(string_lst,columns=['msg'])

defderive_features(message):
words_lst=message.split()

num_charactes =
len(message)num_words=len(
words_lst)

words_length =
[]lower_words_lst =
[]upper_words_lst =
[]is_stop_word_lst =
[]forwordinwords_lst:
words_length.append(len(word))lower_words_lst.ap
pend(word.islower())upper_words_lst.append(word.i
supper())is_stop_word_lst.append(word.lower()insto
pwords)

stop_words_count=sum(is_stop_word_lst)
avg_word_length = round(sum(words_length)/len(words_length),
1)max_ length_word=max(words_length)
min_length_word =
min(words_length)total_lower_words =
sum(lower_words_lst)total_upper_words=s
um(upper_words_lst)
returnnum_charactes,num_words,avg_word_length,total_lower_words,total_upper_words,ma
x_length_word,min_length_word, stop_words_count
df['num_chars'], df['num_words'], df['avg_word_len'],
df['num_lower_words'],df['num_upper_words'],df['max_length_word'],df[
'min_length_word'],df['num_stop_words']
=zip(*df['msg'].apply(lambdar:derive_features(r)))

BagofWords

Bag of words is a technique to extract features from the provided text data. This technique
countsthe occurrence of words in the sentence. If the word is found in the sentence, then the
occurrencevalueincreases by1, elseits occurrencevalueremains 0.

Thistechniqueissimpleandeasytoimplementbutitcomeswithitsownlimitationsaswell.Wewill
discuss that at theend ofthis article.

How touseBagof WordsTechnique?

When we have the text information available with us, it is a prerequisite that we clean our
textfirst.

All the steps discussed in the previous chapterwill be applied and text is cleaned. Once done
wewillusethe Bag of words technique to extract featuresout of it.

Code:

importpandasaspdi
mportre
import nltk
from nltk.tokenizeimport
word_tokenizefromnltk.stemimportPorte
rStemmer
fromsklearn.feature_extraction.textimportCountVectorizer
stopwords=nltk.corpus.stopwords.words('english')#importingstopwordslist

string_lst= ['<THE>quick BROWN fox jumpingoverthelazydog.','Iam


TOO LAZY, to doANYTHING. Pleasehelp' ,
'PadhaiTimeistheretohelpyououtinanythingquick']

df = pd.DataFrame(string_lst,
columns=['msg'])defclean_data(text_data):
print("Original:{}\n".format(text_data))

#Cleaning htmltags
html_pattern=re.compile('<.*?>')
text_data=re.sub(html_pattern,'',text_data)pr
int("Step1:{}\n".format(text_data))

# Handling Case
inconsistenciestext_data =
text_data.lower()print("Step2:{}\n".fo
rmat(text_data))

#RemovingPunctuations
text_data = re.sub(r'[^\w\s]', '',
text_data)print("Step3:{}\n".format(text
_data))

#Removing <=2 letterwords


text_data = ' '.join(word for word in text_data.split() if
len(word)>2)print("Step4: {}\n".format(text_data))

tokenized_text=word_tokenize(text_data)
text_data = " ".join(word for word in tokenized_textif word not in
stopwords)print("Step5: {}\n".format(text_data))

porter=PorterStemmer()
text_data="".join(porter.stem(word)forword
inword_tokenize(text_data))print("Step6: {}\n".format(text_data))

return
Output:
text_dataclean_data(df['msg'
].iloc[0])
Original:<THE>quick BROWNfox

jumpingoverthelazydog.Step1: quick BROWN fox jumping

overthe lazydog.

Step2: quick brown fox jumping overthe lazydog.


Step3:quickbrownfoxjumpingoverthelazydogStep

4:quickbrownfoxjumpingoverthelazydogStep5:

quick brown fox jumping lazydog

Step6: quickbrown foxjump lazidog

Now,letususetheclean_data()functiontocleanalltherowsandaddonenewcolumninthedataframe
itself with thename ―cleaned_msg‖

df['cleaned_msg']=df['msg'].apply(clean_data)

Wehaverawmessagesandcleanedmessagesavailablewithus.Letususe„msg‟columnandgetthe
count foreachword.

vectorizer=CountVectorizer()
X=
vectorizer.fit_transform(df['msg'])resu
lt=pd.DataFrame(X.toarray())
result.columns =
vectorizer.get_feature_names()result

Doyou seeanyprobleminthis?
If we do not use cleaned data for feature engineering then we will end up in so many columns
inthe final dataframe. You can see, in just 3 sentences, there are 22 words present. If we have
adatasetof10,000rows,thenthiswordcountwillbetoomuchandtheCountvectorizerwillreturnMillions
of columns which is nota Good practice.

Sothistime,wewillusecleaneddataforfeaturecreation.

vectorizer_clean=CountVectorizer()
X_clean =
vectorizer_clean.fit_transform(df['cleaned_msg'])print(vecto
rizer_clean.vocabulary_)
Output:

{'quick':9,

'brown':1,

'fox':3,

'jump':5,

'lazi':6,

'dog':2,

'anyth':0,

'pleas':8,

'help':4,

'padhai':7,

'time':10}

Nowwehavegot only11 words which will become ourfeaturein thefinal dataframe.

result_clean =
pd.DataFrame(X_clean.toarray())result_clean.columns=vecto
rizer_clean.get_feature_names()result_clean
Limitationsof BagofWordsApproach:

1) Countvectorizerdoesnot understandthemeaning oftheword.

Takeexample:-Igoto sleep at 10PM and goto walk at 7AM

- Igo to walk at 10 PM and go to sleep at 7 AM

Afterfeatureengineering, bothofthesesentenceswill resultinsamecolumnvalues

2) ThroughtheBOWapproach,thereexistssomanyzerosinthedataandiscalledsparsematrix.

TermFrequency -InverseDocumentFrequency

Wehaveread sofar about Bag ofwords which onlyfocuses on thefrequencyofaword


inasentence.Considerbelow scenarioswhereBag ofwords willnotbeagood approachtouse:

1) Supposewedon‟twanttoremovestopwordsfromthetextcorpus.Inthiscase,thefrequencyof ―is‖,
―the‖, ―a‖ will be very high. But actually these words do not make sense in the
sentenceforamodel to learn anything.

2) Suppose if we are processing product reviews on Amazon / Flipkart, then the terms
like―product‖, ―item‖aredomaindependentand
areusedtoooftenineachreview.Hencethesewordswill not help model in learninganything.

3) The―mobile‖keywordinthephonedata setisnotgivinganyvalue.Keywords
like―5GB‖,―SplashProof‖, ―Android OS‖will makemoresense.

Hencethereisatechniquecalled TF-IDFwhere
mostoftenwordsaresuppressed(givenlowerimportance)
anduniquewords(lessfrequent)areprovidedahigherweightageinthesentence.

Letuslookatbelow twosentences:

- Ido not like Vanilla Cake

- IdonotlikeVanilla Icecream
Inbothofthe above sentences,UniquethingorI wouldsaythe importantkeywordstonoticeare
―Cake‖ and ―Icecream‖. Remaining terms like ―I‖, ―do‖, ―not‖, ―like‖, ―Vanilla‖ are repeatedin
both the sentences, hencetheydo not provideanyuseful info to the model.

Terminology:

Corpus:Theentiretextdata given tous. A Corpus can havemanydocuments

Document: Single sentence inside a Text

CorpusTerm:SinglewordinsideaDocument/Sente

nceTF:Term Frequency

IDF: InverseDocumentFrequency

Formulas:

Letustakethesame ExampleandcalculateTermFrequencyandIDF:

Document A: I do not like Vanilla

CakeDocumentB:IdonotlikeVanillaIcecrea

mNo.ofwords in Document A: 6

No.ofwords in Document B:6


Itisclearfromtheaboveapproachthatlessfrequentwordslike„cake‟and„icecream‟getmoreweightth
an morefrequent words.

WecanachievethesametaskbyimportingTfidfVectorizerfrom sklearnlibrary.

Thing to note is that everylibrarywhich iscalculating Tf-idf mayhaveadifferentformulaforit.Also,


there are certain parameters which you can set for smoothning of the results. So, when yousee a
different Tf-idf value from sklearn, do not get confused. At Least you got the basic
ideabehindthis approach.

Someofthetechniquesadd1inthedenominatorwhilecalculatingthe IDFvaluesetc.

Code:
importpandasaspdi
mportre
import nltk
from nltk.tokenizeimport
word_tokenizefromnltk.stemimportPorte
rStemmer
fromsklearn.feature_extraction.textimportTfidfVectorizer
stopwords=nltk.corpus.stopwords.words('english')#importingstopwordslist

string_lst = [ 'I do not like Vanilla


Cake','IdonotlikeVanillaIcecrea
m']

df=pd.DataFrame(string_lst,columns=['msg'])

defclean_data(text_data):
print("Original:{}\n".format(text_data))

#Cleaning htmltags
html_pattern=re.compile('<.*?>')
text_data=re.sub(html_pattern,'',text_data)pr
int("Step1:{}\n".format(text_data))

# Handling Case
inconsistenciestext_data =
text_data.lower()print("Step2:{}\n".fo
rmat(text_data))

#RemovingPunctuations
text_data=re.sub(r'[^\w\s]',
'',text_data)print("Step3:{}\n".format(te
xt_data))

#Removing <=2 letterwords


text_data=''.join(word
forwordintext_data.split()iflen(word)>2)print("Step4:
{}\n".format(text_data))

tokenized_text=word_tokenize(text_data)
text_data = " ".join(word for word in tokenized_textif word not in
stopwords)print("Step5: {}\n".format(text_data))

porter=PorterStemmer()
text_data="".join(porter.stem(word)forword
inword_tokenize(text_data))print("Step6: {}\n".format(text_data))

returntext_data

df['cleaned_msg']=df['msg'].apply(clean_data)

Output:
vectorizer_clean =
TfidfVectorizer(smooth_idf=True)X_clean =
vectorizer_clean.fit_transform(df['cleaned_msg'])print(vecto
rizer_clean.vocabulary_)
Output:

{'like': 2, 'vanilla': 3, 'cake': 0, 'icecream':


1}terms=vectorizer_clean.get_feature_names
()terms

Output:

['cake','icecream','like','vanilla']

idf_values=vectorizer_clean.idf_
print("IDFValues: \n",{terms[i]:idf_values[i]foriinrange(len(terms))})

Output:

IDFValues:
{'cake':1.4054651081081644,'icecream':1.4054651081081644,'like':1.0,'vanilla':1.0}

result_clean=pd.DataFrame(X_clean.toarray())result_clean.c
olumns=vectorizer_clean.get_feature_names()result_clean

Output:

Now youwould havegotsomesensewhyTfidf vectorizationis betterthanBag ofwords.

NaturalLanguage Processing–Ove rvie w

 Natural language processing (NLP) is the interactions between computers and


humanlanguage, how to program computers to process and analyse large amounts of
naturallanguagedata. 
 Thetechnologycanaccuratelyextractinformationandinsightscontainedinthedocumentsas
well as categorizeand organizethe documents themselves. 
 NLPmakescomputerscapableof―understanding‖thecontentsofdocuments,includingthecont
extual nuances ofthe languagewithin them. 
 Mosthigher-levelNLPapplications
involveaspectsthatemulateintelligentbehaviorandapparentcomprehension of natural
language. 
 Manydifferentclassesofmachine-learningalgorithmshavebeenappliedtonatural-language-
processingtasks. 
 Thesealgorithmstakeasinputalargesetof―features‖thataregeneratedfromtheinputdata. 

NaturalLanguage Processing–Segme ntationAnalys is

NaturalLanguage Processing–Advantages

1. Betterdataanalysis
2. Streamlinedprocesses
3. Cost-effective
4. Empoweredemployees
5. Enhancedcustomerexperience

NaturalLanguage Processing–5Phases
Phase1–LexicalAnalysis

 Lexicalanalysisistheprocessofconvertingasequenceofcharactersintoasequenceoftokens. 
 Alexerisgenerallycombinedwithaparser,whichtogetheranalyzesthesyntaxofprogrammingla
nguages,web pages,and soforth. 
 Lexersandparsersaremostoftenusedforcompilersbutcanbeusedforothercomputerlanguaget
ools, such as prettyprinters orlinters. 
 Lexicalanalysisisalsoanimportantanalysisduringtheearlystageof naturallanguageprocessing
,wheretext orsound waves aresegmented into words and other units. 

Phase2–SyntacticAnalysis

 Parsing, syntax analysis, or syntactic analysis is the process of analyzing a string


ofsymbols,eitherinnaturallanguage,computerlanguages,ordatastructures,conformingtother
ules of formal grammar. 
 It is used in the analysis of computer languages, referring to the syntactic analysis of
theinputcodeintoitscomponentpartstofacilitatethe writingofcompilersand interpreters.
 Grammatical rules are applied to categories and groups of words, not individual
words.Thesyntacticanalysis basicallyassigns asemantic structuretotext. 
 SyntacticanalysisisaveryimportantpartofNLPthathelpsinunderstandingthegrammaticalmea
ning ofanysentence.

NaturalLanguage Processing–M arketSize

NaturalLanguageP rocessingMarketwasvaluedatUSD11.02Billion in2020andisprojectedto


reachUSD45.79Billionby2028,grow ingataCAGRof19.49
%from2021to2028.
Phase3–Se manticAnalys is

 SemanticAnalysis attemptstounderstandthemeaningofNaturalLanguage. 
 SemanticAnalysisofNaturalLanguagecapturesthemeaningofthegiventextwhileconsidering
context, logical structuringof sentences, and grammarroles. 
 2partsofSemanticAnalysisare(a)LexicalSemanticAnalysisand(b)CompositionalSemantics
Analysis.
 Semanticanalysiscanbeginwiththerelationshipbetweenindividualwords. 

Phase4–DiscourseAnalys is

 ResearchersuseDiscourseanalysistouncover the motivationbehind atext. 


 Itisusefulforstudyingtheunderlyingmeaningofaspokenor written textasitconsidersthesocial
and historical contexts. 
 Discourseanalysisisaprocessofperformingtextorlanguageanalysis,involvingtextinterpretati
on,and understanding thesocialinteractions. 

Phase5–PragmaticAnalys is

 PragmaticAnalysisispartoftheprocessofextracting informationfromtext. 
 Itfocusesontakingastructuredsetoftextandfiguringouttheactualmeaningofthetext. 
 Italsofocuses onthemeaningof the wordsof the time andcontext. 
 EffectsoninterpretationcanbemeasuredusingPAbyunderstandingthecommunicativeandsoci
al content.
NLP-WordSenseDisambiguation
We understand that words have different meanings based on the context of its usage in
thesentence. If we talk about human languages, then they are ambiguous too because many
wordscanbeinterpreted in multipleways depending upon the contextoftheiroccurrence.
Word sense disambiguation, in natural language processing (NLP), may be defined as the
abilityto determine which meaning of word is activated by the use of word in a particular
context.Lexical ambiguity, syntactic or semantic, is one of the very first problem that any NLP
systemfaces. Part-of-speech (POS) taggers with high level of accuracy can solve Word‟s
syntacticambiguity. On the other hand, the problem of resolving semantic ambiguity is called
WSD
(wordsensedisambiguation).Resolvingsemanticambiguityisharderthanresolvingsyntacticambiguit
y.
Forexample, considerthetwoexamples of thedistinct sensethat existfortheword“bass” −
 Ican hearbass sound. 
 Helikes toeat grilledbass. 
Theoccurrenceoftheword bass clearlydenotesthedistinctmeaning.Infirstsentence,itmeans
frequency and insecond, itmeans fish. Hence, if itwould be disambiguated by
WSDthenthecorrect meaningto theabovesentencescan beassigned asfollows−
 Ican hear bass/frequencysound. 
 Helikestoeatgrilledbass/fish.Ev

aluationof WSD 
Theevaluation ofWSD requiresthe following twoinputs−
A Dictionary
The very first input for evaluation of WSD is dictionary, which is used to specify the senses to
bedisambiguated.
TestCorpus
Another input required by WSD is the high-annotated test corpus that has the target or correct-
senses.Thetest corporacan beof two types &minsu;
 Lexicalsample
−Thiskindofcorporaisusedinthesystem,whereitisrequiredtodisambiguateasmall sampleof
words.
 All-words−Thiskindofcorporaisusedinthesystem,whereitisexpectedtodisambiguateall the
words in apieceofrunning text. 
ApproachesandMethodstoWord SenseDisambiguation(WSD)
Approaches and methods to WSD are classified according to the source of knowledge used
inworddisambiguation.
Letus nowseethe fourconventionalmethods to WSD−
Dictionary-basedorKnowledge-basedMethods
As the name suggests, for disambiguation, these methods primarily rely on dictionaries,
treasuresand lexical knowledge base. They do not use corpora evidences for disambiguation. The
Leskmethod is the seminal dictionary-based method introduced by Michael Lesk in 1986. The
Leskdefinition,onwhichtheLeskalgorithmisbasedis “measureoverlapbetweensensedefinitions
for all words in context”. However, in 2000, Kilgarriff and Rosensweig gave
thesimplifiedLeskdefinitionas
“measureoverlapbetweensensedefinitionsofwordandcurrentcontext”,whichfurther means
identify the correct sense for one word at a time. Herethecurrent context is thesetof words in
surrounding sentenceorparagraph.
SupervisedMethods
For disambiguation, machine learning methods make use of sense-annotated corpora to
train.These methods assume that the context can provide enough evidence on its own to
disambiguatethe sense. In these methods, the words knowledge and reasoning are deemed
unnecessary. Thecontext is represented as a set of ―features‖ of the words. It includes the
information about thesurroundingwordsalso.Supportvectormachineandmemory-
basedlearningarethemostsuccessful supervised learning approaches to WSD. These methods rely
on substantial amount ofmanuallysense-tagged corpora, which is veryexpensive to create.
Semi-supervisedMethods
Due to the lack of training corpus, most of the word sense disambiguation algorithms use semi-
supervised learning methods. It is because semi-supervised methods use both labelled as well
asunlabeled data. These methods require very small amount of annotated text and large amount
ofplain unannotated text. The technique that is used by semisupervised methods is
bootstrappingfromseed data.
UnsupervisedMethods
These methods assume that similar senses occur in similar context. That is why the senses can
beinduced from text by clustering word occurrences by using some measure of similarity of
thecontext. This task is called word sense induction or discrimination. Unsupervised methods
havegreat potential to overcome the knowledge acquisition bottleneck due to non-dependency
onmanualefforts.
ApplicationsofWordSenseDisambiguation(WSD)
Wordsensedisambiguation(WSD)isappliedinalmosteveryapplicationoflanguagetechnology.
Let us now see the scope of WSD
−MachineTranslation
Machine translation or MT is the most obvious application of WSD. In MT, Lexical choice
forthe words that have distinct translations for different senses, is done by WSD. The senses in
MTare represented as words in the target language. Most of the machine translation systems do
notuseexplicit WSD module.
InformationRetrieval (IR)
Information retrieval (IR) may be defined as a software program that deals with the
organization,storage, retrieval and evaluation of information from document repositories
particularly textualinformation. The system basically assists users in finding the information they
required but itdoes not explicitly return the answers of the questions. WSD is used to resolve the
ambiguities ofthe queries provided to IR system. As like MT, current IR systems do not
explicitly use WSDmodule and they rely on the concept that user would type enough context in
the query to onlyretrieve relevant documents.
TextMiningandInformationExtraction(IE)
In most of the applications, WSD is necessary to do accurate analysis of text. For example,
WSDhelps intelligent gathering system to do flagging of the correct words. For example,
medicalintelligentsystemmight needflagging of ―illegaldrugs‖ratherthan―medical drugs‖
Lexicography
WSD and lexicography can work together in loop because modern lexicography is
corpusbased.Withlexicography,WSDprovidesroughempiricalsensegroupingsaswellasstatisticallys
ignificantcontextual indicators ofsense.
DifficultiesinWordSenseDisambiguation(WSD)
Followingsaresomedifficultiesfacedbywordsensedisambiguation(WSD)−Differ
encesbetween dictionaries
The major problem of WSD is to decide the sense of the word because different senses can
beverycloselyrelated.Evendifferentdictionariesandthesaurusescanprovidedifferentdivisionsofwor
ds into senses.
Differentalgorithmsfordifferent applications
Another problem of WSD is that completely different algorithm might be needed for
differentapplications. For example, in machine translation, it takes the form of target word
selection; andininformation retrieval,asenseinventoryis not required.
Inter-judgevariance
Another problem of WSD is that WSD systems are generally tested by having their results on
ataskcomparedagainstthetaskofhumanbeings.Thisiscalledtheproblemofinterjudgevariance.
Word-sensediscreteness
Anotherdifficultyin WSDis that wordscannot beeasilydivided into discrete submeanings.
Part-of-speech (POS) tagging is a popular Natural Language Processing process which refers
tocategorizingwordsinatext(corpus)incorrespondencewithaparticularpartofspeech,dependingonth

edefinitionofthewordanditscontext.

Figure1: Exampleof POS tagging (ImagebyAuthor)

In Figure 1, we can see each word has its own lexical term written underneath, however,
havingto constantly write out these full terms when we perform text analysis can very quickly
becomecumbersome — especially as the size of the corpus grows. Thence, we use a short
representationreferredtoas―tags”torepresentthecategories.

As earlier mentioned, the process of assigning a specific tag to a word in our corpus is referred
toas part-of-speech tagging (POS tagging for short) since the POS tags are used to describe
thelexicaltermsthat wehavewithinourtext.

Figure 2: Griddisplayingdifferenttypesof lexicalterms,their


tags,andrandomexamples(ImageByAuthor)
Part-of-speech tags describe the characteristic structure of lexical terms within a sentence or
text,therefore, we can use them for making assumptions about semantics. Other applications of

POStagginginclude:

 NamedEntityRecognition

 Co-referenceResolution

 SpeechRecognition

WhenweperformPOStagging,it‟softenthecasethatourtaggerwillencounterwordsthatwerenot
within the vocabulary that was used. Consequently, augmenting your dataset to
includeunknownwordtokenswillaidthetaggerinselectingappropriatetagsforthosewords.

Taking the example text we used in Figure 1, ―Why not tell someone?‖, imaging the sentence
istruncated to ―Why not tell … ‖ and we want to determine whether the following word in
thesentenceisanoun, verb,adverb,orsomeotherpart-of-speech.

Now, if you are familiar with English, you‟d instantly identify the verb and assume that it
ismore likely the word is followed by a noun rather than another verb. Therefore, the idea
asshown in this example is that the POS tag that is assigned to the next word is dependent on
thePOStagofthepreviousword.

Figure3: RepresentingLikelihoods visually(ImagebyAuthor)

Byassociatingnumberswitheacharrowdirection,ofwhichimplythelikelihoodofthenextwordgiven
the current word, we can say there is a higher likelihood the next word in our
sentencewouldbeanounsinceithasahigherlikelihoodthanthenextwordbeingaverbifweare
currently on a verb. The image in Figure 3, is a great example of how a Markov Model works
onaverysmallscale.

Given this example, we may now describe markov models as ―a stochastic model used to
modelrandomly changing systems. It is assumed that future states depend only on the current
state, notontheeventsthatoccurredbeforeit(thatis,itassumestheMarkovproperty).‖

Wecandepictamarkovchainasdirectedgraph:

Figure 4: Depiction of Markov Model as Graph (Image By Author) — Replica of the image
usedinNLP Specialization CourseraCourse2, Week2.

Thelineswitharrowsareanindicationofthedirectionhencethename―directedgraph‖,andthecircles
may be regarded as the states of the model — a state is simply the condition of the
presentmoment.

We could use this Markov model to perform POS. Considering we view a sentence as a
sequenceof words, we can represent the sequence as a graph where we use the POS tags as the
events thatoccurwhichwouldbeillustratedbythestatsofour modelgraph.

Forexample,q1inFigure4wouldbecomeNNindicatinganoun,q2wouldbeVBwhichisshortforverb,an
dq3wouldbeOsignifyingallothertagsthatarenotNNorVB.LikeinFigure3,thedirectedlineswouldbeg
ivenatransitionprobabilitythatdefinetheprobabilityofgoingfromonestatetothenext.
Figure5: Exampleof MarkovModel to performPOStagging. (ImagebyAuthor)

Amorecompactwaytostorethetransitionandstateprobabilitiesisusingatable,betterknownasa―transit
ion matrix‖.

Figure6: Transition Matrix (ImagebyAuthor)

Notice this model only tells us the transition probability of one state to the next when we
knowthe previous word. Hence, this model does not show us what to do when there is no
previousword.Tohandlethiscase,weaddwhatisknownasthe―initialstate‖.
Figure7: Addingan Initial Stateto dealwith beginningof wordmatrix (ImagebyAuthor)

You may now be wondering,how did we populate the transition matrix? Great Question. I
willuse 3 sentences for our corpus. The first is ―<s> in a station of the metro‖, ―<s> the
apparition ofthese faces in the crowd‖, ―<s> petals on a wet, black bough.‖ (Note these are the
same sentencesusedinthecourse).Next,wewillbreakdownhowtopopulatethematrixintosteps:

1. Countoccurrencesoftagpairsinthetrainingdataset

Figure 8: Counting the occurrences of the tag (Image by

Author)=Attheendofstepone,ourtablewouldlooksomethinglikethis


Figure9: applying step onewithourcorpus. (ImagebyAuthor)

2. Calculatetheprobabilityofusingthecounts

Figure10: Calculate probabilities usingthe counts (ImagebyAuthor)

ApplingtheformulainFigure10tothetableinFigure9,ournewtablewouldlookasfollows…

Figure11: Probabilitiespopulating thetransition matrix.(ImagebyAuthor)

You may notice that there are many 0‟s in our transition matrix which would result in our

modelbeing incapable of generalizing to other text that may contain verbs. To overcome this

problem,weaddsmoothing.

Adding smoothing requires we slightly we adjust the formula from Figure 10 by adding a
smallvalue, epsilon, to each of the counts in the numerator, and add N * epsilon to the

denominator,suchthatthe rowsumstill addsupto1.


Figure12: Calculating theprobabilities with smoothing(ImagebyAuthor)

Figure 13: New probabilities with smoothing added. N is the length of the corpus and epsilon
issome verysmall number.(ImagebyAuthor)

Note: In a real world example, applying smoothing to the initial probabilities (the first row)
asthiswouldallowforasentencetopossiblystartwithanyPOStag.

HiddenMarkovModel

Hidden Markov Model (HMM) is a statistical Markov model in which the system being

modeledis assumed to be a Markov process with unobservable (―hidden‖) states (Source:

Wikipedia). Inourcase,theunobservablestates arethePOStagsof aword.

If we rewind back to our Markov Model in Figure 5, we see that the model has states for part
ofspeechsuch asVB forverbandNNforanoun. Wemaynowthinkoftheseas hiddenstatessincethey

are not directly observable from the corpus. Though a human may be capable of decipheringwhat

POS applies to a specific word, a machine only sees the text, hence making it observable,and is

unaware of whether that word POS tag is noun, verb, or something else which in-

turnmeanstheyareunobservable.
Both the Markov Model and Hidden Markov model have transition probabilities that describe
thetransitionfrom one hidden state to the next, however, the Hidden Markov Modelalso

hassomethingknownasemissionprobabilities.

Theemissionprobabilitiesdescribethetransitionsfromthehiddenstatesinthemodel—remember the
hidden states are the POS tags— to the observable states — remember theobservablestates

arethewords.

Figure14: Exampleof Hidden Markovmodel. (ImagebyAuthor)

In Figure 14 we see that for the hidden VB state we have observable states. The
emissionprobabilityfromthehiddenstatesVBtotheobservableeatis0.5hencethereisa50%chancethatt
hemodelwouldoutputthiswordwhenthecurrenthiddenstateisVB.

Wecanalsorepresenttheemissionprobabilitiesasatable…
Figure15: Emissionmatrixexpressedasatable—Thenumbersarenotaccuraterepresentations,
theyarejust random (ImagebyAuthor)

Similar to the transition probability matrix, the row values must sum to 1. Also, the reason all
ofour POS tags emission probabilities are more than 0 since words can have a different POS
tagdependingonthecontext.

To populate the emission matrix, we‟d follow a procedure very similar to the way we‟d
populatethetransitionmatrix.We‟dfirstcounthowoftenawordistaggedwithaspecifictag.

Figure16: Calculatingthe countsof aword andhow often it is taggedwith aspecifictag.

Since the process is so similar to calculating the transition matrix, I will instead provide you
withtheformulawithsmoothingappliedtoseehowitwouldbecalculated.

Figure 17: Formula for calculating transition probabilities where N is the number of tags
andepsilon is averysmall number (ImagebyAuthor).

WrapUp

You now know what a POS tag is and its different applications, as well as Markov

Models,Hidden Markov Models, and transition and emission matrices and how to populate them

withsmoothingapplied.

You might also like