NLP Study Materials Updated
NLP Study Materials Updated
NaturalLanguageProcessing(NLP)isasubfieldof linguistics(scientificstudyoflanguage)andArtificia
lIntelligence.Itenablesmachinestounderstandandinterpret humanlanguage.
NLPinourdailylife:
Ifyou have noticed Google Search is integrated with the ―Search by Voice‖ button
whichenablesusers to search for anything byjust speaking.
at the back end, Google receives your voice recording, processes your words (Natural
Language)and converts it into text, and then through text matching algorithms relevant results
are displayedtoyou.
Companies are using sentiment analysis, an application of natural language processing (NLP) to
identify the opinion and sentiment of their customers online. It will help companies to
understand what their customers think about the products and services.
When you type out a message or search query on Google, NLP allows you to type faster.
Text messengers, search engines, websites, forms, etc., utilize NLP technology simultaneously, to speed up
the access to relevant information.
While writing an email, word documents, composing blog posts, or using Google Docs, NLP allows users to
write more precisely.
3. Grammar checkers – helps users use punctuations, voice, articles, propositions and other grammatical
elements by providing suggestions in your language of choice.
4. Spell checkers – help users remove spelling errors, stylistically incorrect language, typos, etc., based on the
language chosen.
For example, Grammarly utilizes both spell checkers and grammar checkers to help you make corrections for
a more accurate output.
This is just an example, similarly our daily life is just filled with numerous applications
ofNaturalLanguageProcessing.
NLPbasedproblemsarenot reallyeasy.Why?
NLP based problems usually have unstructured data and when the data is in unstructured
form,thendata processing becomes difficult.
Unstructureddata:Datawhichisnotinproperstructureandwhichcannotbestoredintheformof
rowand columndirectly.
ApplicationsofNaturalLanguageProcessing:
AutoCompletefeatureinEmailsandinSearchEngines
VoiceRecognition(Machinesare abletoconvert VoiceintoText)
Texttospeech
Chatbots
VoiceAssistants(For e.g.Alexa, Siri)
SentimentsAnalysis(Fore.g.Positive,Negative, Neutral)
TextSummarization(Fore.g.Summarisingthenewsinto50words)
EmailSpamDetection(Fore.g.Spam/NotSpam)
Advertisements
ExtractinginformationfromResume(For e.g.NamedEntityRecognition)
And alot more. . .
This list just goes on, as the technologies are advancing, we are becoming more and
moreconnected with the Internet and we are producing lots of data daily over the internet. Be it
yourSocial Media, your Product buying history, your search history, companies are just using
theseinformationto personalizeyour experienceon their platforms.
HencethereisagrowingdemandinthefieldofNLPwhereengineersandscientistsareengagedwithunstru
ctured data soas tomakesomerelevant business decisions.
ComponentsofNLP
TherearetwocomponentsofNLP,oneworksforunderstandingtherightmeaning/semanticofthetext
and other helps in generatingthe textas aresponse tothe user.
a) NaturalLanguageUnderstanding(NLU)
b) NaturalLanguageGeneration(NLG)
1) Natural Language Understanding: It is where the syntax and semantics are learnt by
themachine.Itisthestepwheremachinesunderstandtheactualmeaningandcontextofthesentence.Buta
sthelanguagecomeswithAmbiguity,sotherearefewproblemsmentionedbelowwhich occurs
whileunderstanding thetext:
Example:IsawBats(theMammalBatorwoodenCricket Bat)
Example:Thechickenisreadytoeat.(Chickendishisnowreadyforyoutoeat,Or,chickenhimself is
readyto eat something)
Example:JohncalledJay.Later,Helaughed.(HerewhoishereferringtoJohnorJay)
Therearedefinedtechniqueswhichareusedtoremovetheseambiguitiesfromthetextsothatthe right
meaning is understood bythe machine.
Example:AutomaticEssayWriting, NewsWritingetc.
NaturalLanguageGenerationworksinthreephases:
b) Sentence planning: In this phase, selection of words, forming meaningful phrases and
settingtoneofthe sentencetakes place.
c) TextRealization:Thisisthefinalphasewhereexecutionofasentenceplanisdoneintothefinalsentenc
efordelivery.
So far we have learnt what is NLP, what are its components, and what are the challenges faced during Text
processing. Now it is the time to do a bit of coding and let us start cleaning the text corpus.
When the text corpus is given to us, it may have following issues:
HTML tags
Upper / Lower Case inconsistency
Punctuations
Stop words
Words not in their root form
And so on. . .
Before using the data for predictions, we need to clean it. Let us start working on these issues one by
one:
While scraping the text data from a website, you may get HTML tags included, so it is recommended that we
remove them.
Example:
“The market extended gains for the seventh consecutive session, climbing 1 percent to end at <b> record
</b> closing high on May 31. Reliance Industries <h2> continued to be a leader </h2> in the rally, followed
by <br> private banks & financials and FMCG stocks.”
To clean the above text, let us remove the words which are present in between the angle brackets „<‟ , „>‟.
We need to write regex (regular expressions) for it.
import re
text_data = '''The market extended gains for the seventh consecutive session, climbing 1
percent to end at <b> record </b> closing high on May 31. Reliance Industries <h2>
continued to be a leader </h2> in the rally, followed by <br> private banks & financials
and FMCG stocks.'''
html_pattern = re.compile('<.*?>')
text_data = re.sub(html_pattern, '', text_data)
text_data
Output:
“The market extended gains for the seventh consecutive session, climbing 1 percent to end at record closing
high on May 31. Reliance Industries continued to be a leader in the rally, followed by private banks &
financials and FMCG stocks.”
Now you can notice that html tags have been replaced with empty strings.
2) Upper and lower case inconsistency:
Let us remove this inconsistency and convert everything into lower case.
text_data = text_data.lower()
text_data
Output:
“the market extended gains for the seventh consecutive session, climbing 1 percent to end at record closing
high on may 31. reliance industries continued to be a leader in the rally, followed by private banks &
financials and fmcg stocks.”
3) Remove Punctuations:
Punctuations in the text do not make much sense hence we can remove them.
Output:
“the market extended gains for the seventh consecutive session climbing 1 percent to end at record
closing high on may 31 reliance industries continued to be a leader in the rally followed by private banks
financials and fmcg stocks”
Words that provide meaningful information often have word length more than 2.
Output:
'the market extended gains for the seventh consecutive session climbing percent end record closing high
may reliance industries continued leader the rally followed private banks financials and fmcg stocks'
PhasesinNaturalLanguageProcessing
There are phases in NLP which need to be performed in order to extract meaningful
informationfrom the text corpus. Once these phases are completed, you are ready with your
refined text andthen youcan applysomemachine learningmodel to predict something.
1) Lexical Analysis: In this phase, the text is broken down into paragraphs, sentences and
words.Analysisisdoneforidentificationanddescriptionofthestructureofwords.Itincludestechniquesa
s follows:
Stopwordremoval(removing„and‟,„of‟,„the‟etc.fromtext)
Tokenization(breakingthetextintosentencesor words)
o Wordtokenizer
o Sentencetokenizer
o Tweettokenizer
Stemming(removing„ing‟,„es‟,„s‟fromthetailofthewords)
Lemmatization(convertingthewordstotheirbaseforms)
2) Syntactic Analysis:
Parsing, syntax analysis, or syntactic analysis is the process of analyzing a string of symbols, either in
natural language, computer languages, or data structures, conforming to the rules of formal grammar.
Grammatical rules are applied to categories and groups of words, not individual words. The syntactic
analysis basically assigns a semantic structure to text.
Syntactic analysis is a very important part of NLP that helps in understanding the grammatical meaning of
any sentence.
Example:Thisworddoesnotmakesense:―TruckiseatingOranges―
Hencethereisaneedtoanalyzetheintentofthewordsinasentence.Someofthetechniquesusedin this
phaseare:
DependencyParsing
PartsofSpeech(POS)tagging
3) Semantic Analysis: Once the tagging and word dependencies are analyzed, semantic
analysisextracts only meaningful information from the text and rejects/ignores the sentences that
do notmakesense.
Semantic Analysis of Natural Language captures the meaning of the given text while considering context,
logical structuring of sentences, and grammar roles.
4) Discourse Integration: Its scope is not only limited to a word or sentence, rather
discourseintegrationhelps in studying thewholetext.
5) Pragmatic Analysis: It is a complex phase where machines should have knowledge not
onlyabout the provided text but also about the real world. There can be multiple scenarios where
theintentofasentencecanbemisunderstoodifthemachinedoesn’thavereal worldknowledge.
Example:"Thank you for coming so late, we have wrapped up the meeting" (Contains mockery)
―Canyoushareyourscreen?"(herethecontextisaboutcomputer‟sscreenshareduringa
remotemeeting)
Tokenization
Processofsplittingthetext, phrases,sentencesintosmallerunitsiscalledTokenization.
Example:
1) SentenceTokenizer:
Textdatawillbesplittedintosentences
Code:
fromnltk.tokenize importsent_tokenize
text_data = 'The market extended gains for the seventhconsecutive session, climbing 1 percentto
end at record closing high on May 31. Reliance Industries continued to be a leader in the
rally,followedbyprivate banks &financials and FMCG stocks.'
sent_tokenize(text_data)
Output:
['Themarketextendedgainsfortheseventhconsecutivesession,climbing1percenttoendatrecord
closing high on May31.',
'RelianceIndustriescontinuedtobealeaderintherally,followedbyprivatebanks&financialsandFMCG
stocks.']
2) WordTokenizer:
Textdatawillbesplittedinto words
Code:
fromnltk.tokenize importword_tokenize
text_data = 'The marketextended gainsfor the seventhconsecutive session, climbing 1 percentto
end at record closing high on May 31. Reliance Industries continued to be a leader in the
rally,followed byprivatebanks &financials and FMCGstocks.'
word_tokenize(text_data)
Output:
['The','market','extended','gains','for','the','seventh','consecutive','session',',','climbing','1',
'percent','to','end','at','record','closing','high','on','May','31','.','Reliance','Industries',
'continued','to','be','a','leader','in','the','rally',',','followed','by','private','banks','&','financials','and','F
MCG','stocks','.']
Code:
fromnltk.tokenize importWhitespaceTokenizer
text_data='Themarketextendedgainsfortheseventhconsecutivesession,climbing1percenttoendatrec
ord closinghighonMay31.RelianceIndustriescontinuedtobe a leaderintherally,followed by private
banks & financials and FMCG stocks.'print(WhitespaceTokenizer().tokenize(text_data))
Output:
['The','market','extended','gains','for','the','seventh','consecutive','session,','climbing','1',
'percent','to','end','at','record','closing','high','on','May','31.','Reliance','Industries',
'continued','to','be','a','leader','in','the','rally,','followed','by','private','banks','&','financials','and','F
MCG','stocks.']
StopwordRemoval
Therearewordsinoursentenceswhichdonotprovideanyrelevantinformationandhencetheycanberemo
ved from thetext.
Example:and,of,at,it,theetc.
TherearemultipleNLP librarieswhichoperateontextandprovidefunctionalitytoremovestopwords.So
me ofthe famouslibraries that provide supportforStop word removal:
NLTK
Spacy
Gensim
Code:
import nltk
stopwords=nltk.corpus.stopwords.words('english')
len(stopwords)=>179
Now,letusimportthewordtokenizerlibrarywhichwillsplitourtextcorpusintowords.Lateron these
words will be checked whether they are part of the stop word list or not, if they are partof it,
we need to ignore that word.
from nltk.tokenizeimport
word_tokenizetokenized_text=word_token
ize(text_data)tokenized_text=word_tokeni
ze(text_data)print(tokenized_text)
Output:
['the','market','extended', 'gains','for','the', 'seventh','consecutive','session','climbing','percent',
'end','record','closing','high','may','reliance','industries','continued','leader','the','rally','followed','pr
ivate','banks','financials','and','fmcg','stocks']
Output:
['market','extended','gains','seventh','consecutive','session','climbing','percent','record',
'closing','high','may','reliance','industries','continued','leader','rally','followed','private','banks','fina
ncials','fmcg','stocks']
Itisclearlyvisiblethat'the','for','and'havebeenremovedfromthetext
Important Point: Let‟ssay there is some word which does not make sense in your domain
andyou want to remove it too. There is a way by which you can enhance your stop words list
byaddingthiswordintoStopwordslistandlateryoucanapplythesamestepforremoval.
Example:„fmcg‟isamorecommonwordinyourdomainsoyouwanttoremoveit.
stopwords.append('fmcg')
len(stopwords)=>180
removed_stop_words_list = [word for word in tokenized_textif word not in
stopwords]print(removed_stop_words_list)
Output:
['market','extended','gains','seventh','consecutive','session','climbing','percent','end','record',
'closing','high','may','reliance','industries','continued','leader','rally','followed','private','banks','fina
ncials','stocks']
At this point, we have cleaned data, however, there are some words which are not in their
rootform. And this problem can affect the model‟s accuracy. Hence it is recommended to
convert thewordsto theirbaseforms.Wearegoing to learnthesetechniques going forward.
Staytuned!!
Stemming&Lemmatization
When we work on some text document, removal of punctuation and stop words are just
notenough,thereis still something morewhich needsour attention.
The words that we use in sentences can take any form. Words can be used in present tense,
orpast or maybein futuretense, accordinglythe word will change.
Fore.g.-Theword„Go‟is„Go‟/„Goes‟inpresenttenseand„Went‟inpasttense
-Theword„See‟is„See‟/„Sees‟inpresenttense,whereasitis„Saw‟inpasttense
These inconsistencies in data can affect the model training and predictions, hence, we need
tomakesurethat thewordsexist in their root forms.
Tohandlethis,therearetwo methods:
1) Stemming:
Stemmingistheprocessofconverting/reducingtheinflectedwordstotheirrootform.Inthismethod,the
suffixes areremoved from theinflected word sothat it becomes theroot.
Foreg.Fromtheword―Going‖,―ing‖suffixwillgetremovedandthe inflectedword―Going‖willbecome
―Go‖whichis therootform.
DevelopDeveloped->Develop
Development-
>DevelopDevelops->
Develop
Alltheseinflectedwordstaketheirrootformwhentheirsuffixesareremoved.Internallythestemmingpr
ocess uses some rules for trimming thesuffixpart.
PythonCode:
fromnltk.stemimportPorterStemmerp
orter =
PorterStemmer()print(porter.stem("d
eveloping"))print(porter.stem("devel
ops"))print(porter.stem("development
"))print(porter.stem("developed"))
Output:
develop
develop
develop
develop
However,there aresomewords whichdonotgetproperlyhandledbythe ―Stemming‖process.
Fore.g.―went‖,―flew‖,―saw‖thesewordscan‟tbeconvertedproperlytotheirbaseformsifStemmingis
applied.
Code:
print(porter.stem("went"))
print(porter.stem("flew"))
print(porter.stem("saw"))
Output:
wentfle
wsaw
FortunatelywehaveLemmatization forthiswork.
Pros: Computationally Fast: As it simply trims the suffix without worrying about the context
ofword
Cons: It is not useful enough if you are concerned about the valid words. Stemmer can give
yousome words which do not haveanymeaning.
"Goes‖-> ―goe‖
2) Lemmatization:
Itiswherethewordsareconvertedtotheirrootformsbyunderstandingthecontextofthewordinthesenten
ce.
calledastem.Whereas,it is calledalemmainLemmatization.
Pros:TherootwordwhichwegetafterconversionholdssomemeaningandthewordbelongstotheDictio
nary.
Cons:Itiscomputationallyexpensive.
NLTKprovidesaclasscalled WordNetforthispurpose.
Code:
from nltk.stemimport
WordNetLemmatizerwordnet_lemmatizer=W
ordNetLemmatizer()
Asmentionedinthistutorialyoumighthaveinstalledthenltklibraryinyoursystem,buttoworkwith
WordNetLemmatizer, you needto download thispackageexplicitly.
Code:
import
nltknltk.downlo
ad()
This will launch one window like below and you need to scroll down and select “wordnet” from the
list and click on Download.
Once downloaded successfully, you should be able to use WordNetLemmatizer
Code:
from nltk.stemimport
WordNetLemmatizerwordnet_lemmatizer =
WordNetLemmatizer()print(wordnet_lemmatizer.lemmatize(
"going"))
print(wordnet_lemmatizer.lemmatize("goes")) # Lemmatizer is able to convert it to
"go"print(wordnet_lemmatizer.lemmatize("went"))
print(porter.stem("goes"))# Stemmingis unableto normalizethe word"goes"properly
Output:
going
gowe
ntgoe
ButyoumightbewonderingthatLemmatizerisunabletonormalizethewords―going‖and―went‖into
their root forms.
It is because we have not passed the context to
it.Partofspeech―pos‖istheparameterwhichweneedtospecify.By defaultitisNOUN.If awordis averb
whichwewant to normalize,then weneedto specifywiththevalueas―v‖
Code:
fromnltk.stemimport
WordNetLemmatizerwordnet_lemmatizer =
WordNetLemmatizer()print(wordnet_lemmatizer.lemmatize(
"going",
pos="v"))print(wordnet_lemmatizer.lemmatize("goes",
pos="v"))print(wordnet_lemmatizer.lemmatize("went",
pos="v"))print(wordnet_lemmatizer.lemmatize("go", pos =
"v"))print(wordnet_lemmatizer.lemmatize("studies", pos =
"v"))print(wordnet_lemmatizer.lemmatize("studying", pos =
"v"))print(wordnet_lemmatizer.lemmatize("studied",pos="v"
))
print(wordnet_lemmatizer.lemmatize("dogs"))#bydefault,itisnounpri
Output:
nt(wordnet_lemmatizer.lemmatize("dogs",pos="n"))
gogo
gogo
study
study
study
dogd
og
BasicExtractionsFromthetextcorpus,wecanextractusefulinformationandcreatevariablesout of it.
Fromtheabovesentence,wecanextractfew meaningfulinformationlike:
Howmanywordsarepresent?
Howmanycharactersare present inthe sentence?
Whatistheaveragelengthofeach word?
Howmanylowercasewords arepresent?
HowmanyUppercasewords arepresent?
Whatisthe lengthofthelongest/smallestword?
Howmanystop words arepresent?
Code:
importpandasaspdi
mport nltk
stopwords=nltk.corpus.stopwords.words('english')#importingstopwordslist
out']df=pd.DataFrame(string_lst,columns=['msg'])
defderive_features(message):
words_lst=message.split()
num_charactes =
len(message)num_words=len(
words_lst)
words_length =
[]lower_words_lst =
[]upper_words_lst =
[]is_stop_word_lst =
[]forwordinwords_lst:
words_length.append(len(word))lower_words_lst.ap
pend(word.islower())upper_words_lst.append(word.i
supper())is_stop_word_lst.append(word.lower()insto
pwords)
stop_words_count=sum(is_stop_word_lst)
avg_word_length = round(sum(words_length)/len(words_length),
1)max_ length_word=max(words_length)
min_length_word =
min(words_length)total_lower_words =
sum(lower_words_lst)total_upper_words=s
um(upper_words_lst)
returnnum_charactes,num_words,avg_word_length,total_lower_words,total_upper_words,ma
x_length_word,min_length_word, stop_words_count
df['num_chars'], df['num_words'], df['avg_word_len'],
df['num_lower_words'],df['num_upper_words'],df['max_length_word'],df[
'min_length_word'],df['num_stop_words']
=zip(*df['msg'].apply(lambdar:derive_features(r)))
BagofWords
Bag of words is a technique to extract features from the provided text data. This technique
countsthe occurrence of words in the sentence. If the word is found in the sentence, then the
occurrencevalueincreases by1, elseits occurrencevalueremains 0.
Thistechniqueissimpleandeasytoimplementbutitcomeswithitsownlimitationsaswell.Wewill
discuss that at theend ofthis article.
When we have the text information available with us, it is a prerequisite that we clean our
textfirst.
All the steps discussed in the previous chapterwill be applied and text is cleaned. Once done
wewillusethe Bag of words technique to extract featuresout of it.
Code:
importpandasaspdi
mportre
import nltk
from nltk.tokenizeimport
word_tokenizefromnltk.stemimportPorte
rStemmer
fromsklearn.feature_extraction.textimportCountVectorizer
stopwords=nltk.corpus.stopwords.words('english')#importingstopwordslist
df = pd.DataFrame(string_lst,
columns=['msg'])defclean_data(text_data):
print("Original:{}\n".format(text_data))
#Cleaning htmltags
html_pattern=re.compile('<.*?>')
text_data=re.sub(html_pattern,'',text_data)pr
int("Step1:{}\n".format(text_data))
# Handling Case
inconsistenciestext_data =
text_data.lower()print("Step2:{}\n".fo
rmat(text_data))
#RemovingPunctuations
text_data = re.sub(r'[^\w\s]', '',
text_data)print("Step3:{}\n".format(text
_data))
tokenized_text=word_tokenize(text_data)
text_data = " ".join(word for word in tokenized_textif word not in
stopwords)print("Step5: {}\n".format(text_data))
porter=PorterStemmer()
text_data="".join(porter.stem(word)forword
inword_tokenize(text_data))print("Step6: {}\n".format(text_data))
return
Output:
text_dataclean_data(df['msg'
].iloc[0])
Original:<THE>quick BROWNfox
overthe lazydog.
4:quickbrownfoxjumpingoverthelazydogStep5:
Now,letususetheclean_data()functiontocleanalltherowsandaddonenewcolumninthedataframe
itself with thename ―cleaned_msg‖
df['cleaned_msg']=df['msg'].apply(clean_data)
Wehaverawmessagesandcleanedmessagesavailablewithus.Letususe„msg‟columnandgetthe
count foreachword.
vectorizer=CountVectorizer()
X=
vectorizer.fit_transform(df['msg'])resu
lt=pd.DataFrame(X.toarray())
result.columns =
vectorizer.get_feature_names()result
Doyou seeanyprobleminthis?
If we do not use cleaned data for feature engineering then we will end up in so many columns
inthe final dataframe. You can see, in just 3 sentences, there are 22 words present. If we have
adatasetof10,000rows,thenthiswordcountwillbetoomuchandtheCountvectorizerwillreturnMillions
of columns which is nota Good practice.
Sothistime,wewillusecleaneddataforfeaturecreation.
vectorizer_clean=CountVectorizer()
X_clean =
vectorizer_clean.fit_transform(df['cleaned_msg'])print(vecto
rizer_clean.vocabulary_)
Output:
{'quick':9,
'brown':1,
'fox':3,
'jump':5,
'lazi':6,
'dog':2,
'anyth':0,
'pleas':8,
'help':4,
'padhai':7,
'time':10}
result_clean =
pd.DataFrame(X_clean.toarray())result_clean.columns=vecto
rizer_clean.get_feature_names()result_clean
Limitationsof BagofWordsApproach:
2) ThroughtheBOWapproach,thereexistssomanyzerosinthedataandiscalledsparsematrix.
TermFrequency -InverseDocumentFrequency
1) Supposewedon‟twanttoremovestopwordsfromthetextcorpus.Inthiscase,thefrequencyof ―is‖,
―the‖, ―a‖ will be very high. But actually these words do not make sense in the
sentenceforamodel to learn anything.
2) Suppose if we are processing product reviews on Amazon / Flipkart, then the terms
like―product‖, ―item‖aredomaindependentand
areusedtoooftenineachreview.Hencethesewordswill not help model in learninganything.
3) The―mobile‖keywordinthephonedata setisnotgivinganyvalue.Keywords
like―5GB‖,―SplashProof‖, ―Android OS‖will makemoresense.
Hencethereisatechniquecalled TF-IDFwhere
mostoftenwordsaresuppressed(givenlowerimportance)
anduniquewords(lessfrequent)areprovidedahigherweightageinthesentence.
Letuslookatbelow twosentences:
- IdonotlikeVanilla Icecream
Inbothofthe above sentences,UniquethingorI wouldsaythe importantkeywordstonoticeare
―Cake‖ and ―Icecream‖. Remaining terms like ―I‖, ―do‖, ―not‖, ―like‖, ―Vanilla‖ are repeatedin
both the sentences, hencetheydo not provideanyuseful info to the model.
Terminology:
CorpusTerm:SinglewordinsideaDocument/Sente
nceTF:Term Frequency
IDF: InverseDocumentFrequency
Formulas:
Letustakethesame ExampleandcalculateTermFrequencyandIDF:
CakeDocumentB:IdonotlikeVanillaIcecrea
mNo.ofwords in Document A: 6
WecanachievethesametaskbyimportingTfidfVectorizerfrom sklearnlibrary.
Someofthetechniquesadd1inthedenominatorwhilecalculatingthe IDFvaluesetc.
Code:
importpandasaspdi
mportre
import nltk
from nltk.tokenizeimport
word_tokenizefromnltk.stemimportPorte
rStemmer
fromsklearn.feature_extraction.textimportTfidfVectorizer
stopwords=nltk.corpus.stopwords.words('english')#importingstopwordslist
df=pd.DataFrame(string_lst,columns=['msg'])
defclean_data(text_data):
print("Original:{}\n".format(text_data))
#Cleaning htmltags
html_pattern=re.compile('<.*?>')
text_data=re.sub(html_pattern,'',text_data)pr
int("Step1:{}\n".format(text_data))
# Handling Case
inconsistenciestext_data =
text_data.lower()print("Step2:{}\n".fo
rmat(text_data))
#RemovingPunctuations
text_data=re.sub(r'[^\w\s]',
'',text_data)print("Step3:{}\n".format(te
xt_data))
tokenized_text=word_tokenize(text_data)
text_data = " ".join(word for word in tokenized_textif word not in
stopwords)print("Step5: {}\n".format(text_data))
porter=PorterStemmer()
text_data="".join(porter.stem(word)forword
inword_tokenize(text_data))print("Step6: {}\n".format(text_data))
returntext_data
df['cleaned_msg']=df['msg'].apply(clean_data)
Output:
vectorizer_clean =
TfidfVectorizer(smooth_idf=True)X_clean =
vectorizer_clean.fit_transform(df['cleaned_msg'])print(vecto
rizer_clean.vocabulary_)
Output:
Output:
['cake','icecream','like','vanilla']
idf_values=vectorizer_clean.idf_
print("IDFValues: \n",{terms[i]:idf_values[i]foriinrange(len(terms))})
Output:
IDFValues:
{'cake':1.4054651081081644,'icecream':1.4054651081081644,'like':1.0,'vanilla':1.0}
result_clean=pd.DataFrame(X_clean.toarray())result_clean.c
olumns=vectorizer_clean.get_feature_names()result_clean
Output:
NaturalLanguage Processing–Advantages
1. Betterdataanalysis
2. Streamlinedprocesses
3. Cost-effective
4. Empoweredemployees
5. Enhancedcustomerexperience
NaturalLanguage Processing–5Phases
Phase1–LexicalAnalysis
Lexicalanalysisistheprocessofconvertingasequenceofcharactersintoasequenceoftokens.
Alexerisgenerallycombinedwithaparser,whichtogetheranalyzesthesyntaxofprogrammingla
nguages,web pages,and soforth.
Lexersandparsersaremostoftenusedforcompilersbutcanbeusedforothercomputerlanguaget
ools, such as prettyprinters orlinters.
Lexicalanalysisisalsoanimportantanalysisduringtheearlystageof naturallanguageprocessing
,wheretext orsound waves aresegmented into words and other units.
Phase2–SyntacticAnalysis
SemanticAnalysis attemptstounderstandthemeaningofNaturalLanguage.
SemanticAnalysisofNaturalLanguagecapturesthemeaningofthegiventextwhileconsidering
context, logical structuringof sentences, and grammarroles.
2partsofSemanticAnalysisare(a)LexicalSemanticAnalysisand(b)CompositionalSemantics
Analysis.
Semanticanalysiscanbeginwiththerelationshipbetweenindividualwords.
Phase4–DiscourseAnalys is
Phase5–PragmaticAnalys is
PragmaticAnalysisispartoftheprocessofextracting informationfromtext.
Itfocusesontakingastructuredsetoftextandfiguringouttheactualmeaningofthetext.
Italsofocuses onthemeaningof the wordsof the time andcontext.
EffectsoninterpretationcanbemeasuredusingPAbyunderstandingthecommunicativeandsoci
al content.
NLP-WordSenseDisambiguation
We understand that words have different meanings based on the context of its usage in
thesentence. If we talk about human languages, then they are ambiguous too because many
wordscanbeinterpreted in multipleways depending upon the contextoftheiroccurrence.
Word sense disambiguation, in natural language processing (NLP), may be defined as the
abilityto determine which meaning of word is activated by the use of word in a particular
context.Lexical ambiguity, syntactic or semantic, is one of the very first problem that any NLP
systemfaces. Part-of-speech (POS) taggers with high level of accuracy can solve Word‟s
syntacticambiguity. On the other hand, the problem of resolving semantic ambiguity is called
WSD
(wordsensedisambiguation).Resolvingsemanticambiguityisharderthanresolvingsyntacticambiguit
y.
Forexample, considerthetwoexamples of thedistinct sensethat existfortheword“bass” −
Ican hearbass sound.
Helikes toeat grilledbass.
Theoccurrenceoftheword bass clearlydenotesthedistinctmeaning.Infirstsentence,itmeans
frequency and insecond, itmeans fish. Hence, if itwould be disambiguated by
WSDthenthecorrect meaningto theabovesentencescan beassigned asfollows−
Ican hear bass/frequencysound.
Helikestoeatgrilledbass/fish.Ev
aluationof WSD
Theevaluation ofWSD requiresthe following twoinputs−
A Dictionary
The very first input for evaluation of WSD is dictionary, which is used to specify the senses to
bedisambiguated.
TestCorpus
Another input required by WSD is the high-annotated test corpus that has the target or correct-
senses.Thetest corporacan beof two types &minsu;
Lexicalsample
−Thiskindofcorporaisusedinthesystem,whereitisrequiredtodisambiguateasmall sampleof
words.
All-words−Thiskindofcorporaisusedinthesystem,whereitisexpectedtodisambiguateall the
words in apieceofrunning text.
ApproachesandMethodstoWord SenseDisambiguation(WSD)
Approaches and methods to WSD are classified according to the source of knowledge used
inworddisambiguation.
Letus nowseethe fourconventionalmethods to WSD−
Dictionary-basedorKnowledge-basedMethods
As the name suggests, for disambiguation, these methods primarily rely on dictionaries,
treasuresand lexical knowledge base. They do not use corpora evidences for disambiguation. The
Leskmethod is the seminal dictionary-based method introduced by Michael Lesk in 1986. The
Leskdefinition,onwhichtheLeskalgorithmisbasedis “measureoverlapbetweensensedefinitions
for all words in context”. However, in 2000, Kilgarriff and Rosensweig gave
thesimplifiedLeskdefinitionas
“measureoverlapbetweensensedefinitionsofwordandcurrentcontext”,whichfurther means
identify the correct sense for one word at a time. Herethecurrent context is thesetof words in
surrounding sentenceorparagraph.
SupervisedMethods
For disambiguation, machine learning methods make use of sense-annotated corpora to
train.These methods assume that the context can provide enough evidence on its own to
disambiguatethe sense. In these methods, the words knowledge and reasoning are deemed
unnecessary. Thecontext is represented as a set of ―features‖ of the words. It includes the
information about thesurroundingwordsalso.Supportvectormachineandmemory-
basedlearningarethemostsuccessful supervised learning approaches to WSD. These methods rely
on substantial amount ofmanuallysense-tagged corpora, which is veryexpensive to create.
Semi-supervisedMethods
Due to the lack of training corpus, most of the word sense disambiguation algorithms use semi-
supervised learning methods. It is because semi-supervised methods use both labelled as well
asunlabeled data. These methods require very small amount of annotated text and large amount
ofplain unannotated text. The technique that is used by semisupervised methods is
bootstrappingfromseed data.
UnsupervisedMethods
These methods assume that similar senses occur in similar context. That is why the senses can
beinduced from text by clustering word occurrences by using some measure of similarity of
thecontext. This task is called word sense induction or discrimination. Unsupervised methods
havegreat potential to overcome the knowledge acquisition bottleneck due to non-dependency
onmanualefforts.
ApplicationsofWordSenseDisambiguation(WSD)
Wordsensedisambiguation(WSD)isappliedinalmosteveryapplicationoflanguagetechnology.
Let us now see the scope of WSD
−MachineTranslation
Machine translation or MT is the most obvious application of WSD. In MT, Lexical choice
forthe words that have distinct translations for different senses, is done by WSD. The senses in
MTare represented as words in the target language. Most of the machine translation systems do
notuseexplicit WSD module.
InformationRetrieval (IR)
Information retrieval (IR) may be defined as a software program that deals with the
organization,storage, retrieval and evaluation of information from document repositories
particularly textualinformation. The system basically assists users in finding the information they
required but itdoes not explicitly return the answers of the questions. WSD is used to resolve the
ambiguities ofthe queries provided to IR system. As like MT, current IR systems do not
explicitly use WSDmodule and they rely on the concept that user would type enough context in
the query to onlyretrieve relevant documents.
TextMiningandInformationExtraction(IE)
In most of the applications, WSD is necessary to do accurate analysis of text. For example,
WSDhelps intelligent gathering system to do flagging of the correct words. For example,
medicalintelligentsystemmight needflagging of ―illegaldrugs‖ratherthan―medical drugs‖
Lexicography
WSD and lexicography can work together in loop because modern lexicography is
corpusbased.Withlexicography,WSDprovidesroughempiricalsensegroupingsaswellasstatisticallys
ignificantcontextual indicators ofsense.
DifficultiesinWordSenseDisambiguation(WSD)
Followingsaresomedifficultiesfacedbywordsensedisambiguation(WSD)−Differ
encesbetween dictionaries
The major problem of WSD is to decide the sense of the word because different senses can
beverycloselyrelated.Evendifferentdictionariesandthesaurusescanprovidedifferentdivisionsofwor
ds into senses.
Differentalgorithmsfordifferent applications
Another problem of WSD is that completely different algorithm might be needed for
differentapplications. For example, in machine translation, it takes the form of target word
selection; andininformation retrieval,asenseinventoryis not required.
Inter-judgevariance
Another problem of WSD is that WSD systems are generally tested by having their results on
ataskcomparedagainstthetaskofhumanbeings.Thisiscalledtheproblemofinterjudgevariance.
Word-sensediscreteness
Anotherdifficultyin WSDis that wordscannot beeasilydivided into discrete submeanings.
Part-of-speech (POS) tagging is a popular Natural Language Processing process which refers
tocategorizingwordsinatext(corpus)incorrespondencewithaparticularpartofspeech,dependingonth
edefinitionofthewordanditscontext.
In Figure 1, we can see each word has its own lexical term written underneath, however,
havingto constantly write out these full terms when we perform text analysis can very quickly
becomecumbersome — especially as the size of the corpus grows. Thence, we use a short
representationreferredtoas―tags”torepresentthecategories.
As earlier mentioned, the process of assigning a specific tag to a word in our corpus is referred
toas part-of-speech tagging (POS tagging for short) since the POS tags are used to describe
thelexicaltermsthat wehavewithinourtext.
POStagginginclude:
NamedEntityRecognition
Co-referenceResolution
SpeechRecognition
WhenweperformPOStagging,it‟softenthecasethatourtaggerwillencounterwordsthatwerenot
within the vocabulary that was used. Consequently, augmenting your dataset to
includeunknownwordtokenswillaidthetaggerinselectingappropriatetagsforthosewords.
Taking the example text we used in Figure 1, ―Why not tell someone?‖, imaging the sentence
istruncated to ―Why not tell … ‖ and we want to determine whether the following word in
thesentenceisanoun, verb,adverb,orsomeotherpart-of-speech.
Now, if you are familiar with English, you‟d instantly identify the verb and assume that it
ismore likely the word is followed by a noun rather than another verb. Therefore, the idea
asshown in this example is that the POS tag that is assigned to the next word is dependent on
thePOStagofthepreviousword.
Byassociatingnumberswitheacharrowdirection,ofwhichimplythelikelihoodofthenextwordgiven
the current word, we can say there is a higher likelihood the next word in our
sentencewouldbeanounsinceithasahigherlikelihoodthanthenextwordbeingaverbifweare
currently on a verb. The image in Figure 3, is a great example of how a Markov Model works
onaverysmallscale.
Given this example, we may now describe markov models as ―a stochastic model used to
modelrandomly changing systems. It is assumed that future states depend only on the current
state, notontheeventsthatoccurredbeforeit(thatis,itassumestheMarkovproperty).‖
Wecandepictamarkovchainasdirectedgraph:
Figure 4: Depiction of Markov Model as Graph (Image By Author) — Replica of the image
usedinNLP Specialization CourseraCourse2, Week2.
Thelineswitharrowsareanindicationofthedirectionhencethename―directedgraph‖,andthecircles
may be regarded as the states of the model — a state is simply the condition of the
presentmoment.
We could use this Markov model to perform POS. Considering we view a sentence as a
sequenceof words, we can represent the sequence as a graph where we use the POS tags as the
events thatoccurwhichwouldbeillustratedbythestatsofour modelgraph.
Forexample,q1inFigure4wouldbecomeNNindicatinganoun,q2wouldbeVBwhichisshortforverb,an
dq3wouldbeOsignifyingallothertagsthatarenotNNorVB.LikeinFigure3,thedirectedlineswouldbeg
ivenatransitionprobabilitythatdefinetheprobabilityofgoingfromonestatetothenext.
Figure5: Exampleof MarkovModel to performPOStagging. (ImagebyAuthor)
Amorecompactwaytostorethetransitionandstateprobabilitiesisusingatable,betterknownasa―transit
ion matrix‖.
Notice this model only tells us the transition probability of one state to the next when we
knowthe previous word. Hence, this model does not show us what to do when there is no
previousword.Tohandlethiscase,weaddwhatisknownasthe―initialstate‖.
Figure7: Addingan Initial Stateto dealwith beginningof wordmatrix (ImagebyAuthor)
You may now be wondering,how did we populate the transition matrix? Great Question. I
willuse 3 sentences for our corpus. The first is ―<s> in a station of the metro‖, ―<s> the
apparition ofthese faces in the crowd‖, ―<s> petals on a wet, black bough.‖ (Note these are the
same sentencesusedinthecourse).Next,wewillbreakdownhowtopopulatethematrixintosteps:
1. Countoccurrencesoftagpairsinthetrainingdataset
Author)=Attheendofstepone,ourtablewouldlooksomethinglikethis
…
Figure9: applying step onewithourcorpus. (ImagebyAuthor)
2. Calculatetheprobabilityofusingthecounts
ApplingtheformulainFigure10tothetableinFigure9,ournewtablewouldlookasfollows…
You may notice that there are many 0‟s in our transition matrix which would result in our
modelbeing incapable of generalizing to other text that may contain verbs. To overcome this
problem,weaddsmoothing.
Adding smoothing requires we slightly we adjust the formula from Figure 10 by adding a
smallvalue, epsilon, to each of the counts in the numerator, and add N * epsilon to the
Figure 13: New probabilities with smoothing added. N is the length of the corpus and epsilon
issome verysmall number.(ImagebyAuthor)
Note: In a real world example, applying smoothing to the initial probabilities (the first row)
asthiswouldallowforasentencetopossiblystartwithanyPOStag.
HiddenMarkovModel
Hidden Markov Model (HMM) is a statistical Markov model in which the system being
If we rewind back to our Markov Model in Figure 5, we see that the model has states for part
ofspeechsuch asVB forverbandNNforanoun. Wemaynowthinkoftheseas hiddenstatessincethey
are not directly observable from the corpus. Though a human may be capable of decipheringwhat
POS applies to a specific word, a machine only sees the text, hence making it observable,and is
unaware of whether that word POS tag is noun, verb, or something else which in-
turnmeanstheyareunobservable.
Both the Markov Model and Hidden Markov model have transition probabilities that describe
thetransitionfrom one hidden state to the next, however, the Hidden Markov Modelalso
hassomethingknownasemissionprobabilities.
Theemissionprobabilitiesdescribethetransitionsfromthehiddenstatesinthemodel—remember the
hidden states are the POS tags— to the observable states — remember theobservablestates
arethewords.
In Figure 14 we see that for the hidden VB state we have observable states. The
emissionprobabilityfromthehiddenstatesVBtotheobservableeatis0.5hencethereisa50%chancethatt
hemodelwouldoutputthiswordwhenthecurrenthiddenstateisVB.
Wecanalsorepresenttheemissionprobabilitiesasatable…
Figure15: Emissionmatrixexpressedasatable—Thenumbersarenotaccuraterepresentations,
theyarejust random (ImagebyAuthor)
Similar to the transition probability matrix, the row values must sum to 1. Also, the reason all
ofour POS tags emission probabilities are more than 0 since words can have a different POS
tagdependingonthecontext.
To populate the emission matrix, we‟d follow a procedure very similar to the way we‟d
populatethetransitionmatrix.We‟dfirstcounthowoftenawordistaggedwithaspecifictag.
Since the process is so similar to calculating the transition matrix, I will instead provide you
withtheformulawithsmoothingappliedtoseehowitwouldbecalculated.
Figure 17: Formula for calculating transition probabilities where N is the number of tags
andepsilon is averysmall number (ImagebyAuthor).
WrapUp
You now know what a POS tag is and its different applications, as well as Markov
Models,Hidden Markov Models, and transition and emission matrices and how to populate them
withsmoothingapplied.