Unit V
Unit V
Natural language processing (NLP) is the intersection of computer science, linguistics and machine
learning. The field focuses on communication between computers and humans in natural language
and NLP is all about making computers understand and generate human language.
Advantages of NLP
o NLP helps users to ask questions about any subject and get a direct response within seconds.
o NLP offers exact answers to the question means it does not offer unnecessary and unwanted
information.
o NLP helps computers to communicate with humans in their languages.
o It is very time efficient.
o Most of the companies use NLP to improve the efficiency of documentation processes,
accuracy of documentation, and identify the information from large databases.
Disadvantages of NLP
o NLP may not show context.
o NLP is unpredictable
o NLP may require more keystrokes.
o NLP is unable to adapt to the new domain, and it has a limited function that's why NLP is
built for a single and specific task only.
Components of NLP
There are the following two components of NLP -
1. Natural Language Understanding (NLU)
Natural Language Understanding (NLU) helps the machine to understand and analyse human
language by extracting the metadata from content such as concepts, entities, keywords, emotion,
relations, and semantic roles.
NLU mainly used in Business applications to understand the customer's problem in both spoken and
written language.
NLU involves the following tasks -
o It is used to map the given input into useful representation.
o It is used to analyse different aspects of the language.
NLU NLG
NLU is the process of reading and interpreting language. It uses NLG is the process of writing or generating language.
Syntactic analysis and semantic analysis.
It produces non-linguistic outputs from natural language inputs. It produces constructing natural language outputs
from non-linguistic inputs.
NLU reads and makes sense of natural language. NLG creates and outputs structured language.
Examples Examples
• Automatic Ticket Routing – It is an eg. of customer • GPT-3 – A language model developed by
service automation. OpenAI
• Machine Translation • Long Short-Term Memory - mimics how
• Automated Reasoning – It is a subfield of cognitive human brains work, remembers previous
science. Eg. Inference about medical diagnosis.
inputs and the contexts of sentences.
• Question Answering – Eg. Google Assistant
PHASES IN NLP
TEXT PREPROCESSING
• Since, text is the most unstructured form of all the available data, various types of noise are
present in it and the data is not readily analysable without any pre-processing. The entire
process of cleaning and standardization of text, making it noise-free and ready for analysis is
known as text pre-processing.
• Text processing refers to only the analysis, manipulation, and generation of text.
• For example, a simple sentiment analysis would require a machine learning model to look
for instances of positive or negative sentiment words, which could be provided to the model
beforehand. This would be text processing, since the model isn’t understanding the words,
it’s just looking for words that it was programmed to look for.
• Importance of Text Pre-processing - Since our interactions with brands have become
increasingly online and text-based, text data is one of the most important ways for
companies to derive business insights. Text data can show a business how their customers
search, buy, and interact with their brand, products, and competitors online. Text processing
with machine learning allows enterprises to handle these large amounts of text data.
• It is predominantly comprised of three steps:
o Noise Removal
o Lexicon Normalization
o Object Standardization
Noise Removal
• Any piece of text which is not relevant to the context of the data and the end-output can be
specified as the noise.
• For example – language stop words (commonly used words of a language – is, am, the, of, in
etc), URLs or links, social media entities (mentions, hash tags), punctuations and industry
specific words. This step deals with removal of all types of noisy entities present in the text.
• A general approach for noise removal is to prepare a dictionary of noisy entities, and iterate
the text object by tokens (or by words), eliminating those tokens which are present in the
noise dictionary.
Following is the python code for the same purpose.
• Another approach is to use the regular expressions while dealing with special patterns of
noise.
• Regular expression is a sequence of character(s) mainly used to find and replace patterns in
a string or file.
• Regular expressions use two types of characters:
a) Meta characters: As the name suggests, these characters have a special meaning,
similar to * in wild card.
Here search() method is able to find a pattern from any position of the string but it only returns
the first occurrence of the search pattern.
result = re.findall('to', 'Welcome to Python Lab with ML To clear our Doubts', re.IGNORECASE)
print (result)
>>> ['to', 'To']
Lexicon Normalization
• Another type of textual noise is about the multiple representations exhibited by single word.
• For example – “play”, “player”, “played”, “plays” and “playing” are the different variations of
the word – “play”, Though they mean different but contextually all are similar. The step
converts all the disparities of a word into their normalized form (also known as lemma).
• Normalization is a pivotal step for feature engineering with text as it converts the high
dimensional features (N different features) to the low dimensional space (1 feature), which
is an ideal ask for any ML model.
• The most common lexicon normalization practices are:
o Stemming: Stemming is a rudimentary rule-based process of removing the suffixes
(“ing”, “ly”, “es”, “s”, ”ious” etc) from a word.
o Lemmatization: Lemmatization, on the other hand, is an organized & step by step
procedure of obtaining the root form of the word, it makes use of vocabulary (dictionary
importance of words) and morphological analysis (word structure and grammar
relations).
Below is the sample code that performs lemmatization and stemming using python’s
popular library – NLTK.
from nltk.stem import PorterStemmer
e_words= ["wait", "waiting", "waited", "waits"]
#e_words= ["multiples", "multiplying", "multiplication", "multiply"]
ps =PorterStemmer()
for w in e_words:
rootWord=ps.stem(w)
print(rootWord)
wait
wait
wait
wait
multipl
multipli
multipl
multipli
Object Standardization
• Text data often contains words or phrases which are not present in any standard lexical
dictionaries. These pieces are not recognized by search engines and models.
• Some of the examples are – acronyms, hashtags with attached words, and colloquial
slangs. With the help of regular expressions and manually prepared data dictionaries,
this type of noise can be fixed.
• The code below uses a dictionary lookup method to replace social media slangs from a
text.
1.1 Dependency Trees – Sentences are composed of some words sewed together. The relationship
among the words in a sentence is determined by the basic dependency grammar. Dependency
grammar is a class of syntactic text analysis that deals with (labeled) asymmetrical binary relations
between two lexical items (words). Every relation can be represented in the form of a triplet
(relation, governor, dependent). For example: consider the sentence – “Bills on ports and
immigration were submitted by Senator Brownback, Republican of Kansas.” The relationship among
the words can be observed in the form of a tree representation as shown:
The tree shows that “submitted” is the root word of this sentence, and is linked by two sub-trees
(subject and object subtrees). Each subtree is a itself a dependency tree with relations such as –
(“Bills” <-> “ports” <by> “proposition” relation), (“ports” <-> “immigration” <by> “conjugation”
relation).
This type of tree, when parsed recursively in top-down manner gives grammar relation triplets as
output which can be used as features for many nlp problems like entity wise sentiment analysis,
actor & entity identification, and text classification. The python wrapper StandfordcoreNLP and NLTK
dependency grammars can be used to generate dependency trees.
1.2 Part of speech tagging – Apart from the grammar relations, every word in a sentence is also
associated with a part of speech (pos) tag (nouns, verbs, adjectives, adverbs etc). The pos tags
defines the usage and function of a word in the sentence. Following code using NLTK performs
pos tagging annotation on input text. (it provides several implementations, the default one is
perceptron tagger)
B. Improving word-based features: A learning model could learn different contexts of a word when
used word as the features, however if the part of speech tag is linked with them, the context is
preserved, thus making strong features. For example:
Sentence -“book my flight, I will read this book”
Tokens – (“book”, 2), (“my”, 1), (“flight”, 1), (“I”, 1), (“will”, 1), (“read”, 1), (“this”, 1)
Tokens with POS – (“book_VB”, 1), (“my_PRP$”, 1), (“flight_NN”, 1), (“I_PRP”, 1), (“will_MD”, 1),
(“read_VB”, 1), (“this_DT”, 1), (“book_NN”, 1)
C. Normalization and Lemmatization: POS tags are the basis of lemmatization process for converting
a word to its base form (lemma).
D.Efficient stopword removal : POS tags are also useful in efficient removal of stopwords.
For example, there are some tags which always define the low frequency / less important words of a
language. For example: (IN – “within”, “upon”, “except”), (CD – “one”,”two”, “hundred”), (MD –
“may”, “mu st” etc)
Sentence – Sergey Brin, the manager of Google Inc. is walking in the streets of New York.
Named Entities – (“person” : “Sergey Brin” ), (“org” : “Google Inc.”), (“location” : “New York”)
3.Statistical Features
Text data can also be quantified directly into numbers using several techniques described below:
3.1. Term Frequency – Inverse Document Frequency (TF – IDF)
TF-IDF is a weighted model commonly used for information retrieval problems. It aims to convert the
text documents into vector models on the basis of occurrence of words in the documents without
taking considering the exact ordering. For Example – let say there is a dataset of N text documents,
In any document “D”, TF and IDF will be defined as –
Term Frequency (TF) – TF for a term “t” is defined as the count of a term “t” in a document “D”
Inverse Document Frequency (IDF) – IDF for a term is defined as logarithm of ratio of total
documents available in the corpus and number of documents containing the term T.
TF . IDF – TF IDF formula gives the relative importance of a term in a corpus (list of documents),
given by the following formula below. Following is the code using python’s scikit learn package to
convert a text into tf idf vectors:
The model creates a vocabulary dictionary and assigns an index to each word. Each row in the
output contains a tuple (i,j) and a tf-idf value of word at index j in document i.
3.2 Frequency / Density / Readability Features
Frequency or Density based features can also be used in models and analysis. These features might
seem trivial but shows a great impact in learning models. Some of the features are: Word Count,
Sentence Count, Punctuation Counts and Industry specific word counts. Other types of measures
include readability measures such as syllable counts, smog index and flesch reading ease.
Textstat library is used to create such features.
Word2Vec and GloVe are the two popular models to create word embedding of a text. These
models takes a text corpus as input and produces the word vectors as output.
Word2Vec model is composed of preprocessing module, a shallow neural network model called
Continuous Bag of Words and another shallow neural network model called skip-gram. These
models are widely used for all other nlp problems. It first constructs a vocabulary from the training
corpus and then learns word embedding representations. Following code using gensim package
prepares the word embedding as the vectors.
print model['learning']
>>> array([ 0.00459356 0.00303564 -0.00467622 0.00209638, ...])
They can be used as feature vectors for ML model, used to measure text similarity using cosine
similarity techniques, words clustering and text classification techniques.
Text classification, in common words is defined as a technique to systematically classify a text object
(document or sentence) in one of the fixed category. It is really helpful when the amount of data is
too large, especially for organizing, information filtering, and storage purposes.
A typical natural language classifier consists of two parts: (a) Training (b) Prediction as shown in
image below. Firstly the text input is processes and features are created. The machine learning
models then learn these features and is used for predicting against the new text.
Here is a code that uses naive bayes classifier using text blob library (built on top of nltk).
model = NBC(training_corpus)
print(model.classify("Their codes are amazing."))
>>> "Class_A"
print(model.classify("I don't like their computer."))
>>> "Class_B"
print(model.accuracy(test_corpus))
>>> 0.83
The text classification model are heavily dependent upon the quality and quantity of features, while
applying any machine learning model it is always a good practice to include more and more training
data.
5.2 Text Matching / Similarity
One of the important areas of NLP is the matching of text objects to find similarities. Important
applications of text matching includes automatic spelling correction, data de-duplication and
genome analysis etc.
A number of text matching techniques are available depending upon the requirement.
A. Levenshtein Distance – The Levenshtein distance between two strings is defined as the minimum
number of edits needed to transform one string into the other, with the allowable edit operations
being insertion, deletion, or substitution of a single character. Following is the implementation for
efficient memory computations.
def levenshtein(s1,s2):
if len(s1) > len(s2):
s1,s2 = s2,s1
distances = range(len(s1) + 1)
for index2,char2 in enumerate(s2):
newDistances = [index2+1]
for index1,char1 in enumerate(s1):
if char1 == char2:
newDistances.append(distances[index1])
else:
newDistances.append(1 + min((distances[index1], distances[index1+1], newDistances[-
1])))
distances = newDistances
return distances[-1]
print(levenshtein("analyze","analysed"))
>>>2
B. Phonetic Matching – A Phonetic matching algorithm takes a keyword as input (person’s name,
location name etc) and produces a character string that identifies a set of words that are (roughly)
phonetically similar. It is very useful for searching large text corpuses, correcting spelling errors and
matching relevant names. Soundex and Metaphone are two main phonetic algorithms used for this
purpose.
C. Flexible String Matching – A complete text matching system includes different algorithms
pipelined together to compute variety of text variations. Regular expressions are really helpful for
this purposes as well. Another common techniques include – exact string matching, lemmatized
matching, and compact matching (takes care of spaces, punctuation’s, slangs etc).
D. Cosine Similarity – W hen the text is represented as vector notation, a general cosine similarity
can also be applied in order to measure vectorized similarity. Following code converts a text to
vectors (using term frequency) and applies cosine similarity to provide closeness among two text.
import math
from collections import Counter
def get_cosine(vec1, vec2):
common = set(vec1.keys()) & set(vec2.keys())
numerator = sum([vec1[x] * vec2[x] for x in common])
if not denominator:
return 0.0
else:
return float(numerator) / denominator
def text_to_vector(text):
words = text.split()
return Counter(words)
vector1 = text_to_vector(text1)
vector2 = text_to_vector(text2)
cosine = get_cosine(vector1, vector2)
print(cosine)
>>> 0.629940788348712