0% found this document useful (0 votes)
38 views16 pages

Unit V

Uploaded by

Pragati Sharma
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
38 views16 pages

Unit V

Uploaded by

Pragati Sharma
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 16

Unit V NLP

Natural language processing (NLP) is the intersection of computer science, linguistics and machine
learning. The field focuses on communication between computers and humans in natural language
and NLP is all about making computers understand and generate human language.

Human language is special for several reasons.


• It is specifically constructed to convey the speaker/writer's meaning. It is a complex system.
• Human language all about symbols and categorical signalling system. This means we can
convey the same meaning in different ways (i.e., speech, gesture, signs, etc.) The encoding
by the human brain is a continuous pattern of activation by which the symbols are
transmitted via continuous signals of sound and vision.
• Understanding human language is considered a difficult task due to its complexity.
o For example, there is an infinite number of different ways to arrange words in a
sentence.
o Also, words can have several meanings and contextual information is necessary to
correctly interpret sentences.
• Every language is more or less unique and ambiguous.
• Applications
o Sentiment Analysis
o Language Translation
o Dialog Systems / Chatbots
o Topic Modelling
o Text Generation
o Speech Recognition
o Autocorrect

SYNTACTIC AND SEMANTIC ANALYSIS


• Syntactic analysis and semantic analysis are the two primary techniques that lead to the
understanding of natural language.
• Language is a set of valid sentences.
• Syntax is the grammatical structure of the text, whereas semantics is the meaning being
conveyed.
• Syntactic analysis also referred to as syntax analysis or parsing, is the process of analysing
natural language with the rules of a formal grammar. Grammatical rules are applied to
categories and groups of words, not individual words.
• Semantic analysis is the process of understanding the meaning and interpretation of words,
signs and sentence structure. This lets computers partly understand natural language the
way humans do.
• For example, speech recognition works almost flawlessly, but we still lack this kind of
proficiency in natural language understanding. Our phone understands what we have said,
but it doesn’t understand the meaning behind it. For example – The boy radiated fire like
vibes. The boy had a very motivating personality or he actually radiated fire?

Advantages of NLP
o NLP helps users to ask questions about any subject and get a direct response within seconds.
o NLP offers exact answers to the question means it does not offer unnecessary and unwanted
information.
o NLP helps computers to communicate with humans in their languages.
o It is very time efficient.
o Most of the companies use NLP to improve the efficiency of documentation processes,
accuracy of documentation, and identify the information from large databases.

Disadvantages of NLP
o NLP may not show context.
o NLP is unpredictable
o NLP may require more keystrokes.
o NLP is unable to adapt to the new domain, and it has a limited function that's why NLP is
built for a single and specific task only.

Components of NLP
There are the following two components of NLP -
1. Natural Language Understanding (NLU)
Natural Language Understanding (NLU) helps the machine to understand and analyse human
language by extracting the metadata from content such as concepts, entities, keywords, emotion,
relations, and semantic roles.
NLU mainly used in Business applications to understand the customer's problem in both spoken and
written language.
NLU involves the following tasks -
o It is used to map the given input into useful representation.
o It is used to analyse different aspects of the language.

2. Natural Language Generation (NLG)


Natural Language Generation (NLG) acts as a translator that converts the computerized data into
natural language representation. It mainly involves Text planning, Sentence planning, and Text
Realization.
• Text planning − It includes retrieving the relevant content from knowledge base.
• Sentence planning − It includes choosing required words, forming meaningful phrases,
setting tone of the sentence.
• Text Realization − It is mapping sentence plan into sentence structure.

NLU NLG

NLU is the process of reading and interpreting language. It uses NLG is the process of writing or generating language.
Syntactic analysis and semantic analysis.

It produces non-linguistic outputs from natural language inputs. It produces constructing natural language outputs
from non-linguistic inputs.

NLU reads and makes sense of natural language. NLG creates and outputs structured language.

Examples Examples
• Automatic Ticket Routing – It is an eg. of customer • GPT-3 – A language model developed by
service automation. OpenAI
• Machine Translation • Long Short-Term Memory - mimics how
• Automated Reasoning – It is a subfield of cognitive human brains work, remembers previous
science. Eg. Inference about medical diagnosis.
inputs and the contexts of sentences.
• Question Answering – Eg. Google Assistant
PHASES IN NLP

• Lexical Analysis/Morphological Analysis− It involves identifying and analysing the structure


of words. Lexicon (also called vocabulary) of a language means the collection of words,
phrases or symbols in a specific language.
Lexical analysis is dividing the whole chunk of text into paragraphs, sentences, and words.
o It looks for morphemes, the smallest unit of a word.
o For example, irrationally can be broken into ir (prefix), rational (root) and -ly (suffix).
Lexical Analysis finds the relation between these morphemes and converts the word
into its root form. A lexical analyser also assigns the possible Part-Of-Speech (POS) to
the word. It takes into consideration the dictionary of the language.
• Syntactic Analysis (Parsing) − It involves analysis of words in the sentence for grammar and
arranging words in a manner that shows the relationship among the words. The sentence
such as “Rise in sun the east.” is rejected by English syntactic analyser.
• Semantic Analysis − It draws the exact meaning or the dictionary meaning from the text.
The text is checked for meaningfulness. It is done by mapping syntactic structures and
objects in the task domain. The semantic analyser disregards sentence such as “hot ice-
cream”.
• Discourse Integration − The meaning of any sentence depends upon the meaning of the
sentence just before it. In addition, it also brings about the meaning of immediately
succeeding sentence.
o In the text, “Jack is a bright student. He spends most of the time in the library.”
Here, discourse assigns “he” to refer to “Jack”.
• Pragmatic Analysis –
o It is the study of contextual meaning.
o Pragmatic analysis is crucial in NLP because it can help computers understand the
meaning of a text beyond the surface level.
o It can help computers identify sarcasm, irony, metaphors, and other figurative
language devices that are commonly used in natural language.
o It helps users to discover the intended effect of a text by applying a set of rules that
characterize cooperative dialogues
o It understands context to know what the text aims to achieve.
o Eg. A teacher says “What time do you call this?” to a student who is coming late.

WHY NLP IS DIFFICULT?


1. Ambiguities
• Lexical Ambiguity
• Lexical Ambiguity exists in the presence of two or more possible meanings of
the sentence within a single word.
• Eg. Give me the bat.
In the above example, the word bat refers to that either cricket bat or
baseball bat or the flying nocturnal animal.
• Syntactic Ambiguity occurs when the grammatical structure of a sentence is unclear
or open to multiple interpretations. It arises from the way words are arranged and
combined to form a sentence, rather than from the meanings of the individual
words themselves.
• Eg. The cow was found by a stream by a farmer.
• Semantic Ambiguity Semantic ambiguity occurs when a word, phrase, or sentence
has more than one possible meaning. It arises from the meaning of the words
themselves, rather than from their grammatical structure.
• Eg. Mallika gave a cake to the children.
• Anaphora Ambiguity
• Eg. The horse ran up the hill. It was very steep. It soon got tired.
2. Contextual words, phrases, homophones, homonyms, etc.
• I ran to the store because we ran out of milk.
• Can I run something past you real quick?
• Homophones – Eg. Carrot, caret
• Homonyms – Eg. Bark, current
3. Synonyms
4. Irony and Sarcasm
5. Errors in Text and Speech
6. Colloquialisms and Slang

TEXT PREPROCESSING
• Since, text is the most unstructured form of all the available data, various types of noise are
present in it and the data is not readily analysable without any pre-processing. The entire
process of cleaning and standardization of text, making it noise-free and ready for analysis is
known as text pre-processing.
• Text processing refers to only the analysis, manipulation, and generation of text.
• For example, a simple sentiment analysis would require a machine learning model to look
for instances of positive or negative sentiment words, which could be provided to the model
beforehand. This would be text processing, since the model isn’t understanding the words,
it’s just looking for words that it was programmed to look for.
• Importance of Text Pre-processing - Since our interactions with brands have become
increasingly online and text-based, text data is one of the most important ways for
companies to derive business insights. Text data can show a business how their customers
search, buy, and interact with their brand, products, and competitors online. Text processing
with machine learning allows enterprises to handle these large amounts of text data.
• It is predominantly comprised of three steps:
o Noise Removal
o Lexicon Normalization
o Object Standardization

Noise Removal
• Any piece of text which is not relevant to the context of the data and the end-output can be
specified as the noise.
• For example – language stop words (commonly used words of a language – is, am, the, of, in
etc), URLs or links, social media entities (mentions, hash tags), punctuations and industry
specific words. This step deals with removal of all types of noisy entities present in the text.
• A general approach for noise removal is to prepare a dictionary of noisy entities, and iterate
the text object by tokens (or by words), eliminating those tokens which are present in the
noise dictionary.
Following is the python code for the same purpose.

noise_list = ["is", "a", "this", "the", "..."]


def _remove_noise(input_text):
words = input_text.split()
noise_free_words = [word for word in words if word not in noise_list]
noise_free_text = " ".join(noise_free_words)
return noise_free_text

print(_remove_noise("this is a sample text"))

• Another approach is to use the regular expressions while dealing with special patterns of
noise.
• Regular expression is a sequence of character(s) mainly used to find and replace patterns in
a string or file.
• Regular expressions use two types of characters:
a) Meta characters: As the name suggests, these characters have a special meaning,
similar to * in wild card.

META CHARACTER DESCRIPTION EXAMPLE


[] A set of characters “[a-m]”
. Any character “he..o”
^ Starts with “^python”
$ Ends with “programming$”

b) Literals (like a, b, 1, 2…)


In Python, we have module “re” that helps with regular expressions.
The most common uses of regular expressions are:
re.search(pattern, string):
It is similar to match() but it doesn’t restrict us to find matches at the beginning of the string
only.
import re
result = re.search('to', 'Welcome to Python Lab with ML')
print (result)
>>> <re.Match object; span=(8, 10), match='to'>

Here search() method is able to find a pattern from any position of the string but it only returns
the first occurrence of the search pattern.

re.findall (pattern, string):


It helps to get a list of all matching patterns. It has no constraints of searching from start or end.
result = re.findall('to', 'Welcome to Python Lab with ML To clear our Doubts')
print (result)
>>>['to']

result = re.findall('to', 'Welcome to Python Lab with ML To clear our Doubts', re.IGNORECASE)
print (result)
>>> ['to', 'To']

result = re.findall('[a-m]', 'Welcome to Python Lab with ML to clear our Doubts')


print (result)
>>> ['e', 'l', 'c', 'm', 'e', 'h', 'a', 'b', 'i', 'h', 'c', 'l', 'e', 'a', 'b']

result = re.findall('[a-m]+', 'Welcome to Python Lab with ML to clear our Doubts')


print (result)
>>> ['elc', 'me', 'h', 'ab', 'i', 'h', 'clea', 'b']

result = re.findall("Py..o", 'Welcome to Python Lab with ML to clear our Doubts')


print(result)
>>> ['Pytho']

result = re.findall("py..o", 'Welcome to Python Lab with ML to clear our Doubts')


print(result)
>>>[]

result = re.findall("py..o", 'Welcome to Python Lab with ML to clear our Doubts',


re.IGNORECASE)
print(result)
>>>['Pytho']

result = re.findall("^Welcome", 'Welcome to Python Lab with ML to clear our Doubts')


print(result)
>>>['Welcome']

result = re.findall("Doubts$", 'Welcome to Python Lab with ML to clear our Doubts')


print(result)
>>>['Doubts']
re.split(pattern, string):
This method helps to split string by the occurrences of given pattern.
result=re.split(' ', 'Welcome to Python Lab with ML')
print (result)
>>>['Welcome', 'to', 'Python', 'Lab', 'with', 'ML']

re.sub(pattern, repl, string):


It helps to search a pattern and replace with a new sub string. If the pattern is
not found, string is returned unchanged.
result =re.sub('Lab', 'Data Science Lab', 'Welcome to Python Lab with ML to clear our Doubts')
print(result)
>>>Welcome to Python Data Science Lab with ML to clear our Doubts

Lexicon Normalization
• Another type of textual noise is about the multiple representations exhibited by single word.
• For example – “play”, “player”, “played”, “plays” and “playing” are the different variations of
the word – “play”, Though they mean different but contextually all are similar. The step
converts all the disparities of a word into their normalized form (also known as lemma).
• Normalization is a pivotal step for feature engineering with text as it converts the high
dimensional features (N different features) to the low dimensional space (1 feature), which
is an ideal ask for any ML model.
• The most common lexicon normalization practices are:
o Stemming: Stemming is a rudimentary rule-based process of removing the suffixes
(“ing”, “ly”, “es”, “s”, ”ious” etc) from a word.
o Lemmatization: Lemmatization, on the other hand, is an organized & step by step
procedure of obtaining the root form of the word, it makes use of vocabulary (dictionary
importance of words) and morphological analysis (word structure and grammar
relations).
Below is the sample code that performs lemmatization and stemming using python’s
popular library – NLTK.
from nltk.stem import PorterStemmer
e_words= ["wait", "waiting", "waited", "waits"]
#e_words= ["multiples", "multiplying", "multiplication", "multiply"]
ps =PorterStemmer()
for w in e_words:
rootWord=ps.stem(w)
print(rootWord)

wait
wait
wait
wait

multipl
multipli
multipl
multipli

Object Standardization
• Text data often contains words or phrases which are not present in any standard lexical
dictionaries. These pieces are not recognized by search engines and models.
• Some of the examples are – acronyms, hashtags with attached words, and colloquial
slangs. With the help of regular expressions and manually prepared data dictionaries,
this type of noise can be fixed.
• The code below uses a dictionary lookup method to replace social media slangs from a
text.

lookup_dict = {'rt':'Retweet', 'dm':'direct message', "awsm": "awesome"}


def lookup_words(input_text):
words = input_text.split()
new_words = []
for word in words:
if word.lower() in lookup_dict:
word = lookup_dict[word.lower()]
new_words.append(word)
new_text = " ".join(new_words)
return new_text

print(lookup_words("RT this is a retweeted tweet by Shivam Bansal"))


>>>Retweet
print(lookup_words("dm is the message of Princy Goyal"))
>>>direct message

TEXT TO FEATURES (FEATURE ENGINEERING ON TEXT DATA)


To analyse a pre-processed data, it needs to be converted into features. Depending upon the usage,
text features can be constructed using assorted techniques – Syntactical Parsing, Entities / N-grams /
word-based features, Statistical features, and word embeddings.
1. Syntactical Parsing
Syntactical parsing involves the analysis of words in the sentence for grammar and their
arrangement in a manner that shows the relationships among the words. Dependency Grammar and
Part of Speech tags are the important attributes of text syntactics.

1.1 Dependency Trees – Sentences are composed of some words sewed together. The relationship
among the words in a sentence is determined by the basic dependency grammar. Dependency
grammar is a class of syntactic text analysis that deals with (labeled) asymmetrical binary relations
between two lexical items (words). Every relation can be represented in the form of a triplet
(relation, governor, dependent). For example: consider the sentence – “Bills on ports and
immigration were submitted by Senator Brownback, Republican of Kansas.” The relationship among
the words can be observed in the form of a tree representation as shown:

The tree shows that “submitted” is the root word of this sentence, and is linked by two sub-trees
(subject and object subtrees). Each subtree is a itself a dependency tree with relations such as –
(“Bills” <-> “ports” <by> “proposition” relation), (“ports” <-> “immigration” <by> “conjugation”
relation).
This type of tree, when parsed recursively in top-down manner gives grammar relation triplets as
output which can be used as features for many nlp problems like entity wise sentiment analysis,
actor & entity identification, and text classification. The python wrapper StandfordcoreNLP and NLTK
dependency grammars can be used to generate dependency trees.

1.2 Part of speech tagging – Apart from the grammar relations, every word in a sentence is also
associated with a part of speech (pos) tag (nouns, verbs, adjectives, adverbs etc). The pos tags
defines the usage and function of a word in the sentence. Following code using NLTK performs
pos tagging annotation on input text. (it provides several implementations, the default one is
perceptron tagger)

from nltk import word_tokenize, pos_tag


text = "I am learning Natural Language Processing”
tokens = word_tokenize(text)
print pos_tag(tokens)
>>> [('I', 'PRP'), ('am', 'VBP'), ('learning', 'VBG'), ('Natural', 'NNP'),('Language', 'NNP'),
('Processing', 'NNP')]

Part of Speech tagging is used for many important purposes in NLP:


A. Word sense disambiguation: Some language words have multiple meanings according to their
usage. For example, in the two sentences below:
I. “Please book my flight for Delhi”
II. “I am going to read this book in the flight”
“Book” is used with different context, however the part of speech tag for both of the cases are
different. In sentence I, the word “book” is used as v erb, while in II it is used as no un.

B. Improving word-based features: A learning model could learn different contexts of a word when
used word as the features, however if the part of speech tag is linked with them, the context is
preserved, thus making strong features. For example:
Sentence -“book my flight, I will read this book”
Tokens – (“book”, 2), (“my”, 1), (“flight”, 1), (“I”, 1), (“will”, 1), (“read”, 1), (“this”, 1)
Tokens with POS – (“book_VB”, 1), (“my_PRP$”, 1), (“flight_NN”, 1), (“I_PRP”, 1), (“will_MD”, 1),
(“read_VB”, 1), (“this_DT”, 1), (“book_NN”, 1)

C. Normalization and Lemmatization: POS tags are the basis of lemmatization process for converting
a word to its base form (lemma).

D.Efficient stopword removal : POS tags are also useful in efficient removal of stopwords.
For example, there are some tags which always define the low frequency / less important words of a
language. For example: (IN – “within”, “upon”, “except”), (CD – “one”,”two”, “hundred”), (MD –
“may”, “mu st” etc)

2.Entity Parsing (Extraction)


Entities are defined as the most important chunks of a sentence – noun phrases, verb phrases or
both. Entity Detection algorithms are generally ensemble models of rule based parsing, dictionary
lookups, pos tagging and dependency parsing.
Topic Modelling & Named Entity Recognition are the two key entity detection methods in NLP.
2.1. Named Entity Recognition (NER)
The process of detecting the named entities such as person names, location names, company names
etc from the text is called as NER. For example:

Sentence – Sergey Brin, the manager of Google Inc. is walking in the streets of New York.

Named Entities – (“person” : “Sergey Brin” ), (“org” : “Google Inc.”), (“location” : “New York”)

A typical NER model consists of three blocks:


Noun phrase identification: This step deals with extracting all the noun phrases from a text using
dependency parsing and part of speech tagging.
Phrase classification: This is the classification step in which all the extracted noun phrases are
classified into respective categories (locations, names etc). Google Maps API provides a good path to
disambiguate locations, Then, the open databases from dbpedia, wikipedia can be used to identify
person names or company names.
Entity disambiguation: Sometimes it is possible that entities are misclassified, hence creating a
validation layer on top of the results is useful. Use of knowledge graphs can be exploited for this
purposes. The popular knowledge graphs are – Google Knowledge Graph, IBM Watson and
Wikipedia.

2.2 Topic Modeling


Topic modeling is a process of automatically identifying the topics present in a text corpus, it derives
the hidden patterns among the words in the corpus in an unsupervised manner. Topics are defined
as “a repeating pattern of co-occurring terms in a corpus”. A good topic model results in – “health”,
“doctor”, “patient”, “hospital” for a topic – Healthcare, and “farm”, “crops”, “wheat” for a topic –
“Farming”.
2.3 N-Grams as Features
A combination of N words together are called N-Grams. N grams (N > 1) are generally more
informative as compared to words (Unigrams) as features. Also, bigrams (N = 2) are considered as
the most important features of all the others. The following code generates bigram of a text.
def generate_ngrams(text, n):
words = text.split()
output = []
for i in range(len(words)-n+1):
output.append(words[i:i+n])
return output

>>> generate_ngrams('this is a sample text', 2)


# [['this', 'is'], ['is', 'a'], ['a', 'sample'], , ['sample', 'text']]

3.Statistical Features
Text data can also be quantified directly into numbers using several techniques described below:
3.1. Term Frequency – Inverse Document Frequency (TF – IDF)
TF-IDF is a weighted model commonly used for information retrieval problems. It aims to convert the
text documents into vector models on the basis of occurrence of words in the documents without
taking considering the exact ordering. For Example – let say there is a dataset of N text documents,
In any document “D”, TF and IDF will be defined as –
Term Frequency (TF) – TF for a term “t” is defined as the count of a term “t” in a document “D”
Inverse Document Frequency (IDF) – IDF for a term is defined as logarithm of ratio of total
documents available in the corpus and number of documents containing the term T.
TF . IDF – TF IDF formula gives the relative importance of a term in a corpus (list of documents),
given by the following formula below. Following is the code using python’s scikit learn package to
convert a text into tf idf vectors:

from sklearn.feature_extraction.text import TfidfVectorizer


obj = TfidfVectorizer()
corpus = ['This is sample document.', 'another random document.', 'third sample document text']
X = obj.fit_transform(corpus)
print X
>>>
(0, 1) 0.345205016865
(0, 4) ... 0.444514311537
(2, 1) 0.345205016865
(2, 4) 0.444514311537

The model creates a vocabulary dictionary and assigns an index to each word. Each row in the
output contains a tuple (i,j) and a tf-idf value of word at index j in document i.
3.2 Frequency / Density / Readability Features
Frequency or Density based features can also be used in models and analysis. These features might
seem trivial but shows a great impact in learning models. Some of the features are: Word Count,
Sentence Count, Punctuation Counts and Industry specific word counts. Other types of measures
include readability measures such as syllable counts, smog index and flesch reading ease.
Textstat library is used to create such features.

4. Word Embedding (text vectors)


Word embedding is the modern way of representing words as vectors. The aim of word embedding
is to redefine the high dimensional word features into low dimensional feature vectors by preserving
the contextual similarity in the corpus. They are widely used in deep learning models such as
Convolutional Neural Networks and Recurrent Neural Networks.

Word2Vec and GloVe are the two popular models to create word embedding of a text. These
models takes a text corpus as input and produces the word vectors as output.

Word2Vec model is composed of preprocessing module, a shallow neural network model called
Continuous Bag of Words and another shallow neural network model called skip-gram. These
models are widely used for all other nlp problems. It first constructs a vocabulary from the training
corpus and then learns word embedding representations. Following code using gensim package
prepares the word embedding as the vectors.

from gensim.models import Word2Vec


sentences = [['data', 'science'], ['vidhya', 'science', 'data', 'analytics'],['machine', 'learning'], ['deep',
'learning']]

# train the model on your corpus


model = Word2Vec(sentences, min_count = 1)

print model.similarity('data', 'science')


>>> 0.11222489293

print model['learning']
>>> array([ 0.00459356 0.00303564 -0.00467622 0.00209638, ...])

They can be used as feature vectors for ML model, used to measure text similarity using cosine
similarity techniques, words clustering and text classification techniques.

5.Important Tasks of NLP


5.1 Text Classification
Text classification is one of the classical problem of NLP. Notorious examples include – Email Spam
Identification, topic classification of news, sentiment classification and organization of web pages by
search engines.

Text classification, in common words is defined as a technique to systematically classify a text object
(document or sentence) in one of the fixed category. It is really helpful when the amount of data is
too large, especially for organizing, information filtering, and storage purposes.
A typical natural language classifier consists of two parts: (a) Training (b) Prediction as shown in
image below. Firstly the text input is processes and features are created. The machine learning
models then learn these features and is used for predicting against the new text.

Here is a code that uses naive bayes classifier using text blob library (built on top of nltk).

from textblob.classifiers import NaiveBayesClassifier as NBC


from textblob import TextBlob
training_corpus = [
('I am exhausted of this work.', 'Class_B'),
("I can't cooperate with this", 'Class_B'),
('He is my badest enemy!', 'Class_B'),
('My management is poor.', 'Class_B'),
('I love this burger.', 'Class_A'),
('This is an brilliant place!', 'Class_A'),
('I feel very good about these dates.', 'Class_A'),
('This is my best work.', 'Class_A'),
("What an awesome view", 'Class_A'),
('I do not like this dish', 'Class_B')]
test_corpus = [
("I am not feeling well today.", 'Class_B'),
("I feel brilliant!", 'Class_A'),
('Gary is a friend of mine.', 'Class_A'),
("I can't believe I'm doing this.", 'Class_B'),
('The date was good.', 'Class_A'), ('I do not enjoy my job', 'Class_B')]

model = NBC(training_corpus)
print(model.classify("Their codes are amazing."))
>>> "Class_A"
print(model.classify("I don't like their computer."))
>>> "Class_B"
print(model.accuracy(test_corpus))
>>> 0.83

The text classification model are heavily dependent upon the quality and quantity of features, while
applying any machine learning model it is always a good practice to include more and more training
data.
5.2 Text Matching / Similarity
One of the important areas of NLP is the matching of text objects to find similarities. Important
applications of text matching includes automatic spelling correction, data de-duplication and
genome analysis etc.
A number of text matching techniques are available depending upon the requirement.
A. Levenshtein Distance – The Levenshtein distance between two strings is defined as the minimum
number of edits needed to transform one string into the other, with the allowable edit operations
being insertion, deletion, or substitution of a single character. Following is the implementation for
efficient memory computations.

def levenshtein(s1,s2):
if len(s1) > len(s2):
s1,s2 = s2,s1
distances = range(len(s1) + 1)
for index2,char2 in enumerate(s2):
newDistances = [index2+1]
for index1,char1 in enumerate(s1):
if char1 == char2:
newDistances.append(distances[index1])
else:
newDistances.append(1 + min((distances[index1], distances[index1+1], newDistances[-
1])))
distances = newDistances
return distances[-1]

print(levenshtein("analyze","analysed"))
>>>2

B. Phonetic Matching – A Phonetic matching algorithm takes a keyword as input (person’s name,
location name etc) and produces a character string that identifies a set of words that are (roughly)
phonetically similar. It is very useful for searching large text corpuses, correcting spelling errors and
matching relevant names. Soundex and Metaphone are two main phonetic algorithms used for this
purpose.

C. Flexible String Matching – A complete text matching system includes different algorithms
pipelined together to compute variety of text variations. Regular expressions are really helpful for
this purposes as well. Another common techniques include – exact string matching, lemmatized
matching, and compact matching (takes care of spaces, punctuation’s, slangs etc).

D. Cosine Similarity – W hen the text is represented as vector notation, a general cosine similarity
can also be applied in order to measure vectorized similarity. Following code converts a text to
vectors (using term frequency) and applies cosine similarity to provide closeness among two text.

import math
from collections import Counter
def get_cosine(vec1, vec2):
common = set(vec1.keys()) & set(vec2.keys())
numerator = sum([vec1[x] * vec2[x] for x in common])

sum1 = sum([vec1[x]**2 for x in vec1.keys()])


sum2 = sum([vec2[x]**2 for x in vec2.keys()])
denominator = math.sqrt(sum1) * math.sqrt(sum2)

if not denominator:
return 0.0
else:
return float(numerator) / denominator

def text_to_vector(text):
words = text.split()
return Counter(words)

text1 = 'This is an article on analytics vidhya'


text2 = 'article on analytics vidhya is about natural language processing'

vector1 = text_to_vector(text1)
vector2 = text_to_vector(text2)
cosine = get_cosine(vector1, vector2)
print(cosine)
>>> 0.629940788348712

5.3 Coreference Resolution


Coreference Resolution is a process of finding relational links among the words (or phrases) within
the sentences. Consider an example sentence: ” Donald went to John’s office to see the new table.
He looked at it for an hour.“
Humans can quickly figure out that “he” denotes Donald (and not John), and that “it” denotes the
table (and not John’s office). Coreference Resolution is the component of NLP that does this job
automatically. It is used in document summarization, question answering, and information
extraction.

5.4 Other NLP problems / tasks


• Text Summarization – Given a text article or paragraph, summarize it automatically to produce most
important and relevant sentences in order.
• Machine Translation – Automatically translate text from one human language to another by taking
care of grammar, semantics and information about the real world, etc.
• Natural Language Generation and Understanding – Convert information from computer databases
or semantic intents into readable human language is called language generation. Converting chunks
of text into more logical structures that are easier for computer programs to manipulate is called
language understanding.
• Optical Character Recognition – Given an image representing printed text, determine the
corresponding text.
• Document to Information – This involves parsing of textual data present in documents (websites,
files, pdfs and images) to analysable and clean format.

Important Libraries for NLP (python)

• Scikit-learn: Machine learning in Python


• Natural Language Toolkit (NLTK): The complete toolkit for all NLP techniques.
• Pattern – A web mining module for the with tools for NLP and machine learning.
• TextBlob – Easy to use nl p tools API, built on top of NLTK and Pattern.
• spaCy – Industrial strength N LP with Python and Cython.
• Gensim – Topic Modelling
• Stanford Core NLP – NLP services and packages by Stanford NLP Group.

You might also like