0% found this document useful (0 votes)
15 views39 pages

Natural Language Processing Manual

The document provides a comprehensive guide to Natural Language Processing (NLP), emphasizing its importance in analyzing and deriving insights from unstructured text data. It covers various techniques for text preprocessing, feature engineering, and important NLP tasks, along with practical implementations in Python. Additionally, it includes references to relevant libraries and a video course for hands-on learning with real-life projects.

Uploaded by

drsravan
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
15 views39 pages

Natural Language Processing Manual

The document provides a comprehensive guide to Natural Language Processing (NLP), emphasizing its importance in analyzing and deriving insights from unstructured text data. It covers various techniques for text preprocessing, feature engineering, and important NLP tasks, along with practical implementations in Python. Additionally, it includes references to relevant libraries and a video course for hands-on learning with real-life projects.

Uploaded by

drsravan
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 39

Natural Language Processing

Introduction
According to industry estimates, only 21% of the available data is present in structured form.

Data is being generated as we speak, as we tweet, as we send messages on Whatsapp and in

various other activities. Majority of this data exists in the textual form, which is highly

unstructured in nature.

Few notorious examples include – tweets / posts on social media, user to user chat conversations,

news, blogs and articles, product or services reviews and patient records in the healthcare sector.

A few more recent ones includes chatbots and other voice driven bots.

Despite having high dimension data, the information present in it is not directly accessible unless

it is processed (read and understood) manually or analyzed by an automated system.

In order to produce significant and actionable insights from text data, it is important to get

acquainted with the techniques and principles of Natural Language Processing (NLP).

So, if you plan to create chatbots this year, or you want to use the power of unstructured text, this

guide is the right starting point. This guide unearths the concepts of natural language processing,

its techniques and implementation. The aim of the article is to teach the concepts of natural

language processing and apply it on real data set. Moreover, we also have a video based course

on NLP with 3 real life projects.

Overview

 Complete guide on natural language processing (NLP) in Python

 Learn various techniques for implementing NLP including parsing & text processing

 Understand how to use NLP for text feature engineering


Introduction
According to industry estimates, only 21% of the available data is present in structured form.
Data is being generated as we speak, as we tweet, as we send messages on Whatsapp and in
various other activities. Majority of this data exists in the textual form, which is highly
unstructured in nature.

Few notorious examples include – tweets / posts on social media, user to user chat conversations,
news, blogs and articles, product or services reviews and patient records in the healthcare sector.
A few more recent ones includes chatbots and other voice driven bots.

Despite having high dimension data, the information present in it is not directly accessible unless
it is processed (read and understood) manually or analyzed by an automated system.

In order to produce significant and actionable insights from text data, it is important to get
acquainted with the techniques and principles of Natural Language Processing (NLP).

So, if you plan to create chatbots this year, or you want to use the power of unstructured text, this
guide is the right starting point. This guide unearths the concepts of natural language processing,
its techniques and implementation. The aim of the article is to teach the concepts of natural
language processing and apply it on real data set. Moreover, we also have a video based course
on NLP with 3 real life projects.
Table of Contents

1. Introduction to NLP

2. Text Preprocessing

o Noise Removal

o Lexicon Normalization

 Lemmatization

 Stemming

o Object Standardization

3. Text to Features (Feature Engineering on text data)

o Syntactical Parsing

 Dependency Grammar

 Part of Speech Tagging

o Entity Parsing

 Phrase Detection

 Named Entity Recognition

 Topic Modelling

 N-Grams

o Statistical features

 TF – IDF

 Frequency / Density Features

 Readability Features

o Word Embeddings

4. Important tasks of NLP


o Text Classification

o Text Matching

 Levenshtein Distance

 Phonetic Matching

 Flexible String Matching

o Coreference Resolution

o Other Problems

5. Important NLP libraries


1. Introduction to Natural Language Processing

NLP is a branch of data science that consists of systematic processes for analyzing,

understanding, and deriving information from the text data in a smart and efficient manner. By

utilizing NLP and its components, one can organize the massive chunks of text data, perform

numerous automated tasks and solve a wide range of problems such as – automatic

summarization, machine translation, named entity recognition, relationship extraction, sentiment

analysis, speech recognition, and topic segmentation etc.

Before moving further, I would like to explain some terms that are used in the article:

 Tokenization – process of converting a text into tokens

 Tokens – words or entities present in the text

 Text object – a sentence or a phrase or a word or an article

Steps to install NLTK and its data:

Install Pip: run in terminal:

sudo easy_install pip

Install NLTK: run in terminal :

sudo pip install -U nltk

Download NLTK data: run python shell (in terminal) and write the following code:

import nltk

nltk.download()

Follow the instructions on screen and download the desired package or collection. Other libraries

can be directly installed using pip.


2. Text Preprocessing
Since, text is the most unstructured form of all the available data, various types of noise are

present in it and the data is not readily analyzable without any pre-processing. The entire process

of cleaning and standardization of text, making it noise-free and ready for analysis is known as

text preprocessing.

It is predominantly comprised of three steps:

 Noise Removal

 Lexicon Normalization

 Object Standardization

The following image shows the architecture of text preprocessing pipeline.

2.1 Noise Removal


Any piece of text which is not relevant to the context of the data and the end-output can be

specified as the noise.

For example – language stopwords (commonly used words of a language – is, am, the, of, in etc),

URLs or links, social media entities (mentions, hashtags), punctuations and industry specific

words. This step deals with removal of all types of noisy entities present in the text.

A general approach for noise removal is to prepare a dictionary of noisy entities, and iterate the

text object by tokens (or by words), eliminating those tokens which are present in the noise

dictionary.

Following is the python code for the same purpose.

Python Code:
Another approach is to use the regular expressions while dealing with special patterns of

noise. We have explained regular expressions in detail in one of our previous article. Following

python code removes a regex pattern from the input text:

```

# Sample code to remove a regex pattern


import re

def _remove_regex(input_text, regex_pattern):


urls = re.finditer(regex_pattern, input_text)
for i in urls:
input_text = re.sub(i.group().strip(), '', input_text)
return input_text

regex_pattern = "#[\w]*"

_remove_regex("remove this #hashtag from analytics vidhya",


regex_pattern)
>>> "remove this from analytics vidhya"

2.2 Lexicon Normalization


Another type of textual noise is about the multiple representations exhibited by single word.

For example – “play”, “player”, “played”, “plays” and “playing” are the different variations of

the word – “play”, Though they mean different but contextually all are similar. The step converts

all the disparities of a word into their normalized form (also known as lemma). Normalization is

a pivotal step for feature engineering with text as it converts the high dimensional features (N

different features) to the low dimensional space (1 feature), which is an ideal ask for any ML

model.

The most common lexicon normalization practices are :


 Stemming: Stemming is a rudimentary rule-based process of stripping the suffixes (“ing”, “ly”,

“es”, “s” etc) from a word.

 Lemmatization: Lemmatization, on the other hand, is an organized & step by step procedure of

obtaining the root form of the word, it makes use of vocabulary (dictionary importance of words)

and morphological analysis (word structure and grammar relations).

Below is the sample code that performs lemmatization and stemming using python’s popular

library – NLTK.

```

from nltk.stem.wordnet import WordNetLemmatizer


lem = WordNetLemmatizer()

from nltk.stem.porter import PorterStemmer


stem = PorterStemmer()

word = "multiplying"
lem.lemmatize(word, "v")
>> "multiply"
stem.stem(word)
>> "multipli"

```
2.3 Object Standardization

Text data often contains words or phrases which are not present in any standard lexical

dictionaries. These pieces are not recognized by search engines and models.

Some of the examples are – acronyms, hashtags with attached words, and colloquial slangs. With

the help of regular expressions and manually prepared data dictionaries, this type of noise can be

fixed, the code below uses a dictionary lookup method to replace social media slangs from a text.

```
lookup_dict = {'rt':'Retweet', 'dm':'direct message', "awsm" :
"awesome", "luv" :"love", "..."}
def _lookup_words(input_text):
words = input_text.split()
new_words = []
for word in words:
if word.lower() in lookup_dict:
word = lookup_dict[word.lower()]
new_words.append(word) new_text = " ".join(new_words)
return new_text

_lookup_words("RT this is a retweeted tweet by Shivam Bansal")


>> "Retweet this is a retweeted tweet by Shivam Bansal"

```

Apart from three steps discussed so far, other types of text preprocessing includes encoding-

decoding noise, grammar checker, and spelling correction etc. The detailed article about

preprocessing and its methods is given in one of my previous article.

3.Text to Features (Feature Engineering on text data)


To analyse a preprocessed data, it needs to be converted into features. Depending upon the

usage, text features can be constructed using assorted techniques – Syntactical Parsing, Entities /

N-grams / word-based features, Statistical features, and word embeddings. Read on to

understand these techniques in detail.

3.1 Syntactic Parsing

Syntactical parsing invol ves the analysis of words in the sentence for grammar and their

arrangement in a manner that shows the relationships among the words. Dependency Grammar

and Part of Speech tags are the important attributes of text syntactics.

Dependency Trees – Sentences are composed of some words sewed together. The relationship

among the words in a sentence is determined by the basic dependency grammar. Dependency

grammar is a class of syntactic text analysis that deals with (labeled) asymmetrical binary

relations between two lexical items (words). Every relation can be represented in the form of a
triplet (relation, governor, dependent). For example: consider the sentence – “Bills on ports and

immigration were submitted by Senator Brownback, Republican of Kansas.” The relationship

among the words can be observed in the form of a tree representation as shown:

The tree shows that “submitted” is the root

word of this sentence, and is linked by two sub-trees (subject and object subtrees). Each subtree

is a itself a dependency tree with relations such as – (“Bills” <-> “ports” <by> “proposition”

relation), (“ports” <-> “immigration” <by> “conjugation” relation).

This type of tree, when parsed recursively in top-down manner gives grammar relation triplets as

output which can be used as features for many nlp problems like entity wise sentiment analysis,

actor & entity identification, and text classification. The python wrapper StanfordCoreNLP (by

Stanford NLP Group, only commercial license) and NLTK dependency grammars can be used to

generate dependency trees.

Part of speech tagging – Apart from the grammar relations, every word in a sentence is also

associated with a part of speech (pos) tag (nouns, verbs, adjectives, adverbs etc). The pos tags

defines the usage and function of a word in the sentence. H ere is a list of all possible pos-tags

defined by Pennsylvania university. Following code using NLTK performs pos tagging
annotation on input text. (it provides several implementations, the default one is perceptron

tagger)

```
from nltk import word_tokenize, pos_tag
text = "I am learning Natural Language Processing on Analytics Vidhya"
tokens = word_tokenize(text)
print pos_tag(tokens)
>>> [('I', 'PRP'), ('am', 'VBP'), ('learning', 'VBG'), ('Natural', 'NNP'),('Language', 'NNP'),
('Processing', 'NNP'), ('on', 'IN'), ('Analytics', 'NNP'),('Vidhya', 'NNP')]
```

Part of Speech tagging is used for many important purposes in NLP:

A.Word sense disambiguation: Some language words have multiple meanings according to

their usage. For example, in the two sentences below:

I. “Please book my flight for Delhi”

II. “I am going to read this book in the flight”

“Book” is used with different context, however the part of speech tag for both of the cases are

different. In sentence I, the word “book” is used as v erb, while in II it is used as no un. (Lesk

Algorithm is also us ed for similar purposes)

B.Improving word-based features: A learning model could learn different contexts of a word

when used word as the features, however if the part of speech tag is linked with them, the context

is preserved, thus making strong features. For example:

Sentence -“book my flight, I will read this book”

Tokens – (“book”, 2), (“my”, 1), (“flight”, 1), (“I”, 1), (“will”, 1), (“read”, 1), (“this”, 1)
Tokens with POS – (“book_VB”, 1), (“my_PRP$”, 1), (“flight_NN”, 1), (“I_PRP”, 1),

(“will_MD”, 1), (“read_VB”, 1), (“this_DT”, 1), (“book_NN”, 1)

C. Normalization and Lemmatization: POS tags are the basis of lemmatization process for

converting a word to its base form (lemma).

D.Efficient stopword removal : P OS tags are also useful in efficient removal of stopwords.

For example, there are some tags which always define the low frequency / less important words

of a language. For example: (IN – “within”, “upon”, “except”), (CD – “one”,”two”, “hundred”),

(MD – “may”, “mu st” etc)

3.2 Entity Extraction (Entities as features)

Entities are defined as the most important chunks of a sentence – noun phrases, verb phrases or

both. Entity Detection algorithms are generally ensemble models of rule based parsing,

dictionary lookups, pos tagging and dependency parsing. The applicability of entity detection can

be seen in the automated chat bots, content analyzers and consumer insights.

Topic Modelling & Named Entity Recognition are the two key entity detection methods in NLP.

A. Named Entity Recognition (NER)

The process of detecting the named entities such as person names, location names, company

names etc from the text is called as NER. For example :

Sentence – Sergey Brin, the manager of Google Inc. is walking in the streets of New York.
Named Entities – ( “person” : “Sergey Brin” ), (“org” : “Google Inc.”), (“location” : “New

York”)

A typical NER model consists of three blocks:

Noun phrase identification: This step deals with extracting all the noun phrases from a text

using dependency parsing and part of speech tagging.

Phrase classification: This is the classification step in which all the extracted noun phrases are

classified into respective categories (locations, names etc). Google Maps API provides a good

path to disambiguate locations, Then, the open databases from dbpedia, wikipedia can be used to

identify person names or company names. Apart from this, one can curate the lookup tables and

dictionaries by combining information from different sources.

Entity disambiguation: Sometimes it is possible that entities are misclassified, hence creating a

validation layer on top of the results is useful. Use of knowledge graphs can be exploited for this

purposes. The popular knowledge graphs are – Google Knowledge Graph, IBM Watson and

Wikipedia.

B. Topic Modeling

Topic modeling is a process of automatically identifying the topics present in a text corpus, it

derives the hidden patterns among the words in the corpus in an unsupervised manner. Topics are

defined as “a repeating pattern of co-occurring terms in a corpus”. A good topic model results in

– “health”, “doctor”, “patient”, “hospital” for a topic – Healthcare, and “farm”, “crops”, “wheat”

for a topic – “Farming”.


Latent Dirichlet Allocation (LDA) is the most popular topic modelling technique, Following is

the code to implement topic modeling using LDA in python. For a detailed explanation about its

working and implementation, check the complete article here.

```
doc1 = "Sugar is bad to consume. My sister likes to have sugar, but not my father."
doc2 = "My father spends a lot of time driving my sister around to dance practice."
doc3 = "Doctors suggest that driving may cause increased stress and blood pressure."
doc_complete = [doc1, doc2, doc3]
doc_clean = [doc.split() for doc in doc_complete]

import gensim from gensim


import corpora

# Creating the term dictionary of our corpus, where every unique term is assigned an index.
dictionary = corpora.Dictionary(doc_clean)

# Converting list of documents (corpus) into Document Term Matrix using dictionary prepared
above.
doc_term_matrix = [dictionary.doc2bow(doc) for doc in doc_clean]

# Creating the object for LDA model using gensim library


Lda = gensim.models.ldamodel.LdaModel

# Running and Training LDA model on the document term matrix


ldamodel = Lda(doc_term_matrix, num_topics=3, id2word = dictionary, passes=50)

# Results
print(ldamodel.print_topics())

```
C. N-Grams as Features

A combination of N words together are called N-Grams. N grams (N > 1) are generally more

informative as compared to words (Unigrams) as features. Also, bigrams (N = 2) are considered

as the most important features of all the others. The following code generates bigram of a text.

```
def generate_ngrams(text, n):
words = text.split()
output = []
for i in range(len(words)-n+1):
output.append(words[i:i+n])
return output

>>> generate_ngrams('this is a sample text', 2)


# [['this', 'is'], ['is', 'a'], ['a', 'sample'], , ['sample', 'text']]
```
3.3 Statistical Features

Text data can also be quantified directly into numbers using several techniques described in this

section:

A. Term Frequency – Inverse Document Frequency (TF – IDF)

TF-IDF is a weighted model commonly used for information retrieval problems. It aims to

convert the text documents into vector models on the basis of occurrence of words in the

documents without taking considering the exact ordering. For Example – let say there is a dataset

of N text documents, In any document “D”, TF and IDF will be defined as –

Term Frequency (TF) – TF for a term “t” is defined as the count of a term “t” in a document

“D”

Inverse Document Frequency (IDF) – IDF for a term is defined as logarithm of ratio of total

documents available in the corpus and number of documents containing the term T.
TF . IDF – TF IDF formula gives the relative importance of a term in a corpus (list of

documents), given by the following formula below. Following is the code using python’s scikit

learn package to convert a text into tf idf vectors:

```
from sklearn.feature_extraction.text import TfidfVectorizer
obj = TfidfVectorizer()
corpus = ['This is sample document.', 'another random document.', 'third sample document text']
X = obj.fit_transform(corpus)
print X
>>>
(0, 1) 0.345205016865
(0, 4) ... 0.444514311537
(2, 1) 0.345205016865
(2, 4) 0.444514311537
```

The model creates a vocabulary dictionary and assigns an index to each word. Each row in the

output contains a tuple (i,j) and a tf-idf value of word at index j in document i.

B. Count / Density / Readability Features

Count or Density based features can also be used in models and analysis. These features might

seem trivial but shows a great impact in learning models. Some of the features are: Word Count,

Sentence Count, Punctuation Counts and Industry specific word counts. Other types of measures

include readability measures such as syllable counts, smog index and flesch reading ease. Refer

to Textstat library to create such features.


3.4 Word Embedding (text vectors)

Word embedding is the modern way of representing words as vectors. The aim of word

embedding is to redefine the high dimensional word features into low dimensional feature

vectors by preserving the contextual similarity in the corpus. They are widely used in deep

learning models such as Convolutional Neural Networks and Recurrent Neural Networks.

Word2Vec and GloVe are the two popular models to create word embedding of a text. These

models takes a text corpus as input and produces the word vectors as output.

Word2Vec model is composed of preprocessing module, a shallow neural network model called

Continuous Bag of Words and another shallow neural network model called skip-gram. These

models are widely used for all other nlp problems. It first constructs a vocabulary from the

training corpus and then learns word embedding representations. Following code using gensim

package prepares the word embedding as the vectors.

```
from gensim.models import Word2Vec
sentences = [['data', 'science'], ['vidhya', 'science', 'data', 'analytics'],['machine', 'learning'], ['deep',
'learning']]

# train the model on your corpus


model = Word2Vec(sentences, min_count = 1)

print model.similarity('data', 'science')


>>> 0.11222489293

print model['learning']
>>> array([ 0.00459356 0.00303564 -0.00467622 0.00209638, ...])

```

They can be used as feature vectors for ML model, used to measure text similarity using cosine

similarity techniques, words clustering and text classification techniques.


4. Important tasks of NLP

This section talks about different use cases and problems in the field of natural language

processing.

4.1 Text Classification

Text classification is one of the classical problem of NLP. Notorious examples include – Email

Spam Identification, topic classification of news, sentiment classification and organization of

web pages by search engines.

Text classification, in common words is defined as a technique to systematically classify a text

object (document or sentence) in one of the fixed category. It is really helpful when the amount

of data is too large, especially for organizing, information filtering, and storage purposes.

A typical natural language classifier consists of two parts: (a) Training (b) Prediction as shown in

image below. Firstly the text input is processes and features are created. The machine learning

models then learn these features and is used for predicting against the new text.

Here is a code that uses naive bayes classifier using text blob library (built on top of nltk).

```
from textblob.classifiers import NaiveBayesClassifier as NBC
from textblob import TextBlob
training_corpus = [
('I am exhausted of this work.', 'Class_B'),
("I can't cooperate with this", 'Class_B'),
('He is my badest enemy!', 'Class_B'),
('My management is poor.', 'Class_B'),
('I love this burger.', 'Class_A'),
('This is an brilliant place!', 'Class_A'),
('I feel very good about these dates.', 'Class_A'),
('This is my best work.', 'Class_A'),
("What an awesome view", 'Class_A'),
('I do not like this dish', 'Class_B')]
test_corpus = [
("I am not feeling well today.", 'Class_B'),
("I feel brilliant!", 'Class_A'),
('Gary is a friend of mine.', 'Class_A'),
("I can't believe I'm doing this.", 'Class_B'),
('The date was good.', 'Class_A'), ('I do not enjoy my job', 'Class_B')]

model = NBC(training_corpus)
print(model.classify("Their codes are amazing."))
>>> "Class_A"
print(model.classify("I don't like their computer."))
>>> "Class_B"
print(model.accuracy(test_corpus))
>>> 0.83
```

Scikit.Learn also provides a pipeline framework for text classification:

```
from sklearn.feature_extraction.text
import TfidfVectorizer from sklearn.metrics
import classification_report
from sklearn import svm

# preparing data for SVM model (using the same training_corpus, test_corpus from naive bayes
example)
train_data = []
train_labels = []
for row in training_corpus:
train_data.append(row[0])
train_labels.append(row[1])

test_data = []
test_labels = []
for row in test_corpus:
test_data.append(row[0])
test_labels.append(row[1])

# Create feature vectors


vectorizer = TfidfVectorizer(min_df=4, max_df=0.9)
# Train the feature vectors
train_vectors = vectorizer.fit_transform(train_data)
# Apply model on test data
test_vectors = vectorizer.transform(test_data)

# Perform classification with SVM, kernel=linear


model = svm.SVC(kernel='linear')
model.fit(train_vectors, train_labels)
prediction = model.predict(test_vectors)
>>> ['Class_A' 'Class_A' 'Class_B' 'Class_B' 'Class_A' 'Class_A']

print (classification_report(test_labels, prediction))


```

The text classification model are heavily dependent upon the quality and quantity of features,

while applying any machine learning model it is always a good practice to include more and

more training data. H ere are some tips that I wrote about improving the text classification

accuracy in one of my previous article.

4.2 Text Matching / Similarity

One of the important areas of NLP is the matching of text objects to find similarities. Important

applications of text matching includes automatic spelling correction, data de-duplication and

genome analysis etc.

A number of text matching techniques are available depending upon the requirement. This

section describes the important techniques in detail.

A. Levenshtein Distance – The Levenshtein distance between two strings is defined as the

minimum number of edits needed to transform one string into the other, with the allowable edit

operations being insertion, deletion, or substitution of a single character. Following is the

implementation for efficient memory computations.


```
def levenshtein(s1,s2):
if len(s1) > len(s2):
s1,s2 = s2,s1
distances = range(len(s1) + 1)
for index2,char2 in enumerate(s2):
newDistances = [index2+1]
for index1,char1 in enumerate(s1):
if char1 == char2:
newDistances.append(distances[index1])
else:
newDistances.append(1 + min((distances[index1], distances[index1+1],
newDistances[-1])))
distances = newDistances
return distances[-1]

print(levenshtein("analyze","analyse"))
```

B. Phonetic Matching – A Phonetic matching algorithm takes a keyword as input (person’s

name, location name etc) and produces a character string that identifies a set of words that are

(roughly) phonetically similar. It is very useful for searching large text corpuses, correcting

spelling errors and matching relevant names. Soundex and Metaphone are two main phonetic

algorithms used for this purpose. Python’s module Fuzzy is used to compute soundex strings for

different words, for example –

```
import fuzzy
soundex = fuzzy.Soundex(4)
print soundex('ankit')
>>> “A523”
print soundex('aunkit')
>>> “A523”
```

C. Flexible String Matching – A complete text matching system includes different algorithms

pipelined together to compute variety of text variations. Regular expressions are really helpful

for this purposes as well. Another common techniques include – exact string matching,

lemmatized matching, and compact matching (takes care of spaces, punctuation’s, slangs etc).
D. Cosine Similarity – W hen the text is represented as vector notation, a general cosine

similarity can also be applied in order to measure vectorized similarity. Following code converts

a text to vectors (using term frequency) and applies cosine similarity to provide closeness among

two text.

```
import math
from collections import Counter
def get_cosine(vec1, vec2):
common = set(vec1.keys()) & set(vec2.keys())
numerator = sum([vec1[x] * vec2[x] for x in common])

sum1 = sum([vec1[x]**2 for x in vec1.keys()])


sum2 = sum([vec2[x]**2 for x in vec2.keys()])
denominator = math.sqrt(sum1) * math.sqrt(sum2)

if not denominator:
return 0.0
else:
return float(numerator) / denominator

def text_to_vector(text):
words = text.split()
return Counter(words)

text1 = 'This is an article on analytics vidhya'


text2 = 'article on analytics vidhya is about natural language processing'

vector1 = text_to_vector(text1)
vector2 = text_to_vector(text2)
cosine = get_cosine(vector1, vector2)
>>> 0.62
```
4.3 Coreference Resolution

Coreference Resolution is a process of finding relational links among the words (or phrases)

within the sentences. Consider an example sentence: ” Donald went to John’s office to see the

new table. He looked at it for an hour.“


Humans can quickly figure out that “he” denotes Donald (and not John), and that “it” denotes the

table (and not John’s office). Coreference Resolution is the component of NLP that does this job

automatically. It is used in document summarization, question answering, and information

extraction. Stanford CoreNLP provides a python wrapper for commercial purposes.

4.4 Other NLP problems / tasks

 Text Summarization – Given a text article or paragraph, summarize it automatically to produce

most important and relevant sentences in order.

 Machine Translation – Automatically translate text from one human language to another by

taking care of grammar, semantics and information about the real world, etc.

 Natural Language Generation and Understanding – Convert information from computer

databases or semantic intents into readable human language is called language generation.

Converting chunks of text into more logical structures that are easier for computer programs to

manipulate is called language understanding.

 Optical Character Recognition – Given an image representing printed text, determine the

corresponding text.

 Document to Information – This involves parsing of textual data present in documents

(websites, files, pdfs and images) to analyzable and clean format.

5. Important Libraries for NLP (python)

 Scikit-learn: Machine learning in Python

 Natural Language Toolkit (NLTK): The complete toolkit for all NLP techniques.

 Pattern – A web mining module for the with tools for NLP and machine learning.

 TextBlob – Easy to use nl p tools API, built on top of NLTK and Pattern.

 spaCy – Industrial strength N LP with Python and Cython.


Program 1:

AIM; Word Analysis

from collections import Counter

# counts word frequency


def count_words(text):
skips = [".", ", ", ":", ";", "'", '"']
for ch in skips:
text = text.replace(ch, "")
word_counts = {}
for word in text.split(" "):
if word in word_counts:
word_counts[word]+= 1
else:
word_counts[word]= 1
return word_counts

# >>>count_words(text) You can check the function

# counts word frequency using


# Counter from collections
def count_words_fast(text):
text = text.lower()
skips = [".", ", ", ":", ";", "'", '"']
for ch in skips:
text = text.replace(ch, "")
word_counts = Counter(text.split(" "))
return word_counts

# >>>count_words_fast(text) You can check the function

OUTPUT:

{‘were’: 1, ‘is’: 1, ‘manageable’: 1, ‘to’: 1, ‘things’: 1, ‘keeping’: 1, ‘my’: 1, ‘test’: 1, ‘text’:

2, ‘keep’: 1, ‘short’: 1, ‘this’: 2}

Counter({‘text’: 2, ‘this’: 2, ‘were’: 1, ‘is’: 1, ‘manageable’: 1, ‘to’: 1, ‘things’: 1, ‘keeping’:

1, ‘my’: 1, ‘test’: 1, ‘keep’: 1, ‘short’: 1})


import os
import pandas as pd

book_dir = "./Books"
os.listdir(book_dir)

stats = pd.DataFrame(columns =("language",


"author",
"title",
"length",
"unique"))

# check >>>stats
title_num = 1
for language in os.listdir(book_dir):
for author in os.listdir(book_dir+"/"+language):
for title in os.listdir(book_dir+"/"+language+"/"+author):

inputfile = book_dir+"/"+language+"/"+author+"/"+title
print(inputfile)
text = read_book(inputfile)
(num_unique, counts) = word_stats(count_words_fast(text))
stats.loc[title_num]= language,
author.capitalize(),
title.replace(".txt", ""),
sum(counts), num_unique
title_num+= 1
import matplotlib.pyplot as plt
plt.plot(stats.length, stats.unique, "bo-")

plt.loglog(stats.length, stats.unique, "ro")

stats[stats.language =="English"] #to check information on english books

plt.figure(figsize =(10, 10))


subset = stats[stats.language =="English"]
plt.loglog(subset.length,
subset.unique,
"o",
label ="English",
color ="crimson")

subset = stats[stats.language =="French"]


plt.loglog(subset.length,
subset.unique,
"o",
label ="French",
color ="forestgreen")

subset = stats[stats.language =="German"]


plt.loglog(subset.length,
subset.unique,
"o",
label ="German",
color ="orange")

subset = stats[stats.language =="Portuguese"]


plt.loglog(subset.length,
subset.unique,
"o",
label ="Portuguese",
color ="blueviolet")

plt.legend()
plt.xlabel("Book Length")
plt.ylabel("Number of Unique words")
plt.savefig("fig.pdf")
plt.show()

#read a book and return it as a string


def read_book(title_path):
with open(title_path, "r", encoding ="utf8") as current_file:
text = current_file.read()
text = text.replace("\n", "").replace("\r", "")
return text

# word_counts = count_words_fast(text)
def word_stats(word_counts):
num_unique = len(word_counts)
counts = word_counts.values()
return (num_unique, counts)

text = read_book("./Books / English / shakespeare / Romeo and Juliet.txt")


word_counts = count_words_fast(text)
(num_unique, counts) = word_stats(word_counts)
print(num_unique, sum(counts))
2.Program to generate words
import random

WORDS = ("python", "jumble", "easy", "difficult", "answer", "xylophone")


word = random.choice(WORDS)
correct = word
jumble = ""
while word:
position = random.randrange(len(word))
jumble += word[position]
word = word[:position] + word[(position + 1):]
print(
"""

Unscramble the leters to make a word.


(press the enter key at prompt to quit)
"""
)
print("The jumble is:", jumble)
guess = input("Your guess: ")
while guess != correct and guess != "":
print("Sorry, that's not it")
guess = input("Your guess: ")
if guess == correct:
print("That's it, you guessed it!\n")
print("Thanks for playing")

input("\n\nPress the enter key to exit")


Randomizing the first 500 words

from urllib.request import Request, urlopen

import random

url="https://svnweb.freebsd.org/csrg/share/dict/words?revision=61569&view=co"

req = Request(url, headers={'User-Agent': 'Mozilla/5.0'})

web_byte = urlopen(req).read()

webpage = web_byte.decode('utf-8')

first500 = webpage[:500].split("\n")

random.shuffle(first500)

print(first500)

Output

['abnegation', 'able', 'aborning', 'Abigail', 'Abidjan', 'ablaze', 'abolish', 'abbe',


'above', 'abort', 'aberrant', 'aboriginal', 'aborigine', 'Aberdeen', 'Abbott',
'Abernathy', 'aback', 'abate', 'abominate', 'AAA', 'abc', 'abed', 'abhorred',
'abolition', 'ablate', 'abbey', 'abbot', 'Abelson', 'ABA', 'Abner', 'abduct',
'aboard', 'Abo', 'abalone', 'a', 'abhorrent', 'Abelian', 'aardvark', 'Aarhus',
'Abe', 'abjure', 'abeyance', 'Abel', 'abetting', 'abash', 'AAAS', 'abdicate',
'abbreviate', 'abnormal', 'abject', 'abacus', 'abide', 'abominable', 'abode',
'abandon', 'abase', 'Ababa', 'abdominal', 'abet', 'abbas', 'aberrate', 'abdomen',
'abetted', 'abound', 'Aaron', 'abhor', 'ablution', 'abeyant', 'about']
PROGRAM 3:
Aim: To perform morphological analysis for an interrogative sentence, declarative sentence and
complex sentences with more than two sentences connected using conjunctions.
The spaCy library is a popular library for natural language processing (NLP) in Python. It
provides a wide range of capabilities for text processing, including tokenization, POS tagging,
named entity recognition, and more. In this program, we are using it for morphological analysis
which is the study of word structure and forms.
linkcode
Import the spaCy library and load the "en_core_web_sm" model. The "en_core_web_sm" model
is a pre-trained model that includes all of the basic NLP capabilities, such as tokenization, POS
tagging, and dependency parsing. This model is small in size, so it's suitable for small to
medium-sized texts.
import spacy
nlp = spacy.load("en_core_web_sm")
interrogative_sentence = "What is the weather like today?" # or
interrogative_sentence = input("Enter an interrogative Sentence.")
declarative_sentence = "The weather is sunny." # or declarative_sentence
= input("Enter an declarative Sentence.")
complex_sentence = "I went to the store, but they were closed, so I had
to go to another store." # or complex_sentence = input("Enter an complex
sentence using conjunction.")
interrogative_doc = nlp(interrogative_sentence)
declarative_doc = nlp(declarative_sentence)
complex_doc = nlp(complex_sentence)
for token in interrogative_doc:
print(token.text, token.pos_)
print("\n")
for token in declarative_doc:
print(token.text, token.pos_)
print("\n")
for token in complex_doc:
print(token.text, token.pos_)

output:
What PRON

is AUX

the DET

weather NOUN

like ADP

today NOUN

? PUNCT

The DET

weather NOUN

is AUX

sunny ADJ

. PUNCT

I PRON

went VERB

to ADP

the DET

store NOUN

, PUNCT

but CCONJ

they PRON

were AUX
closed ADJ

, PUNCT

so CCONJ

I PRON

had VERB

to PART

go VERB

PROGRAM 4:

Aim: N-Gram Language Modelling with NLTK

Language modeling is the way of determining the probability of any sequence of words.
Language modeling is used in a wide variety of applications such as Speech Recognition,
Spam filtering, etc. In fact, language modeling is the key aim behind the implementation of
many state-of-the-art Natural Language Processing models.
Methods of Language Modelings:
Two types of Language Modelings:
 Statistical Language Modelings: Statistical Language Modeling, or Language Modeling,
is the development of probabilistic models that are able to predict the next word in the
sequence given the words that precede. Examples such as N-gram language modeling.
 Neural Language Modelings: Neural network methods are achieving better results than
classical methods both on standalone language models and when models are incorporated
into larger models on challenging tasks like speech recognition and machine translation. A
way of performing a neural language model is through word embeddings.
N-gram
N-gram can be defined as the contiguous sequence of n items from a given sample of text or
speech. The items can be letters, words, or base pairs according to the application. The N-
grams typically are collected from a text or speech corpus (A long text dataset).
N-gram Language Model:
An N-gram language model predicts the probability of a given N-gram within any sequence of
words in the language. A good N-gram model can predict the next word in the sentence i.e the
value of p(w|h)
Example of N-gram such as unigram (“This”, “article”, “is”, “on”, “NLP”) or bi-gram (‘This
article’, ‘article is’, ‘is on’,’on NLP’).
Now, we will establish a relation on how to find the next word in the sentence using
. We need to calculate p(w|h), where is the candidate for the next word. For example in the
above example, lets’ consider, we want to calculate what is the probability of the last word
being “NLP” given the previous words:
import string
import random
import nltk
nltk.download('punkt')
nltk.download('stopwords')
nltk.download('reuters')
from nltk.corpus import reuters
from nltk import FreqDist

# input the reuters sentences


sents =reuters.sents()

# write the removal characters such as : Stopwords and punctuation


stop_words = set(stopwords.words('english'))
string.punctuation = string.punctuation +'"'+'"'+'-'+'''+'''+'—'
string.punctuation
removal_list = list(stop_words) + list(string.punctuation)+ ['lt','rt']
removal_list

# generate unigrams bigrams trigrams


unigram=[]
bigram=[]
trigram=[]
tokenized_text=[]
for sentence in sents:
sentence = list(map(lambda x:x.lower(),sentence))
for word in sentence:
if word== '.':
sentence.remove(word)
else:
unigram.append(word)

tokenized_text.append(sentence)
bigram.extend(list(ngrams(sentence, 2,pad_left=True, pad_right=True)))
trigram.extend(list(ngrams(sentence, 3, pad_left=True, pad_right=True)))

# remove the n-grams with removable words


def remove_stopwords(x):
y = []
for pair in x:
count = 0
for word in pair:
if word in removal_list:
count = count or 0
else:
count = count or 1
if (count==1):
y.append(pair)
return (y)
unigram = remove_stopwords(unigram)
bigram = remove_stopwords(bigram)
trigram = remove_stopwords(trigram)

# generate frequency of n-grams


freq_bi = FreqDist(bigram)
freq_tri = FreqDist(trigram)

d = defaultdict(Counter)
for a, b, c in freq_tri:
if(a != None and b!= None and c!= None):
d[a, b] += freq_tri[a, b, c]

# Next word prediction


s=''
def pick_word(counter):
"Chooses a random element."
return random.choice(list(counter.elements()))
prefix = "he", "said"
print(" ".join(prefix))
s = " ".join(prefix)
for i in range(19):
suffix = pick_word(d[prefix])
s=s+' '+suffix
print(s)
prefix = prefix[1], suffix

OUTPUT:

he said
he said kotc
he said kotc made
he said kotc made profits
he said kotc made profits of
he said kotc made profits of 265
he said kotc made profits of 265 ,
he said kotc made profits of 265 , 457
he said kotc made profits of 265 , 457 vs
he said kotc made profits of 265 , 457 vs loss
he said kotc made profits of 265 , 457 vs loss eight
he said kotc made profits of 265 , 457 vs loss eight cts
he said kotc made profits of 265 , 457 vs loss eight cts net
he said kotc made profits of 265 , 457 vs loss eight cts net loss
he said kotc made profits of 265 , 457 vs loss eight cts net loss 343
he said kotc made profits of 265 , 457 vs loss eight cts net loss 343 ,
he said kotc made profits of 265 , 457 vs loss eight cts net loss 343 , 266
he said kotc made profits of 265 , 457 vs loss eight cts net loss 343 ,
266 ,
he said kotc made profits of 265 , 457 vs loss eight cts net loss 343 ,
266 , 000
he said kotc made profits of 265 , 457 vs loss eight cts net loss 343 ,
266 , 000 shares

program 5:

N-Gram smoothing
An N-gram is a sequence of N words: a 2-gram (or bigram) is a two-word
sequence of words like “lütfen ödevinizi”, “ödevinizi çabuk”, or ”çabuk
veriniz”, and a 3-gram (or trigram) is a three-word sequence of words like
“lütfen ödevinizi çabuk”, or “ödevinizi çabuk veriniz”.

Smoothing

To keep a language model from assigning zero probability to unseen events,


we’ll have to shave off a bit of probability mass from some more frequent
events and give it to the events we’ve never seen. This modification is called
smoothing or discounting.

Laplace Smoothing
The simplest way to do smoothing is to add one to all the bigram counts,
before we normalize them into probabilities. All the counts that used to be
zero will now have a count of 1, the counts of 1 will be 2, and so on. This
algorithm is called Laplace smoothing.

Add-k Smoothing
One alternative to add-one smoothing is to move a bit less of the probability
mass from the seen to the unseen events. Instead of adding 1 to each count,
we add a fractional count k. This algorithm is therefore called add-k
smoothing.

Requirements

 Python 3.7 or higher


 Git

Python
To check if you have a compatible version of Python installed, use the
following command:

python -V
You can find the latest version of Python here.

Git

Install the latest version of Git.

Pip Install

pip3 install NlpToolkit-NGram

Download Code

In order to work on code, create a fork from GitHub page. Use Git for cloning
the code to your local or below line for Ubuntu:

git clone <your-fork-git-link>


A directory called NGram will be created. Or you can use below link for
exploring the code:

git clone https://github.com/starlangsoftware/NGram-Py.git


Open project with Pycharm IDE

Steps for opening the cloned project:

 Start IDE
 Select File | Open from main menu
 Choose NGram-PY file
 Select open as project option
 Couple of seconds, dependencies will be downloaded.

Detailed Description

 Training NGram
 Using NGram
 Saving NGram
 Loading NGram

Training NGram

To create an empty NGram model:

NGram(N: int)
For example,

a = NGram(2)
this creates an empty NGram model.

To add an sentence to NGram

addNGramSentence(self, symbols: list)


For example,

nGram = NGram(2)
nGram.addNGramSentence(["jack", "read", "books", "john", "mary", "went"])
nGram.addNGramSentence(["jack", "read", "books", "mary", "went"])
with the lines above, an empty NGram model is created and two sentences
are added to the bigram model.

NoSmoothing class is the simplest technique for smoothing. It doesn't


require training. Only probabilities are calculated using counters. For
example, to calculate the probabilities of a given NGram model using
NoSmoothing:

a.calculateNGramProbabilities(NoSmoothing())
LaplaceSmoothing class is a simple smoothing technique for smoothing. It
doesn't require training. Probabilities are calculated adding 1 to each
counter. For example, to calculate the probabilities of a given NGram model
using LaplaceSmoothing:

a.calculateNGramProbabilities(LaplaceSmoothing())
GoodTuringSmoothing class is a complex smoothing technique that doesn't
require training. To calculate the probabilities of a given NGram model using
GoodTuringSmoothing:

a.calculateNGramProbabilities(GoodTuringSmoothing())
AdditiveSmoothing class is a smoothing technique that requires training.

a.calculateNGramProbabilities(AdditiveSmoothing())

Using NGram

To find the probability of an NGram:

getProbability(self, *args) -> float


For example, to find the bigram probability:

a.getProbability("jack", "reads")
To find the trigram probability:

a.getProbability("jack", "reads", "books")

Saving NGram

To save the NGram model:

saveAsText(self, fileName: str)


For example, to save model "a" to the file "model.txt":

a.saveAsText("model.txt");

Loading NGram

To load an existing NGram model:

NGram(fileName: str)
For example,

a = NGram("model.txt")
this loads an NGram model in the file "model.txt".

from setuptools
import setup
from pathlib import Path
this_directory = Path(__file__).parent
long_description = (this_directory /
"README.md").read_text(encoding="utf-8")
setup(
name='NlpToolkit-NGram',
version='1.0.19',
packages=['NGram', 'test'],
url='https://github.com/StarlangSoftware/NGram-Py',
license='',
author='olcaytaner',
author_email='[email protected]',
description='NGram library',
install_requires=['NlpToolkit-DataStructure', 'NlpToolkit-
Sampling'],
long_description=long_description,
long_description_content_type='text/markdown'
)
from NGram.NGram
import NGram
from NGram.SimpleSmoothing import SimpleSmoothing
class NoSmoothing(SimpleSmoothing):
def setProbabilities(self,
nGram: NGram,
level: int):
nGram.setProbabilityWithPseudoCount(0.0, level)

You might also like