0% found this document useful (0 votes)
75 views

Lab Manual - NLP

Lab manual
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
75 views

Lab Manual - NLP

Lab manual
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 139

Laboratory Manual

CSDL7013 NATURAL LANGUAGE PROCESSING LAB

BRANCH: COMPUTER ENGINEERING

SEMESTER: 7

AY: 2022-23

SUBJECT TEACHER

PROF.M.VELLADURAI
LIST OF EXPERIMENTS
SL.NO. EXPERIMENT NAME
1. Study various applications of NLP and Formulate the Problem
Statement for Mini Project based on chosen real world NLP
applications
2. Various text preprocessing techniques for any given text :
Tokenization and Filtration & Script Validation
3. Text preprocessing techniques for any given text : Stop Word
Removal, Lemmatization / Stemming
4. Morphological analysis and word generation for any given text
5. Implementing N-Gram model for the given text input
6. Study the different POS taggers and Perform POS tagging on the
given text
7. Chunking for the given text input
8. Implement Named Entity Recognizer for the given text input
9. Implement Text Similarity Recognizer for the chosen text
documents
10. Exploratory data analysis of a given text (Word Cloud)
11. Mini Project Report: For any one chosen real world NLP application
12 Implementation and Presentation of Mini Project
Ex.1 STUDY VARIOUS APPLICATIONS OF NLP AND FORMULATE THE
PROBLEM STATEMENT FOR MINI PROJECT BASED ON CHOSEN REAL WORLD
NLP APPLICATIONS

LAB OBJECTIVES:

To Study the various applications of NLP and formulate the problem statement for mini project

LAB OUTCOMES:

On Successful Completion, the Student will be able to understand about various NLP
applications.

PROCEDURE:

Machine Translation

What is a machine translation and how does it work?

Machine Translation or MT or robotized interpretation is simply a procedure when a computer


software translates text from one language to another without human contribution. At its
fundamental level, machine translation performs a straightforward replacement of atomic words
in a single characteristic language for words in another.

Using corpus methods, more complicated translations can be conducted, taking into account
better treatment of contrasts in phonetic typology, express acknowledgement, and translations of
idioms, just as the seclusion of oddities. Currently, some systems are not able to perform just like
a human translator, but in the coming future, it will also be possible.

In simple language, we can say that machine translation works by using computer software to
translate the text from one source language to another target language. There are different
types of machine translation and in the next section, we will discuss them in detail.

Different types of machine translation in NLP


There are four types of machine translation:
1. Statistical Machine Translation or SMT
It works by alluding to statistical models that depend on the investigation of huge volumes of
bilingual content. It expects to decide the correspondence between a word from the source
language and a word from the objective language. A genuine illustration of this is Google
Translate.

Presently, SMT is extraordinary for basic translation, however its most noteworthy disadvantage
is that it doesn't factor in context, which implies translation can regularly be wrong or you can
say, don't expect great quality translation. There are several types of statistical-based machine
translation models which are: Hierarchical phrase-based translation, Syntax-based translation,
Phrase-based translation, Word-based translation.

2. Rule-based Machine Translation or RBMT

RBMT basically translates the basics of grammatical rules. It directs a grammatical examination
of the source language and the objective language to create the translated sentence. But, RBMT
requires broad editing, and its substantial reliance on dictionaries implies that proficiency is
accomplished after a significant period. (Also read: Top 10 Natural Processing Languages (NLP)
Libraries with Python)

3. Hybrid Machine Translation or HMT


HMT, as the term demonstrates, is a mix of RBMT and SMT. It uses a translation memory,
making it unquestionably more successful regarding quality. Nevertheless, even HMT has a lot
of downsides, the biggest of which is the requirement for enormous editing, and human
translators will also be needed. There are several approaches to HMT like multi-engine,
statistical rule generation, multi-pass, and confidence-based.
(Must read: Top NLP trends in 2021)

4. Neural Machine Translation or NMT


NMT is a type of machine translation that relies upon neural network models (based on the
human brain) to build statistical models with the end goal of translation. The essential advantage
of NMT is that it gives a solitary system that can be prepared to unravel the source and target
text. Subsequently, it doesn't rely upon specific systems that are regular to other machine
translation systems, particularly SMT.

What are the benefits of machine translation?


One of the crucial benefits of machine translation is speed as you have noticed that computer
programs can translate a huge amount of text rapidly. Yes, the human translator does their
work more accurately but they cannot match the speed of the computer.
If you especially train the machine to your requirements, machine translation gives the ideal
blend of brisk and cost-effective translations as it is less expensive than using a human
translator. With a specially trained machine, MT can catch the setting of full sentences before
translating them, which gives you high quality and human-sounding yield. Another benefit of
machine translation is its capability to learn important words and reuse them wherever they
might fit.

( Related blog: Introduction to Text Analytics and Models in Natural Language Processing)
Applications of machine translation
Machine translation technology and products have been used in numerous application situations,
for example, business travel, the travel industry, etc. In terms of the object of translation, there
are composed language-oriented content text translation and spoken language.

Text translation
Automated text translation is broadly used in an assortment of sentence-level and text-level
translation applications. Sentence-level translation applications incorporate the translation of
inquiry and recovery inputs and the translation of (OCR) outcomes of picture optical character
acknowledgement. Text-level translation applications incorporate the translation of a wide range
of unadulterated reports, and the translation of archives with organized data.
(Related blog: Sentiment Analysis of YouTube Comments)
Organized data mostly incorporates the presentation configuration of text content, object type
activity, and other data, for example, textual styles, colours, tables, structures, hyperlinks, etc.
Presently, the translation objects of machine translation systems are mostly founded on the
sentence level.
Most importantly, a sentence can completely communicate a subject substance, which normally
frames an articulation unit, and the significance of each word in the sentence can be resolved to
an enormous degree as per the restricted setting inside the sentence.

Also, the methods and nature of getting data at the sentence level granularity from the
preparation corpus are more effective than that dependent on other morphological levels, for
example, words, expressions, and text passages. Finally, the translation depends on sentence-
level can be normally reached out to help translation at other morphological levels.

Speech translation
With the fast advancement of mobile applications, voice input has become an advantageous
method of human-computer cooperation, and discourse translation has become a significant
application situation. The fundamental cycle of discourse interpretation is "source language
discourse source language text-target language text-target language discourse".
In this cycle, programmed text translation from source language text to target-language text is an
important moderate module. What's more, the front end and back end likewise need programmed
discourse recognition, ASR and text-to-speech, TTs.
(Read also: Introduction to Natural Language Processing: Text Cleaning & Preprocessing)
Other applications
Naturally, the task of machine translation is to change one source language word succession into
another objective language word grouping which is semantically the same. Generally, it finishes
a grouping transformation task, which changes over a succession object into another arrangement
object as indicated by some information and rationale through model and algorithms.
All things considered, many undertaking situations total the change between grouping objects,
and the language in the machine translation task is just one of the succession object types. In this
manner, when the ideas of the source language and target language are stretched out from
dialects to other arrangement object types, machine translation strategies and techniques can be
applied to settle numerous comparable change undertakings.

Machine Translation vs Human translation


Machine translation hits that sweet spot of cost and speed, offering a truly snappy path for
brands to translate their records at scale without much overhead. Yet, that doesn't mean it's
consistently relevant. On the other hand, human translation is incredible for those undertakings
that require additional consideration and subtlety. Talented translators work on your image's
substance to catch the first importance and pass on that feeling or message basically in another
assortment of work.
Leaning upon how much content should be translated, the machine translation can give
translated content very quickly, though human translators will take additional time. Time spent
finding, verifying, and dealing with a group of translators should likewise be considered.
Numerous translation programming providers can give machine translations at practically zero
cost, making it a reasonable answer for organizations who will be unable to manage the cost of
expert translations.

Machine Translation is the instant modification of text from one language to another utilizing
artificial intelligence whereas a human translation, includes actual brainpower, in the form of one
or more translators translating the text manually.

TEXT CLASSIFICATION: (Text Categorization)


Words and Sequences
NLP system needs to understand text, sign, and semantic properly. Many methods help the NLP
system to understand text and symbols. They are text classification, vector semantic, word
embedding, probabilistic language model, sequence labeling, and speech reorganization.
Text classification
Text clarification is the process of categorizing the text into a group of words. By using NLP,
text classification can automatically analyze text and then assign a set of predefined tags or
categories based on its context. NLP is used for sentiment analysis, topic detection, and language
detection. There is mainly three text classification approach-
Rule-based System,
Machine System
Hybrid System.
In the rule-based approach, texts are separated into an organized group using a set of handicraft
linguistic rules. Those handicraft linguistic rules contain users to define a list of words that are
characterized by groups. For example, words like Donald Trump and Boris Johnson would be
categorized into politics. People like LeBron James and Ronaldo would be categorized into
sports.
Machine-based classifier learns to make a classification based on past observation from the data
sets. User data is prelabeled as tarin and test data. It collects the classification strategy from the
previous inputs and learns continuously. Machine-based classifier usage a bag of a word for
feature extension.

So
urce
In a bag of words, a vector represents the frequency of words in a predefined dictionary of a
word list. We can perform NLP using the following machine learning algorithms: Naïve Bayer,
SVM, and Deep Learning.
The third approach to text classification is the Hybrid Approach. Hybrid approach usage
combines a rule-based and machine Based approach. Hybrid based approach usage of the rule-
based system to create a tag and use machine learning to train the system and create a rule. Then
the machine-based rule list is compared with the rule-based rule list. If something does not match
on the tags, humans improve the list manually. It is the best method to implement text
classification

1. Vector Semantic
Vector Semantic is another way of word and sequence analysis. Vector semantic defines
semantic and interprets words meaning to explain features such as similar words and opposite
words. The main idea behind vector semantic is two words are alike if they have used in a
similar context. Vector semantic divide the words in a multi-dimensional vector space. Vector
semantic is useful in sentiment analysis.
Source
2. Word Embedding
Word embedding is another method of word and sequence analysis. Embedding translates spares
vectors into a low-dimensional space that preserves semantic relationships. Word embedding is a
type of word representation that allows words with similar meaning to have a similar
representation. There are two types of word embedding-
Word2vec
Doc2Vec.
Word2Vec is a statistical method for effectively learning a standalone word embedding from a
text corpus.

Source
Doc2Vec is similar to Doc2Vec, but it analyzes a group of text like pages.
Source

3. Probabilistic Language Model


Another approach to word and sequence analysis is the probabilistic language model. The goal of
the probabilistic language model is to calculate the probability of a sentence of a sequence of
words. For example, the probability of the word “a” occurring in a given word “to” is
0.00013131 percent.

Source

4. Sequence Labeling
Sequence labeling is a typical NLP task that assigns a class or label to each token in a given
input sequence. If someone says “play the movie by tom hanks”. In sequence, labeling will be
[play, movie, tom hanks]. Play determines an action. Movies are an instance of action. Tom
Hanks goes for a search entity. It divides the input into multiple tokens and uses LSTM to
analyze it. There are two forms of sequence labeling. They are token labeling and span labeling.
apple. The best example is Amazon Alexa. , noun  ate, determiner  tom, verb Parsing is a
phase of NLP where the parser determines the syntactic structure of a text by analyzing its
constituent words based on an underlying grammar. For example, “tom ate an apple” will be
divided into proper noun

S
ource
We discuss how text is classified and how to divide the word and sequence so that the algorithm
can understand and categorize it. In this project, we are going to discover a sentiment analysis of
fifty thousand IMDB movie reviewer. Our goal is to identify whether the review posted on the
IMDB site by its user is positive or negative.
This project covers text mining techniques like Text Embedding, Bags of Words, word context,
and other things. We will also cover the introduction of a bidirectional LSTM sentiment
classifier. We will also look at how to import a labeled dataset from TensorFlow automatically.
This project also covers steps like data cleaning, text processing, data balance through sampling,
and train and test a deep learning model to classify text.

Parsing
Parser determines the syntactic structure of a text by analyzing its constituent words based on an
underlying grammar. It divides group words into component parts and separates words.
Source
For more details about parsing, check this article.
Semantic
Text is at the heart of how we communicate. What is really difficult is understanding what is
being said in written or spoken conversation? Understanding lengthy articles and books are even
more difficult. Semantic is a process that seeks to understand linguistic meaning by constructing
a model of the principle that the speaker uses to convey meaning. It’s has been used in customer
feedback analysis, article analysis, fake news detection, Semantic analysis, etc.
Example Application
Here is the code Sample:
Importing necessary library
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra


import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory


# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the
input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
for filename in filenames:
print(os.path.join(dirname, filename))
# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as
output when you create a version using "Save & Run All"
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the
current session

#Importing require Libraries


import os

import matplotlib.pyplot as plt


import nltk
from tkinter import *
import seaborn as sns
import matplotlib.pyplot as plt
sns.set()
import scipy

import tensorflow as tf
import tensorflow_hub as hub
import tensorflow_datasets as tfds
from tensorflow.python import keras

from tensorflow.keras.models import Sequential


from tensorflow.keras.layers import Dense, Embedding, LSTM

from sklearn.model_selection import train_test_split


from sklearn.metrics import confusion_matrix
from sklearn.metrics import classification_report

Downloading necessary file


# this cells takes time, please run once
# Split the training set into 60% and 40%, so we'll end up with 15,000 examples
# for training, 10,000 examples for validation and 25,000 examples for testing.
original_train_data, original_validation_data, original_test_data = tfds.load(
name="imdb_reviews",
split=('train[:60%]', 'train[60%:]', 'test'),
as_supervised=True)
Getting word index from Keras datasets
#tokanizing by tensorflow
word_index = tf.keras.datasets.imdb.get_word_index(
path='imdb_word_index.json'
)
In [8]:
{k:v for (k,v) in word_index.items() if v < 20}
Out[8]:
{'with': 16, 'i': 10, 'as': 14, 'it': 9, 'is': 6, 'in': 8, 'but': 18, 'of': 4, 'this': 11, 'a': 3, 'for': 15, 'br':
7, 'the': 1, 'was': 13, 'and': 2, 'to': 5, 'film': 19, 'movie': 17, 'that': 12}
Positive and Negative Review Comparision

Creating Train, Test Data

Model and Model Summary


Splitting data and fitting the model

Model effect Overview

Confusion Matrix and Correlation Report


Note: Data Source and Data for this model is publicly available and can be accessed by using
Tensorflow.
For the complete code and details, please follow this GitHub Repository.
In conclusion, NLP is a field full of opportunities. NLP has a tremendous effect on how to
analyze text and speeches. NLP is doing better and better every day. Knowledge extraction from
the large data set was impossible five years ago. The rise of the NLP technique made it possible
and easy. There are still many opportunities to discover in NLP.

TEXT SUMMARIZATION:
Text summarization is a very useful and important part of Natural Language Processing (NLP).
First let us talk about what text summarization is. Suppose we have too many lines of text data in
any form, such as from articles or magazines or on social media. We have time scarcity so we
want only a nutshell report of that text. We can summarize our text in a few lines by removing
unimportant text and converting the same text into smaller semantic text form.
Now let us see how we can implement NLP in our programming. We will take a look at all the
approaches later, but here we will classify approaches of NLP.
TEXT SUMMARIZATION
In this approach we build algorithms or programs which will reduce the text size and create a
summary of our text data. This is called automatic text summarization in machine learning.
Text summarization is the process of creating shorter text without removing the semantic
structure of text.
There are two approaches to text summarization.
Extractive approaches
Abstractive approaches
EXTRACTIVE APPROACHES:
Using an extractive approach we summarize our text on the basis of simple and traditional
algorithms. For example, when we want to summarize our text on the basis of the frequency
method, we store all the important words and frequency of all those words in the dictionary. On
the basis of high frequency words, we store the sentences containing that word in our final
summary. This means the words which are in our summary confirm that they are part of the
given text.
ABSTRACTIVE APPROACHES:
An abstractive approach is more advanced. On the basis of time requirements we exchange some
sentences for smaller sentences with the same semantic approaches of our text data.

Here we generally use deep machine learning, that is transformers, bi-directional


transformers(BERT), GPT, etc.
EXTRACTIVE APPROACHES:
We will take a look at a few machine learning models below.
TEXT SUMMARIZATION USING THE FREQUENCY METHOD
In this method we find the frequency of all the words in our text data and store the text data and
its frequency in a dictionary. After that, we tokenize our text data. The sentences which contain
more high frequency words will be kept in our final summary data.
import pandas as pd
import numpy as np
data = "my name is shubham kumar shukla. It is my pleasure to got opportunity to write article
for xyz related to nlp"
from nltk.tokenize
import word_tokenize, sent_tokenize
from nltk.corpus
import stopwords
def solve(text):
stopwords1 = set(stopwords.words("english"))
words = word_tokenize(text)
freqTable = {}
for word in words:
word = word.lower()
if word in stopwords1:
continue
if word in freqTable:
freqTable[word] += 1
else :
freqTable[word] = 1

sentences = sent_tokenize(text)
sentenceValue = {}
for sentence in sentences:
for word, freq in freqTable.items():
if word in sentence.lower():
if sentence in sentenceValue:
sentenceValue[sentence] += freq
else :
sentenceValue[sentence] = freq
sumValues = 0
for sentence in sentenceValue:
sumValues += sentenceValue[sentence]
average = int(sumValues / len(sentenceValue))

summary = ''
for sentence in sentences:
if (sentence in sentenceValue) and(sentenceValue[sentence] > (1.2 * average)):
summary += "" + sentence
return summary
Sumy:
Sumy is a textrank based machine learning algorithm. Below is the implementation of that
model.
# Load Packages
from sumy.parsers.plaintext
import PlaintextParser
from sumy.nlp.tokenizers
import Tokenizer

# Creating text parser using tokenization


parser = PlaintextParser.from_string(text, Tokenizer("english"))

from sumy.summarizers.text_rank
import TextRankSummarizer

# Summarize using sumy TextRank


summarizer = TextRankSummarizer()
summary = summarizer(parser.document, 2)

text_summary = ""
for sentence in summary:
text_summary += str(sentence)

print(text_summary)
Lex
Rank:
This is an unsupervised machine learning based approach in which we use the textrank approach
to find the summary of our sentences. Using cosine similarity and vector based algorithms, we
find minimum cosine distance among various words and store the more similar words together.
from sumy.parsers.plaintext
import PlaintextParser
from sumy.nlp.tokenizers
import Tokenizer
from sumy.summarizers.lex_rank
import LexRankSummarizer
def sumy_method(text):
parser = PlaintextParser.from_string(text, Tokenizer("english"))
summarizer = LexRankSummarizer()
#Summarize the document with 2 sentences
summary = summarizer(parser.document, 2)
dp = []
for i in summary:
lp = str(i)
dp.append(lp)
final_sentence = ' '.join(dp)
return final_sentence
Using Luhn:
This approach is based on the frequency method; here we find TF-IDF (term frequency inverse
document frequency).
from sumy.summarizers.luhn
import LuhnSummarizer
def lunh_method(text):
parser = PlaintextParser.from_string(text, Tokenizer("english"))
summarizer_luhn = LuhnSummarizer()
summary_1 = summarizer_luhn(parser.document, 2)
dp = []
for i in summary_1:
lp = str(i)
dp.append(lp)
final_sentence = ' '.join(dp)
return final_sentence
LSA
Latent Semantic Analyzer (LSA) is based on decomposing the data into low dimensional space.
LSA has the ability to store the semantic of given text while summarizing.
from sumy.summarizers.lsa
import LsaSummarizer
def lsa_method(text):
parser = PlaintextParser.from_string(text, Tokenizer("english"))
summarizer_lsa = LsaSummarizer()
summary_2 = summarizer_lsa(parser.document, 2)
dp = []
for i in summary_2:
lp = str(i)
dp.append(lp)
final_sentence = ' '.join(dp)
return final_sentence

CHAT BOT:
Three Pillars of an NLP Based Chatbot
Now it's time to take a closer look at all the core elements that make NLP chatbot happen.
1) Dialog System
To communicate, people use mouths to speak, ears to hear, fingers to type, and eyes to read.
Chatbot, too, needs to have an interface compatible with the ways humans receive and share
information with communication. That is what we call a dialog system, or else, a conversational
agent.
There are no set dialog system components.
But for a dialog system to, indeed, be a dialog system, it has to be capable of producing output
and accept input. Other than that, they can adopt a variety of forms. You can differentiate them
based on:
Modality (text-based, speech-based, graphical or mixed)
Device
Style (command-based, menu-driven and - of course - natural language)
Initiative (system, user, or mixed)

2) Natural Language Understanding


So, you already know NLU is an essential sub-domain of NLP and have a general idea
of how it works.
Still, it’s important to point out that the ability to process what the user is saying is
probably the most obvious weakness in NLP based chatbots today. Human
languages are just way too complex. Besides enormous vocabularies, they are filled
with multiple meanings many of which are completely unrelated.
To nail the NLU is more important than making the bot sound 110% human with
impeccable NLG.
Why?
If a bot understands the users and fulfils their intent, most won’t care if that response is a bit
taciturn… It doesn't work the other way around. A bot that can’t derive meaning from the natural
input efficiently can have the smoothest small talk skills and nobody will care. Not even a little!
3) Natural Language Generation
Given that the NLP chatbot successfully parsed and understood the user’s input, its programming
will determine an appropriate response and “translate” it back to natural language.Needless to
say, that response doesn’t appear out of thin air.
For the NLP to produce a human-friendly narrative, the format of the content must be outlined
be it through rules-based workflows, templates, or intent-driven approaches. In other
words, the bot must have something to work with in order to create that output.
‍Currently, every NLG system relies on narrative design - also called conversation design - to
produce that output. This narrative design is guided by rules known as “conditional logic”.
These rules trigger different outputs based on which conditions are being met and which are not.
Do You Need an NLP Chatbot?
Let’s be clear.
Using NLP for simple and straightforward use cases is over the top and completely
unnecessary.
In fact, if used in an inappropriate context, natural language processing chatbot can be an
absolute buzzkill and hurt rather than help your business. If a task can be accomplished in just a
couple of clicks, making the user type it all up is most certainly not making things easier.
On the other hand, if the alternative means presenting the user with an excessive number of
options at once, NLP chatbot can be useful. It can save your clients from confusion/frustration by
simply asking them to type or say what they want.
It’s not much different from coming up to the staff member at the counter in the real world. AI is
cool but if it fails to be useful, no one will really care how “modern” your company is.
What Can NLP Chatbots Learn From Rule-Based Bots
There are many who will argue that a chatbot not using AI and natural language isn’t even a
chatbot but just a mare auto-response sequence on a messaging-like interface.
You have two choices here.
You can decide to stay hung up on nomenclature or create a chatbot capable of completing tasks,
achieving goals and delivering results.Being obsessed with the purity of AI bot experience is just
not good for business.
In fact, when it comes down to it, your NLP bot can learn A LOT about efficiency and
practicality from those rule-based “auto-response sequences” we dare to call chatbots.
1) Constrain the Input & Leverage Rich Controls
Why would you make anyone type out a message if a quick tap or click can do the trick?
At times, constraining user input can be a great way to focus and speed up query resolution.
So, when logical, falling back upon rich elements such as buttons, carousels or quick replies
won’t make your bot seem any less intelligent.
To the contrary…Besides the speed, rich controls also help to reduce users’ cognitive load.
Hence, they don’t need to wonder about what is the right thing to say or ask.When in doubt,
always opt for simplicity.
2) Do the Dialog Flow Diagram
NLP bots generate responses based on user inputs.
So, technically, designing a conversation doesn’t require you to draw up a diagram of the
conversation flow.However!Having a branching diagram of the possible conversation paths
helps you think through what you are building.
Consequently, it's easier to design a natural-sounding, fluent narrative. You can draw up your
map the old fashion way or use a digital tool. Both Landbot’s visual bot builder or any mind-
mapping software will serve the purpose well.
3) Define End to the Conversation
Lack of a conversation ender can easily become an issue and you would be surprised how many
NLB chatbots actually don’t have one.
If the user isn’t sure whether or not the conversation has ended your bot might end up looking
stupid or it will force you to work on further intents that would have otherwise been unnecessary.
Rule and choice-based bots don’t have this issue. Once it’s over, it’s over.
No confusion there. Hence, make it clear the conversation has ended, verbally or visually.
You can even offer additional instructions to relaunch the conversation.
4) Don’t Get Caught Up in Handling Corner Cases
Bots that don’t use AI don’t care about corner cases. They take people down one of the outlined
paths and it's over.
There is a lesson here… don’t hinder the bot creation process by handling corner cases.
Especially so if you are still in your prototyping phase.
Focus on developing the core intents and developing them well. Don't waste your time focusing
on use cases that are highly unlikely to occur any time soon. You can come back to those when
your bot is popular and the probability of that corner case taking place is more significant.
If you really want to feel safe, if the user isn’t getting the answers he or she wants, you can set up
a trigger for human agent takeover.
5) Design for an Easy Conversation Restart
Include a restart button and make it obvious.Just because it’s a supposedly intelligent natural
language processing chatbot, it doesn’t mean users can’t get frustrated with or make the
conversation “go wrong”.
Save your users/clients/visitors the frustration and allows to restart the conversation whenever
they see fit.
Can you Build NLP Chatbot Without Coding?
Unfortunately, a no-code natural language processing chatbot is still a fantasy. You need an
experienced developer/narrative designer to build the classification system and train the bot to
understand and generate human-friendly responses.
However, there are tools that can help you significantly simplify the process.
For example, one of the most widely used NLP chatbot development platforms is
Google’s Dialogflow which connects to the Google Cloud Platform.
By giving developers/narrative designers a clean and user-friendly interface and taking care of
the natural langue processing and machine learning and other deeper concepts “behind the
scene”, it allows you to focus on the conversation flow and build bots.
NLP is far from being simple even with the use of a tool such as DialogFlow. However, it does
make the task at hand more comprehensible and manageable.
Another thing you can do to simplify your NLP chatbot building process is using a visual no-
code bot builder - like Landbot - as your base in which you integrate the NLP element.
We might be a bit biased about Landbot. BUT, when it comes to streamlining the entire process
of bot creation, it’s hard to argue against it. While the builder is usually used to create a choose-
your-adventure type of conversational flows, it does allow for Dialogflow integration.
This means you can offer your users a smart NLP based assistant while taking advantage of
no-code features in the creation and management process such as digital tool integration;
human takeover; rich controls; front-end bot design; web/website/messaging app integration and
so forth…

Even better?
The use of Dialogflow and a no-code chatbot building platform like Landbot allows you
to combine the smart and natural aspects of NLP with the practical and functional aspects
of choice-based bots.

PLAGARISM:
Plagiarism is rampant on the internet and in the classroom. With so much content out there, it’s
sometimes hard to know when something has been plagiarized. Authors writing blog posts may
want to check if someone has stolen their work and posted it elsewhere. Teachers may want to
check students’ papers against other scholarly articles for copied work. News outlets may want to
check if a content farm has stolen their news articles and claimed the content as its own.
So, how do we guard against plagiarism? Wouldn’t it be nice if we could have software do the
heavy lifting for us? Using machine learning, we can build our own plagiarism checker that
searches a vast database for stolen content. In this article, we’ll do exactly that.
We’ll build a Python Flask app that uses Pinecone — a similarity search service — to find
possibly plagiarized content.
Demo App Overview
Let’s take a look at the demo app we’ll be building today. Below, you can see a brief animation of
the app in action.
The UI features a simple textarea input in which the user can paste the text from an article. When
the user clicks the Submit button, this input is used to query a database of articles. Results and
their match scores are then displayed to the user. To help reduce the amount of noise, the app also
includes a slider input in which the user can specify a similarity threshold to only show extremely
strong matches.

Demo app — plagiarism checker


As you can see, when original content is used as the search input, the match scores for possibly
plagiarized articles are relatively low. However, if we were to copy and paste the text from one of
the articles in our database, the results for the plagiarized article come back with a 99.99% match!
So, how did we do it?
In building the app, we start with a dataset of news articles from Kaggle. This dataset contains
143,000 news articles from 15 major publications, but we’re just using the first 20,000. (The full
dataset that this one is derived from contains over two million articles!)
Next, we clean up the dataset by renaming a couple columns and dropping a few unnecessary
ones. Then, we run the articles through an embedding model to create vector embeddings —
that’s metadata for machine learning algorithms to determine similarities between various inputs.
We use the Average Word Embeddings Model. Finally, we insert these vector embeddings into
a vector database managed by Pinecone.
With the vector embeddings added to the database and indexed, we’re ready to start finding
similar content. When users submit their article text as input, a request is made to an API
endpoint that uses Pinecone’s SDK to query the index of vector embeddings. The endpoint
returns 10 similar articles that were possibly plagiarized and displays them in the app’s UI.
That’s it! Simple enough, right?
If you’d like to try it out for yourself, you can find the code for this app on GitHub.
The README contains instructions for how to run the app locally on your own machine.
Spelling & Grammar checkers:

Summary of approaches to Grammar Error Correction (GEC). Source: Source: Adapted from
Ailani et al. 2019, figs. 1-4.
A well-written article with correct grammar, punctuation and spelling along with an appropriate
tone and style to match the needs of the intended reader or community is always important.
Software tools offer algorithm-based solutions for grammar and spell checking and correction.
Classical rule-based approaches employ a dictionary of words along with a set of rules. Recent
neural network-based approaches learn from millions of published articles and offer suggestions
for appropriate choice of words and way to phrase parts of sentences to adjust the tone, style and
semantics of the sentence. They can alter suggestions based on the publication domain of the
article like academic, news, etc.
Grammar and spelling correction are tasks that belong to a more general NLP process
called lexical disambiguation.
Discussion

What is a software grammar and spell checker, its general tasks and uses?

Illustrating grammar and spell checks and suggested corrections. Source: Devopedia 2021.
A grammar and spell checker is a software tool that checks a written text for grammatical
mistakes, appropriate punctuation, misspellings, and issues related to sentence structure. More
recently, neural network-based tools also evaluate tone, style, and semantics to ensure that the
writing is flawless.
Often such tools offer a visual indication by highlighting or underlining spelling and grammar
errors in different colors (often red for spelling and blue for grammar). Upon hovering or
clicking on the highlighted parts, they offer appropriately ranked suggestions to correct those
errors. Certain tools offer a suggestive corrected version by displaying correction as strikeout in
an appropriate color.
Such tools are used to improve writing, produce engaging content, and for assessment and
training purposes. Several tools also offer style correction to adapt the article for specific
domains like academic publications, marketing, and advertising, legal, news reporting, etc.
However, till today, no tool is a perfect alternative to an expert human evaluator.
What are some important terms relevant to a grammar and spell checker?
The following NLP terms and approaches are relevant to grammar and spell checker:
Part-of-Speech (PoS) tagging marks words as noun, verb, adverb, etc. based on definition and
context.
Named Entity Recognition (NER) is labeling a sequence of text into predefined categories such
as name, location, etc. Labels help determine the context of words around them.
Confusion Set is a set of probable words that can appear in a certain context, e.g. set of articles
before a noun.
N-Gram is a sub-sequence of n words or tokens. For example, "The sun is bright" has these 2-
grams: {"the sun", "sun is", "is bright"}.
Parallel Corpus is a collection of text placed alongside its translation, e.g. text with errors and
its corresponding corrected version(s).
Language Model (LM) determines the probability distribution over a sequence of words. It says
how likely is a particular sequence of words.
Machine Translation (MT) is a software approach to translate one sequence of text into
another. In grammar checking, this refers to translating erroneous text into correct text.
What are the various types of grammar and spelling errors?

Types of grammar and spelling errors. Source: Soni and Thakur 2018, fig. 3.
We describe the following types:
Sentence Structure: Parts of speech are organized incorrectly. For example, "she began to
singing" shows misplaced 'to' or '-ing'. Dependent clause without the main clause, run-on
sentence due to missing conjunction, or missing subject are some structural errors.
Syntax Error: Violation of rules of grammar. These can be in relation to subject-verb agreement,
wrong/missing article or preposition, verb tense or verb form error, or a noun number error.
Punctuation Error: Punctuation marks like comma, semi-colon, period, exclamation, question
mark, etc. are missing, unnecessary, or wrongly placed.
Spelling Error: Word is not known in the dictionary.
Semantic Error: Grammar rules are followed but the sentence doesn't make sense, often due to a
wrong choice of words. "I am going to the library to buy a book" is an example where 'bookstore'
should replace 'library'. Rule-based approaches typically can't handle semantic errors. They
require statistical or machine learning approaches, which can also flag other types of errors.
Often a combination of approaches leads to a good solution.

Classical methods of spelling correction match words against a given dictionary, an approach
alluded by critiques to be unreliable as it can't detect incorrect use of correctly spelled words; or
correct words not in the dictionary, like technical words, acronyms, etc.
Grammar checkers use hand-coded grammar rules on PoS tagged text for correct or incorrect
sentences. For instance, the rule I + Verb (3rd person, singular form) corresponds to the incorrect
verbform usage, as in the phrase "I has a dog." These methods provide detailed explanations of
flagged errors making it helpful for learning. However, rule maintenance is tedious and devoid of
context.
Statistical approaches validate parts of a sentence (n-grams) against their presence in a corpus.
These approaches can flag words used out of context. However, it's challenging to provide
detailed explanations. Their efficiency is limited to the choice of corpora.
Noisy channel model is one statistical approach. A LM based on trigrams and bigrams gives
better results than just unigrams. Where rare words are wrongly corrected, using a blacklist of
words or a probability threshold can help.
What are Machine Learning-based methods for implementing grammar and spell checkers?
ML-based approaches are either Classification (discriminative) or Machine Translation
(generative).
Classification approaches work with well-defined errors. Each error type (article, preposition,
etc.) requires training a separate multi-class classifier. For example, a proposition error classifier
takes n-grams associated with propositions in a sentence and outputs a score for every candidate
proposition in the confusion set. Contextual corrections also consider features like PoS and NER.
A model can be a linear classifier like a Support Vector Machine (SVM), an n-gram LM-based
or Naïve Bayes classifier, or even a DNN-based classifier.
Machine Translation approaches can be Statistical Machine Translation (SMT) or Neural
Machine Translation (NMT). Both these use parallel corpora to train a sequence-to-sequence
model, where text with errors translates to corrected text. NMT uses encoder-decoder
architecture, where an encoder determines a latent vector for a sentence based upon the input
word embeddings. The decoder then generates target tokens from the latent vector and relevant
surrounding input and output tokens (attention). These benefit from transfer learning and
advancements in transformer-based architecture. Editor models reduce training time by
outputting edits to input tokens from a reduced confusion set instead of generating target tokens.
How can I train an NMT model for grammar and spell checking?

Training an NMT for GEC. Source: Adapted from Naghshnejad et al. 2020, fig. 3, fig. 5, table 4.
In general, NMT requires training an encoder-decoder model using cross-entropy as the loss
function by comparing maximum likelihood output to the gold standard correct output. To train a
good model requires a large number of parallel corpora and compute capacity. Transformers are
attention-based deep seq2seq architectures. Pre-trained language models generated by
transformer architectures like BERT provide contextual embeddings to find the most likely token
given the surrounding tokens, making it useful to flag contextual errors in an n-gram.
Transfer learning via fine tuning weights of a transformer using the parallel corpus of incorrect
to correct examples makes it suitable for GEC use. Pre-processing or pre-training with synthetic
data improves the performance and accuracy. Further enhancements can be to use separate heads
for different types of errors.
Editor models are better as they output edit sequences instead of corrected versions. Training
and testing of editor models require the generation of edit sequences from source-target parallel
texts.
What datasets are available for training and evaluation of grammar and spell check models?
MT or classification models need datasets with annotated errors. NMT requires a large amount
of data.
Lang 8, the largest available parallel corpora, has 100,051 English entries. Corpus of Linguistic
Acceptability (CoLA) is a dataset of sentences labeled as either grammatically correct or
incorrect. It can be used, for example, to fine tune a pre-trained model. GitHub Typo Corpus is
harvested from GitHub and contains errors and their corrections.
Benchmarking data in Standard Generalized Markup Language (SGML) format is
available. Sebastian Ruder offers a detailed list of available benchmarking test datasets along
with the various models (publications and source code).
Noise models use transducers to produce erroneous sentences from correct ones with a specified
probability. They induce various error types to generate a larger dataset from a smaller one, like
replacing a word from its confusion set, misplace or remove punctuations, induce spelling, tense,
noun number, or verb form mistakes, etc. Round-trip MT, such as English-German-English
translation, can also generate parallel corpora. Wikipedia edit sequences offer millions of
consecutive snapshots to serve as source-target pairs. However, only a tiny fraction of those
edits are language related.
How do I annotate or evaluate the performance of grammar and spell checkers?
ERRor ANnotation Toolkit (ERRANT) enabled suggestions with explanation. It automatically
annotates parallel English sentences with error type information, thereby standardizing parallel
datasets and facilitating detailed error type evaluation.
Training and evaluation require comparing the output to the target gold standard and giving a
numerical measure of effectiveness or loss. Editor models have an advantage as the sequence
length of input and output is the same. Unequal sequences need alignment with the insertion of
empty tokens.
Max-Match (M2M2) scorer determine the smallest edit sequence out of the multiple possible
ways to arrive at the gold standard using the notion of Levenshtein distance. The evaluation
happens by computing precision, recall, and F1 measure between the set of system edits and the
set of gold edits for all sentences after aligning the sequences to the same length.
Dynamic programming can also align multiple sequences to the gold standard when there is
more than one possible correct outcome.
Could you mention some tools or libraries that implement grammar and spell checking?
GNU Aspell is a standard utility used in GNU OS and other UNIX-like OS. Hunspell is a spell
checker that's part of popular software such as LibreOffice, OpenOffice.org, Mozilla Firefox 3 &
Thunderbird, Google Chrome, and more. Hunspell itself is based on MySpell. Hunspell can use
one or more dictionaries, stemming, morphological analysis, and Unicode text.
Python packages for spell checking include pyspellchecker, textblob and autocorrect.
A search for "grammar spell" on GitHub brings up useful dictionaries or code implemented in
various languages. There's a converter from British to American English. Spellcheckr is a
JavaScript implementation for web frontends.
Deep learning models include Textly-DRF-API and GECwBERT.
Many online services or offline software also exist: WhiteSmoke from 2002, LanguageTool from
2005, Grammarly from 2009, Ginger from 2011, Reverso from 2013, and Trinka from 2020.
Trinka focuses on an academic style of writing. Grammarly focuses on suggestions in terms of
writing style, clarity, engagement, delivery, etc.
Milestones
1960

Abbreviation ABBT maps incorrect word 'absorbant' to the correct word 'absorbent'. Source:
Blair 1960.
Blair implements a simple spelling corrector using heuristics and a dictionary of correct words.
Incorrect spellings are associated with the corrected ones via abbreviations that indicate
similarity between the two. Blair notes that this is in some sense a form of pattern recognition. In
one experiment, the program successfully corrects 89 of 117 misspelled words. In general,
research interest in spell checking and correction begins in the 1960s.
1971
R. E. Gorin writes Ispell in PDP-10 assembly. Ispell becomes the main spell-checking program
for UNIX. Ispell is also credited with introducing the generalized affix description system. Much
later, Geoff Kuenning implements a C++ version with support for many European languages.
This is called International Ispell. GNU Aspell, MySpell and Hunspell are other software
inspired by Ispell.
1980

Evolution of GEC. Source: Naghshnejad et al. 2020, fig 1.


In the 1980s, GEC systems are syntax-based systems, such as EPISTLE. They determine the
syntactic structure of each sentence and the grammatical functions fulfilled by various phrases.
They detect several classes of grammatical errors, such as disagreement in number between the
subject and the verb.
1990
This decade focuses on simple linear classifiers to flag incorrect choice of articles or statistical
methods to identify and flag use of commonly confused words. Confusion can be due to identical
sounding words, typos etc.
2000
Rule-based methods evolve in the 2000s. Rule generation is based on parse trees, designed
heuristically or based on linguistic knowledge or statistical analysis of erratic texts. These
methods don't generalize to new types of errors. New rules need to be constantly added.
2005
The mid-2000s sees methods to record and create aligned corpora of pre- and post-
editing ESL (English as a Second Language) writing samples. SMTs offer improvement in
identifying and correcting writing errors. GEC sees the use of semantic and syntactic features
including PoS tags and NER information for determining the applicable correction. Support
Vector Machines (SVMs), n-gram LM-based and Naïve Bayes classifiers are used to predict the
potential correction.
2010
DNN-based classifier approaches are proposed in 2000s and early 2010s. However, a specific
set of error types have to be defined. Typically only well-defined errors can be addressed with
these approaches. SMT models learn mappings from source text to target text using a noisy
channel model. SMT-based GEC models use parallel corpora of erratic text and grammatically
correct version of the same text in the same language. Open-source SMT engines are available
online and include Moses, Joshua and cdec.
2016
Neural Machine Translation (NMT) shows better prospects by capturing some learner errors
missed by SMT models. This is because NMT can encode structural patterns from training data
and is more likely to capture an unseen error.
2018
With the advent of attention-based transformer architecture in 2017, its application
to GEC gives promising results.
2019
Methods to improve the training data by text augmentation of various types, including cyclic
machine translation, emerge. These improve the performance of GEC tools significantly and
enable better flagging of style or context-based errors or suggestions. Predicting edits instead of
tokens allows the model to pick the output from a smaller confusion set. Thus, editor models lead
to faster training and inference of GEC models.
Sample Code
SPELLING-CORRECTOR
# Source: https://norvig.com/spell-correct.html
# Accessed 2021-04-25
# This is Peter Norvig's implementation from 2007.
# It relies on big.txt, a file of about a million words.
import re
from collections import Counter
def words(text): return re.findall(r'\w+', text.lower())
WORDS = Counter(words(open('big.txt').read()))
def P(word, N=sum(WORDS.values())):
"Probability of `word`."
return WORDS[word] / N
def correction(word):
"Most probable spelling correction for word."
return max(candidates(word), key=P)
def candidates(word):
"Generate possible spelling corrections for word."
return (known([word]) or known(edits1(word)) or known(edits2(word)) or [word])
def known(words):
"The subset of `words` that appear in the dictionary of WORDS."
return set(w for w in words if w in WORDS)
def edits1(word):
"All edits that are one edit away from `word`."
letters = 'abcdefghijklmnopqrstuvwxyz'
splits = [(word[:i], word[i:]) for i in range(len(word) + 1)]
deletes = [L + R[1:] for L, R in splits if R]
transposes = [L + R[1] + R[0] + R[2:] for L, R in splits if len(R)>1]
replaces = [L + c + R[1:] for L, R in splits if R for c in letters]
inserts = [L + c + R for L, R in splits for c in letters]
return set(deletes + transposes + replaces + inserts)

def edits2(word):
"All edits that are two edits away from `word`."
return (e2 for e1 in edits1(word) for e2 in edits1(e1))
# usage:
correction('speling') # spelling (single deletion)
correction('korrectud') # corrected (double replacements)

SENTIMENT ANALYSIS:

What is sentiment analysis (opinion mining)?


Sentiment analysis, also referred to as opinion mining, is an approach to natural language
processing (NLP) that identifies the emotional tone behind a body of text. This is a popular way
for organizations to determine and categorize opinions about a product, service, or idea. It
involves the use of data mining, machine learning (ML) and artificial intelligence (AI) to mine
text for sentiment and subjective information.

Sentiment analysis systems help organizations gather insights from unorganized and unstructured
text that comes from online sources such as emails, blog posts, support tickets, web chats, social
media channels, forums and comments. Algorithms replace manual data processing by
implementing rule-based, automatic or hybrid methods. Rule-based systems perform sentiment
analysis based on predefined, lexicon-based rules while automatic systems learn from data with
machine learning techniques. A hybrid sentiment analysis combines both approaches.

In addition to identifying sentiment, opinion mining can extract the polarity (or the amount of
positivity and negativity), subject and opinion holder within the text. Furthermore, sentiment
analysis can be applied to varying scopes such as document, paragraph, sentence and sub-
sentence levels.

Vendors that offer sentiment analysis platforms or SaaS products include Brandwatch, Hootsuite,
Lexalytics, NetBase, Sprout Social, Sysomos and Zoho. Businesses that use these tools can
review customer feedback more regularly and proactively respond to changes of opinion within
the market.

Types of sentiment analysis


Fine-grained sentiment analysis provides a more precise level of polarity by breaking it down
into further categories, usually very positive to very negative. This can be considered the opinion
equivalent of ratings on a 5-star scale.
Emotion detection identifies specific emotions rather than positivity and negativity. Examples
could include happiness, frustration, shock, anger and sadness.
Intent-based analysis recognizes actions behind a text in addition to opinion. For example, an online
comment expressing frustration about changing a battery could prompt customer service to reach out to
resolve that specific issue.
Aspect-based analysis gathers the specific component being positively or negatively mentioned. For
example, a customer might leave a review on a product saying the battery life was too short. Then, the
system will return that the negative sentiment is not about the product as a whole, but about the
battery life.
Applications of sentiment analysis
Sentiment analysis tools can be used by organizations for a variety of applications, including:
Identifying brand awareness, reputation and popularity at a specific moment or over time.
Tracking consumer reception of new products or features.
Evaluating the success of a marketing campaign.
Pinpointing the target audience or demographics.
Collecting customer feedback from social media, websites or online forms.
Conducting market research.
Categorizing customer service requests.
Challenges with sentiment analysis
Challenges associated with sentiment analysis typically revolve around inaccuracies in training
models. Objectivity, or comments with a neutral sentiment, tend to pose a problem for systems
and are often misidentified. For example, if a customer received the wrong color item and
submitted a comment "The product was blue," this would be identified as neutral when in fact it
should be negative.
Sentiment can also be challenging to identify when systems cannot understand the context or
tone. Answers to polls or survey questions like "nothing" or "everything" are hard to categorize
when the context is not given, as they could be labeled as positive or negative depending on the
question. Similarly, irony and sarcasm often cannot be explicitly trained and lead to falsely
labeled sentiments.
Computer programs also have trouble when encountering emojis and irrelevant information.
Special attention needs to be given to training models with emojis and neutral data so as to not
improperly flag texts.
Finally, people can be contradictory in their statements. Most reviews will have both positive and
negative comments, which is somewhat manageable by analyzing sentences one at a time.
However, the more informal the medium, the more likely people are to combine different
opinions in the same sentence and the more difficult it will be for a computer to parse.

QUESTION ANSWERING SYSTEM:


Introduction Question-Answering System
Question answering is a critical NLP problem and a long-standing artificial intelligence
milestone. QA systems allow a user to express a question in natural language and get an
immediate and brief response. QA systems are now found in search engines and phone
conversational interfaces, and they’re fairly good at answering simple snippets of information.
On more hard questions, however, these normally only go as far as returning a list of snippets
that we, the users, must then browse through to find the answer to our question.
Reading comprehension is the ability to read a piece of text and then answer questions about it.
Reading comprehension is difficult for machines because it requires both natural language
understanding and knowledge of the world.
SQuAD Dataset for building Question-Answering System
The Stanford Question Answering Dataset (SQuAD) is a reading comprehension dataset made
up of questions posed by crowd workers on a collection of Wikipedia articles, with the response
to each question being a text segment, or span, from the relevant reading passage, or the question
being unanswerable.
The reading sections in SQuAD are taken from high-quality Wikipedia pages, and they cover a
wide range of topics from music celebrities to abstract notions. A paragraph from an article is
called a passage, and it can be any length. Reading comprehension questions are included with
each passage in SQuAD. These questions are based on the passage’s content and can be
answered by reading it again. Finally, we have one or more answers to each question.
One of SQuAD’s distinguishing features is that the answers to all the questions are text portions,
or spans, in the chapter. These can be a single word or a group of words, and they are not limited
to entities–any range is fair game.
Question-Answering System
This is a very adaptable design, and we’ve found that it can ask for a wide range of queries.
Instead of having a list of options for each question, systems must choose the best answer from
all potential spans in the passage, which means they must deal with a vast number of
possibilities. Spans have the extra benefit of being simple to evaluate.

Image: Source
The QA setting, depending on the span is extremely natural. Open-domain QA systems can
typically discover the right papers that hold the solution to many user questions sent into search
engines. The task is to discover the shortest fragment of text in the passage or document that
answers the query, which is the ultimate phase of “answer extraction.”
Problem Description for Question-Answering System
The purpose is to locate the text for any new question that has been addressed, as well as the
context. This is a closed dataset, so the answer to a query is always a part of the context and that
the context spans a continuous span. For the time being, I’ve divided the problem into two pieces

Source: SQuAT
Getting the correct solution to the sentence (highlighted green)
Getting the correct response from the sentence once we completed it (highlighted blue)
We have a context, question, and text for each observation in the training set. One such
observation is:
Facebook Sentence Embedding
We now have word2vec, doc2vec, food2vec, node2vec, and sentence2vec, so why not
sentence2vec? The main idea behind these embeddings is to numerically represent entities using
vectors of various dimensions, making it easier for computers to grasp them for various NLP
tasks.
Traditionally, we applied the bag of words approach, which averaged the vectors of all the words
in a sentence. Each sentence is tokenized into words, and the vectors for these words are
discovered using glove embeddings. The average of all these vectors is then calculated. This
method has been done admirably, although it is not an accurate method because it ignores word
order.
This is where Infersent comes in. It’s a sentence embeddings method that generates semantic
sentence representations. It’s based on natural language inference data and can handle a wide
range of tasks.
InferSent is a method for generating semantic sentence representations using sentence
embeddings. It’s based on natural language inference data and can handle a wide range of tasks.
The procedure for building the model :
Make a vocabulary out of the training data and use it to train the inferent model.
I used Python 2.7 (with recent versions of NumPy/SciPy) with Pytorch (recent version) and
NLTK >= 3
if you want to download the trained model on AllNLI then run-
curl -Lo encoder/infersent.allnli.pickle
https://s3.amazonaws.com/senteval/infersent/infersent.allnli.pickle
Load the pre-trained model
import nltk
nltk.download('punkt')
import torch
infersent = torch.load('InferSent/encoder/infersent.allnli.pickle', map_location=lambda
storage, loc: storage)
infersent.set_glove_path("InferSent/dataset/GloVe/glove.840B.300d.txt")
infersent.build_vocab(sentences, tokenize=True)
dict_embeddings = {}
for i in range(len(sentences)):
print(i)
dict_embeddings[sentences[i]] = infersent.encode([sentences[i]], tokenize=True)
Where sentences are the number of sentences in your list. You can use
infersent.update_vocab(sentences) to update your vocabulary or infersent. Build vocab k
words(K=100000) to load the K most common English words directly. If tokenize is set to True
(the default), NTLK will tokenize sentences.
We can use these embeddings for a variety of tasks in the future, such as determining whether
two sentences are similar.
Sentence Segmentation:
You can use Doc.has_annotation with the attribute name “SENT_START” to see if a Doc has
sentence boundaries. Here the paragraph is broken into a meaningful sentence.
import spacy
nlp = spacy.load("en_core_web_sm")
doc = nlp("Environmentalists are concerned about the loss of biodiversity that will result from
the destruction of the forest.They also concerned about the release of the carbon contained within
the vegetation. This release may accelerate global warming.")
assert doc.has_annotation("SENT_START")
for sent in doc.sents:
print(sent.text)
Make many sentences out of the paragraph/context. Spacy and Textblob are two tools I’m
familiar with for handling text data. TextBlob was used to do this. Unlike spacy’s sentence
detection, which can produce random sentences based on the period, it executes intelligent
splitting. Here’s a real-life example:

The paragraph is splitting it into 7 sentences using TextBlob


Here the sentence is split up into separate text :
import spacy
nlp = spacy.load("en_core_web_sm")
doc = nlp("Environmentalists are concerned about the loss of biodiversity that will result from the
destruction of the forest, and also about the release of the carbon contained within the vegetation,
which could accelerate global warming.")
for token in doc:
print(token.text)

Using the Infersent model, get the vector representation of each sentence and question.
Machine learning Models
we should tackle the problem by utilizing two key methods:
Supervised learning and unsupervised learning, in which I did not use the target variable. I’m
going to return the sentence from the paragraph that is the furthest away from the given question.
Unsupervised Learning Model
Let’s see if we can use Euclidean distance to find the sentence that is closest to the question. This
model’s accuracy was roughly 45 per cent. The accuracy rose from 45 per cent to 63 per cent
after altering the cosine similarity. This makes sense because the Euclidean distance is
unaffected by the alignment or angle of the vectors, whereas cosine is. With vectorial
representations, the direction is crucial.
However, this strategy does not take advantage of the rich data with target labels we are given.
However, because of the solution’s simplicity, it still produces a solid outcome with no training.
Facebook sentence embedding deserves credit for the excellent results.
Supervised Learning Model
Creating a training set for this section has been difficult since each portion does not have a
predetermined amount of sentences and answers can range from one word to many words.
I’ve converted the target variable’s text to the sentence index that contains that text. I’ve kept my
paragraphs to a maximum of ten sentences to keep things simple (around 98 percent of the
paragraphs have 10 or fewer sentences). As a result, in this scenario, I have 10 labels to forecast.
I created a feature based on cosine distance for each sentence. If a paragraph has fewer than 10
sentences, I replace its feature value with 1 (maximum cosine distance) to make 10 sentences.
Question – What kind of sending technology is being used to protect tribal lands in the
Amazon?
Context – The use of remote sensing for the conservation of the Amazon is also being used by
the indigenous tribes of the basin to protect their tribal lands from commercial interests. Using
handheld GPS devices and programs like Google Earth, members of the Trio Tribe, who live in
the rainforests of southern Suriname, map out their ancestral lands to help strengthen their
territorial claims. Currently, most tribes in the Amazon do not have clearly defined boundaries,
making it easier for commercial ventures to target their territories.
-From SQuAD
Text – remote sensing
Because the highlighted sentence index is 1, the target variable will be changed to 1. There will
be ten features, each of which corresponds to one sentence in the paragraph. Because these
sentences do not appear in the paragraph, the missing values for column cos 2, and column cos 3
are filled with NaN.
Source: SQuAD

Encode the sentences (list of n sentences):


embeddings = infersent.encode(sentences, tokenize=True)
Next, Code the train_nli.py in the IDE:
import numpy as np
import torch
from torch.autograd import Variable
import torch.nn as nn
from data import get_nli, get_batch, build_vocab
from mutils import get_optimizer
from models import NLINet
GLOVE_PATH = "dataset/GloVe/glove.840B.300d.txt"
parser = argparse.ArgumentParser(description='NLI training')
# paths
parser.add_argument("--nlipath", type=str, default='dataset/SNLI/', help="NLI data path
(SNLI or MultiNLI)")
parser.add_argument("--outputdir", type=str, default='savedir/', help="Output directory")
parser.add_argument("--outputmodelname", type=str, default='model.pickle')
# training
parser.add_argument("--n_epochs", type=int, default=20)
parser.add_argument("--batch_size", type=int, default=64)
parser.add_argument("--dpout_model", type=float, default=0., help="encoder dropout")
parser.add_argument("--dpout_fc", type=float, default=0., help="classifier dropout")
parser.add_argument("--nonlinear_fc", type=float, default=0, help="use nonlinearity in
fc")
parser.add_argument("--optimizer", type=str, default="sgd,lr=0.1", help="adam or
sgd,lr=0.1")
parser.add_argument("--lrshrink", type=float, default=5, help="shrink factor for sgd")
parser.add_argument("--decay", type=float, default=0.99, help="lr decay")
parser.add_argument("--minlr", type=float, default=1e-5, help="minimum lr")
parser.add_argument("--max_norm", type=float, default=5., help="max norm (grad
clipping)")
# model
parser.add_argument("--encoder_type", type=str, default='BLSTMEncoder', help="see
list of encoders")
parser.add_argument("--enc_lstm_dim", type=int, default=2048, help="encoder nhid
dimension")
parser.add_argument("--n_enc_layers", type=int, default=1, help="encoder num layers")
parser.add_argument("--fc_dim", type=int, default=512, help="nhid of fc layers")
parser.add_argument("--n_classes", type=int, default=3,
help="entailment/neutral/contradiction")
parser.add_argument("--pool_type", type=str, default='max', help="max or mean")
np.random.seed(params.seed)
torch.manual_seed(params.seed)
torch.cuda.manual_seed(params.seed)
def evaluate(epoch, eval_type='valid', final_eval=False):
nli_net.eval()
correct = 0.
global val_acc_best, lr, stop_training, adam_stop
if eval_type == 'valid':
print('nVALIDATION : Epoch {0}'.format(epoch))
s1 = valid['s1'] if eval_type == 'valid' else test['s1']
s2 = valid['s2'] if eval_type == 'valid' else test['s2']
target = valid['label'] if eval_type == 'valid' else test['label']
for i in range(0, len(s1), params.batch_size):
# prepare batch
s1_batch, s1_len = get_batch(s1[i:i + params.batch_size], word_vec)
s2_batch, s2_len = get_batch(s2[i:i + params.batch_size], word_vec)
s1_batch, s2_batch = Variable(s1_batch.cuda()), Variable(s2_batch.cuda())
tgt_batch = Variable(torch.LongTensor(target[i:i + params.batch_size])).cuda()
# model forward
output = nli_net((s1_batch, s1_len), (s2_batch, s2_len))
pred = output.data.max(1)[1]
correct += pred.long().eq(tgt_batch.data.long()).cpu().sum()
# save model
eval_acc = round(100 * correct / len(s1), 2)
if final_eval:
print('finalgrep : accuracy {0} : {1}'.format(eval_type, eval_acc))
else:
print('togrep : results : epoch {0} ; mean accuracy {1} :
{2}'.format(epoch, eval_type, eval_acc))
if eval_type == 'valid' and epoch <= params.n_epochs:
if eval_acc > val_acc_best:
print('saving model at epoch {0}'.format(epoch))
if not os.path.exists(params.outputdir):
os.makedirs(params.outputdir)
torch.save(nli_net, os.path.join(params.outputdir,
params.outputmodelname))
val_acc_best = eval_acc
else:
if 'sgd' in params.optimizer:
optimizer.param_groups[0]['lr'] = optimizer.param_groups[0]['lr'] / params.lrshrink
print('Shrinking lr by : {0}. New lr = {1}'
.format(params.lrshrink,
optimizer.param_groups[0]['lr']))
if optimizer.param_groups[0]['lr'] < params.minlr:
stop_training = True
if 'adam' in params.optimizer:
# early stopping (at 2nd decrease in accuracy)
stop_training = adam_stop
adam_stop = True
return eval_acc
Develop a model based on natural language inference (SNLI):
Set GLOVE PATH in train nli.py to replicate our results and train our models on SNLI, then run:
python train_nli.py
After the model has been trained, pass the sentence to the encoder function, which will produce a
4096-dimensional vector regardless of how many words are in the text.
Parsing Dependencies:
The “Dependency Parse Tree” is another feature I used to solve this problem. The model’s
accuracy will improve by 5% because of this. Spacy tree parsing was used since it has a robust
API for traversing through the tree.
Above the text, directed, named arcs from heads to dependents show the relationships between
the words. Because we generate the labels from a pre-defined inventory of grammatical relations,
we call this a Typed Dependency structure. It also comprises a root node, which denotes the
tree’s root, as well as the entire structure’s head.
Let’s use Spacy tree parse to show our data. I’m going to use the same example.
What kind of sending technology is being used to protect tribal lands in the Amazon?
[to_nltk_tree(sent.root) pretty_print() for sent in en_nlp(predicted.iloc[ 0, 2]).sents]

Sentence having the solution — The use of remote sensing for the conservation of the Amazon
is also being used by the indigenous tribes of the basin to protect their tribal lands from
commercial interests.
All Roots of the Sentences in the Paragraph are visualized:
for sent is doc.sents:
roots = [st.stem(chunk.root. head.text.lower()) for chunk in sent.noun_chunks ]
prlnt(roots)
Lemmatization:
The Lemmatizer is a configurable pipeline component that supports lookup and rule-based
lemmatization methods. As part of its language data, a language can expand the Lemmatizer.
Before comparing the roots of the sentence to the question root, it’s crucial to do stemming and
lemmatization. Protect is the root word for the question in the previous example, while protected
is the root word in the sentence. It will be impossible to match them unless you stem and
lemmatize “protect” to a common phrase.
import spacy
nlp = spacy.load("en_core_web_sm")
lemmatizer = nlp.get_pipe("lemmatizer")
print(lemmatizer.mode) # 'rule'
doc = nlp("The use of remote sensing for the conservation of the Amazon is also being used by
the indigenous tribes of the basin to protect their tribal lands from commercial interests.")
print([token.lemma_ for token in doc])

The goal is to match the root of the question, which in this case is “appear,” to all the sentence’s
roots and sub-roots. We can gain several roots since there are multiple verbs in a sentence. If the
root of the question is present in the roots of the statement, there is a better possibility that the
sentence will answer the question. With this in mind, I’ve designed a feature for each sentence
that has a value of 1 or 0. Here, 1 shows that the question’s root is contained in the sentence
roots, and 0 shows that it is not.
We develop the transposed data with two observations from the processed training data model.
So, for ten phrases in a paragraph, we have 20 characteristics combining cosine distance and root
match. The range of the target variable is 0 to 9.
This problem can also be solved using supervised learning, in which we fit multinomial logistic
regression, random forest, and xgboost to construct 20 features, two of which represent the
cosine distance and Euclidean distance for one sentence. As a result, we will limit each para to
ten sentences).
import numpy as np, pandas as pd
import ast
from sklearn import linear_model
from sklearn import metrics
from sklearn.cross_validation import train_test_split
import warnings
warnings.filterwarnings('ignore')
import spacy
from nltk import Tree
en_nlp = spacy.load('en')
from nltk.stem.lancaster import LancasterStemmer
st = LancasterStemmer()
from sklearn.preprocessing import StandardScaler, MinMaxScaler
from sklearn.ensemble import RandomForestClassifier
import xgboost as xgb
from sklearn.model_selection import RandomizedSearchCV, GridSearchCV
Load the dataset CSV file:
data = pd.read_csv("train_detect_sent.csv").reset_index(drop=True)
Let’s retrieve the Abstract Syntax Tree for the DataFrames:
ast.literal_eval(data["sentences"][0])
After all, create a feature for the Data Frames and then train the model:
def crt_feature(data):
train = pd.DataFrame()
for k in range(len(data["euclidean_dis"])):
dis = ast.literal_eval(data["euclidean_dis"][k])
for i in range(len(dis)):
train.loc[k, "column_euc_"+"%s"%i] = dis[i]
print("Finished")
for k in range(len(data["cosine_sim"])):
dis = ast.literal_eval(data["cosine_sim"][k].replace("nan","1"))
for i in range(len(dis)):
train.loc[k, "column_cos_"+"%s"%i] = dis[i]
train["target"] = data["target"]
return train
train = crt_feature(data)
train.head(3).transpose()
Train the model using Multinomial Logistic regression:
train_x, test_x, train_y, test_y = train_test_split(X,
train.iloc[:,-1], train_size=0.8, random_state = 5)
mul_lr = linear_model.LogisticRegression(multi_class='multinomial', solver='newton-cg')
mul_lr.fit(train_x, train_y)
print("Multinomial Logistic regression Train Accuracy : ", metrics.accuracy_score(train_y,
mul_lr.predict(train_x)))
print("Multinomial Logistic regression Test Accuracy : ", metrics.accuracy_score(test_y,
mul_lr.predict(test_x)))

The phrase ID with the right answer is the target variable. As a result, I have ten labels. The
accuracy is currently 63%, 65 per cent, and 69 per cent, respectively, on the validation set.

PERSONAL ASSISTANT:
Understanding Natural Language Processing in Virtual Assistants
In a market saturated with offerings from various companies, it is crucial to possess an
understanding of the underlying technologies that make Virtual Assistants effective vs.
ineffective. Of paramount importance is the Assistant’s ability to interface with users and
complete the appropriate action based on the information provided by the user. While it may
seem like common sense, that description merely scratches the surface of the capabilities that
separate a chatbot from a Virtual Assistant.

Critical differences between rudimentary artificial intelligence and its corollary abilities to
complete user-requested actions lie in the framework of Natural Language Processing (NLP),
Understanding (NLU), and Generation (NLG). Natural Language Processing in Virtual
Assistants is key in understanding both the broad picture and the minute details.
When combined in Aisera’s Virtual Assistant, the three technologies far exceed the precedent set
by competitors for conversational intelligence and Robotic Process Automation (RPA) solutions.
In this blog, we will examine the specificities of NLP, and later NLU and NLG, as they appear
across the Aisera AI Service Management (AISM) platform and how these differ from other
Virtual Assistant offerings on the market.
Ecosystem
When interacting with a user, a proper Virtual Assistant must be equipped to capture any
incoming request regardless of the domain and intent of the request and return an immediate and
relevant response. In this way, the Virtual Assistant is like the goalkeeper of a World Cup-
winning soccer team: they catch the incoming ball from any angle and return it up the field to
one of their teammates. Unsurprisingly, Virtual Assistants can become a critical player on any
customer service team as the Assistant keeps track of customers throughout their buying journey,
executes automated processes on backend systems, deflects routine issues from service agents,
and escalates unresolved requests to the most effective agent when the time is right. But this is
only one flavor for the application of a Virtual assistant, there are numerous use-cases across
Sales and Marketing, Human Resources, IT, Legal and Finance, and more. Knowing that a
capable Virtual Assistant must be equipped to handle so many different facets of the customer’s
journey while understanding the nuances of a given customer’s mood and sentiment, and therein
lies the most powerful applications of cutting-edge Natural Language Processing in Virtual
Assistants and NLU technologies. We go deeper into the gamut of capabilities Aisera’s Virtual
Assistant has and how it helps with Customer Intelligence along the customer’s journey in this
blog.
Understanding Intents
For the uninitiated, semantic NLP, NLU, and NLG are technologies built to solve one problem:
identifying the user’s intent during any given interaction. In humans, there are many mechanisms
that are employed to aid in deciphering the intent is behind another human’s word choice,
whether they are visual cues, the difference in inflection across a word, and a familiarity with the
vernacular dialect of the conversation. Machines, however, do have most of these luxuries and
therefore must rely on different mechanisms to ensure the correct interpretation of user
interaction. The components that make up NLP are a message interpreter and an exception
handler. These two pieces allow AIsera to process a user request then execute tasks and actions
based on the extracted information. The message interpreter uses techniques such as
tokenization, spell checking, and lemmatization to break down the nature of the user’s request
prior to classifying the request and passing it along to the NLU module to further analyze the
intent behind the request. For example, an utterance of “I would like to access Zoom” could be
understood as:
Intent: [name: “Provision $Application”], entities: [name: “$Application: Zoom
Videoconferencing”]

From there, the aforementioned interpretation techniques can be added to further breakdown the
utterance, which could look like:

Intent: [domain: “IT”], intent: [type: “action”], entities: [class-name: “videoconferencing”],


sentiment: [score: “positive”]
If the request is not able to be dissected and classified, the NLP module engages the exception
handler to inform the user that it cannot process their request and offer an instant escalation if the
user would like to continue to pursue the request. More often than not, the NLP module is able to
skip the failsafe of invoking the exception handler since the message interpreter does a fantastic
job of cleaning and parsing the incoming data.
The final part of the classification process involves relating the interpreted message to the
existing domain ontology and taxonomies. Aisera’s NLP module takes the processed user
utterance and classifies it under a certain domain by mapping the extracted entities from the
request to a list of popular entities within the target domain. This is known as the ontology, a vast
group of associated entities and their logical dependencies. Aisera’s Virtual Assistant is pre-
trained on a broad and deep body of data, which includes a global ontology of over 1.1 Trillion
phrases and 5 Billion intents. The NLP module does more than process text, classify the domain
and intent of the utterance. The Message Interpreter also classifies the intent type, analyzing the
user’s sentiment to add additional information (in the form of supplemental tags) to help keep the
data organized and augment Aisera’s Virtual Assistant in determining the optimal next best steps
to take action on.
What it Means for Users
Natural Language Processing in Virtual Assistants is only one piece of the puzzle, and the
example provided merely scratches the surface of the exact inner workings of Aisera’s Virtual
Assistant, but it does offer a more technical glimpse into how inbound data is handled by the
NLP module. With NLP, businesses gain the ability to instantly engage users without resorting to
the narrow scope of an archaic scripted dialog flow. NLP enables Aisera to be like the World
Cup goalkeeper – catching incoming requests, no matter the speed, spin, or angle of attack. In
our next blog, we will take a look under the hood of how Aisera built its world-class Natural
Language Understanding module and the benefits to businesses and end-users alike.

TUTORING SYSTEMS:
INTRODUCTION
Many Intelligent Tutoring Systems (ITSs) aim to help students become better readers. The
computational challenges involved are (1) to assess the students’ natural language inputs and (2)
to provide appropriate feedback and guide students through the ITS curriculum. To overcome
both challenges, the following non-structural Natural Language Processing (NLP) techniques
have been explored and the first two are already in use: word-matching (WM), latent semantic
analysis (LSA, Landauer, Foltz, & Laham, 1998), and topic models (TM, Steyvers & Griffiths,
2007).
This article describes these NLP techniques, the iSTART (Strategy Trainer for Active Reading and
Thinking, McNamara, Levinstein, & Boonthum, 2004) intelligent tutor and the related Reading Strategies
Assessment Tool (R-SAT, Magliano et al., 2006), and how these NLP techniques can be used in
assessing students’ input in iSTART and R-SAT. This article also discusses other related NLP
techniques which are used in other applications and may be of use in the assessment tools or
intelligent tutoring systems.
BACKGROUND
Interpreting text is critical for intelligent tutoring systems (ITSs) that are designed to interact
meaningfully with, and adapt to, the users’ input. Different ITSs use different Natural Language
Processing (NLP) techniques in their system. NLP systems may be structural, i.e., focused on
grammar and logic, or non-structural, i.e., focused on words and statistics. This article deals with
the latter.

Examples of the structural approach include ExtrAns (Extracting Answers from technical
texts question-answering system; Molla et al., 2003) which uses minimal logical forms (MLF;
that is, the form of first order predicates) to represent both texts and questions and C-Rater
(Leacock & Chodorow, 2003) which scores short-answer questions by analyzing the conceptual
information of an answer in respect to the given question. Turning to the non-structural
approach, AutoTutor (Graesser et al., 2000) uses LSA to analyze the student’s input against
expected sets of answers and CIRCSIM-Tutor (Kim et al., 1989) uses a word-matching
technique to evaluate students’ short answers. The systems considered more fully below,
iSTART (McNamara et al, 2004) and R-SAT (Magliano et al., 2006) use both word-matching
and LSA in assessing quality of students’ self-explanation. Topic models (TM) were explored in
both systems, but have not yet been integrated.
MAIN FOCUS OF THE CHAPTER
This article presents three non-structural NLP techniques (WM, LSA, and TM) which are
currently used or being explored in reading strategies assessment and training applications,
particularly, iSTART and R-SAT.
Word Matching
Word matching is a simple and intuitive way to estimate the nature of an explanation. There are
two ways to compare words from the reader’s input (either answers or explanations) against
benchmarks (collections of words that represent a unit of text or an ideal answer): (1) Literal
word matching and (2) Soundex matching.
Literal word matching – Words are compared character by character and if there is a match of
sufficient length then we call this a literal match. An alternative is to count words that have the
same stem (e.g., indexer and indexing) as matching. If a word is short a complete match may be
required to reduce the number of false-positives.
Soundex matching - This algorithm compensates for misspellings by mapping similar
characters to the same soundex symbol (Christian, 1998). Words are transformed to their
soundex code by retaining the first character, dropping the vowels, and then converting other
characters into soundex symbols: 1 for b, p; 2 for f v; 3 for c, k, s; etc. Sometimes only one
consecutive occurrence of the same symbol is retained. There are many variants of this algorithm
designed to reduce the number of false positives (e.g., Philips, 1990). As in literal matching,
short words may require a full soun-dex match while for longer words the first n soundex
symbols may suffice.
Word-matching is also used in other applications, such as, CIRCSIM-Tutor (Kim et al., 1989) on
short-answer questions and Short Essay Grading System (Ventura et al., 2004) on questions with
ideal expert answers.
Latent Semantic Analysis (LSA)
Latent Semantic Analysis (LSA; Landauer, Foltz, & Laham, 1998) uses statistical computation
to extract and represent the meaning of words. Meanings are represented in terms of their
similarity to other words in a large corpus of documents. LSA begins by finding the frequency of
terms used and the number of co-occurrences in each document throughout the corpus and then
uses a powerful mathematical transformation to find deeper meanings and relations between
words.
When measuring the similarity between text-objects, LSA’s accuracy improves with the size
of the objects, so it provides the most benefit in finding similarity between two documents but as
it does not take word order into account, short documents may not receive the full benefit. The
details for constructing an LSA corpus matrix are in Landauer & Dumais (1997). Briefly, the
steps are: (1) select a corpus; (2) create a term-document-frequency (TDF) matrix; (3) apply
Singular Value Decomposition (sVD; Press et al, 1986) to the TDF matrix to decompose it into
three matrices (L x S x R; where S is a scaling, matrix). The leftmost matrix (L) becomes the
LSA matrix of that corpus. The optimal size is usually in the range of 300-400 dimensions.
Hence, the LSA matrix dimensions become N x D where N is the number of unique words in the
entire corpus and D is the optimal dimension (reduced from the total number of documents in the
entire corpus).
The similarity of terms (or words) is computed by comparing two rows, each representing a
term vector. This is done by taking the cosine of the two term vectors. To find the similarity of
sentences or documents, (1) for each document, create a document vector using the sum of the
term vectors of all the terms appearing in the document and (2) calculate a cosine between two
document vectors. Cosine values range from ±1 where +1 means highly similar.
To use LSA in the tutoring systems, a set of benchmarks are created and compared with the
trainee’s input. Examples benchmarks are the current target sentence, previous sentences, and the
ideal answer. A high cosine value between the current sentence benchmark and the reader’s input
would indicate that the reader understood the sentence and was able to paraphrase what was read.
To provide appropriate feedback, a number of cosines are computed (one for each benchmark).
Various statistical methods, such as discriminant analysis and regression analysis, are used to
construct the feedback formula. McNamara et al. (2007) describe various ways that LSA can be
used to evaluate the reader’s explanations: either LSA alone or a combination of LSA with WM.
The final conclusion is that a fully-automated (i.e., less hand-crafted benchmarks construction),
combined system produces the better results.
There are a number of other intelligent tutoring systems that use LSA in their feedback system,
for examples, Summary Street (Steinhart, 2001), Auto- Tutor (Greasser et al, 2000), and
Tutoring System (Lemaire, 1999).
Topic Models
The Topic Models approach (TM; Steyvers & Griffiths, 2007) applies a probabilistic model to
find a relationship between terms and documents in terms of topics. A document is considered to
be generated probabilistically from a number of topics where each topic consists of a number of
terms, each given a probability of selection if that topic is used. By using a TM matrix, the
probability that a certain topic was used in the creation of a given document is estimated. If two
documents are similar, the estimates of the topics within these documents should be similar. TM
is similar to LSA, except that a term-document frequency matrix is factored into two matrices
instead of three: one is the probabilities of terms belonging to the topics (the TM matrix), the
other the probabilities of topics belonging to the documents. The Topic Modeling Toolbox
(Steyvers & Griffiths, 2007) can be used to construct a TM matrix,
To measure the similarity between documents, the Kullback Leibler distance (KL-distance:
Steyvers & Griffiths, 2007) is recommended, rather than the cosine measure (which can also be
used). Using TM in a tutoring system is similar to using LSA, where a set of benchmarks is
defined and the reader’s input is compared against each benchmark. The only different is the use
of KL-distance instead of LSA-cosine value. The preliminary results of investigating TM in
place of LSA (Boonthum, Levinstein, & McNamara, 2006) indicate that TM is as good as LSA
alone (correlation between computerized-scores and human rating scores), but a little bit lower
than a combined system using both WM and LSA. This suggests that the TM should be further
investigated in combination with WM or LSA or both.
TM is mostly used in document clustering (grouping documents based on relevancy or similar
topics; Buntine et al., 2005), data mining (Tuulos & Tirri, 2004), and search engines (Perkio et
al., 2004). A variation on TM by Steyvers & Griffiths (2007), is Probabilistic Latent Semantic
Analysis (PLSA; Hofmann, 2001) which models each document as generated from a number of
hidden topics and each topic has its features defined as the conditional probabilities of word
occurrences in that topic.
iSTART and RSAT Applications
iSTART (Interactive Strategy Trainer for Active Reading and Thinking) is a web-based,
automated tutor designed to help students become better readers using multi-media technology.
It provides adolescent to college-aged students with a program of self-explanation and reading
strategy training (McNamara et al., 2004) called Self-Explanation Reading Training, or SERT
(see McNamara et al, 2004). iSTART consists of three modules: Introduction (description of
SERT and reading strategies), Demonstration (illustration of how these reading strategies can be
used), and Practice (hands-on practice of these reading strategies). In the Practice module,
students practice using reading strategies by typing self-explanations of sentences. The system
evaluates each explanation and then provides appropriate feedback to the student. If the
explanation is irrelevant or too short compared to the given sentence and passage, the student is
required to add more information. Otherwise, the feedback is based on the level of its overall
quality.
The computational challenge is to provide appropriate feedback to the students about their
explanations. Doing so requires capturing some sense of both the meaning and quality of their
explanation. A combination of word-matching and LSA provided better results (comparing the
computerized-score using NLP techniques to the human rating score and having higher
correlation between these two sets of scores) than either separately (McNamara, Boonthum,
Levinstein, & Millis, 2007).
R-SAT (Reading StrategyAssessment Tool; Maglino et al., 2007) is an automated web-based
reading assessment tool designed to measure readers’ comprehension and spontaneous use of
reading strategies. The R-SAT is similar to the iSTART Practice module in the sense that it
presents passages to the reader one sentence at a time and asks for the reader’s input. The
difference is that, instead of an explanation, R-SAT asks either an indirect (“What are your
thoughts regarding your understanding of the sentence in the context of the passage?”) or a direct
question (e.g., Why did the miller want to marry the girl?”) at pre-selected target sentences. The
answers to the indirect questions are evaluated on how they are related to the given sentence and
passage; the answers to the direct questions are assessed by comparing them to ideal answers.
The problem is to analyze the answers and generate a set of scores for overall
comprehension and strategy usage. Ultimately, these scores can be used as a pre-assessment
for iSTART allowing the trainer to individualize the iSTART curriculum based on the reader’s
needs. R-SAT was initially proposed to use word-matching, LSA, and other techniques beyond
LSA. However, during the course of development, word-matching was found to produce better
results than LSA or in combination with LSA.
FUTURE TRENDS
These three NLP techniques (WM, LSA, and TM) are used in the ongoing research on
assessing and improving comprehension skills via reading strategies in the R-SAT and iSTART
projects. WM and LSA have been extensively investigated for iSTART and to some extent in R-
SAT. The lack of success of LSA compared to the simpler WM in R-SAT is somewhat
surprising and may be due to particular features of the algorithms used or to the variety of text
genres used in R-SAT. Future work is planned with modified algorithms and substituting genre-
specific LSA spaces for the general space now used. In addition TM needs further exploration,
especially in its use with small units of text where the recommended Kullback Leibler distance
has not proven particularly effective.

Conclusion:

Thus the Study of various applications of NLP is done for real world projects.
EXPT.2 VARIOUS TEXT PREPROCESSING TECHNIQUES FOR ANY GIVEN
TEXT : TOKENIZATION AND FILTRATION & SCRIPT VALIDATION

LAB OBJECTIVES:

To understand the various text preprocessing techniques for Tokenization, Filtration & Script
Validation

LAB OUTCOMES:

On Successful Completion, the Student will be able to understand about various text
preprocessing techniques for Tokenization, Filtration & Script Validation for real world
applications.

PROCEDURE:

In Python tokenization basically refers to splitting up a larger body of text into smaller lines,
words or even creating words for a non-English language. The various tokenization functions in-
built into the nltk module itself and can be used in programs as shown below.
Line Tokenization
In the below example we divide a given text into different lines by using the function
sent_tokenize.
import nltk
sentence_data = "The First sentence is about Python. The Second: about Django. You can learn
Python,Django and Data Ananlysis here. "
nltk_tokens = nltk.sent_tokenize(sentence_data)
print (nltk_tokens)
When we run the above program, we get the following output −
['The First sentence is about Python.', 'The Second: about Django.', 'You can learn
Python,Django and Data Ananlysis here.']
Non-English Tokenization
In the below example we tokenize the German text.
import nltk

german_tokenizer = nltk.data.load('tokenizers/punkt/german.pickle')
german_tokens=german_tokenizer.tokenize('Wie geht es Ihnen? Gut, danke.')
print(german_tokens)
When we run the above program, we get the following output −
['Wie geht es Ihnen?', 'Gut, danke.']
Word Tokenzitaion
We tokenize the words using word_tokenize function available as part of nltk.
import nltk
word_data = "It originated from the idea that there are readers who prefer learning new skills
from the comforts of their drawing rooms"
nltk_tokens = nltk.word_tokenize(word_data)
print (nltk_tokens)
When we run the above program we get the following output −
['It', 'originated', 'from', 'the', 'idea', 'that', 'there', 'are', 'readers',
'who', 'prefer', 'learning', 'new', 'skills', 'from', 'the',
'comforts', 'of', 'their', 'drawing', 'rooms']

FILTRATION:
filter() in python
The filter() method filters the given sequence with the help of a function that tests each element
in the sequence to be true or not.
syntax:
filter(function, sequence)
Parameters:
function: function that tests if each element of a
sequence true or not.
sequence: sequence which needs to be filtered, it can
be sets, lists, tuples, or containers of any iterators.
Returns:
returns an iterator that is already filtered.

# function that filters vowels


def fun(variable):
letters = ['a', 'e', 'i', 'o', 'u']
if (variable in letters):
return True
else:
return False
# sequence
sequence = ['g', 'e', 'e', 'j', 'k', 's', 'p', 'r']

# using filter function


filtered = filter(fun, sequence)

print('The filtered letters are:')


for s in filtered:
print(s)

Output:
The filtered letters are:
e
e
Application:
It is normally used with Lambda functions to separate list, tuple, or sets.

# a list contains both even and odd numbers.


seq = [0, 1, 2, 3, 5, 8, 13]

# result contains odd numbers of the list


result = filter(lambda x: x % 2 != 0, seq)
print(list(result))

# result contains even numbers of the list


result = filter(lambda x: x % 2 == 0, seq)
print(list(result))

Output:
[1, 3, 5, 13]
[0, 2, 8]

SCRIPT VALIDATION:
Introduction to Python Validation
Whenever the user accepts an input, it needs to be checked for validation which checks if the
input data is what we are expecting. The validation can be done in two different ways, that is by
using a flag variable or by using try or except which the flag variable will be set to false initially
and if we can find out that the input data is what we are expecting the flag status can be set to
true and find out what can be done next based on the status of the flag whereas while using try or
except, a section of code is tried to run. If there is a negative response, then the except block of
code is run.
Types of Validation in Python
There are three types of validation in python, they are:
Type Check: This validation technique in python is used to check the given input data type. For
example, int, float, etc.
Length Check: This validation technique in python is used to check the given input string’s
length.
Range Check: This validation technique in python is used to check if a given number falls in
between the two numbers.
The syntax for validation in Python is given below:
Syntax using the flag:
flagName = False
while not flagName:
if [Do check here]:
flagName = True
else:
print('error message')
The status of the flag is set to false initially, and the same condition is considered for a while
loop to make the statement while not true, and the validation is performed setting the flag to true
if the validation condition is satisfied; otherwise, the error message is printed.
Syntax using an exception:

while True:
try:
[run code that might fail here]
break
except:
print('This is the error message if the code fails')

print('run the code from here if code is successfully run in the try block of code above')
print(‘run the code from here if code is successfully run in the try block of code above)
We set the condition to be true initially and perform the necessary validation by running a block
of code, and if the code fails to perform the validation, an exception is raised displaying the error
message and a success message is printed if the try block successfully executes the code.
Examples of Python Validation
Examples of python validation are:
Example #1
Python program using a flag to validate if the input given by the user is an integer.#Datatype
check.
#Declare a variable validInt which is also considered as flag and set it to false
validInt = False
#Consider the while condition to be true and prompt the user to enter the input
while not validInt:
#The user is prompted to enter the input
age1 = input('Please enter your age ')
#The input entered by the user is checked to see if it’s a digit or a number
if age1.isdigit():
#The flag is set to true if the if condition is true
validInt = True
else:
print('The input is not a valid number')
#This statement is printed if the input entered by the user is a number
print('The entered input is a number and that is ' + str(age1))
Output:

Example #2
Python program uses flag and exception to validate the type of input given by the user and
determine if it lies within a given range. #Range Check.
Code:
#Declare a variable areTeenager which is also considered as flag and set it to false
areTeenager = False
#Consider the while condition to be true and prompt the user to enter the input
while not areTeenager:
try:
#The user is prompted to enter the input
age1 = int(input('Please enter your age '))
#The input entered by the user is checked if it lies between the range specified
if age1 >= 13 and age1 <= 19:
areTeenager = True
except:
print('The age entered by you is not a valid number between 13 and 19')
#This statement is printed if the input entered by the user lies between the range of the
number specified
print('You are a teenager whose age lies between 13 and 19 and the entered age is ' + str(age))

Example #3
Python program using the flag to check the length of the input string. #Length Check.
Code:
#Declare a variable lenstring which is also considered as flag and set it to false
lenstring = False
#Consider the while condition to be true and prompt the user to enter the input
while not lenstring:
password1 = input('Please enter a password consisting of five characters ')
#The input entered by the user is checked for its length and if it is below five
if len(password1) >= 5:
lenstring = True
else:
print('The number of characters in the entered password is less than five characters')
#This statement is printed if the input entered by the user consists of less than five characters
print('The entered password is: ' + password1)
Output

Conclusion:
Thus the various text preprocessing techniques, Tokenization and Filtration & Script Validation
are done and verified.

EXPT.3 VARIOUS OTHER TEXT PREPROCESSING TECHNIQUES FOR ANY


GIVEN TEXT : STOP WORD REMOVAL, LEMMATIZATION / STEMMING.
LAB OBJECTIVES:
To understand various other the text preprocessing techniques for any given text : stop word
removal, lemmatization / stemming

LAB OUTCOMES:
On Successful Completion, the Student will be able to understand about Cloud9 various other the
text preprocessing techniques for any given text : stop word removal, lemmatization / stemming

PROCEDURE:

Stop Word Removal


Stopwords are the English words which does not add much meaning to a sentence. They can
safely be ignored without sacrificing the meaning of the sentence. For example, the words like
the, he, have etc. Such words are already captured this in corpus named corpus. We first
download it to our python environment.
import nltk
nltk.download('stopwords')
It will download a file with English stopwords.
Verifying the Stopwords
from nltk.corpus import stopwords
stopwords.words('english')
print stopwords.words() [620:680]
When we run the above program we get the following output −
[u'your', u'yours', u'yourself', u'yourselves', u'he', u'him', u'his', u'himself', u'she',
u"she's", u'her', u'hers', u'herself', u'it', u"it's", u'its', u'itself', u'they', u'them',
u'their', u'theirs', u'themselves', u'what', u'which', u'who', u'whom', u'this',
u'that', u"that'll", u'these', u'those', u'am', u'is', u'are', u'was', u'were', u'be',
u'been', u'being', u'have', u'has', u'had', u'having', u'do', u'does', u'did', u'doing',
u'a', u'an', u'the', u'and', u'but', u'if', u'or', u'because', u'as', u'until',
u'while', u'of', u'at']
The various language other than English which has these stopwords are as below.

from nltk.corpus import stopwords


print stopwords.fileids()
When we run the above program we get the following output −
[u'arabic', u'azerbaijani', u'danish', u'dutch', u'english', u'finnish',
u'french', u'german', u'greek', u'hungarian', u'indonesian', u'italian',
u'kazakh', u'nepali', u'norwegian', u'portuguese', u'romanian', u'russian',
u'spanish', u'swedish', u'turkish']
Example
We use the below example to show how the stopwords are removed from the list of words.
from nltk.corpus import stopwords
en_stops = set(stopwords.words('english'))

all_words = ['There', 'is', 'a', 'tree','near','the','river']


for word in all_words:
if word not in en_stops:
print(word)
When we run the above program we get the following output −
There
tree
near
river

Lemmatization:
Python | Lemmatization with NLTK
Lemmatization is the process of grouping together the different inflected forms of a word so they
can be analyzed as a single item. Lemmatization is similar to stemming but it brings context to
the words. So it links words with similar meanings to one word.
Text preprocessing includes both Stemming as well as Lemmatization. Many times people find
these two terms confusing. Some treat these two as the same. Actually, lemmatization is
preferred over Stemming because lemmatization does morphological analysis of the words.
Applications of lemmatization are:

Used in comprehensive retrieval systems like search engines.


Used in compact indexing

Examples of lemmatization:

-> rocks : rock


-> corpora : corpus
-> better : good
One major difference with stemming is that lemmatize takes a part of speech parameter, “pos” If
not supplied, the default is “noun.”
Below is the implementation of lemmatization words using NLTK:

Python3

# import these modules


from nltk.stem import WordNetLemmatizer

lemmatizer = WordNetLemmatizer()
print("rocks :", lemmatizer.lemmatize("rocks"))
print("corpora :", lemmatizer.lemmatize("corpora"))

# a denotes adjective in "pos"


print("better :", lemmatizer.lemmatize("better", pos ="a"))

Output :

rocks : rock
corpora : corpus
better : good
Stemming:Python | Stemming words with NLTK
Stemming is the process of producing morphological variants of a root/base word. Stemming
programs are commonly referred to as stemming algorithms or stemmers. A stemming algorithm
reduces the words “chocolates”, “chocolatey”, and “choco” to the root word, “chocolate” and
“retrieval”, “retrieved”, “retrieves” reduce to the stem “retrieve”.
Prerequisite: Introduction to Stemming

Some more example of stemming for root word "like" include:

-> "likes"
-> "liked"
-> "likely"
-> "liking"
Errors in Stemming: There are mainly two errors in stemming
– Overstemming and Understemming. Overstemming occurs when two words are stemmed
from the same root that are of different stems. Under-stemming occurs when two words are
stemmed from the same root that is not of different stems.
Applications of stemming are:
Stemming is used in information retrieval systems like search engines.
It is used to determine domain vocabularies in domain analysis.
Stemming is desirable as it may reduce redundancy as most of the time the word stem and their
inflected/derived words mean the same.
Below is the implementation of stemming words using NLTK:
Code #1:
Python3

# import these modules


from nltk.stem import PorterStemmer
from nltk.tokenize import word_tokenize
ps = PorterStemmer()
# choose some words to be stemmed
words = ["program", "programs", "programmer", "programming", "programmers"]
for w in words:
print(w, " : ", ps.stem(w))

Output:
program : program
programs : program
programmer : program
programming : program
programmers : program
Code #2: Stemming words from sentences
Python3

# importing modules
from nltk.stem import PorterStemmer
from nltk.tokenize import word_tokenize
ps = PorterStemmer()
sentence = "Programmers program with programming languages"
words = word_tokenize(sentence)
for w in words:
print(w, " : ", ps.stem(w))

Output :
Programmers : program
program : program
with : with
programming : program
languages : language

Conclusion:
Thus the various other text preprocessing techniques for any given text : Stop Word Removal,
Lemmatization / Stemming are done and verified.
EXPT.4 MORPHOLOGICAL ANALYSIS AND WORD GENERATION FOR ANY
GIVEN TEXT

LAB OBJECTIVES:

To understand the concepts of morphological analysis and word generation for any given text

LAB OUTCOMES:

On Successful Completion, the Student will be able to understand about morphological analysis
and word generation for any given text in the real world application.

PROCEDURE:
Like any other python library, we will install polyglot using pip install polyglot.
Morphological Analysis
Polyglot offers trained morfessor models to generate morphemes from words. The goal of the
Morpho project is to develop unsupervised data-driven methods that discover the regularities
behind word forming in natural languages. In particular, Morpho project is focussing on the
discovery of morphemes, which are the primitive units of syntax, the smallest individually
meaningful elements in the utterances of a language. Morphemes are important in automatic
generation and recognition of a language, especially in languages in which words may have
many different inflected forms.
Languages Coverage
Using polyglot vocabulary dictionaries, we trained morfessor models on the most frequent words
50,000 words of each language.
from polyglot.downloader import downloader
print(downloader.supported_languages_table("morph2"))
1. Piedmontese language 2. Lombard language 3. Gan Chinese
4. Sicilian 5. Scots 6. Kirghiz, Kyrgyz
7. Pashto, Pushto 8. Kurdish 9. Portuguese
10. Kannada 11. Korean 12. Khmer
13. Kazakh 14. Ilokano 15. Polish
16. Panjabi, Punjabi 17. Georgian 18. Chuvash
19. Alemannic 20. Czech 21. Welsh
22. Chechen 23. Catalan; Valencian 24. Northern Sami
25. Sanskrit (Saṁskṛta) 26. Slovene 27. Javanese
28. Slovak 29. Bosnian-Croatian-Serbian 30. Bavarian
31. Swedish 32. Swahili 33. Sundanese
34. Serbian 35. Albanian 36. Japanese
37. Western Frisian 38. French 39. Finnish
40. Upper Sorbian 41. Faroese 42. Persian
43. Sinhala, Sinhalese 44. Italian 45. Amharic
46. Aragonese 47. Volapük 48. Icelandic
49. Sakha 50. Afrikaans 51. Indonesian
52. Interlingua 53. Azerbaijani 54. Ido
55. Arabic 56. Assamese 57. Yoruba
58. Yiddish 59. Waray-Waray 60. Croatian
61. Hungarian 62. Haitian; Haitian Creole 63. Quechua
64. Armenian 65. Hebrew (modern) 66. Silesian
67. Hindi 68. Divehi; Dhivehi; Mald... 69. German
70. Danish 71. Occitan 72. Tagalog
73. Turkmen 74. Thai 75. Tajik
76. Greek, Modern 77. Telugu 78. Tamil
79. Oriya 80. Ossetian, Ossetic 81. Tatar
82. Turkish 83. Kapampangan 84. Venetian
85. Manx 86. Gujarati 87. Galician
88. Irish 89. Scottish Gaelic; Gaelic 90. Nepali
91. Cebuano 92. Zazaki 93. Walloon
94. Dutch 95. Norwegian 96. Norwegian Nynorsk
97. West Flemish 98. Chinese 99. Bosnian
100. Breton 101. Belarusian 102. Bulgarian
103. Bashkir 104. Egyptian Arabic 105. Tibetan Standard, Tib...
106. Bengali 107. Burmese 108. Romansh
109. Marathi (Marāṭhī) 110. Malay 111. Maltese
112. Russian 113. Macedonian 114. Malayalam
115. Mongolian 116. Malagasy 117. Vietnamese
118. Spanish; Castilian 119. Estonian 120. Basque
121. Bishnupriya Manipuri 122. Asturian 123. English
124. Esperanto 125. Luxembourgish, Letzeb... 126. Latin
127. Uighur, Uyghur 128. Ukrainian 129. Limburgish, Limburgan...
130. Latvian 131. Urdu 132. Lithuanian
133. Fiji Hindi 134. Uzbek 135. Romanian, Moldavian, ...
Download Necessary Models
%%bash
polyglot download morph2.en morph2.ar
[polyglot_data] Downloading package morph2.en to
[polyglot_data] /home/rmyeid/polyglot_data...
[polyglot_data] Package morph2.en is already up-to-date!
[polyglot_data] Downloading package morph2.ar to
[polyglot_data] /home/rmyeid/polyglot_data...
[polyglot_data] Package morph2.ar is already up-to-date!
Example
from polyglot.text import Text, Word
words = ["preprocessing", "processor", "invaluable", "thankful", "crossed"]
for w in words:
w = Word(w, language="en")
print("{:<20}{}".format(w, w.morphemes))
preprocessing ['pre', 'process', 'ing']
processor ['process', 'or']
invaluable ['in', 'valuable']
thankful ['thank', 'ful']
crossed ['cross', 'ed']
If the text is not tokenized properly, morphological analysis could offer a smart of way of
splitting the text into its original units. Here, is an example:
blob = "Wewillmeettoday."
text = Text(blob)
text.language = "en"
text.morphemes
WordList([u'We', u'will', u'meet', u'to', u'day', u'.'])
!polyglot --lang en tokenize --input testdata/cricket.txt | polyglot --lang en morph | tail -n 30
which which
India In_dia
beat beat
Bermuda Ber_mud_a
in in
Port Port
of of
Spain Spa_in
in in
2007 2007
, ,
which which
was wa_s
equalled equal_led
five five
days day_s
ago ago
by by
South South
Africa Africa
in in
their t_heir
victory victor_y
over over
West West
Indies In_dies
in in
Sydney Syd_ney
. .
This is an interface to the implementation being described in the Morfessor2.0: Python
Implementation and Extensions for Morfessor Baseline technical report.
@InProceedings{morfessor2,
title:{Morfessor 2.0: Python Implementation and Extensions for Morfessor Baseline},
author: {Virpioja, Sami ; Smit, Peter ; Grönroos, Stig-Arne ; Kurimo, Mikko},
year: {2013},
publisher: {Department of Signal Processing and Acoustics, Aalto University},
booktitle:{Aalto University publication series}
}

Word Generation for any given text:


Pulling a random word or string from a line in a text file in Python
File handling in Python is really simple and easy to implement. In order to pull a random word or
string from a text file, we will first open the file in read mode and then use the methods in
Python’s random module to pick a random word.
There are various ways to perform this operation:
This is the text file we will read from:

Method 1: Using random.choice()


Steps:
Using with function, open the file in read mode. The with function takes care of closing the file
automatically.
Read all the text from the file and store in a string
Split the string into words separated by space.
Use random.choice() to pick a word or string.
Python

# Python code to pick a random


# word from a text file
import random

# Open the file in read mode


with open("MyFile.txt", "r") as file:
allText = file.read()
words = list(map(str, allText.split()))

# print random string


print(random.choice(words))

Note: The split() function, by default, splits by white space. If you want any other delimiter like
newline character you can specify that as an argument.
Output:

Output for two sample runs


The above can be achieved with just a single line of code like this :
Python

# import required module


import random

# print random word


print(random.choice(open("myFile.txt","r").readline().split()))

Method 2: Using random.randint()


Steps:
Open the file in read mode using with function
Store all data from the file in a string and split the string into words.
Count the total number of words.
Use random.randint() to generate a random number between 0 and the word_count.
Print the word at that position.
Python

# using randint()
import random

# open file
with open("myFile.txt", "r") as file:
data = file.read()
words = data.split()

# Generating a random number for word position


word_pos = random.randint(0, len(words)-1)
print("Position:", word_pos)
print("Word at position:", words[word_pos])

Output:

Conclusion:
Thus the morphological analysis and word generation for any given text is done and verified for
real world application.
EXPT. 5 N GRAM MODEL FOR THE GIVEN TEXT INPUT

LAB OBJECTIVES:

To understand the concept of N Gram Model for the given text input for real world applications.

LAB OUTCOMES:

On Successful Completion, the Student will be able to understand about N Gram Model for the
given text input for real world applications.

PROCEDURE:
N-Gram Language Modelling with NLTK
Language modeling is the way of determining the probability of any sequence of words.
Language modeling is used in a wide variety of applications such as Speech Recognition, Spam
filtering, etc. In fact, language modeling is the key aim behind the implementation of many state-
of-the-art Natural Language Processing models.
Methods of Language Modelings:
Two types of Language Modelings:
Statistical Language Modelings: Statistical Language Modeling, or Language Modeling, is the
development of probabilistic models that are able to predict the next word in the sequence given
the words that precede. Examples such as N-gram language modeling.
Neural Language Modelings: Neural network methods are achieving better results than
classical methods both on standalone language models and when models are incorporated into
larger models on challenging tasks like speech recognition and machine translation. A way of
performing a neural language model is through word embeddings.
N-gram
N-gram can be defined as the contiguous sequence of n items from a given sample of text or
speech. The items can be letters, words, or base pairs according to the application. The N-grams
typically are collected from a text or speech corpus (A long text dataset).
N-gram Language Model:
An N-gram language model predicts the probability of a given N-gram within any sequence of
words in the language. A good N-gram model can predict the next word in the sentence i.e the
value of p(w|h)
Example of N-gram such as unigram (“This”, “article”, “is”, “on”, “NLP”) or bi-gram (‘This
article’, ‘article is’, ‘is on’,’on NLP’).
Understanding N-grams
Text n-grams are commonly utilized in natural language processing and text mining. It’s
essentially a string of words that appear in the same window at the same time.
When computing n-grams, you normally advance one word (although in more complex scenarios
you can move n-words). N-grams are used for a variety of purposes.
N Grams Demonstration

For example, while creating language models, n-grams are utilized not only to create unigram
models but also bigrams and trigrams.
Google and Microsoft have created web-scale grammar models that may be used for a variety of
activities such as spelling correction, hyphenation, and text summarization.

Implementing n-grams in Python


In order to implement n-grams, ngrams function present in nltk is used which will perform all the
n-gram operation.
1 from nltk import ngrams
2 sentence = input("Enter the sentence: ")
3 n = int(input("Enter the value of n: "))
4 n_grams = ngrams(sentence.split(), n)
5 for grams in n_grams:
6 print(grams)
Sample Output
Enter the sentence: Let's test the n-grams implementation with this sample sentence! Yay!
Enter the value of n: 3
("Let's", 'test', 'the')
('test', 'the', 'n-grams')
('the', 'n-grams', 'implementation')
('n-grams', 'implementation', 'with')
('implementation', 'with', 'this')
('with', 'this', 'sample')
('this', 'sample', 'sentence!')
('sample', 'sentence!', 'Yay!')
Conclusion:

Thus the concept of N-Gram model for the given text input is done and verified.
EXPT. 6 STUDY THE DIFFERENT POS TAGGERS AND PERFORM POS TAGGING
ON THE GIVEN TEXT

LAB OBJECTIVES:

To understand the study the different pos taggers and perform pos tagging on the given text for
real world application.

LAB OUTCOMES:

On Successful Completion, the Student will be able to understand about study the different pos
taggers and perform pos tagging on the given text for real world application.

PROCEDURE:

POS Tagging
POS Tagging (Parts of Speech Tagging) is a process to mark up the words in text format for a
particular part of a speech based on its definition and context. It is responsible for text reading in
a language and assigning some specific token (Parts of Speech) to each word. It is also called
grammatical tagging.
Let’s learn with a NLTK Part of Speech example:
Input: Everything to permit us.
Output: [(‘Everything’, NN),(‘to’, TO), (‘permit’, VB), (‘us’, PRP)]
Steps Involved in the POS tagging example:
Tokenize text (word_tokenize)

apply pos_tag to above step that is nltk.pos_tag(tokenize_text)

NLTK POS Tags Examples are as below:
Abbreviation Meaning

CC coordinating conjunction

CD cardinal digit

DT Determiner

EX existential there

FW foreign word

IN preposition/subordinating conjunction
Abbreviation Meaning

JJ This NLTK POS Tag is an adjective (large)

JJR adjective, comparative (larger)

JJS adjective, superlative (largest)

LS list market

MD modal (could, will)

NN noun, singular (cat, tree)

NNS noun plural (desks)

NNP proper noun, singular (sarah)

NNPS proper noun, plural (indians or americans)

PDT predeterminer (all, both, half)

POS possessive ending (parent\ ‘s)

PRP personal pronoun (hers, herself, him, himself)

PRP$ possessive pronoun (her, his, mine, my, our )

RB adverb (occasionally, swiftly)

RBR adverb, comparative (greater)

RBS adverb, superlative (biggest)

RP particle (about)

TO infinite marker (to)

UH interjection (goodbye)

VB verb (ask)
Abbreviation Meaning

VBG verb gerund (judging)

VBD verb past tense (pleaded)

VBN verb past participle (reunified)

VBP verb, present tense not 3rd person singular(wrap)

VBZ verb, present tense with 3rd person singular (bases)

WDT wh-determiner (that, what)

WP wh- pronoun (who)

WRB wh- adverb (how)

The above NLTK POS tag list contains all the NLTK POS Tags. NLTK POS tagger is used to
assign grammatical information of each word of the sentence. Installing, Importing and
downloading all the packages of POS NLTK is complete.

COUNTING POS TAGS


We have discussed various pos_tag in the previous section. In this particular tutorial, you will
study how to count these tags. Counting tags are crucial for text classification as well as
preparing the features for the Natural language-based operations. I will be discussing with you
the approach which guru99 followed while preparing code along with a discussion of output.
Hope this will help you.
How to count Tags:
Here first we will write working code and then we will write different steps to explain the code.
from collections import Counter
import nltk
text = "Guru99 is one of the best sites to learn WEB, SAP, Ethical Hacking and much more
online."
lower_case = text.lower()
tokens = nltk.word_tokenize(lower_case)
tags = nltk.pos_tag(tokens)
counts = Counter( tag for word, tag in tags)
print(counts)
Output:
Counter({‘NN’: 5, ‘,’: 2, ‘TO’: 1, ‘CC’: 1, ‘VBZ’: 1, ‘NNS’: 1, ‘CD’: 1, ‘.’: 1, ‘DT’: 1, ‘JJS’: 1,
‘JJ’: 1, ‘JJR’: 1, ‘IN’: 1, ‘VB’: 1, ‘RB’: 1})
Elaboration of the code

1. To count the tags, you can use the package Counter from the collection’s module. A
counter is a dictionary subclass which works on the principle of key-value operation. It is
an unordered collection where elements are stored as a dictionary key while the count is
their value.
2. Import nltk which contains modules to tokenize the text.
3. Write the text whose pos_tag you want to count.
4. Some words are in upper case and some in lower case, so it is appropriate to transform all
the words in the lower case before applying tokenization.
5. Pass the words through word_tokenize from nltk.
6. Calculate the pos_tag of each token
Output = [('guru99', 'NN'), ('is', 'VBZ'), ('one', 'CD'), ('of', 'IN'), ('the', 'DT'), ('best', 'JJS'),
('site', 'NN'), ('to', 'TO'), ('learn', 'VB'), ('web', 'NN'), (',', ','), ('sap', 'NN'), (',', ','), ('ethical',
'JJ'), ('hacking', 'NN'), ('and', 'CC'), ('much', 'RB'), ('more', 'JJR'), ('online', 'JJ')]
7. Now comes the role of dictionary counter. We have imported in the code line 1. Words
are the key and tags are the value and counter will count each tag total count present in
the text.

Tagging Sentences
Tagging Sentence in a broader sense refers to the addition of labels of the verb, noun, etc., by the
context of the sentence. Identification of POS tags is a complicated process. Thus generic
tagging of POS is manually not possible as some words may have different (ambiguous)
meanings according to the structure of the sentence. Conversion of text in the form of list is an
important step before tagging as each word in the list is looped and counted for a particular tag.
Please see the below code to understand it better
import nltk
text = "Hello Guru99, You have to build a very good site, and I love visiting your site."
sentence = nltk.sent_tokenize(text)
for sent in sentence:
print(nltk.pos_tag(nltk.word_tokenize(sent)))
Output:
[(‘Hello’, ‘NNP’), (‘Guru99’, ‘NNP’), (‘,’, ‘,’), (‘You’, ‘PRP’), (‘have’, ‘VBP’), (‘build’,
‘VBN’), (‘a’, ‘DT’), (‘very’, ‘RB’), (‘good’, ‘JJ’), (‘site’, ‘NN’), (‘and’, ‘CC’), (‘I’, ‘PRP’),
(‘love’, ‘VBP’), (‘visiting’, ‘VBG’), (‘your’, ‘PRP$’), (‘site’, ‘NN’), (‘.’, ‘.’)]

Code Explanation:
1. Code to import nltk (Natural language toolkit which contains submodules such as
sentence tokenize and word tokenize.)
2. Text whose tags are to be printed.
3. Sentence Tokenization
4. For loop is implemented where words are tokenized from sentence and tag of each word
is printed as output.
In Corpus there are two types of POS taggers:
 Rule-Based
 Stochastic POS Taggers
1.Rule-Based POS Tagger: For the words having ambiguous meaning, rule-based approach on
the basis of contextual information is applied. It is done so by checking or analyzing the meaning
of the preceding or the following word. Information is analyzed from the surrounding of the
word or within itself. Therefore words are tagged by the grammatical rules of a particular
language such as capitalization and punctuation. e.g., Brill’s tagger.
2.Stochastic POS Tagger: Different approaches such as frequency or probability are applied
under this method. If a word is mostly tagged with a particular tag in training set then in the test
sentence it is given that particular tag. The word tag is dependent not only on its own tag but also
on the previous tag. This method is not always accurate. Another way is to calculate the
probability of occurrence of a specific tag in a sentence. Thus the final tag is calculated by
checking the highest probability of a word with a particular tag.

POS tagging with Hidden Markov Model


Tagging Problems can also be modeled using HMM. It treats input tokens to be observable
sequence while tags are considered as hidden states and goal is to determine the hidden state
sequence. For example x = x1,x2,…………,xn where x is a sequence of tokens while y =
y1,y2,y3,y4………ynis the hidden sequence.

How Hidden Markov Model (HMM) Works?


HMM uses join distribution which is P(x, y) where x is the input sequence/ token sequence and y
is tag sequence.
Tag Sequence for x will be argmaxy1….ynp(x1,x2,….xn,y1,y2,y3,…..). We have categorized tags
from the text, but stats of such tags are vital. So the next part is counting these tags for statistical
study.
Conclusion:

Thus the Study of the different POS taggers and Perform POS tagging on the given text is done
and verified.
EXPT.7 PERFORM CHUNKING FOR THE GIVEN TEXT INPUT

LAB OBJECTIVES:

To understand to perform chunking for the given text input for real world application.

LAB OUTCOMES:

On Successful Completion, the Student will be able to understand to perform chunking for the
given text input for real world application

PROCEDURE:

Python – Divide String into Equal K chunks


Input : test_str = ‘geeksforgeek’, K = 4
Output : [‘gee’, ‘ksf’, ‘org’, ‘eek’]
Explanation : 12/4 = 3, length of each string extracted.
Input : test_str = ‘geeksforgeek’, K = 1
Output : [‘geeksforgeek’]
Explanation : 12/1 = 12, whole string is single chunk.
Method #1: Using len() + loop
In this, we first perform task of computation of length of each chunk required from K and string
length, post that, string is splitted on desired indices to extract chunks using slicing.
 Python3
# Python3 code to demonstrate working of
# Divide String into Equal K chunks
# Using len() + loop

# initializing strings
test_str = 'geeksforgeeks 1'

# printing original string


print("The original string is : " + str(test_str))

# initializing K
K=5

# compute chunk length


chnk_len = len(test_str) // K

res = []
for idx in range(0, len(test_str), chnk_len):

# appending sliced string


res.append(test_str[idx : idx + chnk_len])

# printing result
print("The K chunked list : " + str(res))

Output
The original string is : geeksforgeeks 1
The K chunked list : ['gee', 'ksf', 'org', 'eek', 's 1']
Method #2: Using list comprehension
The method similar to above, difference being that last process is encapsulated to one-liner list
comprehension.
 Python3
# Python3 code to demonstrate working of
# Divide String into Equal K chunks
# Using list comprehension

# initializing strings
test_str = 'geeksforgeeks 1'

# printing original string


print("The original string is : " + str(test_str))

# initializing K
K=5

# compute chunk length


chnk_len = len(test_str) // K

# one-liner to perform the task


res = [test_str[idx : idx + chnk_len] for idx in range(0, len(test_str), chnk_len)]

# printing result
print("The K len chunked list : " + str(res))

Output

The original string is : geeksforgeeks 1


The K len chunked list : ['gee', 'ksf', 'org', 'eek', 's 1']
Conclusion:

Thus the experiment to Perform Chunking for the given text input is done and verified.
EXPT. 8 IMPLEMENTING NAMED ENTITY RECOGNIZER FOR THE
GIVEN TEXT INPUT
LAB OBJECTIVES:

To implement the named entity recognizer for the given text input for the real world applications.

LAB OUTCOMES:

On Successful Completion, the Student will be able to understand about the named entity
recognizer for the given text input for the real world applications

PROCEDURE:

Named Entity Recognition


The named entity recognition (NER) is one of the most data preprocessing task. It involves the
identification of key information in the text and classification into a set of predefined categories.
An entity is basically the thing that is consistently talked about or refer to in the text.
NER is the form of NLP. At its core, NLP is just a two-step process, below are the two steps that
are involved:
 Detecting the entities from the text
 Classifying them into different categories
Some of the categories that are the most important architecture in NER such that:
 Person
 Organization
 Place/ location
Other common tasks include classifying of the following:
 date/time.
 expression
 Numeral measurement (money, percent, weight, etc)
 E-mail address
Ambiguity in NE
 For a person, the category definition is intuitively quite clear, but for computers, there is
some ambiguity in classification. Let’s look at some ambiguous example:
 England (Organisation) won the 2019 world cup vs The 2019 world cup
happened in England(Location).
 Washington(Location) is the capital of the US vs The first president of the US
was Washington(Person).

Methods of NER
 One way is to train the model for multi-class classification using different machine learning
algorithms, but it requires a lot of labelling. In addition to labelling the model also requires a
deep understanding of context to deal with the ambiguity of the sentences. This makes it a
challenging task for simple machine learning /
 Another way is that Conditional random field that is implemented by both NLP Speech
Tagger and NLTK. It is a probabilistic model that can be used to model sequential data such
as words. The CRF can capture a deep understanding of the context of the sentence.

 Deep Learning Based NER: deep learning NER is much more accurate than previous
method, as it is capable to assemble words. This is due to the fact that it used a method
called word embedding, that is capable of understanding the semantic and syntactic
relationship between various words. It is also able to learn analyzes topic-specific as well as
high level words automatically. This makes deep learning NER applicable for performing
multiple tasks. Deep learning can do most of the repetitive work itself, hence researchers for
example can use their time more efficiently.
Implementation
 In this implementation, we will perform Named Entity Recognition using two different
frameworks: Spacy and NLTK. This code can be run on colab, however for visualization
purpose. I recommend the local environment. We can install the following frameworks using
pip install
 First, we performed Named Entity recognition using Spacy.

 Python3
# command to run before code
! pip install spacy
! pip install nltk
! python -m spacy download en_core_web_sm

# imports and load spacy english language package


import spacy
from spacy import displacy
from spacy import tokenizer
nlp = spacy.load('en_core_web_sm')

#Load the text and process it


# I copied the text from python wiki
text =("Python is an interpreted, high-level and general-purpose programming
language
"Pythons design philosophy emphasizes code readability with"
"its notable use of significant indentation."
"Its language constructs and object-oriented approach aim to"
"help programmers write clear and"
"logical code for small and large-scale projects")
# text2 = # copy the paragraphs from https://www.python.org/doc/essays/
doc = nlp(text)
#doc2 = nlp(text2)
sentences = list(doc.sents)
print(sentences)
# tokenization
for token in doc:
print(token.text)
# print entities
ents = [(e.text, e.start_char, e.end_char, e.label_) for e in doc.ents]
print(ents)
# now we use displaycy function on doc2
displacy.render(doc, style='ent', jupyter=True)

Output:
[Python is an interpreted, high-level and general-purpose programming language.,
Pythons design philosophy emphasizes code readability with its notable use of significant
indentation.,
Its language constructs and object-oriented approach aim to help programmers write clear,
logical code for small and large-scale projects]
# tokens
Python
is
an
interpreted
,
high
-
level
and
general
-
purpose
programming
language
.
Pythons
design
philosophy
emphasizes
code
readability
with
its
notable
use
of
significant
indentation
.
Its
language
constructs
and
object
-
oriented
approachaim
to
help
programmers
write
clear
,
logical
code
for
small
and
large
-
scale
projects
# named entity
[('Python', 0, 6, 'ORG')]
#here ORG stands for Organization
Conclusion:

Thus the experiment to Implement Named Entity Recognizer for the given text input is done and
verified.

EXPT.9 IMPLEMENTING TEXT SIMILARITY RECOGNIZER FOR THE CHOSEN


TEXT DOCUMENTS

LAB OBJECTIVES:

To implement text similarity recognizer for the chosen text documents for real world
applications.

LAB OUTCOMES:

On Successful Completion, the Student will be able to understand about implement text
similarity recognizer for the chosen text documents for real world applications.

PROCEDURE:

Measuring the Document Similarity in Python


Document similarity, as the name suggests determines how similar are the two given documents.
By “documents”, we mean a collection of strings. For example, an essay or a .txt file. Many
organizations use this principle of document similarity to check plagiarism. It is also used by
many exams conducting institutions to check if a student cheated from the other. Therefore, it is
very important as well as interesting to know how all of this works.
Document similarity is calculated by calculating document distance. Document distance is a
concept where words(documents) are treated as vectors and is calculated as the angle between
two given document vectors. Document vectors are the frequency of occurrences of words in a
given document. Let’s see an example:
Say that we are given two documents D1 and D2 as:
D1: “This is a geek”
D2: “This was a geek thing”
The similar words in both these documents then become:
"This a geek"
If we make a 3-D representation of this as vectors by taking D1, D2 and similar words in 3 axis
geometry, then we get:
Now if we take dot product of D1 and D2,
D1.D2 = "This"."This"+"is"."was"+"a"."a"+"geek"."geek"+"thing".0
D1.D2 = 1+0+1+1+0
D1.D2 = 3
Now that we know how to calculate the dot product of these documents, we can now calculate
the angle between the document vectors:
cos d = D1.D2/|D1||D2|
Here d is the document distance. It’s value ranges from 0 degree to 90 degrees. Where 0 degree
means the two documents are exactly identical and 90 degrees indicate that the two documents
are very different.
Now that we know about document similarity and document distance, let’s look at a Python
program to calculate the same:
Document similarity program :
Our algorithm to confirm document similarity will consist of three fundamental steps:
 Split the documents in words.
 Compute the word frequencies.
 Calculate the dot product of the document vectors.
For the first step, we will first use the .read() method to open and read the content of the files.
As we read the contents, we will split them into a list. Next, we will calculate the word
frequency list of the read in the file. Therefore, the occurrence of each word is counted and the
list is sorted alphabetically.

import math
import string
import sys

# reading the text file


# This function will return a
# list of the lines of text
# in the file.
def read_file(filename):

try:
with open(filename, 'r') as f:
data = f.read()
return data
except IOError:
print("Error opening or reading input file: ", filename)
sys.exit()

# splitting the text lines into words


# translation table is a global variable
# mapping upper case to lower case and
# punctuation to spaces
translation_table = str.maketrans(string.punctuation+string.ascii_uppercase,
" "*len(string.punctuation)+string.ascii_lowercase)

# returns a list of the words


# in the file
def get_words_from_line_list(text):

text = text.translate(translation_table)
word_list = text.split()

return word_list

Now that we have the word list, we will now calculate the frequency of occurrences of the
words.

# counts frequency of each word


# returns a dictionary which maps
# the words to their frequency.
def count_frequency(word_list):

D = {}

for new_word in word_list:

if new_word in D:
D[new_word] = D[new_word] + 1

else:
D[new_word] = 1

return D
# returns dictionary of (word, frequency)
# pairs from the previous dictionary.
def word_frequencies_for_file(filename):

line_list = read_file(filename)
word_list = get_words_from_line_list(line_list)
freq_mapping = count_frequency(word_list)

print("File", filename, ":", )


print(len(line_list), "lines, ", )
print(len(word_list), "words, ", )
print(len(freq_mapping), "distinct words")

return freq_mapping

Lastly, we will calculate the dot product to give the document distance.
# returns the dot product of two documents
def dotProduct(D1, D2):
Sum = 0.0

for key in D1:

if key in D2:
Sum += (D1[key] * D2[key])

return Sum

# returns the angle in radians


# between document vectors
def vector_angle(D1, D2):
numerator = dotProduct(D1, D2)
denominator = math.sqrt(dotProduct(D1, D1)*dotProduct(D2, D2))

return math.acos(numerator / denominator)

That’s all! Time to see the document similarity function:

def documentSimilarity(filename_1, filename_2):


# filename_1 = sys.argv[1]
# filename_2 = sys.argv[2]
sorted_word_list_1 = word_frequencies_for_file(filename_1)
sorted_word_list_2 = word_frequencies_for_file(filename_2)
distance = vector_angle(sorted_word_list_1, sorted_word_list_2)

print("The distance between the documents is: % 0.6f (radians)"% distance)

Here is the full sourcecode.


import math
import string
import sys

# reading the text file


# This function will return a
# list of the lines of text
# in the file.
def read_file(filename):

try:
with open(filename, 'r') as f:
data = f.read()
return data

except IOError:
print("Error opening or reading input file: ", filename)
sys.exit()

# splitting the text lines into words


# translation table is a global variable
# mapping upper case to lower case and
# punctuation to spaces
translation_table = str.maketrans(string.punctuation+string.ascii_uppercase,
" "*len(string.punctuation)+string.ascii_lowercase)

# returns a list of the words


# in the file
def get_words_from_line_list(text):
text = text.translate(translation_table)
word_list = text.split()

return word_list

# counts frequency of each word


# returns a dictionary which maps
# the words to their frequency.
def count_frequency(word_list):

D = {}

for new_word in word_list:

if new_word in D:
D[new_word] = D[new_word] + 1

else:
D[new_word] = 1

return D

# returns dictionary of (word, frequency)


# pairs from the previous dictionary.
def word_frequencies_for_file(filename):

line_list = read_file(filename)
word_list = get_words_from_line_list(line_list)
freq_mapping = count_frequency(word_list)

print("File", filename, ":", )


print(len(line_list), "lines, ", )
print(len(word_list), "words, ", )
print(len(freq_mapping), "distinct words")

return freq_mapping

# returns the dot product of two documents


def dotProduct(D1, D2):
Sum = 0.0

for key in D1:

if key in D2:
Sum += (D1[key] * D2[key])

return Sum

# returns the angle in radians


# between document vectors
def vector_angle(D1, D2):
numerator = dotProduct(D1, D2)
denominator = math.sqrt(dotProduct(D1, D1)*dotProduct(D2, D2))

return math.acos(numerator / denominator)

def documentSimilarity(filename_1, filename_2):

# filename_1 = sys.argv[1]
# filename_2 = sys.argv[2]
sorted_word_list_1 = word_frequencies_for_file(filename_1)
sorted_word_list_2 = word_frequencies_for_file(filename_2)
distance = vector_angle(sorted_word_list_1, sorted_word_list_2)

print("The distance between the documents is: % 0.6f (radians)"% distance)

# Driver code
documentSimilarity('GFG.txt', 'file.txt')

Output:
File GFG.txt :
15 lines,
4 words,
4 distinct words
File file.txt :
22 lines,
5 words,
5 distinct words
The distance between the documents is: 0.835482 (radians)
Conclusion:

Thus the experiment to Implement Text Similarity Recognizer for the chosen text documents is
done and verified.

EXPT.10 EXPLORATORY DATA ANALYSIS FOR A GIVEN TEXT (WORD CLOUD)

LAB OBJECTIVES:

To understand about exploratory data analysis for a given text (word cloud) for real world
applications.

LAB OUTCOMES:

On Successful Completion, the Student will be able to understand about exploratory data
analysis for a given text (word cloud) for real world applications.

PROCEDURE:

Generating Word Cloud in Python


Word Cloud is a data visualization technique used for representing text data in which the size of
each word indicates its frequency or importance. Significant textual data points can be
highlighted using a word cloud. Word clouds are widely used for analyzing data from social
network websites.
For generating word cloud in Python, modules needed are – matplotlib, pandas and wordcloud.
To install these packages, run the following commands :
pip install matplotlib
pip install pandas
pip install wordcloud
The dataset used for generating word cloud is collected from UCI Machine Learning Repository.
It consists of YouTube comments on videos of popular artists.
Dataset Link : https://archive.ics.uci.edu/ml/machine-learning-databases/00380/
Below is the implementation :

 Python3

# Python program to generate WordCloud

# importing all necessary modules


from wordcloud import WordCloud, STOPWORDS
import matplotlib.pyplot as plt
import pandas as pd

# Reads 'Youtube04-Eminem.csv' file


df = pd.read_csv(r"Youtube04-Eminem.csv", encoding ="latin-1")

comment_words = ''
stopwords = set(STOPWORDS)

# iterate through the csv file


for val in df.CONTENT:

# typecaste each val to string


val = str(val)

# split the value


tokens = val.split()

# Converts each token into lowercase


for i in range(len(tokens)):
tokens[i] = tokens[i].lower()

comment_words += " ".join(tokens)+" "

wordcloud = WordCloud(width = 800, height = 800,


background_color ='white',
stopwords = stopwords,
min_font_size = 10).generate(comment_words)

# plot the WordCloud image


plt.figure(figsize = (8, 8), facecolor = None)
plt.imshow(wordcloud)
plt.axis("off")
plt.tight_layout(pad = 0)

plt.show()

Output :

The above word cloud has been generated using Youtube04-Eminem.csv file in the dataset. One
interesting task might be generating word clouds using other csv files available in the dataset.
Advantages of Word Clouds :
1. Analyzing customer and employee feedback.
2. Identifying new SEO keywords to target.
Drawbacks of Word Clouds :
1. Word Clouds are not perfect for every situation.
2. Data should be optimized for context.
Reference : https://en.wikipedia.org/wiki/Tag_cloud
Conclusion:

Thus the experiment, Exploratory data analysis of a given text (Word Cloud) is done and
verified.

EXPT.11 MINI PROJECT REPORT: WEB SCRAPING

LAB OBJECTIVES:

To understand about the mini project report: web scraping for real world applications.

LAB OUTCOMES:

On Successful Completion, the Student will be able to understand about the mini project report:
web scraping for real world applications.

PROCEDURE:
Gathering and getting ready datasets is one of the critical techniques in any Machine learning
project. People accumulate the dataset thru numerous approaches like databases, online
repositories, APIs, survey forms, and many others. But when we want to extract any internet site
data when no means of API is there then the best alternative left is Web Scraping.

In this article, you will learn about web scraping in brief and see how to extract data from

websites with a hands-on demonstration with python. We will be covering the following topics.
Table of Contents
1. What is Web Scraping
2. Why is Web Scraping used
3. Challenges and Guide for Web Scraping
4. Python Libraries for Web Scraping
5. Hands-on Web Scraping with Python
6. Web Scraping using lxml
7. Web Scraping using Scrapy
8. End Notes
What is Web Scraping?

Web scraping is a simple technique that describes an automatic collection of a huge amount of

data from websites. Data is of three types as structured, unstructured, and semi-structured.

Websites hold all the types of data in an unstructured way web scraping is a technique that helps

to collect this instructed data from websites and store it in a structured way. Today most of the

corporate use web-scraping to leverage good business decisions in this comparative market. So

let’s learn why and where web scraping is used the most.
Why is Web Scraping used?

We have already discussed that to automatically fetch data from websites web scraping is

required but where it is used, And what is a requirement to do so? To better understand this let’s

look at some applications of web scraping


 Price Comparison – Some various platforms and websites provide comparison, pros, and
cons of different products of different companies on their platforms that make customer easy to
choose the right product for them. Parsehub is a great example that compares the prices of
various products from different shopping websites. College Dunia is another example that
compares the rating, courses fee structure of different institutions.
 Research and Development – most of the websites use cookies, privacy policies. They scrape
the user data like timestamp, time spent, etc to conduct various statistical analyses and manage
customer relationships in a better way.
 Job Listing – many job portals display job openings in a different organization according to
location, skills, etc. They scrape the data from the organization’s careers page to list many
openings from different companies on one single platform.
 Email Gathering – companies that use email for marketing purposes use web scraping to gather
lots of emails from different websites and send bulk emails.

Now we hope that it makes a clear understanding that why web scraping is necessary and use

most, and this application widens your thought and you can think of many different applications

it is used nowadays.

Challenges and Guide for Scraping


If you know HTML, and CSS then it is very to understand and perform web scrapping easily

because in a nutshell by web scrapping we extract data of websites that return in an HTML doc

form and CSS is to get specific data that we are looking for. And it is also important because web

scrapping faces a little challenge.


1. Variety – Each website is different which has different formatting, different templates.
You need to inspect through website HTML to extract relevant information.
2. Durability – Websites updates with time. new posting and formatting keep changing so
once you had built a web scrapper and it runs flawlessly that does not mean that always
after some time it will run fine.

But in this article, we will perform from ground level so you can follow it easily.
Python Libraries for web scraping

requests – It is the most basic library for web scraping. The request is a python in-built module

that allows you to send an HTTP request like a GET, POST, etc to websites using python.

Getting the HTML content of a web page is the first and foremost step of web scraping. Due to

its ease of use, it comes as the motto of HTTP for humans. However, requests do not parse the

retrieved HTML content. for that, we require other libraries.

Beautiful Soup(bs4) – Beautiful Soup is a Python library used for web scraping. It sits at a top

of an HTML or XML parser which provides python idioms for iterating, searching, and

modifying a parse tree. It automatically converts incoming documents to Unicode and outgoing

documents to UTF-8. Beautiful Soup is easy to learn, robust, beginner-friendly and, the most

used web scraping library in recent times with request.


pip install bs4

lxml – It is a high performance, fast HTML and XML parsing library. It is faster than a beautiful

soup. It works well when we are aiming to scrape large datasets. It also allows you to extract data

from HTML using XPath and CSS selectors.


pip install lxml
Scrapy – Scrapy is not just a library, it is a complete web scraping framework. Scrapy helps you

to scrape a large amount of dataset efficiently and effectively. It can be used for a wide range of

purposes, from data mining to monitoring and automated testing. Scrapy creates spiders that

crawl across websites and retrieve the data. The best thing about scrapy is it is asynchronous, and

with the help of spacy, you can make multiple HTTP requests simultaneously. You can also

create a pipeline using scrapy.


pip install scrapy
Hands-on Web Scraping with Python
Problem Description

We are going to scrape the data from the Ambition box website. Ambition Box is a platform that

lists job openings in different companies in India. If you visit the companies page you can

observe the name of the company, rating, review, how old the company is, and different

information about it. So we aim to get this data in a table format that consists of the name of the

company, rating, review, company age, number of employees, etc information. There are about

33 pages and on each page, approximately 30 companies are listed and we want to fetch all the

33 pages of each company data in the dataframe.

Let’s get started!


Import the libraries
import pandas as pd
import requests
from bs4 import BeautifulSoup
import numpy as np

Make a request

Now we will create an HTTP request to the Ambition Box website and it will give us a response

HTML document content.


webpage=requests.get('https://www.ambitionbox.com/list-of-companies?page=1').text

Parse through response using beautiful soup

We have the HTML content, and to extract the data present in that we will use beautiful soup

which creates a parser around it. If you print the parser output using prettify function then you

can see the extracted document in a readable format.

How we will find each text we need?

To get any name from an HTML document there is a special tag in which it is written. If you go

on the website and right-click if you go on inspect section then you can see that each company’s

information is in a separate division(div) tag. In that, there is a unique class or id to element so


by using tag, class, id we can get the desired results. the heading is written in the H1 tag. for

example if we want to access this then it’s easy.


soup.find_all('h1')[0].text
How to get the name of companies?

The names of all the companies are written in an H2 tag. we can run a loop and get all names of

companies on the first page. when we write find all it extracts a list object and in that we access a

zero-based index which is the title of the text and access the text written in that. Strip function is

used to avoid the extra spaces that are used in a web page for design.

How to extract other details?

If you inspect on rating, review of companies then they all are written in paragraph(p) tag. Along

with the paragraph tag, they are having a unique class name using which we can identify them.

Rating is having rating class, reviews are having review class But company type, company age,

headquarters location, and several employees are in the same tag and having the same class

name. To access this we will use list indexing.


for i in soup.find_all('p'):
print(i.text.strip())

Creating a list of each feature

Now as we have seen we will access each feature using tag and class. so let us create a separate

list of each feature whose length will be 30. first, we will store the list of all the 30 divisions

means all company divisions in a variable, and apply a loop over it.
company=soup.find_all('div',class_='company-content-wrapper')
print(len(company)) #30

Now we can easily loop over the company variables and get all the information on the first page.
name = []
rating = []
reviews = []
comp_type = []
head_q = []
how_old = []
no_of_employees = []
for comp in company:
name.append(comp.find('h2').text.strip())
rating.append(comp.find('p', class_ = "rating").text.strip())
reviews.append(comp.find('a', class_ = "review-count").text.strip())
comp_type.append(comp.find_all('p', class_ = 'infoEntity')[0].text.strip())
head_q.append(comp.find_all('p',class_='infoEntity')[1].text.strip())
how_old.append(comp.find_all('p',class_='infoEntity')[2].text.strip())
no_of_employees.append(comp.find_all('p',class_='infoEntity')[3].text.strip())
#creating dataframe for all list
features = {'name':name, 'rating':rating,'reviews':reviews,
'company_type':comp_type,'Head_Quarters':head_q, 'Company_Age':how_old,
'No_of_Employee':no_of_employees }
df = pd.DataFrame(features)

The above is a complete dataframe of only the first page, and now let’s kickstart our enthusiasm

and fetch data for all the pages.


Creating a Final Dataframe

Let’s Prepare Dataset Using Web Scraping!

Now you have a better understanding of web scraping and how data is coming in a separate

feature. So we are ready to create a final dataframe of all 33 pages and each page is having data

of 30 companies. But on some pages, there are some inconsistencies like a little information

about a company is not provided so we need to handle this. so we will place each feature in a try-

except block and if data is not present then we will append the Null value in place of it. For

fetching data from each page, we have to make a request again and again on a different page in a

loop and fetch its data and after that, all the things are the same as above.
final = pd.DataFrame()
for j in range(1, 33):
#make a request to specific page
webpage=requests.get('https://www.ambitionbox.com/list-of-companies?
page={}'.format(j)).text
soup = BeautifulSoup(webpage, 'lxml')
company = soup.find_all('div', class_ = 'company-content-wrapper')
name = []
rating = []
reviews = []
comp_type = []
head_q = []
how_old = []
no_of_employees = []
for comp in company:
try:
name.append(comp.find('h2').text.strip())
except:
name.append(np.nan)
try:
rating.append(comp.find('p', class_ = "rating").text.strip())
except:
rating.append(np.nan)
try:
reviews.append(comp.find('a', class_ = "review-count").text.strip())
except:
reviews.append(np.nan)
try:
comp_type.append(comp.find_all('p', class_ = 'infoEntity')[0].text.strip())
except:
comp_type.append(np.nan)
try:
head_q.append(comp.find_all('p',class_='infoEntity')[1].text.strip())
except:
head_q.append(np.nan)
try:
how_old.append(comp.find_all('p',class_='infoEntity')[2].text.strip())
except:
how_old.append(np.nan)
try:
no_of_employees.append(comp.find_all('p',class_='infoEntity')[3].text.strip())
except:
no_of_employees.append(np.nan)
#creating dataframe for all list
features = {'name':name, 'rating':rating,'reviews':reviews,
'company_type':comp_type,'Head_Quarters':head_q, 'Company_Age':how_old,
'No_of_Employee':no_of_employees }
df = pd.DataFrame(features)
final = final.append(df, ignore_index=True)
We have created a dynamic URL of each page to make a request and fetch the data and you can

have a look at the final dataframe. That sits this is how web scraping is done.

Web Scraping using lxml

Now we have an understanding of how web scraping works, and how to extract a single piece of

information from a website and implement a dataframe. What if we want to extract some

paragraphs or some informant line from some blog or article then It is easy to do with lxml using

XPath.

We will extract a paragraph from one of the Analytics Vidhya articles using lxml with only a few

lines of code. I hope that you have already installed lxml using the pip command, and are ready

to follow the below steps.


Step-1) Inspect the paragraph which has to be scrapped

Visit the article and select any paragraph and right-click on it and click on inspect option.
Step-2) Right-click element on source-code to the right

As you click on Inspect the Element section will open, and in that right-click on the selected

element and copy XPath of element and come to the coding environment and save the path in a

variable as a string.
Step-3) HTTP Request to retrieve HTML content

Make HTTP requests on the Article website to retrieve the HTML content.
import requests
from lxml import html
URL = 'https://www.analyticsvidhya.com/blog/2021/09/a-comprehensive-guide-on-neural-
networks-performance-optimization/'
path = '//*[@id="1"]/p[5]'
response = requests.get(URL)
Step-4) Get Byte string and filter source code

using lxml parser parses the response content received on request and converts it to a source code

object.
byte_data = response.content
source_code = html.fromstring(byte_data)
Step-5) Jump to preferred HTML element

Now using Xpath retrieve the desired paragraph we aim to get.


tree = source_code.xpath(path)
print(tree[0].text_content())
It’s done and this simple is using an lxml parser to extract a large amount of data from the

website in our coding environment.


Hands-on Web Scraping using Scrapy

Scraping data efficiently in a few minutes is everyone’s aim which is fulfilled by scrapy. with

multiple spider bots that crawl on a website to retrieve data for you. In this section, we will be

using scrapy in our local jupyter notebook(Goole collab) and scrape data in our dataframe.

Scrapy provides a default quote website for learning web scraping using scrapy.

It consists of various quotes along with the author’s name and tags to which it belongs. we will

create a dataframe with 3 columns as quote, author, and tag. After installing spacy follow the

below steps. After scraping details from a website we will write details in JSON file and load

dataframe from JSON using Pandas library.


Step-1) set Interactive python shell
from IPython.core.interactiveshell import InteractiveShell
import scrapy
from scrapy.crawler import CrawlerProcess
Step-2) Setup a Pipeline

here we create a class that creates a new JSON file and function to write all items found during

scraping in a JSON file where each line contains one JSON element.
#setup pipeline
import json
class JsonWriterPipeline(object):
def open_spider(self, spider):
self.file = open('quoteresult.jl', 'w')
def close_spider(self, spider):
self.file.close()
def process_item(self, item, spider):
line = json.dumps(dict(item)) + "n"
self.file.write(line)
return item
step-3) Define the Spider

Now we need to define our crawler(Spider) and we pass the URL from where to start parsing and

which values to retrieve. I set the logging level to a warning so that notebook is not overloaded.
#define spider
import logging
class QuotesSpider(scrapy.Spider):
name = "quotes"
start_urls = [
'http://quotes.toscrape.com/page/1/',
'http://quotes.toscrape.com/page/2/',
]
custom_settings = {
'LOG_LEVEL': logging.WARNING,
'ITEM_PIPELINES': {'__main__.JsonWriterPipeline': 1}, # Used for pipeline 1
'FEED_FORMAT':'json', # Used for pipeline 2
'FEED_URI': 'quoteresult.json' # Used for pipeline 2
}
def parse(self, response):
for quote in response.css('div.quote'):
yield {
'text': quote.css('span.text::text').extract_first(),
'author': quote.css('span small::text').extract_first(),
'tags': quote.css('div.tags a.tag::text').extract(),
}

Each quote is written in a separate division tag with class name as the quote so we have fetched

all quotes using this division and quote CSS selector.

Step-4) Start the Crawler

define the scrapy crawler process and pass the spider class to start retrieving the data.
#start crawler
process = CrawlerProcess({
'USER_AGENT': 'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1)'
})
process.crawl(QuotesSpider)
process.start()

Step-5) Create DataFrames

The retrieved data is saved in a JSON file and we will load them as a dataframe using pandas.
import pandas as pd
dfjson = pd.read_json('quoteresult.json')
This is how scrapy works and helps you to extract lots of data from websites very quickly.
End Notes

In this article, we have learned about Web scraping, its applications, and why it is being used

everywhere. We have performed hands-on live web scraping from websites to fetch different

companies and prepare a dataframe that is used for further machine learning project purposes

using beautiful soup. We have also learned about the lxml library and perform a practical

demonstration. Apart from this, we have learned about the boss of web scraping library name

scrapy and why it is known so.


Conclusion:

Thus the Mini Project Report: Web Scraping for real world NLP application is done and verified.
EXPT.12 IMPLEMENTATION AND PRESENTATION OF MINI PROJECT: WEB
SCRAPING

LAB OBJECTIVES:

To implement and presentation of mini project: web scraping for real world applications.

LAB OUTCOMES:

On Successful Completion, the Student will be able to implementation and presentation of mini
project: web scraping for real world applications

PROCEDURE:

Python Web Scraping Tutorial


Let’s suppose you want to get some information from a website? Let’s say an article from the
geeksforgeeks website or some news article, what will you do? The first thing that may come in
your mind is to copy and paste the information into your local media. But what if you want a
large amount of data on a daily basis and as quickly as possible. In such situations, copy and
paste will not work and that’s where you’ll need web scraping.
In this article, we will discuss how to perform web scraping using the requests library and
beautifulsoup library in Python.
Requests Module
Requests library is used for making HTTP requests to a specific URL and returns the response.
Python requests provide inbuilt functionalities for managing both the request and response.

Installation

Requests installation depends on the type of operating system, the basic command anywhere
would be to open a command terminal and run,
pip install requests
Making a Request

Python requests module has several built-in methods to make HTTP requests to specified URI
using GET, POST, PUT, PATCH, or HEAD requests. A HTTP request is meant to either retrieve
data from a specified URI or to push data to a server. It works as a request-response protocol
between a client and a server. Here we will be using the GET request.
GET method is used to retrieve information from the given server using a given URI. The GET
method sends the encoded user information appended to the page request.

Example: Python requests making GET request

 Python3
import requests

# Making a GET request


r = requests.get('https://www.geeksforgeeks.org/python-programming-language/')

# check status code for response received


# success code - 200
print(r)

# print content of request


print(r.content)

Output:
Response object

When one makes a request to a URI, it returns a response. This Response object in terms of
python is returned by requests.method(), method being – get, post, put, etc. Response is a
powerful object with lots of functions and attributes that assist in normalizing data or creating
ideal portions of code. For example, response.status_code returns the status code from the
headers itself, and one can check if the request was processed successfully or not.
Response objects can be used to imply lots of features, methods, and functionalities.
Example: Python requests Response Object
Python3

import requests

# Making a GET request


r = requests.get('https://www.geeksforgeeks.org/python-programming-language/')

# print request object


print(r.url)

# print status code


print(r.status_code)

Output:

https://www.geeksforgeeks.org/python-programming-language/
200
For more information, refer to our Python Requests Tutorial.
BeautifulSoup Library
BeautifulSoup is used extract information from the HTML and XML files. It provides a parse
tree and the functions to navigate, search or modify this parse tree.
Installation
To install Beautifulsoup on Windows, Linux, or any operating system,
one would need pip package. To check how to install pip on your
operating system, check out – PIP Installation – Windows || Linux. Now
run the below command in the terminal.

pip install beautifulsoup4


Inspecting Website

Before getting out any information from the HTML of the page, we must understand the
structure of the page. This is needed to be done in order to select the desired data from the entire
page. We can do this by right-clicking on the page we want to scrape and select inspect element.

After clicking the inspect button the Developer Tools of the browser gets open. Now almost all
the browsers come with the developers tools installed, and we will be using Chrome for this
tutorial.
The developer’s tools allow seeing the site’s Document Object Model (DOM). If you don’t know
about DOM then don’t worry just consider the text displayed as the HTML structure of the page.

Parsing the HTML


After getting the HTML of the page let’s see how to parse this raw HTML code into some
useful information. First of all, we will create a BeautifulSoup object by specifying the parser we
want to use.
Note: BeautifulSoup library is built on top of the HTML parsing libraries like html5lib, lxml,
html.parser, etc. So BeautifulSoup object and specify the parser library can be created at the
same time.

Example: Python BeautifulSoup Parsing HTML

 Python3
import requests
from bs4 import BeautifulSoup

# Making a GET request


r = requests.get('https://www.geeksforgeeks.org/python-programming-language/')

# check status code for response received


# success code - 200
print(r)
# Parsing the HTML
soup = BeautifulSoup(r.content, 'html.parser')
print(soup.prettify())

Output:

This information is still not useful to us, let’s see another example to make some clear picture
from this. Let’s try to extract the title of the page.

 Python3
import requests
from bs4 import BeautifulSoup

# Making a GET request


r = requests.get('https://www.geeksforgeeks.org/python-programming-language/')

# Parsing the HTML


soup = BeautifulSoup(r.content, 'html.parser')

# Getting the title tag


print(soup.title)

# Getting the name of the tag


print(soup.title.name)

# Getting the name of parent tag


print(soup.title.parent.name)

# use the child attribute to get


# the name of the child tag

Output:
<title>Python Programming Language - GeeksforGeeks</title>
title
html

Finding Elements

Now, we would like to extract some useful data from the HTML content. The soup object
contains all the data in the nested structure which could be programmatically extracted. The
website we want to scrape contains a lot of text so now let’s scrape all those content. First, let’s
inspect the webpage we want to scrape.
Finding Elements by class

In the above image, we can see that all the content of the page is under the div with class entry-
content. We will use the find class. This class will find the given tag with the given attribute. In
our case, it will find all the div having class as entry-content. We have got all the content from
the site but you can see that all the images and links are also scraped. So our next task is to find
only the content from the above-parsed HTML. On again inspecting the HTML of our website –

We can see that the content of the page is under the <p> tag. Now we have to find all the p tags
present in this class. We can use the find_all class of the BeautifulSoup.
 Python3
import requests
from bs4 import BeautifulSoup

# Making a GET request


r = requests.get('https://www.geeksforgeeks.org/python-programming-language/')

# Parsing the HTML


soup = BeautifulSoup(r.content, 'html.parser')

s = soup.find('div', class_='entry-content')
content = s.find_all('p')

print(content)

Output:

Finding Elements by ID

In the above example, we have found the elements by the class name but let’s see how to find
elements by id. Now for this task let’s scrape the content of the leftbar of the page. The first step
is to inspect the page and see the leftbar falls under which tag.
The above image shows that the leftbar falls under the <div> tag with id as main. Now lets’s get
the HTML content under this tag. Now let’s inspect more of the page get the content of the
leftbar.

We can see that the list in the leftbar is under the <ul> tag with the class as leftBarList and our
task is to find all the li under this ul.

 Python3
import requests
from bs4 import BeautifulSoup
# Making a GET request
r = requests.get('https://www.geeksforgeeks.org/python-programming-language/')

# Parsing the HTML


soup = BeautifulSoup(r.content, 'html.parser')

# Finding by id
s = soup.find('div', id= 'main')

# Getting the leftbar


leftbar = s.find('ul', class_='leftBarList')

# All the li under the above ul


content = leftbar.find_all('li')

print(content)

Output:
Extracting Text from the tags

In the above examples, you must have seen that while scraping the data the tags also get scraped
but what if we want only the text without any tags. Don’t worry we will discuss the same in this
section. We will be using the text property. It only prints the text from the tag. We will be using
the above example and will remove all the tags from them.
Example 1: Removing the tags from the content of the page

 Python3
import requests
from bs4 import BeautifulSoup

# Making a GET request


r = requests.get('https://www.geeksforgeeks.org/python-programming-language/')

# Parsing the HTML


soup = BeautifulSoup(r.content, 'html.parser')

s = soup.find('div', class_='entry-content')

lines = s.find_all('p')

for line in lines:


print(line.text)

Output:
Example 2: Removing the tags from the content of the
leftbar

Python3

import requests
from bs4 import BeautifulSoup

# Making a GET request


r = requests.get('https://www.geeksforgeeks.org/python-programming-language/')

# Parsing the HTML


soup = BeautifulSoup(r.content, 'html.parser')

# Finding by id
s = soup.find('div', id= 'main')

# Getting the leftbar


leftbar = s.find('ul', class_='leftBarList')

# All the li under the above ul


lines = leftbar.find_all('li')

for line in lines:


print(line.text)

Output:

Extracting Links

Till now we have seen how to extract text, let’s now see how to extract the links from the page.
Example: Python BeautifulSoup Extracting Links

 Python3
import requests
from bs4 import BeautifulSoup

# Making a GET request


r = requests.get('https://www.geeksforgeeks.org/python-programming-language/')

# Parsing the HTML


soup = BeautifulSoup(r.content, 'html.parser')
# find all the anchor tags with "href"
for link in soup.find_all('a'):
print(link.get('href'))

Output:

Extracting Image Information

On again inspecting the page, we can see that images lie inside the
img tag and the link of that image is inside the src attribute. See the
below image –
Example: Python BeautifulSoup Extract Image

Python3

import requests
from bs4 import BeautifulSoup

# Making a GET request


r = requests.get('https://www.geeksforgeeks.org/python-programming-language/')

# Parsing the HTML


soup = BeautifulSoup(r.content, 'html.parser')

images_list = []

images = soup.select('img')
for image in images:
src = image.get('src')
alt = image.get('alt')
images_list.append({"src": src, "alt": alt})

for image in images_list:


print(image)

Output:
Scraping multiple Pages

Now, there may arise various instances where you may want to get data from multiple pages
from the same website or multiple different URLs as well, and manually writing code for each
webpage is a time-consuming and tedious task. Plus, it defines all basic principles of automation.
Duh!
To solve this exact problem, we will see two main techniques that will help us extract data from
multiple webpages:
The same website
 Different website URLs

Example 1: Looping through the page numbers

page numbers at the bottom of the GeeksforGeeks website


Most websites have pages labeled from 1 to N. This makes it really simple for us to loop through
these pages and extract data from them as these pages have similar structures. For example:

page numbers at the bottom of the GeeksforGeeks website

Here, we can see the page details at the end of the URL. Using this information we can easily
create a for loop iterating over as many pages as we want (by putting page/(i)/ in the URL string
and iterating “i” till N) and scrape all the useful data from them. The following code will give
you more clarity over how to scrape data by using a For Loop in Python.

 Python3
import requests
from bs4 import BeautifulSoup as bs

URL = 'https://www.geeksforgeeks.org/page/1/'

req = requests.get(URL)
soup = bs(req.text, 'html.parser')

titles = soup.find_all('div',attrs = {'class','head'})

print(titles[4].text)

Output:
7 Most Common Time Wastes During Software Development
Now, using the above code, we can get the titles of all the articles by just sandwiching those lines
with a loop.

 Python3
import requests
from bs4 import BeautifulSoup as bs
URL = 'https://www.geeksforgeeks.org/page/'

for page in range(1, 10):

req = requests.get(URL + str(page) + '/')


soup = bs(req.text, 'html.parser')

titles = soup.find_all('div', attrs={'class', 'head'})

for i in range(4, 19):


if page > 1:
print(f"{(i-3)+page*15}" + titles[i].text)
else:
print(f"{i-3}" + titles[i].text)

Output:

Example 2: Looping through a list of different URLs

The above technique is absolutely wonderful, but what if you need to


scrape different pages, and you don’t know their page numbers? You’ll
need to scrape those different URLs one by one and manually code a
script for every such webpage.
Instead, you could just make a list of these URLs and loop through
them. By simply iterating the items in the list i.e. the URLs, we will be
able to extract the titles of those pages without having to write code
for each page. Here’s an example code of how you can do it.

 Python3
import requests
from bs4 import BeautifulSoup as bs

URL = ['https://www.geeksforgeeks.org','https://www.geeksforgeeks.org/page/10/']

for url in range(0,2):


req = requests.get(URL[url])
soup = bs(req.text, 'html.parser')

titles = soup.find_all('div',attrs={'class','head'})
for i in range(4, 19):
if url+1 > 1:
print(f"{(i - 3) + url * 15}" + titles[i].text)
else:
print(f"{i - 3}" + titles[i].text)

Output:
For more information, refer to our Python BeautifulSoup Tutorial.
Saving Data to CSV
First we will create a list of dictionaries with the key value pairs that we
want to add in the CSV file. Then we will use the csv module to write
the output in the CSV file. See the below example for better
understanding.

Example: Python BeautifulSoup saving to CSV

 Python3
import requests
from bs4 import BeautifulSoup as bs
import csv

URL = 'https://www.geeksforgeeks.org/page/'

soup = bs(req.text, 'html.parser')

titles = soup.find_all('div', attrs={'class', 'head'})


titles_list = []

count = 1
for title in titles:
d = {}
d['Title Number'] = f'Title {count}'
d['Title Name'] = title.text
count += 1
titles_list.append(d)

filename = 'titles.csv'
with open(filename, 'w', newline='') as f:
w = csv.DictWriter(f,['Title Number','Title Name'])
w.writeheader()

w.writerows(titles_list)

Output:
Conclusion: Thus the Implementation and Presentation of Mini Project: Web Scraping is
done and verified.

You might also like