Lab Manual - NLP
Lab Manual - NLP
SEMESTER: 7
AY: 2022-23
SUBJECT TEACHER
PROF.M.VELLADURAI
LIST OF EXPERIMENTS
SL.NO. EXPERIMENT NAME
1. Study various applications of NLP and Formulate the Problem
Statement for Mini Project based on chosen real world NLP
applications
2. Various text preprocessing techniques for any given text :
Tokenization and Filtration & Script Validation
3. Text preprocessing techniques for any given text : Stop Word
Removal, Lemmatization / Stemming
4. Morphological analysis and word generation for any given text
5. Implementing N-Gram model for the given text input
6. Study the different POS taggers and Perform POS tagging on the
given text
7. Chunking for the given text input
8. Implement Named Entity Recognizer for the given text input
9. Implement Text Similarity Recognizer for the chosen text
documents
10. Exploratory data analysis of a given text (Word Cloud)
11. Mini Project Report: For any one chosen real world NLP application
12 Implementation and Presentation of Mini Project
Ex.1 STUDY VARIOUS APPLICATIONS OF NLP AND FORMULATE THE
PROBLEM STATEMENT FOR MINI PROJECT BASED ON CHOSEN REAL WORLD
NLP APPLICATIONS
LAB OBJECTIVES:
To Study the various applications of NLP and formulate the problem statement for mini project
LAB OUTCOMES:
On Successful Completion, the Student will be able to understand about various NLP
applications.
PROCEDURE:
Machine Translation
Using corpus methods, more complicated translations can be conducted, taking into account
better treatment of contrasts in phonetic typology, express acknowledgement, and translations of
idioms, just as the seclusion of oddities. Currently, some systems are not able to perform just like
a human translator, but in the coming future, it will also be possible.
In simple language, we can say that machine translation works by using computer software to
translate the text from one source language to another target language. There are different
types of machine translation and in the next section, we will discuss them in detail.
Presently, SMT is extraordinary for basic translation, however its most noteworthy disadvantage
is that it doesn't factor in context, which implies translation can regularly be wrong or you can
say, don't expect great quality translation. There are several types of statistical-based machine
translation models which are: Hierarchical phrase-based translation, Syntax-based translation,
Phrase-based translation, Word-based translation.
RBMT basically translates the basics of grammatical rules. It directs a grammatical examination
of the source language and the objective language to create the translated sentence. But, RBMT
requires broad editing, and its substantial reliance on dictionaries implies that proficiency is
accomplished after a significant period. (Also read: Top 10 Natural Processing Languages (NLP)
Libraries with Python)
( Related blog: Introduction to Text Analytics and Models in Natural Language Processing)
Applications of machine translation
Machine translation technology and products have been used in numerous application situations,
for example, business travel, the travel industry, etc. In terms of the object of translation, there
are composed language-oriented content text translation and spoken language.
Text translation
Automated text translation is broadly used in an assortment of sentence-level and text-level
translation applications. Sentence-level translation applications incorporate the translation of
inquiry and recovery inputs and the translation of (OCR) outcomes of picture optical character
acknowledgement. Text-level translation applications incorporate the translation of a wide range
of unadulterated reports, and the translation of archives with organized data.
(Related blog: Sentiment Analysis of YouTube Comments)
Organized data mostly incorporates the presentation configuration of text content, object type
activity, and other data, for example, textual styles, colours, tables, structures, hyperlinks, etc.
Presently, the translation objects of machine translation systems are mostly founded on the
sentence level.
Most importantly, a sentence can completely communicate a subject substance, which normally
frames an articulation unit, and the significance of each word in the sentence can be resolved to
an enormous degree as per the restricted setting inside the sentence.
Also, the methods and nature of getting data at the sentence level granularity from the
preparation corpus are more effective than that dependent on other morphological levels, for
example, words, expressions, and text passages. Finally, the translation depends on sentence-
level can be normally reached out to help translation at other morphological levels.
Speech translation
With the fast advancement of mobile applications, voice input has become an advantageous
method of human-computer cooperation, and discourse translation has become a significant
application situation. The fundamental cycle of discourse interpretation is "source language
discourse source language text-target language text-target language discourse".
In this cycle, programmed text translation from source language text to target-language text is an
important moderate module. What's more, the front end and back end likewise need programmed
discourse recognition, ASR and text-to-speech, TTs.
(Read also: Introduction to Natural Language Processing: Text Cleaning & Preprocessing)
Other applications
Naturally, the task of machine translation is to change one source language word succession into
another objective language word grouping which is semantically the same. Generally, it finishes
a grouping transformation task, which changes over a succession object into another arrangement
object as indicated by some information and rationale through model and algorithms.
All things considered, many undertaking situations total the change between grouping objects,
and the language in the machine translation task is just one of the succession object types. In this
manner, when the ideas of the source language and target language are stretched out from
dialects to other arrangement object types, machine translation strategies and techniques can be
applied to settle numerous comparable change undertakings.
Machine Translation is the instant modification of text from one language to another utilizing
artificial intelligence whereas a human translation, includes actual brainpower, in the form of one
or more translators translating the text manually.
So
urce
In a bag of words, a vector represents the frequency of words in a predefined dictionary of a
word list. We can perform NLP using the following machine learning algorithms: Naïve Bayer,
SVM, and Deep Learning.
The third approach to text classification is the Hybrid Approach. Hybrid approach usage
combines a rule-based and machine Based approach. Hybrid based approach usage of the rule-
based system to create a tag and use machine learning to train the system and create a rule. Then
the machine-based rule list is compared with the rule-based rule list. If something does not match
on the tags, humans improve the list manually. It is the best method to implement text
classification
1. Vector Semantic
Vector Semantic is another way of word and sequence analysis. Vector semantic defines
semantic and interprets words meaning to explain features such as similar words and opposite
words. The main idea behind vector semantic is two words are alike if they have used in a
similar context. Vector semantic divide the words in a multi-dimensional vector space. Vector
semantic is useful in sentiment analysis.
Source
2. Word Embedding
Word embedding is another method of word and sequence analysis. Embedding translates spares
vectors into a low-dimensional space that preserves semantic relationships. Word embedding is a
type of word representation that allows words with similar meaning to have a similar
representation. There are two types of word embedding-
Word2vec
Doc2Vec.
Word2Vec is a statistical method for effectively learning a standalone word embedding from a
text corpus.
Source
Doc2Vec is similar to Doc2Vec, but it analyzes a group of text like pages.
Source
Source
4. Sequence Labeling
Sequence labeling is a typical NLP task that assigns a class or label to each token in a given
input sequence. If someone says “play the movie by tom hanks”. In sequence, labeling will be
[play, movie, tom hanks]. Play determines an action. Movies are an instance of action. Tom
Hanks goes for a search entity. It divides the input into multiple tokens and uses LSTM to
analyze it. There are two forms of sequence labeling. They are token labeling and span labeling.
apple. The best example is Amazon Alexa. , noun ate, determiner tom, verb Parsing is a
phase of NLP where the parser determines the syntactic structure of a text by analyzing its
constituent words based on an underlying grammar. For example, “tom ate an apple” will be
divided into proper noun
S
ource
We discuss how text is classified and how to divide the word and sequence so that the algorithm
can understand and categorize it. In this project, we are going to discover a sentiment analysis of
fifty thousand IMDB movie reviewer. Our goal is to identify whether the review posted on the
IMDB site by its user is positive or negative.
This project covers text mining techniques like Text Embedding, Bags of Words, word context,
and other things. We will also cover the introduction of a bidirectional LSTM sentiment
classifier. We will also look at how to import a labeled dataset from TensorFlow automatically.
This project also covers steps like data cleaning, text processing, data balance through sampling,
and train and test a deep learning model to classify text.
Parsing
Parser determines the syntactic structure of a text by analyzing its constituent words based on an
underlying grammar. It divides group words into component parts and separates words.
Source
For more details about parsing, check this article.
Semantic
Text is at the heart of how we communicate. What is really difficult is understanding what is
being said in written or spoken conversation? Understanding lengthy articles and books are even
more difficult. Semantic is a process that seeks to understand linguistic meaning by constructing
a model of the principle that the speaker uses to convey meaning. It’s has been used in customer
feedback analysis, article analysis, fake news detection, Semantic analysis, etc.
Example Application
Here is the code Sample:
Importing necessary library
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load
import os
for dirname, _, filenames in os.walk('/kaggle/input'):
for filename in filenames:
print(os.path.join(dirname, filename))
# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as
output when you create a version using "Save & Run All"
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the
current session
import tensorflow as tf
import tensorflow_hub as hub
import tensorflow_datasets as tfds
from tensorflow.python import keras
TEXT SUMMARIZATION:
Text summarization is a very useful and important part of Natural Language Processing (NLP).
First let us talk about what text summarization is. Suppose we have too many lines of text data in
any form, such as from articles or magazines or on social media. We have time scarcity so we
want only a nutshell report of that text. We can summarize our text in a few lines by removing
unimportant text and converting the same text into smaller semantic text form.
Now let us see how we can implement NLP in our programming. We will take a look at all the
approaches later, but here we will classify approaches of NLP.
TEXT SUMMARIZATION
In this approach we build algorithms or programs which will reduce the text size and create a
summary of our text data. This is called automatic text summarization in machine learning.
Text summarization is the process of creating shorter text without removing the semantic
structure of text.
There are two approaches to text summarization.
Extractive approaches
Abstractive approaches
EXTRACTIVE APPROACHES:
Using an extractive approach we summarize our text on the basis of simple and traditional
algorithms. For example, when we want to summarize our text on the basis of the frequency
method, we store all the important words and frequency of all those words in the dictionary. On
the basis of high frequency words, we store the sentences containing that word in our final
summary. This means the words which are in our summary confirm that they are part of the
given text.
ABSTRACTIVE APPROACHES:
An abstractive approach is more advanced. On the basis of time requirements we exchange some
sentences for smaller sentences with the same semantic approaches of our text data.
sentences = sent_tokenize(text)
sentenceValue = {}
for sentence in sentences:
for word, freq in freqTable.items():
if word in sentence.lower():
if sentence in sentenceValue:
sentenceValue[sentence] += freq
else :
sentenceValue[sentence] = freq
sumValues = 0
for sentence in sentenceValue:
sumValues += sentenceValue[sentence]
average = int(sumValues / len(sentenceValue))
summary = ''
for sentence in sentences:
if (sentence in sentenceValue) and(sentenceValue[sentence] > (1.2 * average)):
summary += "" + sentence
return summary
Sumy:
Sumy is a textrank based machine learning algorithm. Below is the implementation of that
model.
# Load Packages
from sumy.parsers.plaintext
import PlaintextParser
from sumy.nlp.tokenizers
import Tokenizer
from sumy.summarizers.text_rank
import TextRankSummarizer
text_summary = ""
for sentence in summary:
text_summary += str(sentence)
print(text_summary)
Lex
Rank:
This is an unsupervised machine learning based approach in which we use the textrank approach
to find the summary of our sentences. Using cosine similarity and vector based algorithms, we
find minimum cosine distance among various words and store the more similar words together.
from sumy.parsers.plaintext
import PlaintextParser
from sumy.nlp.tokenizers
import Tokenizer
from sumy.summarizers.lex_rank
import LexRankSummarizer
def sumy_method(text):
parser = PlaintextParser.from_string(text, Tokenizer("english"))
summarizer = LexRankSummarizer()
#Summarize the document with 2 sentences
summary = summarizer(parser.document, 2)
dp = []
for i in summary:
lp = str(i)
dp.append(lp)
final_sentence = ' '.join(dp)
return final_sentence
Using Luhn:
This approach is based on the frequency method; here we find TF-IDF (term frequency inverse
document frequency).
from sumy.summarizers.luhn
import LuhnSummarizer
def lunh_method(text):
parser = PlaintextParser.from_string(text, Tokenizer("english"))
summarizer_luhn = LuhnSummarizer()
summary_1 = summarizer_luhn(parser.document, 2)
dp = []
for i in summary_1:
lp = str(i)
dp.append(lp)
final_sentence = ' '.join(dp)
return final_sentence
LSA
Latent Semantic Analyzer (LSA) is based on decomposing the data into low dimensional space.
LSA has the ability to store the semantic of given text while summarizing.
from sumy.summarizers.lsa
import LsaSummarizer
def lsa_method(text):
parser = PlaintextParser.from_string(text, Tokenizer("english"))
summarizer_lsa = LsaSummarizer()
summary_2 = summarizer_lsa(parser.document, 2)
dp = []
for i in summary_2:
lp = str(i)
dp.append(lp)
final_sentence = ' '.join(dp)
return final_sentence
CHAT BOT:
Three Pillars of an NLP Based Chatbot
Now it's time to take a closer look at all the core elements that make NLP chatbot happen.
1) Dialog System
To communicate, people use mouths to speak, ears to hear, fingers to type, and eyes to read.
Chatbot, too, needs to have an interface compatible with the ways humans receive and share
information with communication. That is what we call a dialog system, or else, a conversational
agent.
There are no set dialog system components.
But for a dialog system to, indeed, be a dialog system, it has to be capable of producing output
and accept input. Other than that, they can adopt a variety of forms. You can differentiate them
based on:
Modality (text-based, speech-based, graphical or mixed)
Device
Style (command-based, menu-driven and - of course - natural language)
Initiative (system, user, or mixed)
Even better?
The use of Dialogflow and a no-code chatbot building platform like Landbot allows you
to combine the smart and natural aspects of NLP with the practical and functional aspects
of choice-based bots.
PLAGARISM:
Plagiarism is rampant on the internet and in the classroom. With so much content out there, it’s
sometimes hard to know when something has been plagiarized. Authors writing blog posts may
want to check if someone has stolen their work and posted it elsewhere. Teachers may want to
check students’ papers against other scholarly articles for copied work. News outlets may want to
check if a content farm has stolen their news articles and claimed the content as its own.
So, how do we guard against plagiarism? Wouldn’t it be nice if we could have software do the
heavy lifting for us? Using machine learning, we can build our own plagiarism checker that
searches a vast database for stolen content. In this article, we’ll do exactly that.
We’ll build a Python Flask app that uses Pinecone — a similarity search service — to find
possibly plagiarized content.
Demo App Overview
Let’s take a look at the demo app we’ll be building today. Below, you can see a brief animation of
the app in action.
The UI features a simple textarea input in which the user can paste the text from an article. When
the user clicks the Submit button, this input is used to query a database of articles. Results and
their match scores are then displayed to the user. To help reduce the amount of noise, the app also
includes a slider input in which the user can specify a similarity threshold to only show extremely
strong matches.
Summary of approaches to Grammar Error Correction (GEC). Source: Source: Adapted from
Ailani et al. 2019, figs. 1-4.
A well-written article with correct grammar, punctuation and spelling along with an appropriate
tone and style to match the needs of the intended reader or community is always important.
Software tools offer algorithm-based solutions for grammar and spell checking and correction.
Classical rule-based approaches employ a dictionary of words along with a set of rules. Recent
neural network-based approaches learn from millions of published articles and offer suggestions
for appropriate choice of words and way to phrase parts of sentences to adjust the tone, style and
semantics of the sentence. They can alter suggestions based on the publication domain of the
article like academic, news, etc.
Grammar and spelling correction are tasks that belong to a more general NLP process
called lexical disambiguation.
Discussion
What is a software grammar and spell checker, its general tasks and uses?
Illustrating grammar and spell checks and suggested corrections. Source: Devopedia 2021.
A grammar and spell checker is a software tool that checks a written text for grammatical
mistakes, appropriate punctuation, misspellings, and issues related to sentence structure. More
recently, neural network-based tools also evaluate tone, style, and semantics to ensure that the
writing is flawless.
Often such tools offer a visual indication by highlighting or underlining spelling and grammar
errors in different colors (often red for spelling and blue for grammar). Upon hovering or
clicking on the highlighted parts, they offer appropriately ranked suggestions to correct those
errors. Certain tools offer a suggestive corrected version by displaying correction as strikeout in
an appropriate color.
Such tools are used to improve writing, produce engaging content, and for assessment and
training purposes. Several tools also offer style correction to adapt the article for specific
domains like academic publications, marketing, and advertising, legal, news reporting, etc.
However, till today, no tool is a perfect alternative to an expert human evaluator.
What are some important terms relevant to a grammar and spell checker?
The following NLP terms and approaches are relevant to grammar and spell checker:
Part-of-Speech (PoS) tagging marks words as noun, verb, adverb, etc. based on definition and
context.
Named Entity Recognition (NER) is labeling a sequence of text into predefined categories such
as name, location, etc. Labels help determine the context of words around them.
Confusion Set is a set of probable words that can appear in a certain context, e.g. set of articles
before a noun.
N-Gram is a sub-sequence of n words or tokens. For example, "The sun is bright" has these 2-
grams: {"the sun", "sun is", "is bright"}.
Parallel Corpus is a collection of text placed alongside its translation, e.g. text with errors and
its corresponding corrected version(s).
Language Model (LM) determines the probability distribution over a sequence of words. It says
how likely is a particular sequence of words.
Machine Translation (MT) is a software approach to translate one sequence of text into
another. In grammar checking, this refers to translating erroneous text into correct text.
What are the various types of grammar and spelling errors?
Types of grammar and spelling errors. Source: Soni and Thakur 2018, fig. 3.
We describe the following types:
Sentence Structure: Parts of speech are organized incorrectly. For example, "she began to
singing" shows misplaced 'to' or '-ing'. Dependent clause without the main clause, run-on
sentence due to missing conjunction, or missing subject are some structural errors.
Syntax Error: Violation of rules of grammar. These can be in relation to subject-verb agreement,
wrong/missing article or preposition, verb tense or verb form error, or a noun number error.
Punctuation Error: Punctuation marks like comma, semi-colon, period, exclamation, question
mark, etc. are missing, unnecessary, or wrongly placed.
Spelling Error: Word is not known in the dictionary.
Semantic Error: Grammar rules are followed but the sentence doesn't make sense, often due to a
wrong choice of words. "I am going to the library to buy a book" is an example where 'bookstore'
should replace 'library'. Rule-based approaches typically can't handle semantic errors. They
require statistical or machine learning approaches, which can also flag other types of errors.
Often a combination of approaches leads to a good solution.
Classical methods of spelling correction match words against a given dictionary, an approach
alluded by critiques to be unreliable as it can't detect incorrect use of correctly spelled words; or
correct words not in the dictionary, like technical words, acronyms, etc.
Grammar checkers use hand-coded grammar rules on PoS tagged text for correct or incorrect
sentences. For instance, the rule I + Verb (3rd person, singular form) corresponds to the incorrect
verbform usage, as in the phrase "I has a dog." These methods provide detailed explanations of
flagged errors making it helpful for learning. However, rule maintenance is tedious and devoid of
context.
Statistical approaches validate parts of a sentence (n-grams) against their presence in a corpus.
These approaches can flag words used out of context. However, it's challenging to provide
detailed explanations. Their efficiency is limited to the choice of corpora.
Noisy channel model is one statistical approach. A LM based on trigrams and bigrams gives
better results than just unigrams. Where rare words are wrongly corrected, using a blacklist of
words or a probability threshold can help.
What are Machine Learning-based methods for implementing grammar and spell checkers?
ML-based approaches are either Classification (discriminative) or Machine Translation
(generative).
Classification approaches work with well-defined errors. Each error type (article, preposition,
etc.) requires training a separate multi-class classifier. For example, a proposition error classifier
takes n-grams associated with propositions in a sentence and outputs a score for every candidate
proposition in the confusion set. Contextual corrections also consider features like PoS and NER.
A model can be a linear classifier like a Support Vector Machine (SVM), an n-gram LM-based
or Naïve Bayes classifier, or even a DNN-based classifier.
Machine Translation approaches can be Statistical Machine Translation (SMT) or Neural
Machine Translation (NMT). Both these use parallel corpora to train a sequence-to-sequence
model, where text with errors translates to corrected text. NMT uses encoder-decoder
architecture, where an encoder determines a latent vector for a sentence based upon the input
word embeddings. The decoder then generates target tokens from the latent vector and relevant
surrounding input and output tokens (attention). These benefit from transfer learning and
advancements in transformer-based architecture. Editor models reduce training time by
outputting edits to input tokens from a reduced confusion set instead of generating target tokens.
How can I train an NMT model for grammar and spell checking?
Training an NMT for GEC. Source: Adapted from Naghshnejad et al. 2020, fig. 3, fig. 5, table 4.
In general, NMT requires training an encoder-decoder model using cross-entropy as the loss
function by comparing maximum likelihood output to the gold standard correct output. To train a
good model requires a large number of parallel corpora and compute capacity. Transformers are
attention-based deep seq2seq architectures. Pre-trained language models generated by
transformer architectures like BERT provide contextual embeddings to find the most likely token
given the surrounding tokens, making it useful to flag contextual errors in an n-gram.
Transfer learning via fine tuning weights of a transformer using the parallel corpus of incorrect
to correct examples makes it suitable for GEC use. Pre-processing or pre-training with synthetic
data improves the performance and accuracy. Further enhancements can be to use separate heads
for different types of errors.
Editor models are better as they output edit sequences instead of corrected versions. Training
and testing of editor models require the generation of edit sequences from source-target parallel
texts.
What datasets are available for training and evaluation of grammar and spell check models?
MT or classification models need datasets with annotated errors. NMT requires a large amount
of data.
Lang 8, the largest available parallel corpora, has 100,051 English entries. Corpus of Linguistic
Acceptability (CoLA) is a dataset of sentences labeled as either grammatically correct or
incorrect. It can be used, for example, to fine tune a pre-trained model. GitHub Typo Corpus is
harvested from GitHub and contains errors and their corrections.
Benchmarking data in Standard Generalized Markup Language (SGML) format is
available. Sebastian Ruder offers a detailed list of available benchmarking test datasets along
with the various models (publications and source code).
Noise models use transducers to produce erroneous sentences from correct ones with a specified
probability. They induce various error types to generate a larger dataset from a smaller one, like
replacing a word from its confusion set, misplace or remove punctuations, induce spelling, tense,
noun number, or verb form mistakes, etc. Round-trip MT, such as English-German-English
translation, can also generate parallel corpora. Wikipedia edit sequences offer millions of
consecutive snapshots to serve as source-target pairs. However, only a tiny fraction of those
edits are language related.
How do I annotate or evaluate the performance of grammar and spell checkers?
ERRor ANnotation Toolkit (ERRANT) enabled suggestions with explanation. It automatically
annotates parallel English sentences with error type information, thereby standardizing parallel
datasets and facilitating detailed error type evaluation.
Training and evaluation require comparing the output to the target gold standard and giving a
numerical measure of effectiveness or loss. Editor models have an advantage as the sequence
length of input and output is the same. Unequal sequences need alignment with the insertion of
empty tokens.
Max-Match (M2M2) scorer determine the smallest edit sequence out of the multiple possible
ways to arrive at the gold standard using the notion of Levenshtein distance. The evaluation
happens by computing precision, recall, and F1 measure between the set of system edits and the
set of gold edits for all sentences after aligning the sequences to the same length.
Dynamic programming can also align multiple sequences to the gold standard when there is
more than one possible correct outcome.
Could you mention some tools or libraries that implement grammar and spell checking?
GNU Aspell is a standard utility used in GNU OS and other UNIX-like OS. Hunspell is a spell
checker that's part of popular software such as LibreOffice, OpenOffice.org, Mozilla Firefox 3 &
Thunderbird, Google Chrome, and more. Hunspell itself is based on MySpell. Hunspell can use
one or more dictionaries, stemming, morphological analysis, and Unicode text.
Python packages for spell checking include pyspellchecker, textblob and autocorrect.
A search for "grammar spell" on GitHub brings up useful dictionaries or code implemented in
various languages. There's a converter from British to American English. Spellcheckr is a
JavaScript implementation for web frontends.
Deep learning models include Textly-DRF-API and GECwBERT.
Many online services or offline software also exist: WhiteSmoke from 2002, LanguageTool from
2005, Grammarly from 2009, Ginger from 2011, Reverso from 2013, and Trinka from 2020.
Trinka focuses on an academic style of writing. Grammarly focuses on suggestions in terms of
writing style, clarity, engagement, delivery, etc.
Milestones
1960
Abbreviation ABBT maps incorrect word 'absorbant' to the correct word 'absorbent'. Source:
Blair 1960.
Blair implements a simple spelling corrector using heuristics and a dictionary of correct words.
Incorrect spellings are associated with the corrected ones via abbreviations that indicate
similarity between the two. Blair notes that this is in some sense a form of pattern recognition. In
one experiment, the program successfully corrects 89 of 117 misspelled words. In general,
research interest in spell checking and correction begins in the 1960s.
1971
R. E. Gorin writes Ispell in PDP-10 assembly. Ispell becomes the main spell-checking program
for UNIX. Ispell is also credited with introducing the generalized affix description system. Much
later, Geoff Kuenning implements a C++ version with support for many European languages.
This is called International Ispell. GNU Aspell, MySpell and Hunspell are other software
inspired by Ispell.
1980
def edits2(word):
"All edits that are two edits away from `word`."
return (e2 for e1 in edits1(word) for e2 in edits1(e1))
# usage:
correction('speling') # spelling (single deletion)
correction('korrectud') # corrected (double replacements)
SENTIMENT ANALYSIS:
Sentiment analysis systems help organizations gather insights from unorganized and unstructured
text that comes from online sources such as emails, blog posts, support tickets, web chats, social
media channels, forums and comments. Algorithms replace manual data processing by
implementing rule-based, automatic or hybrid methods. Rule-based systems perform sentiment
analysis based on predefined, lexicon-based rules while automatic systems learn from data with
machine learning techniques. A hybrid sentiment analysis combines both approaches.
In addition to identifying sentiment, opinion mining can extract the polarity (or the amount of
positivity and negativity), subject and opinion holder within the text. Furthermore, sentiment
analysis can be applied to varying scopes such as document, paragraph, sentence and sub-
sentence levels.
Vendors that offer sentiment analysis platforms or SaaS products include Brandwatch, Hootsuite,
Lexalytics, NetBase, Sprout Social, Sysomos and Zoho. Businesses that use these tools can
review customer feedback more regularly and proactively respond to changes of opinion within
the market.
Image: Source
The QA setting, depending on the span is extremely natural. Open-domain QA systems can
typically discover the right papers that hold the solution to many user questions sent into search
engines. The task is to discover the shortest fragment of text in the passage or document that
answers the query, which is the ultimate phase of “answer extraction.”
Problem Description for Question-Answering System
The purpose is to locate the text for any new question that has been addressed, as well as the
context. This is a closed dataset, so the answer to a query is always a part of the context and that
the context spans a continuous span. For the time being, I’ve divided the problem into two pieces
–
Source: SQuAT
Getting the correct solution to the sentence (highlighted green)
Getting the correct response from the sentence once we completed it (highlighted blue)
We have a context, question, and text for each observation in the training set. One such
observation is:
Facebook Sentence Embedding
We now have word2vec, doc2vec, food2vec, node2vec, and sentence2vec, so why not
sentence2vec? The main idea behind these embeddings is to numerically represent entities using
vectors of various dimensions, making it easier for computers to grasp them for various NLP
tasks.
Traditionally, we applied the bag of words approach, which averaged the vectors of all the words
in a sentence. Each sentence is tokenized into words, and the vectors for these words are
discovered using glove embeddings. The average of all these vectors is then calculated. This
method has been done admirably, although it is not an accurate method because it ignores word
order.
This is where Infersent comes in. It’s a sentence embeddings method that generates semantic
sentence representations. It’s based on natural language inference data and can handle a wide
range of tasks.
InferSent is a method for generating semantic sentence representations using sentence
embeddings. It’s based on natural language inference data and can handle a wide range of tasks.
The procedure for building the model :
Make a vocabulary out of the training data and use it to train the inferent model.
I used Python 2.7 (with recent versions of NumPy/SciPy) with Pytorch (recent version) and
NLTK >= 3
if you want to download the trained model on AllNLI then run-
curl -Lo encoder/infersent.allnli.pickle
https://s3.amazonaws.com/senteval/infersent/infersent.allnli.pickle
Load the pre-trained model
import nltk
nltk.download('punkt')
import torch
infersent = torch.load('InferSent/encoder/infersent.allnli.pickle', map_location=lambda
storage, loc: storage)
infersent.set_glove_path("InferSent/dataset/GloVe/glove.840B.300d.txt")
infersent.build_vocab(sentences, tokenize=True)
dict_embeddings = {}
for i in range(len(sentences)):
print(i)
dict_embeddings[sentences[i]] = infersent.encode([sentences[i]], tokenize=True)
Where sentences are the number of sentences in your list. You can use
infersent.update_vocab(sentences) to update your vocabulary or infersent. Build vocab k
words(K=100000) to load the K most common English words directly. If tokenize is set to True
(the default), NTLK will tokenize sentences.
We can use these embeddings for a variety of tasks in the future, such as determining whether
two sentences are similar.
Sentence Segmentation:
You can use Doc.has_annotation with the attribute name “SENT_START” to see if a Doc has
sentence boundaries. Here the paragraph is broken into a meaningful sentence.
import spacy
nlp = spacy.load("en_core_web_sm")
doc = nlp("Environmentalists are concerned about the loss of biodiversity that will result from
the destruction of the forest.They also concerned about the release of the carbon contained within
the vegetation. This release may accelerate global warming.")
assert doc.has_annotation("SENT_START")
for sent in doc.sents:
print(sent.text)
Make many sentences out of the paragraph/context. Spacy and Textblob are two tools I’m
familiar with for handling text data. TextBlob was used to do this. Unlike spacy’s sentence
detection, which can produce random sentences based on the period, it executes intelligent
splitting. Here’s a real-life example:
Using the Infersent model, get the vector representation of each sentence and question.
Machine learning Models
we should tackle the problem by utilizing two key methods:
Supervised learning and unsupervised learning, in which I did not use the target variable. I’m
going to return the sentence from the paragraph that is the furthest away from the given question.
Unsupervised Learning Model
Let’s see if we can use Euclidean distance to find the sentence that is closest to the question. This
model’s accuracy was roughly 45 per cent. The accuracy rose from 45 per cent to 63 per cent
after altering the cosine similarity. This makes sense because the Euclidean distance is
unaffected by the alignment or angle of the vectors, whereas cosine is. With vectorial
representations, the direction is crucial.
However, this strategy does not take advantage of the rich data with target labels we are given.
However, because of the solution’s simplicity, it still produces a solid outcome with no training.
Facebook sentence embedding deserves credit for the excellent results.
Supervised Learning Model
Creating a training set for this section has been difficult since each portion does not have a
predetermined amount of sentences and answers can range from one word to many words.
I’ve converted the target variable’s text to the sentence index that contains that text. I’ve kept my
paragraphs to a maximum of ten sentences to keep things simple (around 98 percent of the
paragraphs have 10 or fewer sentences). As a result, in this scenario, I have 10 labels to forecast.
I created a feature based on cosine distance for each sentence. If a paragraph has fewer than 10
sentences, I replace its feature value with 1 (maximum cosine distance) to make 10 sentences.
Question – What kind of sending technology is being used to protect tribal lands in the
Amazon?
Context – The use of remote sensing for the conservation of the Amazon is also being used by
the indigenous tribes of the basin to protect their tribal lands from commercial interests. Using
handheld GPS devices and programs like Google Earth, members of the Trio Tribe, who live in
the rainforests of southern Suriname, map out their ancestral lands to help strengthen their
territorial claims. Currently, most tribes in the Amazon do not have clearly defined boundaries,
making it easier for commercial ventures to target their territories.
-From SQuAD
Text – remote sensing
Because the highlighted sentence index is 1, the target variable will be changed to 1. There will
be ten features, each of which corresponds to one sentence in the paragraph. Because these
sentences do not appear in the paragraph, the missing values for column cos 2, and column cos 3
are filled with NaN.
Source: SQuAD
Sentence having the solution — The use of remote sensing for the conservation of the Amazon
is also being used by the indigenous tribes of the basin to protect their tribal lands from
commercial interests.
All Roots of the Sentences in the Paragraph are visualized:
for sent is doc.sents:
roots = [st.stem(chunk.root. head.text.lower()) for chunk in sent.noun_chunks ]
prlnt(roots)
Lemmatization:
The Lemmatizer is a configurable pipeline component that supports lookup and rule-based
lemmatization methods. As part of its language data, a language can expand the Lemmatizer.
Before comparing the roots of the sentence to the question root, it’s crucial to do stemming and
lemmatization. Protect is the root word for the question in the previous example, while protected
is the root word in the sentence. It will be impossible to match them unless you stem and
lemmatize “protect” to a common phrase.
import spacy
nlp = spacy.load("en_core_web_sm")
lemmatizer = nlp.get_pipe("lemmatizer")
print(lemmatizer.mode) # 'rule'
doc = nlp("The use of remote sensing for the conservation of the Amazon is also being used by
the indigenous tribes of the basin to protect their tribal lands from commercial interests.")
print([token.lemma_ for token in doc])
The goal is to match the root of the question, which in this case is “appear,” to all the sentence’s
roots and sub-roots. We can gain several roots since there are multiple verbs in a sentence. If the
root of the question is present in the roots of the statement, there is a better possibility that the
sentence will answer the question. With this in mind, I’ve designed a feature for each sentence
that has a value of 1 or 0. Here, 1 shows that the question’s root is contained in the sentence
roots, and 0 shows that it is not.
We develop the transposed data with two observations from the processed training data model.
So, for ten phrases in a paragraph, we have 20 characteristics combining cosine distance and root
match. The range of the target variable is 0 to 9.
This problem can also be solved using supervised learning, in which we fit multinomial logistic
regression, random forest, and xgboost to construct 20 features, two of which represent the
cosine distance and Euclidean distance for one sentence. As a result, we will limit each para to
ten sentences).
import numpy as np, pandas as pd
import ast
from sklearn import linear_model
from sklearn import metrics
from sklearn.cross_validation import train_test_split
import warnings
warnings.filterwarnings('ignore')
import spacy
from nltk import Tree
en_nlp = spacy.load('en')
from nltk.stem.lancaster import LancasterStemmer
st = LancasterStemmer()
from sklearn.preprocessing import StandardScaler, MinMaxScaler
from sklearn.ensemble import RandomForestClassifier
import xgboost as xgb
from sklearn.model_selection import RandomizedSearchCV, GridSearchCV
Load the dataset CSV file:
data = pd.read_csv("train_detect_sent.csv").reset_index(drop=True)
Let’s retrieve the Abstract Syntax Tree for the DataFrames:
ast.literal_eval(data["sentences"][0])
After all, create a feature for the Data Frames and then train the model:
def crt_feature(data):
train = pd.DataFrame()
for k in range(len(data["euclidean_dis"])):
dis = ast.literal_eval(data["euclidean_dis"][k])
for i in range(len(dis)):
train.loc[k, "column_euc_"+"%s"%i] = dis[i]
print("Finished")
for k in range(len(data["cosine_sim"])):
dis = ast.literal_eval(data["cosine_sim"][k].replace("nan","1"))
for i in range(len(dis)):
train.loc[k, "column_cos_"+"%s"%i] = dis[i]
train["target"] = data["target"]
return train
train = crt_feature(data)
train.head(3).transpose()
Train the model using Multinomial Logistic regression:
train_x, test_x, train_y, test_y = train_test_split(X,
train.iloc[:,-1], train_size=0.8, random_state = 5)
mul_lr = linear_model.LogisticRegression(multi_class='multinomial', solver='newton-cg')
mul_lr.fit(train_x, train_y)
print("Multinomial Logistic regression Train Accuracy : ", metrics.accuracy_score(train_y,
mul_lr.predict(train_x)))
print("Multinomial Logistic regression Test Accuracy : ", metrics.accuracy_score(test_y,
mul_lr.predict(test_x)))
The phrase ID with the right answer is the target variable. As a result, I have ten labels. The
accuracy is currently 63%, 65 per cent, and 69 per cent, respectively, on the validation set.
PERSONAL ASSISTANT:
Understanding Natural Language Processing in Virtual Assistants
In a market saturated with offerings from various companies, it is crucial to possess an
understanding of the underlying technologies that make Virtual Assistants effective vs.
ineffective. Of paramount importance is the Assistant’s ability to interface with users and
complete the appropriate action based on the information provided by the user. While it may
seem like common sense, that description merely scratches the surface of the capabilities that
separate a chatbot from a Virtual Assistant.
Critical differences between rudimentary artificial intelligence and its corollary abilities to
complete user-requested actions lie in the framework of Natural Language Processing (NLP),
Understanding (NLU), and Generation (NLG). Natural Language Processing in Virtual
Assistants is key in understanding both the broad picture and the minute details.
When combined in Aisera’s Virtual Assistant, the three technologies far exceed the precedent set
by competitors for conversational intelligence and Robotic Process Automation (RPA) solutions.
In this blog, we will examine the specificities of NLP, and later NLU and NLG, as they appear
across the Aisera AI Service Management (AISM) platform and how these differ from other
Virtual Assistant offerings on the market.
Ecosystem
When interacting with a user, a proper Virtual Assistant must be equipped to capture any
incoming request regardless of the domain and intent of the request and return an immediate and
relevant response. In this way, the Virtual Assistant is like the goalkeeper of a World Cup-
winning soccer team: they catch the incoming ball from any angle and return it up the field to
one of their teammates. Unsurprisingly, Virtual Assistants can become a critical player on any
customer service team as the Assistant keeps track of customers throughout their buying journey,
executes automated processes on backend systems, deflects routine issues from service agents,
and escalates unresolved requests to the most effective agent when the time is right. But this is
only one flavor for the application of a Virtual assistant, there are numerous use-cases across
Sales and Marketing, Human Resources, IT, Legal and Finance, and more. Knowing that a
capable Virtual Assistant must be equipped to handle so many different facets of the customer’s
journey while understanding the nuances of a given customer’s mood and sentiment, and therein
lies the most powerful applications of cutting-edge Natural Language Processing in Virtual
Assistants and NLU technologies. We go deeper into the gamut of capabilities Aisera’s Virtual
Assistant has and how it helps with Customer Intelligence along the customer’s journey in this
blog.
Understanding Intents
For the uninitiated, semantic NLP, NLU, and NLG are technologies built to solve one problem:
identifying the user’s intent during any given interaction. In humans, there are many mechanisms
that are employed to aid in deciphering the intent is behind another human’s word choice,
whether they are visual cues, the difference in inflection across a word, and a familiarity with the
vernacular dialect of the conversation. Machines, however, do have most of these luxuries and
therefore must rely on different mechanisms to ensure the correct interpretation of user
interaction. The components that make up NLP are a message interpreter and an exception
handler. These two pieces allow AIsera to process a user request then execute tasks and actions
based on the extracted information. The message interpreter uses techniques such as
tokenization, spell checking, and lemmatization to break down the nature of the user’s request
prior to classifying the request and passing it along to the NLU module to further analyze the
intent behind the request. For example, an utterance of “I would like to access Zoom” could be
understood as:
Intent: [name: “Provision $Application”], entities: [name: “$Application: Zoom
Videoconferencing”]
From there, the aforementioned interpretation techniques can be added to further breakdown the
utterance, which could look like:
TUTORING SYSTEMS:
INTRODUCTION
Many Intelligent Tutoring Systems (ITSs) aim to help students become better readers. The
computational challenges involved are (1) to assess the students’ natural language inputs and (2)
to provide appropriate feedback and guide students through the ITS curriculum. To overcome
both challenges, the following non-structural Natural Language Processing (NLP) techniques
have been explored and the first two are already in use: word-matching (WM), latent semantic
analysis (LSA, Landauer, Foltz, & Laham, 1998), and topic models (TM, Steyvers & Griffiths,
2007).
This article describes these NLP techniques, the iSTART (Strategy Trainer for Active Reading and
Thinking, McNamara, Levinstein, & Boonthum, 2004) intelligent tutor and the related Reading Strategies
Assessment Tool (R-SAT, Magliano et al., 2006), and how these NLP techniques can be used in
assessing students’ input in iSTART and R-SAT. This article also discusses other related NLP
techniques which are used in other applications and may be of use in the assessment tools or
intelligent tutoring systems.
BACKGROUND
Interpreting text is critical for intelligent tutoring systems (ITSs) that are designed to interact
meaningfully with, and adapt to, the users’ input. Different ITSs use different Natural Language
Processing (NLP) techniques in their system. NLP systems may be structural, i.e., focused on
grammar and logic, or non-structural, i.e., focused on words and statistics. This article deals with
the latter.
Examples of the structural approach include ExtrAns (Extracting Answers from technical
texts question-answering system; Molla et al., 2003) which uses minimal logical forms (MLF;
that is, the form of first order predicates) to represent both texts and questions and C-Rater
(Leacock & Chodorow, 2003) which scores short-answer questions by analyzing the conceptual
information of an answer in respect to the given question. Turning to the non-structural
approach, AutoTutor (Graesser et al., 2000) uses LSA to analyze the student’s input against
expected sets of answers and CIRCSIM-Tutor (Kim et al., 1989) uses a word-matching
technique to evaluate students’ short answers. The systems considered more fully below,
iSTART (McNamara et al, 2004) and R-SAT (Magliano et al., 2006) use both word-matching
and LSA in assessing quality of students’ self-explanation. Topic models (TM) were explored in
both systems, but have not yet been integrated.
MAIN FOCUS OF THE CHAPTER
This article presents three non-structural NLP techniques (WM, LSA, and TM) which are
currently used or being explored in reading strategies assessment and training applications,
particularly, iSTART and R-SAT.
Word Matching
Word matching is a simple and intuitive way to estimate the nature of an explanation. There are
two ways to compare words from the reader’s input (either answers or explanations) against
benchmarks (collections of words that represent a unit of text or an ideal answer): (1) Literal
word matching and (2) Soundex matching.
Literal word matching – Words are compared character by character and if there is a match of
sufficient length then we call this a literal match. An alternative is to count words that have the
same stem (e.g., indexer and indexing) as matching. If a word is short a complete match may be
required to reduce the number of false-positives.
Soundex matching - This algorithm compensates for misspellings by mapping similar
characters to the same soundex symbol (Christian, 1998). Words are transformed to their
soundex code by retaining the first character, dropping the vowels, and then converting other
characters into soundex symbols: 1 for b, p; 2 for f v; 3 for c, k, s; etc. Sometimes only one
consecutive occurrence of the same symbol is retained. There are many variants of this algorithm
designed to reduce the number of false positives (e.g., Philips, 1990). As in literal matching,
short words may require a full soun-dex match while for longer words the first n soundex
symbols may suffice.
Word-matching is also used in other applications, such as, CIRCSIM-Tutor (Kim et al., 1989) on
short-answer questions and Short Essay Grading System (Ventura et al., 2004) on questions with
ideal expert answers.
Latent Semantic Analysis (LSA)
Latent Semantic Analysis (LSA; Landauer, Foltz, & Laham, 1998) uses statistical computation
to extract and represent the meaning of words. Meanings are represented in terms of their
similarity to other words in a large corpus of documents. LSA begins by finding the frequency of
terms used and the number of co-occurrences in each document throughout the corpus and then
uses a powerful mathematical transformation to find deeper meanings and relations between
words.
When measuring the similarity between text-objects, LSA’s accuracy improves with the size
of the objects, so it provides the most benefit in finding similarity between two documents but as
it does not take word order into account, short documents may not receive the full benefit. The
details for constructing an LSA corpus matrix are in Landauer & Dumais (1997). Briefly, the
steps are: (1) select a corpus; (2) create a term-document-frequency (TDF) matrix; (3) apply
Singular Value Decomposition (sVD; Press et al, 1986) to the TDF matrix to decompose it into
three matrices (L x S x R; where S is a scaling, matrix). The leftmost matrix (L) becomes the
LSA matrix of that corpus. The optimal size is usually in the range of 300-400 dimensions.
Hence, the LSA matrix dimensions become N x D where N is the number of unique words in the
entire corpus and D is the optimal dimension (reduced from the total number of documents in the
entire corpus).
The similarity of terms (or words) is computed by comparing two rows, each representing a
term vector. This is done by taking the cosine of the two term vectors. To find the similarity of
sentences or documents, (1) for each document, create a document vector using the sum of the
term vectors of all the terms appearing in the document and (2) calculate a cosine between two
document vectors. Cosine values range from ±1 where +1 means highly similar.
To use LSA in the tutoring systems, a set of benchmarks are created and compared with the
trainee’s input. Examples benchmarks are the current target sentence, previous sentences, and the
ideal answer. A high cosine value between the current sentence benchmark and the reader’s input
would indicate that the reader understood the sentence and was able to paraphrase what was read.
To provide appropriate feedback, a number of cosines are computed (one for each benchmark).
Various statistical methods, such as discriminant analysis and regression analysis, are used to
construct the feedback formula. McNamara et al. (2007) describe various ways that LSA can be
used to evaluate the reader’s explanations: either LSA alone or a combination of LSA with WM.
The final conclusion is that a fully-automated (i.e., less hand-crafted benchmarks construction),
combined system produces the better results.
There are a number of other intelligent tutoring systems that use LSA in their feedback system,
for examples, Summary Street (Steinhart, 2001), Auto- Tutor (Greasser et al, 2000), and
Tutoring System (Lemaire, 1999).
Topic Models
The Topic Models approach (TM; Steyvers & Griffiths, 2007) applies a probabilistic model to
find a relationship between terms and documents in terms of topics. A document is considered to
be generated probabilistically from a number of topics where each topic consists of a number of
terms, each given a probability of selection if that topic is used. By using a TM matrix, the
probability that a certain topic was used in the creation of a given document is estimated. If two
documents are similar, the estimates of the topics within these documents should be similar. TM
is similar to LSA, except that a term-document frequency matrix is factored into two matrices
instead of three: one is the probabilities of terms belonging to the topics (the TM matrix), the
other the probabilities of topics belonging to the documents. The Topic Modeling Toolbox
(Steyvers & Griffiths, 2007) can be used to construct a TM matrix,
To measure the similarity between documents, the Kullback Leibler distance (KL-distance:
Steyvers & Griffiths, 2007) is recommended, rather than the cosine measure (which can also be
used). Using TM in a tutoring system is similar to using LSA, where a set of benchmarks is
defined and the reader’s input is compared against each benchmark. The only different is the use
of KL-distance instead of LSA-cosine value. The preliminary results of investigating TM in
place of LSA (Boonthum, Levinstein, & McNamara, 2006) indicate that TM is as good as LSA
alone (correlation between computerized-scores and human rating scores), but a little bit lower
than a combined system using both WM and LSA. This suggests that the TM should be further
investigated in combination with WM or LSA or both.
TM is mostly used in document clustering (grouping documents based on relevancy or similar
topics; Buntine et al., 2005), data mining (Tuulos & Tirri, 2004), and search engines (Perkio et
al., 2004). A variation on TM by Steyvers & Griffiths (2007), is Probabilistic Latent Semantic
Analysis (PLSA; Hofmann, 2001) which models each document as generated from a number of
hidden topics and each topic has its features defined as the conditional probabilities of word
occurrences in that topic.
iSTART and RSAT Applications
iSTART (Interactive Strategy Trainer for Active Reading and Thinking) is a web-based,
automated tutor designed to help students become better readers using multi-media technology.
It provides adolescent to college-aged students with a program of self-explanation and reading
strategy training (McNamara et al., 2004) called Self-Explanation Reading Training, or SERT
(see McNamara et al, 2004). iSTART consists of three modules: Introduction (description of
SERT and reading strategies), Demonstration (illustration of how these reading strategies can be
used), and Practice (hands-on practice of these reading strategies). In the Practice module,
students practice using reading strategies by typing self-explanations of sentences. The system
evaluates each explanation and then provides appropriate feedback to the student. If the
explanation is irrelevant or too short compared to the given sentence and passage, the student is
required to add more information. Otherwise, the feedback is based on the level of its overall
quality.
The computational challenge is to provide appropriate feedback to the students about their
explanations. Doing so requires capturing some sense of both the meaning and quality of their
explanation. A combination of word-matching and LSA provided better results (comparing the
computerized-score using NLP techniques to the human rating score and having higher
correlation between these two sets of scores) than either separately (McNamara, Boonthum,
Levinstein, & Millis, 2007).
R-SAT (Reading StrategyAssessment Tool; Maglino et al., 2007) is an automated web-based
reading assessment tool designed to measure readers’ comprehension and spontaneous use of
reading strategies. The R-SAT is similar to the iSTART Practice module in the sense that it
presents passages to the reader one sentence at a time and asks for the reader’s input. The
difference is that, instead of an explanation, R-SAT asks either an indirect (“What are your
thoughts regarding your understanding of the sentence in the context of the passage?”) or a direct
question (e.g., Why did the miller want to marry the girl?”) at pre-selected target sentences. The
answers to the indirect questions are evaluated on how they are related to the given sentence and
passage; the answers to the direct questions are assessed by comparing them to ideal answers.
The problem is to analyze the answers and generate a set of scores for overall
comprehension and strategy usage. Ultimately, these scores can be used as a pre-assessment
for iSTART allowing the trainer to individualize the iSTART curriculum based on the reader’s
needs. R-SAT was initially proposed to use word-matching, LSA, and other techniques beyond
LSA. However, during the course of development, word-matching was found to produce better
results than LSA or in combination with LSA.
FUTURE TRENDS
These three NLP techniques (WM, LSA, and TM) are used in the ongoing research on
assessing and improving comprehension skills via reading strategies in the R-SAT and iSTART
projects. WM and LSA have been extensively investigated for iSTART and to some extent in R-
SAT. The lack of success of LSA compared to the simpler WM in R-SAT is somewhat
surprising and may be due to particular features of the algorithms used or to the variety of text
genres used in R-SAT. Future work is planned with modified algorithms and substituting genre-
specific LSA spaces for the general space now used. In addition TM needs further exploration,
especially in its use with small units of text where the recommended Kullback Leibler distance
has not proven particularly effective.
Conclusion:
Thus the Study of various applications of NLP is done for real world projects.
EXPT.2 VARIOUS TEXT PREPROCESSING TECHNIQUES FOR ANY GIVEN
TEXT : TOKENIZATION AND FILTRATION & SCRIPT VALIDATION
LAB OBJECTIVES:
To understand the various text preprocessing techniques for Tokenization, Filtration & Script
Validation
LAB OUTCOMES:
On Successful Completion, the Student will be able to understand about various text
preprocessing techniques for Tokenization, Filtration & Script Validation for real world
applications.
PROCEDURE:
In Python tokenization basically refers to splitting up a larger body of text into smaller lines,
words or even creating words for a non-English language. The various tokenization functions in-
built into the nltk module itself and can be used in programs as shown below.
Line Tokenization
In the below example we divide a given text into different lines by using the function
sent_tokenize.
import nltk
sentence_data = "The First sentence is about Python. The Second: about Django. You can learn
Python,Django and Data Ananlysis here. "
nltk_tokens = nltk.sent_tokenize(sentence_data)
print (nltk_tokens)
When we run the above program, we get the following output −
['The First sentence is about Python.', 'The Second: about Django.', 'You can learn
Python,Django and Data Ananlysis here.']
Non-English Tokenization
In the below example we tokenize the German text.
import nltk
german_tokenizer = nltk.data.load('tokenizers/punkt/german.pickle')
german_tokens=german_tokenizer.tokenize('Wie geht es Ihnen? Gut, danke.')
print(german_tokens)
When we run the above program, we get the following output −
['Wie geht es Ihnen?', 'Gut, danke.']
Word Tokenzitaion
We tokenize the words using word_tokenize function available as part of nltk.
import nltk
word_data = "It originated from the idea that there are readers who prefer learning new skills
from the comforts of their drawing rooms"
nltk_tokens = nltk.word_tokenize(word_data)
print (nltk_tokens)
When we run the above program we get the following output −
['It', 'originated', 'from', 'the', 'idea', 'that', 'there', 'are', 'readers',
'who', 'prefer', 'learning', 'new', 'skills', 'from', 'the',
'comforts', 'of', 'their', 'drawing', 'rooms']
FILTRATION:
filter() in python
The filter() method filters the given sequence with the help of a function that tests each element
in the sequence to be true or not.
syntax:
filter(function, sequence)
Parameters:
function: function that tests if each element of a
sequence true or not.
sequence: sequence which needs to be filtered, it can
be sets, lists, tuples, or containers of any iterators.
Returns:
returns an iterator that is already filtered.
Output:
The filtered letters are:
e
e
Application:
It is normally used with Lambda functions to separate list, tuple, or sets.
Output:
[1, 3, 5, 13]
[0, 2, 8]
SCRIPT VALIDATION:
Introduction to Python Validation
Whenever the user accepts an input, it needs to be checked for validation which checks if the
input data is what we are expecting. The validation can be done in two different ways, that is by
using a flag variable or by using try or except which the flag variable will be set to false initially
and if we can find out that the input data is what we are expecting the flag status can be set to
true and find out what can be done next based on the status of the flag whereas while using try or
except, a section of code is tried to run. If there is a negative response, then the except block of
code is run.
Types of Validation in Python
There are three types of validation in python, they are:
Type Check: This validation technique in python is used to check the given input data type. For
example, int, float, etc.
Length Check: This validation technique in python is used to check the given input string’s
length.
Range Check: This validation technique in python is used to check if a given number falls in
between the two numbers.
The syntax for validation in Python is given below:
Syntax using the flag:
flagName = False
while not flagName:
if [Do check here]:
flagName = True
else:
print('error message')
The status of the flag is set to false initially, and the same condition is considered for a while
loop to make the statement while not true, and the validation is performed setting the flag to true
if the validation condition is satisfied; otherwise, the error message is printed.
Syntax using an exception:
while True:
try:
[run code that might fail here]
break
except:
print('This is the error message if the code fails')
print('run the code from here if code is successfully run in the try block of code above')
print(‘run the code from here if code is successfully run in the try block of code above)
We set the condition to be true initially and perform the necessary validation by running a block
of code, and if the code fails to perform the validation, an exception is raised displaying the error
message and a success message is printed if the try block successfully executes the code.
Examples of Python Validation
Examples of python validation are:
Example #1
Python program using a flag to validate if the input given by the user is an integer.#Datatype
check.
#Declare a variable validInt which is also considered as flag and set it to false
validInt = False
#Consider the while condition to be true and prompt the user to enter the input
while not validInt:
#The user is prompted to enter the input
age1 = input('Please enter your age ')
#The input entered by the user is checked to see if it’s a digit or a number
if age1.isdigit():
#The flag is set to true if the if condition is true
validInt = True
else:
print('The input is not a valid number')
#This statement is printed if the input entered by the user is a number
print('The entered input is a number and that is ' + str(age1))
Output:
Example #2
Python program uses flag and exception to validate the type of input given by the user and
determine if it lies within a given range. #Range Check.
Code:
#Declare a variable areTeenager which is also considered as flag and set it to false
areTeenager = False
#Consider the while condition to be true and prompt the user to enter the input
while not areTeenager:
try:
#The user is prompted to enter the input
age1 = int(input('Please enter your age '))
#The input entered by the user is checked if it lies between the range specified
if age1 >= 13 and age1 <= 19:
areTeenager = True
except:
print('The age entered by you is not a valid number between 13 and 19')
#This statement is printed if the input entered by the user lies between the range of the
number specified
print('You are a teenager whose age lies between 13 and 19 and the entered age is ' + str(age))
Example #3
Python program using the flag to check the length of the input string. #Length Check.
Code:
#Declare a variable lenstring which is also considered as flag and set it to false
lenstring = False
#Consider the while condition to be true and prompt the user to enter the input
while not lenstring:
password1 = input('Please enter a password consisting of five characters ')
#The input entered by the user is checked for its length and if it is below five
if len(password1) >= 5:
lenstring = True
else:
print('The number of characters in the entered password is less than five characters')
#This statement is printed if the input entered by the user consists of less than five characters
print('The entered password is: ' + password1)
Output
Conclusion:
Thus the various text preprocessing techniques, Tokenization and Filtration & Script Validation
are done and verified.
LAB OUTCOMES:
On Successful Completion, the Student will be able to understand about Cloud9 various other the
text preprocessing techniques for any given text : stop word removal, lemmatization / stemming
PROCEDURE:
Lemmatization:
Python | Lemmatization with NLTK
Lemmatization is the process of grouping together the different inflected forms of a word so they
can be analyzed as a single item. Lemmatization is similar to stemming but it brings context to
the words. So it links words with similar meanings to one word.
Text preprocessing includes both Stemming as well as Lemmatization. Many times people find
these two terms confusing. Some treat these two as the same. Actually, lemmatization is
preferred over Stemming because lemmatization does morphological analysis of the words.
Applications of lemmatization are:
Examples of lemmatization:
Python3
lemmatizer = WordNetLemmatizer()
print("rocks :", lemmatizer.lemmatize("rocks"))
print("corpora :", lemmatizer.lemmatize("corpora"))
Output :
rocks : rock
corpora : corpus
better : good
Stemming:Python | Stemming words with NLTK
Stemming is the process of producing morphological variants of a root/base word. Stemming
programs are commonly referred to as stemming algorithms or stemmers. A stemming algorithm
reduces the words “chocolates”, “chocolatey”, and “choco” to the root word, “chocolate” and
“retrieval”, “retrieved”, “retrieves” reduce to the stem “retrieve”.
Prerequisite: Introduction to Stemming
-> "likes"
-> "liked"
-> "likely"
-> "liking"
Errors in Stemming: There are mainly two errors in stemming
– Overstemming and Understemming. Overstemming occurs when two words are stemmed
from the same root that are of different stems. Under-stemming occurs when two words are
stemmed from the same root that is not of different stems.
Applications of stemming are:
Stemming is used in information retrieval systems like search engines.
It is used to determine domain vocabularies in domain analysis.
Stemming is desirable as it may reduce redundancy as most of the time the word stem and their
inflected/derived words mean the same.
Below is the implementation of stemming words using NLTK:
Code #1:
Python3
Output:
program : program
programs : program
programmer : program
programming : program
programmers : program
Code #2: Stemming words from sentences
Python3
# importing modules
from nltk.stem import PorterStemmer
from nltk.tokenize import word_tokenize
ps = PorterStemmer()
sentence = "Programmers program with programming languages"
words = word_tokenize(sentence)
for w in words:
print(w, " : ", ps.stem(w))
Output :
Programmers : program
program : program
with : with
programming : program
languages : language
Conclusion:
Thus the various other text preprocessing techniques for any given text : Stop Word Removal,
Lemmatization / Stemming are done and verified.
EXPT.4 MORPHOLOGICAL ANALYSIS AND WORD GENERATION FOR ANY
GIVEN TEXT
LAB OBJECTIVES:
To understand the concepts of morphological analysis and word generation for any given text
LAB OUTCOMES:
On Successful Completion, the Student will be able to understand about morphological analysis
and word generation for any given text in the real world application.
PROCEDURE:
Like any other python library, we will install polyglot using pip install polyglot.
Morphological Analysis
Polyglot offers trained morfessor models to generate morphemes from words. The goal of the
Morpho project is to develop unsupervised data-driven methods that discover the regularities
behind word forming in natural languages. In particular, Morpho project is focussing on the
discovery of morphemes, which are the primitive units of syntax, the smallest individually
meaningful elements in the utterances of a language. Morphemes are important in automatic
generation and recognition of a language, especially in languages in which words may have
many different inflected forms.
Languages Coverage
Using polyglot vocabulary dictionaries, we trained morfessor models on the most frequent words
50,000 words of each language.
from polyglot.downloader import downloader
print(downloader.supported_languages_table("morph2"))
1. Piedmontese language 2. Lombard language 3. Gan Chinese
4. Sicilian 5. Scots 6. Kirghiz, Kyrgyz
7. Pashto, Pushto 8. Kurdish 9. Portuguese
10. Kannada 11. Korean 12. Khmer
13. Kazakh 14. Ilokano 15. Polish
16. Panjabi, Punjabi 17. Georgian 18. Chuvash
19. Alemannic 20. Czech 21. Welsh
22. Chechen 23. Catalan; Valencian 24. Northern Sami
25. Sanskrit (Saṁskṛta) 26. Slovene 27. Javanese
28. Slovak 29. Bosnian-Croatian-Serbian 30. Bavarian
31. Swedish 32. Swahili 33. Sundanese
34. Serbian 35. Albanian 36. Japanese
37. Western Frisian 38. French 39. Finnish
40. Upper Sorbian 41. Faroese 42. Persian
43. Sinhala, Sinhalese 44. Italian 45. Amharic
46. Aragonese 47. Volapük 48. Icelandic
49. Sakha 50. Afrikaans 51. Indonesian
52. Interlingua 53. Azerbaijani 54. Ido
55. Arabic 56. Assamese 57. Yoruba
58. Yiddish 59. Waray-Waray 60. Croatian
61. Hungarian 62. Haitian; Haitian Creole 63. Quechua
64. Armenian 65. Hebrew (modern) 66. Silesian
67. Hindi 68. Divehi; Dhivehi; Mald... 69. German
70. Danish 71. Occitan 72. Tagalog
73. Turkmen 74. Thai 75. Tajik
76. Greek, Modern 77. Telugu 78. Tamil
79. Oriya 80. Ossetian, Ossetic 81. Tatar
82. Turkish 83. Kapampangan 84. Venetian
85. Manx 86. Gujarati 87. Galician
88. Irish 89. Scottish Gaelic; Gaelic 90. Nepali
91. Cebuano 92. Zazaki 93. Walloon
94. Dutch 95. Norwegian 96. Norwegian Nynorsk
97. West Flemish 98. Chinese 99. Bosnian
100. Breton 101. Belarusian 102. Bulgarian
103. Bashkir 104. Egyptian Arabic 105. Tibetan Standard, Tib...
106. Bengali 107. Burmese 108. Romansh
109. Marathi (Marāṭhī) 110. Malay 111. Maltese
112. Russian 113. Macedonian 114. Malayalam
115. Mongolian 116. Malagasy 117. Vietnamese
118. Spanish; Castilian 119. Estonian 120. Basque
121. Bishnupriya Manipuri 122. Asturian 123. English
124. Esperanto 125. Luxembourgish, Letzeb... 126. Latin
127. Uighur, Uyghur 128. Ukrainian 129. Limburgish, Limburgan...
130. Latvian 131. Urdu 132. Lithuanian
133. Fiji Hindi 134. Uzbek 135. Romanian, Moldavian, ...
Download Necessary Models
%%bash
polyglot download morph2.en morph2.ar
[polyglot_data] Downloading package morph2.en to
[polyglot_data] /home/rmyeid/polyglot_data...
[polyglot_data] Package morph2.en is already up-to-date!
[polyglot_data] Downloading package morph2.ar to
[polyglot_data] /home/rmyeid/polyglot_data...
[polyglot_data] Package morph2.ar is already up-to-date!
Example
from polyglot.text import Text, Word
words = ["preprocessing", "processor", "invaluable", "thankful", "crossed"]
for w in words:
w = Word(w, language="en")
print("{:<20}{}".format(w, w.morphemes))
preprocessing ['pre', 'process', 'ing']
processor ['process', 'or']
invaluable ['in', 'valuable']
thankful ['thank', 'ful']
crossed ['cross', 'ed']
If the text is not tokenized properly, morphological analysis could offer a smart of way of
splitting the text into its original units. Here, is an example:
blob = "Wewillmeettoday."
text = Text(blob)
text.language = "en"
text.morphemes
WordList([u'We', u'will', u'meet', u'to', u'day', u'.'])
!polyglot --lang en tokenize --input testdata/cricket.txt | polyglot --lang en morph | tail -n 30
which which
India In_dia
beat beat
Bermuda Ber_mud_a
in in
Port Port
of of
Spain Spa_in
in in
2007 2007
, ,
which which
was wa_s
equalled equal_led
five five
days day_s
ago ago
by by
South South
Africa Africa
in in
their t_heir
victory victor_y
over over
West West
Indies In_dies
in in
Sydney Syd_ney
. .
This is an interface to the implementation being described in the Morfessor2.0: Python
Implementation and Extensions for Morfessor Baseline technical report.
@InProceedings{morfessor2,
title:{Morfessor 2.0: Python Implementation and Extensions for Morfessor Baseline},
author: {Virpioja, Sami ; Smit, Peter ; Grönroos, Stig-Arne ; Kurimo, Mikko},
year: {2013},
publisher: {Department of Signal Processing and Acoustics, Aalto University},
booktitle:{Aalto University publication series}
}
Note: The split() function, by default, splits by white space. If you want any other delimiter like
newline character you can specify that as an argument.
Output:
# using randint()
import random
# open file
with open("myFile.txt", "r") as file:
data = file.read()
words = data.split()
Output:
Conclusion:
Thus the morphological analysis and word generation for any given text is done and verified for
real world application.
EXPT. 5 N GRAM MODEL FOR THE GIVEN TEXT INPUT
LAB OBJECTIVES:
To understand the concept of N Gram Model for the given text input for real world applications.
LAB OUTCOMES:
On Successful Completion, the Student will be able to understand about N Gram Model for the
given text input for real world applications.
PROCEDURE:
N-Gram Language Modelling with NLTK
Language modeling is the way of determining the probability of any sequence of words.
Language modeling is used in a wide variety of applications such as Speech Recognition, Spam
filtering, etc. In fact, language modeling is the key aim behind the implementation of many state-
of-the-art Natural Language Processing models.
Methods of Language Modelings:
Two types of Language Modelings:
Statistical Language Modelings: Statistical Language Modeling, or Language Modeling, is the
development of probabilistic models that are able to predict the next word in the sequence given
the words that precede. Examples such as N-gram language modeling.
Neural Language Modelings: Neural network methods are achieving better results than
classical methods both on standalone language models and when models are incorporated into
larger models on challenging tasks like speech recognition and machine translation. A way of
performing a neural language model is through word embeddings.
N-gram
N-gram can be defined as the contiguous sequence of n items from a given sample of text or
speech. The items can be letters, words, or base pairs according to the application. The N-grams
typically are collected from a text or speech corpus (A long text dataset).
N-gram Language Model:
An N-gram language model predicts the probability of a given N-gram within any sequence of
words in the language. A good N-gram model can predict the next word in the sentence i.e the
value of p(w|h)
Example of N-gram such as unigram (“This”, “article”, “is”, “on”, “NLP”) or bi-gram (‘This
article’, ‘article is’, ‘is on’,’on NLP’).
Understanding N-grams
Text n-grams are commonly utilized in natural language processing and text mining. It’s
essentially a string of words that appear in the same window at the same time.
When computing n-grams, you normally advance one word (although in more complex scenarios
you can move n-words). N-grams are used for a variety of purposes.
N Grams Demonstration
For example, while creating language models, n-grams are utilized not only to create unigram
models but also bigrams and trigrams.
Google and Microsoft have created web-scale grammar models that may be used for a variety of
activities such as spelling correction, hyphenation, and text summarization.
Thus the concept of N-Gram model for the given text input is done and verified.
EXPT. 6 STUDY THE DIFFERENT POS TAGGERS AND PERFORM POS TAGGING
ON THE GIVEN TEXT
LAB OBJECTIVES:
To understand the study the different pos taggers and perform pos tagging on the given text for
real world application.
LAB OUTCOMES:
On Successful Completion, the Student will be able to understand about study the different pos
taggers and perform pos tagging on the given text for real world application.
PROCEDURE:
POS Tagging
POS Tagging (Parts of Speech Tagging) is a process to mark up the words in text format for a
particular part of a speech based on its definition and context. It is responsible for text reading in
a language and assigning some specific token (Parts of Speech) to each word. It is also called
grammatical tagging.
Let’s learn with a NLTK Part of Speech example:
Input: Everything to permit us.
Output: [(‘Everything’, NN),(‘to’, TO), (‘permit’, VB), (‘us’, PRP)]
Steps Involved in the POS tagging example:
Tokenize text (word_tokenize)
apply pos_tag to above step that is nltk.pos_tag(tokenize_text)
NLTK POS Tags Examples are as below:
Abbreviation Meaning
CC coordinating conjunction
CD cardinal digit
DT Determiner
EX existential there
FW foreign word
IN preposition/subordinating conjunction
Abbreviation Meaning
LS list market
RP particle (about)
UH interjection (goodbye)
VB verb (ask)
Abbreviation Meaning
The above NLTK POS tag list contains all the NLTK POS Tags. NLTK POS tagger is used to
assign grammatical information of each word of the sentence. Installing, Importing and
downloading all the packages of POS NLTK is complete.
1. To count the tags, you can use the package Counter from the collection’s module. A
counter is a dictionary subclass which works on the principle of key-value operation. It is
an unordered collection where elements are stored as a dictionary key while the count is
their value.
2. Import nltk which contains modules to tokenize the text.
3. Write the text whose pos_tag you want to count.
4. Some words are in upper case and some in lower case, so it is appropriate to transform all
the words in the lower case before applying tokenization.
5. Pass the words through word_tokenize from nltk.
6. Calculate the pos_tag of each token
Output = [('guru99', 'NN'), ('is', 'VBZ'), ('one', 'CD'), ('of', 'IN'), ('the', 'DT'), ('best', 'JJS'),
('site', 'NN'), ('to', 'TO'), ('learn', 'VB'), ('web', 'NN'), (',', ','), ('sap', 'NN'), (',', ','), ('ethical',
'JJ'), ('hacking', 'NN'), ('and', 'CC'), ('much', 'RB'), ('more', 'JJR'), ('online', 'JJ')]
7. Now comes the role of dictionary counter. We have imported in the code line 1. Words
are the key and tags are the value and counter will count each tag total count present in
the text.
Tagging Sentences
Tagging Sentence in a broader sense refers to the addition of labels of the verb, noun, etc., by the
context of the sentence. Identification of POS tags is a complicated process. Thus generic
tagging of POS is manually not possible as some words may have different (ambiguous)
meanings according to the structure of the sentence. Conversion of text in the form of list is an
important step before tagging as each word in the list is looped and counted for a particular tag.
Please see the below code to understand it better
import nltk
text = "Hello Guru99, You have to build a very good site, and I love visiting your site."
sentence = nltk.sent_tokenize(text)
for sent in sentence:
print(nltk.pos_tag(nltk.word_tokenize(sent)))
Output:
[(‘Hello’, ‘NNP’), (‘Guru99’, ‘NNP’), (‘,’, ‘,’), (‘You’, ‘PRP’), (‘have’, ‘VBP’), (‘build’,
‘VBN’), (‘a’, ‘DT’), (‘very’, ‘RB’), (‘good’, ‘JJ’), (‘site’, ‘NN’), (‘and’, ‘CC’), (‘I’, ‘PRP’),
(‘love’, ‘VBP’), (‘visiting’, ‘VBG’), (‘your’, ‘PRP$’), (‘site’, ‘NN’), (‘.’, ‘.’)]
Code Explanation:
1. Code to import nltk (Natural language toolkit which contains submodules such as
sentence tokenize and word tokenize.)
2. Text whose tags are to be printed.
3. Sentence Tokenization
4. For loop is implemented where words are tokenized from sentence and tag of each word
is printed as output.
In Corpus there are two types of POS taggers:
Rule-Based
Stochastic POS Taggers
1.Rule-Based POS Tagger: For the words having ambiguous meaning, rule-based approach on
the basis of contextual information is applied. It is done so by checking or analyzing the meaning
of the preceding or the following word. Information is analyzed from the surrounding of the
word or within itself. Therefore words are tagged by the grammatical rules of a particular
language such as capitalization and punctuation. e.g., Brill’s tagger.
2.Stochastic POS Tagger: Different approaches such as frequency or probability are applied
under this method. If a word is mostly tagged with a particular tag in training set then in the test
sentence it is given that particular tag. The word tag is dependent not only on its own tag but also
on the previous tag. This method is not always accurate. Another way is to calculate the
probability of occurrence of a specific tag in a sentence. Thus the final tag is calculated by
checking the highest probability of a word with a particular tag.
Thus the Study of the different POS taggers and Perform POS tagging on the given text is done
and verified.
EXPT.7 PERFORM CHUNKING FOR THE GIVEN TEXT INPUT
LAB OBJECTIVES:
To understand to perform chunking for the given text input for real world application.
LAB OUTCOMES:
On Successful Completion, the Student will be able to understand to perform chunking for the
given text input for real world application
PROCEDURE:
# initializing strings
test_str = 'geeksforgeeks 1'
# initializing K
K=5
res = []
for idx in range(0, len(test_str), chnk_len):
# printing result
print("The K chunked list : " + str(res))
Output
The original string is : geeksforgeeks 1
The K chunked list : ['gee', 'ksf', 'org', 'eek', 's 1']
Method #2: Using list comprehension
The method similar to above, difference being that last process is encapsulated to one-liner list
comprehension.
Python3
# Python3 code to demonstrate working of
# Divide String into Equal K chunks
# Using list comprehension
# initializing strings
test_str = 'geeksforgeeks 1'
# initializing K
K=5
# printing result
print("The K len chunked list : " + str(res))
Output
Thus the experiment to Perform Chunking for the given text input is done and verified.
EXPT. 8 IMPLEMENTING NAMED ENTITY RECOGNIZER FOR THE
GIVEN TEXT INPUT
LAB OBJECTIVES:
To implement the named entity recognizer for the given text input for the real world applications.
LAB OUTCOMES:
On Successful Completion, the Student will be able to understand about the named entity
recognizer for the given text input for the real world applications
PROCEDURE:
Methods of NER
One way is to train the model for multi-class classification using different machine learning
algorithms, but it requires a lot of labelling. In addition to labelling the model also requires a
deep understanding of context to deal with the ambiguity of the sentences. This makes it a
challenging task for simple machine learning /
Another way is that Conditional random field that is implemented by both NLP Speech
Tagger and NLTK. It is a probabilistic model that can be used to model sequential data such
as words. The CRF can capture a deep understanding of the context of the sentence.
Deep Learning Based NER: deep learning NER is much more accurate than previous
method, as it is capable to assemble words. This is due to the fact that it used a method
called word embedding, that is capable of understanding the semantic and syntactic
relationship between various words. It is also able to learn analyzes topic-specific as well as
high level words automatically. This makes deep learning NER applicable for performing
multiple tasks. Deep learning can do most of the repetitive work itself, hence researchers for
example can use their time more efficiently.
Implementation
In this implementation, we will perform Named Entity Recognition using two different
frameworks: Spacy and NLTK. This code can be run on colab, however for visualization
purpose. I recommend the local environment. We can install the following frameworks using
pip install
First, we performed Named Entity recognition using Spacy.
Python3
# command to run before code
! pip install spacy
! pip install nltk
! python -m spacy download en_core_web_sm
Output:
[Python is an interpreted, high-level and general-purpose programming language.,
Pythons design philosophy emphasizes code readability with its notable use of significant
indentation.,
Its language constructs and object-oriented approach aim to help programmers write clear,
logical code for small and large-scale projects]
# tokens
Python
is
an
interpreted
,
high
-
level
and
general
-
purpose
programming
language
.
Pythons
design
philosophy
emphasizes
code
readability
with
its
notable
use
of
significant
indentation
.
Its
language
constructs
and
object
-
oriented
approachaim
to
help
programmers
write
clear
,
logical
code
for
small
and
large
-
scale
projects
# named entity
[('Python', 0, 6, 'ORG')]
#here ORG stands for Organization
Conclusion:
Thus the experiment to Implement Named Entity Recognizer for the given text input is done and
verified.
LAB OBJECTIVES:
To implement text similarity recognizer for the chosen text documents for real world
applications.
LAB OUTCOMES:
On Successful Completion, the Student will be able to understand about implement text
similarity recognizer for the chosen text documents for real world applications.
PROCEDURE:
import math
import string
import sys
try:
with open(filename, 'r') as f:
data = f.read()
return data
except IOError:
print("Error opening or reading input file: ", filename)
sys.exit()
text = text.translate(translation_table)
word_list = text.split()
return word_list
Now that we have the word list, we will now calculate the frequency of occurrences of the
words.
D = {}
if new_word in D:
D[new_word] = D[new_word] + 1
else:
D[new_word] = 1
return D
# returns dictionary of (word, frequency)
# pairs from the previous dictionary.
def word_frequencies_for_file(filename):
line_list = read_file(filename)
word_list = get_words_from_line_list(line_list)
freq_mapping = count_frequency(word_list)
return freq_mapping
Lastly, we will calculate the dot product to give the document distance.
# returns the dot product of two documents
def dotProduct(D1, D2):
Sum = 0.0
if key in D2:
Sum += (D1[key] * D2[key])
return Sum
try:
with open(filename, 'r') as f:
data = f.read()
return data
except IOError:
print("Error opening or reading input file: ", filename)
sys.exit()
return word_list
D = {}
if new_word in D:
D[new_word] = D[new_word] + 1
else:
D[new_word] = 1
return D
line_list = read_file(filename)
word_list = get_words_from_line_list(line_list)
freq_mapping = count_frequency(word_list)
return freq_mapping
if key in D2:
Sum += (D1[key] * D2[key])
return Sum
# filename_1 = sys.argv[1]
# filename_2 = sys.argv[2]
sorted_word_list_1 = word_frequencies_for_file(filename_1)
sorted_word_list_2 = word_frequencies_for_file(filename_2)
distance = vector_angle(sorted_word_list_1, sorted_word_list_2)
# Driver code
documentSimilarity('GFG.txt', 'file.txt')
Output:
File GFG.txt :
15 lines,
4 words,
4 distinct words
File file.txt :
22 lines,
5 words,
5 distinct words
The distance between the documents is: 0.835482 (radians)
Conclusion:
Thus the experiment to Implement Text Similarity Recognizer for the chosen text documents is
done and verified.
LAB OBJECTIVES:
To understand about exploratory data analysis for a given text (word cloud) for real world
applications.
LAB OUTCOMES:
On Successful Completion, the Student will be able to understand about exploratory data
analysis for a given text (word cloud) for real world applications.
PROCEDURE:
Python3
comment_words = ''
stopwords = set(STOPWORDS)
plt.show()
Output :
The above word cloud has been generated using Youtube04-Eminem.csv file in the dataset. One
interesting task might be generating word clouds using other csv files available in the dataset.
Advantages of Word Clouds :
1. Analyzing customer and employee feedback.
2. Identifying new SEO keywords to target.
Drawbacks of Word Clouds :
1. Word Clouds are not perfect for every situation.
2. Data should be optimized for context.
Reference : https://en.wikipedia.org/wiki/Tag_cloud
Conclusion:
Thus the experiment, Exploratory data analysis of a given text (Word Cloud) is done and
verified.
LAB OBJECTIVES:
To understand about the mini project report: web scraping for real world applications.
LAB OUTCOMES:
On Successful Completion, the Student will be able to understand about the mini project report:
web scraping for real world applications.
PROCEDURE:
Gathering and getting ready datasets is one of the critical techniques in any Machine learning
project. People accumulate the dataset thru numerous approaches like databases, online
repositories, APIs, survey forms, and many others. But when we want to extract any internet site
data when no means of API is there then the best alternative left is Web Scraping.
In this article, you will learn about web scraping in brief and see how to extract data from
websites with a hands-on demonstration with python. We will be covering the following topics.
Table of Contents
1. What is Web Scraping
2. Why is Web Scraping used
3. Challenges and Guide for Web Scraping
4. Python Libraries for Web Scraping
5. Hands-on Web Scraping with Python
6. Web Scraping using lxml
7. Web Scraping using Scrapy
8. End Notes
What is Web Scraping?
Web scraping is a simple technique that describes an automatic collection of a huge amount of
data from websites. Data is of three types as structured, unstructured, and semi-structured.
Websites hold all the types of data in an unstructured way web scraping is a technique that helps
to collect this instructed data from websites and store it in a structured way. Today most of the
corporate use web-scraping to leverage good business decisions in this comparative market. So
let’s learn why and where web scraping is used the most.
Why is Web Scraping used?
We have already discussed that to automatically fetch data from websites web scraping is
required but where it is used, And what is a requirement to do so? To better understand this let’s
Now we hope that it makes a clear understanding that why web scraping is necessary and use
most, and this application widens your thought and you can think of many different applications
it is used nowadays.
because in a nutshell by web scrapping we extract data of websites that return in an HTML doc
form and CSS is to get specific data that we are looking for. And it is also important because web
But in this article, we will perform from ground level so you can follow it easily.
Python Libraries for web scraping
requests – It is the most basic library for web scraping. The request is a python in-built module
that allows you to send an HTTP request like a GET, POST, etc to websites using python.
Getting the HTML content of a web page is the first and foremost step of web scraping. Due to
its ease of use, it comes as the motto of HTTP for humans. However, requests do not parse the
Beautiful Soup(bs4) – Beautiful Soup is a Python library used for web scraping. It sits at a top
of an HTML or XML parser which provides python idioms for iterating, searching, and
modifying a parse tree. It automatically converts incoming documents to Unicode and outgoing
documents to UTF-8. Beautiful Soup is easy to learn, robust, beginner-friendly and, the most
lxml – It is a high performance, fast HTML and XML parsing library. It is faster than a beautiful
soup. It works well when we are aiming to scrape large datasets. It also allows you to extract data
to scrape a large amount of dataset efficiently and effectively. It can be used for a wide range of
purposes, from data mining to monitoring and automated testing. Scrapy creates spiders that
crawl across websites and retrieve the data. The best thing about scrapy is it is asynchronous, and
with the help of spacy, you can make multiple HTTP requests simultaneously. You can also
We are going to scrape the data from the Ambition box website. Ambition Box is a platform that
lists job openings in different companies in India. If you visit the companies page you can
observe the name of the company, rating, review, how old the company is, and different
information about it. So we aim to get this data in a table format that consists of the name of the
company, rating, review, company age, number of employees, etc information. There are about
33 pages and on each page, approximately 30 companies are listed and we want to fetch all the
Make a request
Now we will create an HTTP request to the Ambition Box website and it will give us a response
We have the HTML content, and to extract the data present in that we will use beautiful soup
which creates a parser around it. If you print the parser output using prettify function then you
To get any name from an HTML document there is a special tag in which it is written. If you go
on the website and right-click if you go on inspect section then you can see that each company’s
The names of all the companies are written in an H2 tag. we can run a loop and get all names of
companies on the first page. when we write find all it extracts a list object and in that we access a
zero-based index which is the title of the text and access the text written in that. Strip function is
used to avoid the extra spaces that are used in a web page for design.
If you inspect on rating, review of companies then they all are written in paragraph(p) tag. Along
with the paragraph tag, they are having a unique class name using which we can identify them.
Rating is having rating class, reviews are having review class But company type, company age,
headquarters location, and several employees are in the same tag and having the same class
Now as we have seen we will access each feature using tag and class. so let us create a separate
list of each feature whose length will be 30. first, we will store the list of all the 30 divisions
means all company divisions in a variable, and apply a loop over it.
company=soup.find_all('div',class_='company-content-wrapper')
print(len(company)) #30
Now we can easily loop over the company variables and get all the information on the first page.
name = []
rating = []
reviews = []
comp_type = []
head_q = []
how_old = []
no_of_employees = []
for comp in company:
name.append(comp.find('h2').text.strip())
rating.append(comp.find('p', class_ = "rating").text.strip())
reviews.append(comp.find('a', class_ = "review-count").text.strip())
comp_type.append(comp.find_all('p', class_ = 'infoEntity')[0].text.strip())
head_q.append(comp.find_all('p',class_='infoEntity')[1].text.strip())
how_old.append(comp.find_all('p',class_='infoEntity')[2].text.strip())
no_of_employees.append(comp.find_all('p',class_='infoEntity')[3].text.strip())
#creating dataframe for all list
features = {'name':name, 'rating':rating,'reviews':reviews,
'company_type':comp_type,'Head_Quarters':head_q, 'Company_Age':how_old,
'No_of_Employee':no_of_employees }
df = pd.DataFrame(features)
The above is a complete dataframe of only the first page, and now let’s kickstart our enthusiasm
Now you have a better understanding of web scraping and how data is coming in a separate
feature. So we are ready to create a final dataframe of all 33 pages and each page is having data
of 30 companies. But on some pages, there are some inconsistencies like a little information
about a company is not provided so we need to handle this. so we will place each feature in a try-
except block and if data is not present then we will append the Null value in place of it. For
fetching data from each page, we have to make a request again and again on a different page in a
loop and fetch its data and after that, all the things are the same as above.
final = pd.DataFrame()
for j in range(1, 33):
#make a request to specific page
webpage=requests.get('https://www.ambitionbox.com/list-of-companies?
page={}'.format(j)).text
soup = BeautifulSoup(webpage, 'lxml')
company = soup.find_all('div', class_ = 'company-content-wrapper')
name = []
rating = []
reviews = []
comp_type = []
head_q = []
how_old = []
no_of_employees = []
for comp in company:
try:
name.append(comp.find('h2').text.strip())
except:
name.append(np.nan)
try:
rating.append(comp.find('p', class_ = "rating").text.strip())
except:
rating.append(np.nan)
try:
reviews.append(comp.find('a', class_ = "review-count").text.strip())
except:
reviews.append(np.nan)
try:
comp_type.append(comp.find_all('p', class_ = 'infoEntity')[0].text.strip())
except:
comp_type.append(np.nan)
try:
head_q.append(comp.find_all('p',class_='infoEntity')[1].text.strip())
except:
head_q.append(np.nan)
try:
how_old.append(comp.find_all('p',class_='infoEntity')[2].text.strip())
except:
how_old.append(np.nan)
try:
no_of_employees.append(comp.find_all('p',class_='infoEntity')[3].text.strip())
except:
no_of_employees.append(np.nan)
#creating dataframe for all list
features = {'name':name, 'rating':rating,'reviews':reviews,
'company_type':comp_type,'Head_Quarters':head_q, 'Company_Age':how_old,
'No_of_Employee':no_of_employees }
df = pd.DataFrame(features)
final = final.append(df, ignore_index=True)
We have created a dynamic URL of each page to make a request and fetch the data and you can
have a look at the final dataframe. That sits this is how web scraping is done.
Now we have an understanding of how web scraping works, and how to extract a single piece of
information from a website and implement a dataframe. What if we want to extract some
paragraphs or some informant line from some blog or article then It is easy to do with lxml using
XPath.
We will extract a paragraph from one of the Analytics Vidhya articles using lxml with only a few
lines of code. I hope that you have already installed lxml using the pip command, and are ready
Visit the article and select any paragraph and right-click on it and click on inspect option.
Step-2) Right-click element on source-code to the right
As you click on Inspect the Element section will open, and in that right-click on the selected
element and copy XPath of element and come to the coding environment and save the path in a
variable as a string.
Step-3) HTTP Request to retrieve HTML content
Make HTTP requests on the Article website to retrieve the HTML content.
import requests
from lxml import html
URL = 'https://www.analyticsvidhya.com/blog/2021/09/a-comprehensive-guide-on-neural-
networks-performance-optimization/'
path = '//*[@id="1"]/p[5]'
response = requests.get(URL)
Step-4) Get Byte string and filter source code
using lxml parser parses the response content received on request and converts it to a source code
object.
byte_data = response.content
source_code = html.fromstring(byte_data)
Step-5) Jump to preferred HTML element
Scraping data efficiently in a few minutes is everyone’s aim which is fulfilled by scrapy. with
multiple spider bots that crawl on a website to retrieve data for you. In this section, we will be
using scrapy in our local jupyter notebook(Goole collab) and scrape data in our dataframe.
Scrapy provides a default quote website for learning web scraping using scrapy.
It consists of various quotes along with the author’s name and tags to which it belongs. we will
create a dataframe with 3 columns as quote, author, and tag. After installing spacy follow the
below steps. After scraping details from a website we will write details in JSON file and load
here we create a class that creates a new JSON file and function to write all items found during
scraping in a JSON file where each line contains one JSON element.
#setup pipeline
import json
class JsonWriterPipeline(object):
def open_spider(self, spider):
self.file = open('quoteresult.jl', 'w')
def close_spider(self, spider):
self.file.close()
def process_item(self, item, spider):
line = json.dumps(dict(item)) + "n"
self.file.write(line)
return item
step-3) Define the Spider
Now we need to define our crawler(Spider) and we pass the URL from where to start parsing and
which values to retrieve. I set the logging level to a warning so that notebook is not overloaded.
#define spider
import logging
class QuotesSpider(scrapy.Spider):
name = "quotes"
start_urls = [
'http://quotes.toscrape.com/page/1/',
'http://quotes.toscrape.com/page/2/',
]
custom_settings = {
'LOG_LEVEL': logging.WARNING,
'ITEM_PIPELINES': {'__main__.JsonWriterPipeline': 1}, # Used for pipeline 1
'FEED_FORMAT':'json', # Used for pipeline 2
'FEED_URI': 'quoteresult.json' # Used for pipeline 2
}
def parse(self, response):
for quote in response.css('div.quote'):
yield {
'text': quote.css('span.text::text').extract_first(),
'author': quote.css('span small::text').extract_first(),
'tags': quote.css('div.tags a.tag::text').extract(),
}
Each quote is written in a separate division tag with class name as the quote so we have fetched
define the scrapy crawler process and pass the spider class to start retrieving the data.
#start crawler
process = CrawlerProcess({
'USER_AGENT': 'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1)'
})
process.crawl(QuotesSpider)
process.start()
The retrieved data is saved in a JSON file and we will load them as a dataframe using pandas.
import pandas as pd
dfjson = pd.read_json('quoteresult.json')
This is how scrapy works and helps you to extract lots of data from websites very quickly.
End Notes
In this article, we have learned about Web scraping, its applications, and why it is being used
everywhere. We have performed hands-on live web scraping from websites to fetch different
companies and prepare a dataframe that is used for further machine learning project purposes
using beautiful soup. We have also learned about the lxml library and perform a practical
demonstration. Apart from this, we have learned about the boss of web scraping library name
Thus the Mini Project Report: Web Scraping for real world NLP application is done and verified.
EXPT.12 IMPLEMENTATION AND PRESENTATION OF MINI PROJECT: WEB
SCRAPING
LAB OBJECTIVES:
To implement and presentation of mini project: web scraping for real world applications.
LAB OUTCOMES:
On Successful Completion, the Student will be able to implementation and presentation of mini
project: web scraping for real world applications
PROCEDURE:
Installation
Requests installation depends on the type of operating system, the basic command anywhere
would be to open a command terminal and run,
pip install requests
Making a Request
Python requests module has several built-in methods to make HTTP requests to specified URI
using GET, POST, PUT, PATCH, or HEAD requests. A HTTP request is meant to either retrieve
data from a specified URI or to push data to a server. It works as a request-response protocol
between a client and a server. Here we will be using the GET request.
GET method is used to retrieve information from the given server using a given URI. The GET
method sends the encoded user information appended to the page request.
Python3
import requests
Output:
Response object
When one makes a request to a URI, it returns a response. This Response object in terms of
python is returned by requests.method(), method being – get, post, put, etc. Response is a
powerful object with lots of functions and attributes that assist in normalizing data or creating
ideal portions of code. For example, response.status_code returns the status code from the
headers itself, and one can check if the request was processed successfully or not.
Response objects can be used to imply lots of features, methods, and functionalities.
Example: Python requests Response Object
Python3
import requests
Output:
https://www.geeksforgeeks.org/python-programming-language/
200
For more information, refer to our Python Requests Tutorial.
BeautifulSoup Library
BeautifulSoup is used extract information from the HTML and XML files. It provides a parse
tree and the functions to navigate, search or modify this parse tree.
Installation
To install Beautifulsoup on Windows, Linux, or any operating system,
one would need pip package. To check how to install pip on your
operating system, check out – PIP Installation – Windows || Linux. Now
run the below command in the terminal.
Before getting out any information from the HTML of the page, we must understand the
structure of the page. This is needed to be done in order to select the desired data from the entire
page. We can do this by right-clicking on the page we want to scrape and select inspect element.
After clicking the inspect button the Developer Tools of the browser gets open. Now almost all
the browsers come with the developers tools installed, and we will be using Chrome for this
tutorial.
The developer’s tools allow seeing the site’s Document Object Model (DOM). If you don’t know
about DOM then don’t worry just consider the text displayed as the HTML structure of the page.
Python3
import requests
from bs4 import BeautifulSoup
Output:
This information is still not useful to us, let’s see another example to make some clear picture
from this. Let’s try to extract the title of the page.
Python3
import requests
from bs4 import BeautifulSoup
Output:
<title>Python Programming Language - GeeksforGeeks</title>
title
html
Finding Elements
Now, we would like to extract some useful data from the HTML content. The soup object
contains all the data in the nested structure which could be programmatically extracted. The
website we want to scrape contains a lot of text so now let’s scrape all those content. First, let’s
inspect the webpage we want to scrape.
Finding Elements by class
In the above image, we can see that all the content of the page is under the div with class entry-
content. We will use the find class. This class will find the given tag with the given attribute. In
our case, it will find all the div having class as entry-content. We have got all the content from
the site but you can see that all the images and links are also scraped. So our next task is to find
only the content from the above-parsed HTML. On again inspecting the HTML of our website –
We can see that the content of the page is under the <p> tag. Now we have to find all the p tags
present in this class. We can use the find_all class of the BeautifulSoup.
Python3
import requests
from bs4 import BeautifulSoup
s = soup.find('div', class_='entry-content')
content = s.find_all('p')
print(content)
Output:
Finding Elements by ID
In the above example, we have found the elements by the class name but let’s see how to find
elements by id. Now for this task let’s scrape the content of the leftbar of the page. The first step
is to inspect the page and see the leftbar falls under which tag.
The above image shows that the leftbar falls under the <div> tag with id as main. Now lets’s get
the HTML content under this tag. Now let’s inspect more of the page get the content of the
leftbar.
We can see that the list in the leftbar is under the <ul> tag with the class as leftBarList and our
task is to find all the li under this ul.
Python3
import requests
from bs4 import BeautifulSoup
# Making a GET request
r = requests.get('https://www.geeksforgeeks.org/python-programming-language/')
# Finding by id
s = soup.find('div', id= 'main')
print(content)
Output:
Extracting Text from the tags
In the above examples, you must have seen that while scraping the data the tags also get scraped
but what if we want only the text without any tags. Don’t worry we will discuss the same in this
section. We will be using the text property. It only prints the text from the tag. We will be using
the above example and will remove all the tags from them.
Example 1: Removing the tags from the content of the page
Python3
import requests
from bs4 import BeautifulSoup
s = soup.find('div', class_='entry-content')
lines = s.find_all('p')
Output:
Example 2: Removing the tags from the content of the
leftbar
Python3
import requests
from bs4 import BeautifulSoup
# Finding by id
s = soup.find('div', id= 'main')
Output:
Extracting Links
Till now we have seen how to extract text, let’s now see how to extract the links from the page.
Example: Python BeautifulSoup Extracting Links
Python3
import requests
from bs4 import BeautifulSoup
Output:
On again inspecting the page, we can see that images lie inside the
img tag and the link of that image is inside the src attribute. See the
below image –
Example: Python BeautifulSoup Extract Image
Python3
import requests
from bs4 import BeautifulSoup
images_list = []
images = soup.select('img')
for image in images:
src = image.get('src')
alt = image.get('alt')
images_list.append({"src": src, "alt": alt})
Output:
Scraping multiple Pages
Now, there may arise various instances where you may want to get data from multiple pages
from the same website or multiple different URLs as well, and manually writing code for each
webpage is a time-consuming and tedious task. Plus, it defines all basic principles of automation.
Duh!
To solve this exact problem, we will see two main techniques that will help us extract data from
multiple webpages:
The same website
Different website URLs
Here, we can see the page details at the end of the URL. Using this information we can easily
create a for loop iterating over as many pages as we want (by putting page/(i)/ in the URL string
and iterating “i” till N) and scrape all the useful data from them. The following code will give
you more clarity over how to scrape data by using a For Loop in Python.
Python3
import requests
from bs4 import BeautifulSoup as bs
URL = 'https://www.geeksforgeeks.org/page/1/'
req = requests.get(URL)
soup = bs(req.text, 'html.parser')
print(titles[4].text)
Output:
7 Most Common Time Wastes During Software Development
Now, using the above code, we can get the titles of all the articles by just sandwiching those lines
with a loop.
Python3
import requests
from bs4 import BeautifulSoup as bs
URL = 'https://www.geeksforgeeks.org/page/'
Output:
Python3
import requests
from bs4 import BeautifulSoup as bs
URL = ['https://www.geeksforgeeks.org','https://www.geeksforgeeks.org/page/10/']
titles = soup.find_all('div',attrs={'class','head'})
for i in range(4, 19):
if url+1 > 1:
print(f"{(i - 3) + url * 15}" + titles[i].text)
else:
print(f"{i - 3}" + titles[i].text)
Output:
For more information, refer to our Python BeautifulSoup Tutorial.
Saving Data to CSV
First we will create a list of dictionaries with the key value pairs that we
want to add in the CSV file. Then we will use the csv module to write
the output in the CSV file. See the below example for better
understanding.
Python3
import requests
from bs4 import BeautifulSoup as bs
import csv
URL = 'https://www.geeksforgeeks.org/page/'
count = 1
for title in titles:
d = {}
d['Title Number'] = f'Title {count}'
d['Title Name'] = title.text
count += 1
titles_list.append(d)
filename = 'titles.csv'
with open(filename, 'w', newline='') as f:
w = csv.DictWriter(f,['Title Number','Title Name'])
w.writeheader()
w.writerows(titles_list)
Output:
Conclusion: Thus the Implementation and Presentation of Mini Project: Web Scraping is
done and verified.