A Practitioner's Guide To Natural Language Processing (Part I) - Processing Understanding Text
A Practitioner's Guide To Natural Language Processing (Part I) - Processing Understanding Text
Introduction
Unstructured data, especially text, images and videos contain a wealth of information.
However, due to the inherent complexity in processing and analyzing this data, people
often refrain from spending extra time and effort in venturing out from structured
datasets to analyze these unstructured sources of data, which can be a potential gold
mine.
https://towardsdatascience.com/a-practitioners-guide-to-natural-language-processing-part-i-processing-understanding-text-9f4abfd13e72 1/46
8/23/2019 A Practitioner's Guide to Natural Language Processing (Part I) — Processing & Understanding Text
Natural Language Processing (NLP) is all about leveraging tools, techniques and
algorithms to process and understand natural language-based data, which is usually
unstructured like text, speech and so on. In this series of articles, we will be looking at
tried and tested strategies, techniques and workflows which can be leveraged by
practitioners and data scientists to extract useful insights from text data. We will also
cover some useful and interesting use-cases for NLP. This article will be all about
processing and understanding text data with tutorials and hands-on examples.
5. Advanced Topics
Feel free to suggest more ideas as this series progresses, and I will be glad to cover
something I might have missed out on. A lot of these articles will showcase tips and
strategies which have worked well in real-world scenarios.
This article will be covering the following aspects of NLP in detail with hands-on
examples.
4. Shallow Parsing
This should give you a good idea of how to get started with analyzing syntax and
semantics in text corpora.
Motivation
Formally, NLP is a specialized field of computer science and artificial intelligence with
roots in computational linguistics. It is primarily concerned with designing and
building applications and systems that enable interaction between machines and
natural languages that have been evolved for use by humans. Hence, often it is
perceived as a niche area to work on. And people usually tend to focus more on
machine learning or statistical learning.
https://towardsdatascience.com/a-practitioners-guide-to-natural-language-processing-part-i-processing-understanding-text-9f4abfd13e72 3/46
8/23/2019 A Practitioner's Guide to Natural Language Processing (Part I) — Processing & Understanding Text
When I started delving into the world of data science, even I was overwhelmed by the
challenges in analyzing and modeling on text data. However, after working as a Data
Scientist on several challenging problems around NLP over the years, I’ve noticed
certain interesting aspects, including techniques, strategies and workflows which can
be leveraged to solve a wide variety of problems. I have covered several topics around
NLP in my books “Text Analytics with Python” (I’m writing a revised version of this
soon) and “Practical Machine Learning with Python”.
However, based on all the excellent feedback I’ve received from all my readers (yes all you
amazing people out there!), the main objective and motivation in creating this series of
articles is to share my learnings with more people, who can’t always find time to sit and
read through a book and can even refer to these articles on the go! Thus, there is no pre-
requisite to buy any of these books to learn NLP.
Getting Started
When building the content and examples for this article, I was thinking if I should focus
on a toy dataset to explain things better, or focus on an existing dataset from one of the
main sources for data science datasets. Then I thought, why not build an end-to-end
tutorial, where we scrape the web to get some text data and showcase examples based
on that!
The source data which we will be working on will be news articles, which we have
retrieved from inshorts, a website that gives us short, 60-word news articles on a wide
variety of topics, and they even have an app for it!
In this article, we will be working with text data from news articles on technology,
sports and world news. I will be covering some basics on how to scrape and retrieve
these news articles from their website in the next section.
https://towardsdatascience.com/a-practitioners-guide-to-natural-language-processing-part-i-processing-understanding-text-9f4abfd13e72 4/46
8/23/2019 A Practitioner's Guide to Natural Language Processing (Part I) — Processing & Understanding Text
be solved by a methodical workflow that has a sequence of steps. The major steps are
depicted in the following figure.
We usually start with a corpus of text documents and follow standard processes of text
wrangling and pre-processing, parsing and basic exploratory data analysis. Based on
the initial insights, we usually represent the text using relevant feature engineering
techniques. Depending on the problem at hand, we either focus on building predictive
supervised models or unsupervised models, which usually focus more on pattern
mining and grouping. Finally, we evaluate the model and the overall success criteria
with relevant stakeholders or customers, and deploy the final model for future usage.
The landing page for technology news articles and its corresponding HTML structure
https://towardsdatascience.com/a-practitioners-guide-to-natural-language-processing-part-i-processing-understanding-text-9f4abfd13e72 5/46
8/23/2019 A Practitioner's Guide to Natural Language Processing (Part I) — Processing & Understanding Text
Thus, we can see the specific HTML tags which contain the textual content of each
news article in the landing page mentioned above. We will be using this information to
extract news articles by leveraging the BeautifulSoup and requests libraries. Let’s first
load up the following dependencies.
import requests
from bs4 import BeautifulSoup
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import os
%matplotlib inline
We will now build a function which will leverage requests to access and get the HTML
content from the landing pages of each of the three news categories. Then, we will use
BeautifulSoup to parse and extract the news headline and article textual content for all
the news articles in each category. We find the content by accessing the specific HTML
tags and classes, where they are present (a sample of which I depicted in the previous
figure).
1 seed_urls = ['https://inshorts.com/en/read/technology',
2 'https://inshorts.com/en/read/sports',
3 'https://inshorts.com/en/read/world']
4
5 def build_dataset(seed_urls):
6 news_data = []
7 for url in seed_urls:
8 news_category = url.split('/')[-1]
9 data = requests.get(url)
10 soup = BeautifulSoup(data.content, 'html.parser')
11
12 news_articles = [{'news_headline': headline.find('span',
13 attrs={"itemprop": "headline"}).string
14 'news_article': article.find('div',
15 attrs={"itemprop": "articleBody"}).strin
16 'news_category': news_category}
17
18 for headline, article in
19 zip(soup.find_all('div',
20 class_=["news-card-title news-right-box"]),
21 soup.find_all('div',
22 l [" d t t i ht b "]))
https://towardsdatascience.com/a-practitioners-guide-to-natural-language-processing-part-i-processing-understanding-text-9f4abfd13e72 6/46
8/23/2019 A Practitioner's Guide to Natural Language Processing (Part I) — Processing & Understanding Text
22 class_=["news-card-content news-right-box"]))
23 ]
24 news_data.extend(news_articles)
25
26 df = pd.DataFrame(news_data)
27 df = df[['news_headline', 'news_article', 'news_category']]
28 return df
It is pretty clear that we extract the news headline, article text and category and build
out a data frame, where each row corresponds to a specific news article. We will now
invoke this function and build our dataset.
news_df = build_dataset(seed_urls)
news_df.head(10)
We, now, have a neatly formatted dataset of news articles and you can quickly check
the total number of news articles with the following code.
news_df.news_category.value_counts()
Output:
-------
world 25
sports 25
technology 24
Name: news_category, dtype: int64
https://towardsdatascience.com/a-practitioners-guide-to-natural-language-processing-part-i-processing-understanding-text-9f4abfd13e72 7/46
8/23/2019 A Practitioner's Guide to Natural Language Processing (Part I) — Processing & Understanding Text
and spacy , both state-of-the-art libraries in NLP. Typically a pip install <library> or a
conda install <library> should suffice. However, in case you face issues with loading
up spacy’s language models, feel free to follow the steps highlighted below to resolve
this issue (I had faced this issue in one of my systems).
OR
Linking successful
./spacymodels/en_core_web_md-2.0.0/en_core_web_md -->
./Anaconda3/lib/site-packages/spacy/data/en_core
Let’s now load up the necessary dependencies for text pre-processing. We will remove
negation words from stop words, since we would want to keep them as they might be
useful, especially during sentiment analysis.
❗ IMPORTANT NOTE: A lot of you have messaged me about not being able to load the
contractions module. It’s not a standard python module. We leverage a standard set of
contractions available in the contractions.py file in my repository.Please add it in the
same directory you run your code from, else it will not work.
https://towardsdatascience.com/a-practitioners-guide-to-natural-language-processing-part-i-processing-understanding-text-9f4abfd13e72 8/46
8/23/2019 A Practitioner's Guide to Natural Language Processing (Part I) — Processing & Understanding Text
import spacy
import pandas as pd
import numpy as np
import nltk
from nltk.tokenize.toktok import ToktokTokenizer
import re
from bs4 import BeautifulSoup
from contractions import CONTRACTION_MAP
import unicodedata
1 def strip_html_tags(text):
2 soup = BeautifulSoup(text, "html.parser")
3 stripped_text = soup.get_text()
4 return stripped_text
5
6 strip_html_tags('<html><h2>Some important text</h2></html>')
It is quite evident from the above output that we can remove unnecessary HTML tags
and retain the useful textual information from any document.
https://towardsdatascience.com/a-practitioners-guide-to-natural-language-processing-part-i-processing-understanding-text-9f4abfd13e72 9/46
8/23/2019 A Practitioner's Guide to Natural Language Processing (Part I) — Processing & Understanding Text
1 def remove_accented_chars(text):
2 text = unicodedata.normalize('NFKD', text).encode('ascii', 'ignore').decode('utf-8', 'ignore
3 return text
4
5 remove_accented_chars('Sómě Áccěntěd těxt')
The preceding function shows us how we can easily convert accented characters to
normal English characters, which helps standardize the words in our corpus.
Expanding Contractions
Contractions are shortened version of words or syllables. They often exist in either
written or spoken forms in the English language. These shortened versions or
contractions of words are created by removing specific letters and sounds. In case of
English contractions, they are often created by removing one of the vowels from the
word. Examples would be, do not to don’t and I would to I’d. Converting each
contraction to its expanded, original form helps with text standardization.
We can see how our function helps expand the contractions from the preceding output.
Are there better ways of doing this? Definitely! If we have enough examples, we can
even train a deep learning model for better performance.
I’ve kept removing digits as optional, because often we might need to keep them in the
pre-processed text.
Stemming
To understand stemming, you need to gain some perspective on what word stems
represent. Word stems are also known as the base form of a word, and we can create
new words by attaching affixes to them in a process known as inflection. Consider the
word JUMP. You can add affixes to it and form new words like JUMPS, JUMPED, and
JUMPING. In this case, the base word JUMP is the word stem.
https://towardsdatascience.com/a-practitioners-guide-to-natural-language-processing-part-i-processing-understanding-text-9f4abfd13e72 11/46
8/23/2019 A Practitioner's Guide to Natural Language Processing (Part I) — Processing & Understanding Text
Word stem and its in ections (Source: Text Analytics with Python, Apress/Springer 2016)
The figure shows how the word stem is present in all its inflections, since it forms the
base on which each inflection is built upon using affixes. The reverse process of
obtaining the base form of a word from its inflected form is known as stemming.
Stemming helps us in standardizing words to their base or root stem, irrespective of
their inflections, which helps many applications like classifying or clustering text, and
even in information retrieval. Let’s see the popular Porter stemmer in action now!
1 def simple_stemmer(text):
2 ps = nltk.porter.PorterStemmer()
3 text = ' '.join([ps.stem(word) for word in text.split()])
4 return text
5
6 simple_stemmer("My system keeps crashing his crashed yesterday, ours crashes daily")
The Porter stemmer is based on the algorithm developed by its inventor, Dr. Martin
Porter. Originally, the algorithm is said to have had a total of five different phases for
reduction of inflections to their stems, where each phase has its own set of rules.
Do note that usually stemming has a fixed set of rules, hence, the root stems may not be
lexicographically correct. Which means, the stemmed words may not be semantically
correct, and might have a chance of not being present in the dictionary (as evident from
the preceding output).
Lemmatization
Lemmatization is very similar to stemming, where we remove word affixes to get to the
base form of a word. However, the base form in this case is known as the root word, but
https://towardsdatascience.com/a-practitioners-guide-to-natural-language-processing-part-i-processing-understanding-text-9f4abfd13e72 12/46
8/23/2019 A Practitioner's Guide to Natural Language Processing (Part I) — Processing & Understanding Text
not the root stem. The difference being that the root word is always a
lexicographically correct word (present in the dictionary), but the root stem may not
be so. Thus, root word, also known as the lemma, will always be present in the
dictionary. Both nltk and spacy have excellent lemmatizers. We will be using spacy
here.
1 def lemmatize_text(text):
2 text = nlp(text)
3 text = ' '.join([word.lemma_ if word.lemma_ != '-PRON-' else word.text for word in text])
4 return text
5
6 lemmatize_text("My system keeps crashing! his crashed yesterday, ours crashes daily")
'My system keep crash ! his crash yesterday , ours crash daily'
You can see that the semantics of the words are not affected by this, yet our text is still
standardized.
Do note that the lemmatization process is considerably slower than stemming, because an
additional step is involved where the root form or lemma is formed by removing the affix
from the word if and only if the lemma is present in the dictionary.
Removing Stopwords
Words which have little or no significance, especially when constructing meaningful
features from text, are known as stopwords or stop words. These are usually words that
end up having the maximum frequency if you do a simple term or word frequency in a
corpus. Typically, these can be articles, conjunctions, prepositions and so on. Some
examples of stopwords are a, an, the, and the like.
10
11 remove_stopwords("The, and, if are stopwords, computer is not")
There is no universal stopword list, but we use a standard English language stopwords
list from nltk . You can also add your own domain-specific stopwords as needed.
Let’s now put this function in action! We will first combine the news headline and the
news article text together to form a document for each piece of news. Then, we will
pre-process them.
https://towardsdatascience.com/a-practitioners-guide-to-natural-language-processing-part-i-processing-understanding-text-9f4abfd13e72 15/46
8/23/2019 A Practitioner's Guide to Natural Language Processing (Part I) — Processing & Understanding Text
Thus, you can see how our text pre-processor helps in pre-processing our news articles!
After this, you can save this dataset to disk if needed, so that you can always load it up
later for future analysis.
Knowledge about the structure and syntax of language is helpful in many areas like text
processing, annotation, and parsing for further operations such as text classification or
summarization. Typical parsing techniques for understanding text syntax are
mentioned below.
Constituency Parsing
Dependency Parsing
https://towardsdatascience.com/a-practitioners-guide-to-natural-language-processing-part-i-processing-understanding-text-9f4abfd13e72 16/46
8/23/2019 A Practitioner's Guide to Natural Language Processing (Part I) — Processing & Understanding Text
N(oun): This usually denotes words that depict some object or entity, which may
be living or nonliving. Some examples would be fox , dog , book , and so on. The
POS tag symbol for nouns is N.
V(erb): Verbs are words that are used to describe certain actions, states, or
occurrences. There are a wide variety of further subcategories, such as auxiliary,
reflexive, and transitive verbs (and many more). Some typical examples of verbs
would be running , jumping , read , and write . The POS tag symbol for verbs is V.
Adj(ective): Adjectives are words used to describe or qualify other words, typically
nouns and noun phrases. The phrase beautiful flower has the noun (N) flower
which is described or qualified using the adjective (ADJ) beautiful . The POS tag
symbol for adjectives is ADJ .
Adv(erb): Adverbs usually act as modifiers for other words including nouns,
adjectives, verbs, or other adverbs. The phrase very beautiful flower has the adverb
(ADV) very , which modifies the adjective (ADJ) beautiful , indicating the degree to
which the flower is beautiful. The POS tag symbol for adverbs is ADV.
https://towardsdatascience.com/a-practitioners-guide-to-natural-language-processing-part-i-processing-understanding-text-9f4abfd13e72 17/46
8/23/2019 A Practitioner's Guide to Natural Language Processing (Part I) — Processing & Understanding Text
Besides these four major categories of parts of speech , there are other categories that
occur frequently in the English language. These include pronouns, prepositions,
interjections, conjunctions, determiners, and many others. Furthermore, each POS tag
like the noun (N) can be further subdivided into categories like singular nouns (NN),
singular proper nouns (NNP), and plural nouns (NNS).
The process of classifying and labeling POS tags for words called parts of speech tagging
or POS tagging . POS tags are used to annotate words and depict their POS, which is
really helpful to perform specific analysis, such as narrowing down upon nouns and
seeing which ones are the most prominent, word sense disambiguation, and grammar
analysis. We will be leveraging both nltk and spacy which usually use the Penn
Treebank notation for POS tagging.
https://towardsdatascience.com/a-practitioners-guide-to-natural-language-processing-part-i-processing-understanding-text-9f4abfd13e72 18/46
8/23/2019 A Practitioner's Guide to Natural Language Processing (Part I) — Processing & Understanding Text
We can see that each of these libraries treat tokens in their own way and assign specific
tags for them. Based on what we see, spacy seems to be doing slightly better than nltk .
Noun phrase (NP): These are phrases where a noun acts as the head word. Noun
phrases act as a subject or object to a verb.
Verb phrase (VP): These phrases are lexical units that have a verb acting as the
head word. Usually, there are two forms of verb phrases. One form has the verb
components as well as other entities such as nouns, adjectives, or adverbs as parts
of the object.
Adjective phrase (ADJP): These are phrases with an adjective as the head word.
Their main role is to describe or qualify nouns and pronouns in a sentence, and
they will be either placed before or after the noun or pronoun.
Adverb phrase (ADVP): These phrases act like adverbs since the adverb acts as the
head word in the phrase. Adverb phrases are used as modifiers for nouns, verbs, or
adverbs themselves by providing further details that describe or qualify them.
https://towardsdatascience.com/a-practitioners-guide-to-natural-language-processing-part-i-processing-understanding-text-9f4abfd13e72 19/46
8/23/2019 A Practitioner's Guide to Natural Language Processing (Part I) — Processing & Understanding Text
We will leverage the conll2000 corpus for training our shallow parser model. This
corpus is available in nltk with chunk annotations and we will be using around 10K
records for training our model. A sample annotated sentence is depicted as follows.
10900 48
(S
Chancellor/NNP
(PP of/IN)
(NP the/DT Exchequer/NNP)
(NP Nigel/NNP Lawson/NNP)
(NP 's/POS restated/VBN commitment/NN)
(PP to/TO)
(NP a/DT firm/NN monetary/JJ policy/NN)
(VP has/VBZ helped/VBN to/TO prevent/VB)
(NP a/DT freefall/NN)
(PP in/IN)
(NP sterling/NN)
(PP over/IN)
(NP the/DT past/JJ week/NN)
./.)
From the preceding output, you can see that our data points are sentences that are
already annotated with phrases and POS tags metadata that will be useful in training
our shallow parser model. We will leverage two chunking utility functions,
tree2conlltags , to get triples of word, tag, and chunk tags for each token, and
conlltags2tree to generate a parse tree from these token triples. We will be using these
functions to train our parser. A sample is depicted below.
The chunk tags use the IOB format. This notation represents Inside, Outside, and
Beginning. The B- prefix before a tag indicates it is the beginning of a chunk, and I-
prefix indicates that it is inside a chunk. The O tag indicates that the token does not
belong to any chunk. The B- tag is always used when there are subsequent tags of the
same type following it without the presence of O tags between them.
We will now define a function conll_tag_ chunks() to extract POS and chunk tags from
sentences with chunked annotations and a function called combined_taggers() to train
multiple taggers with backoff taggers (e.g. unigram and bigram taggers)
1 def conll_tag_chunks(chunk_sents):
2 tagged_sents = [tree2conlltags(tree) for tree in chunk_sents]
3 return [[(t, c) for (w, t, c) in sent] for sent in tagged_sents]
4
https://towardsdatascience.com/a-practitioners-guide-to-natural-language-processing-part-i-processing-understanding-text-9f4abfd13e72 21/46
8/23/2019 A Practitioner's Guide to Natural Language Processing (Part I) — Processing & Understanding Text
5
6 def combined_tagger(train_data, taggers, backoff=None):
7 for tagger in taggers:
8 backoff = tagger(train_data, backoff=backoff)
9 return backoff
We will now define a class NGramTagChunker that will take in tagged sentences as
training input, get their (word, POS tag, Chunk tag) WTC triples, and train a
BigramTagger with a UnigramTagger as the backoff tagger. We will also define a parse()
The UnigramTagger , BigramTagger , and TrigramTagger are classes that inherit from the base
class NGramTagger , which itself inherits from the ContextTagger class , which inherits from
the SequentialBackoffTagger class .
We will use this class to train on the conll2000 chunked train_data and evaluate the
model performance on the test_data
https://towardsdatascience.com/a-practitioners-guide-to-natural-language-processing-part-i-processing-understanding-text-9f4abfd13e72 22/46
8/23/2019 A Practitioner's Guide to Natural Language Processing (Part I) — Processing & Understanding Text
26 print(ntc.evaluate(test_data))
ChunkParse score:
IOB Accuracy: 90.0%%
Precision: 82.1%%
Recall: 86.3%%
F-Measure: 84.1%%
Our chunking model gets an accuracy of around 90% which is quite good! Let’s now
leverage this model to shallow parse and chunk our sample news article headline
which we used earlier, “US unveils world’s most powerful supercomputer, beats
China”.
chunk_tree = ntc.parse(nltk_pos_tagged)
print(chunk_tree)
Output:
-------
(S
(NP US/NNP)
(VP unveils/VBZ world's/VBZ)
(NP most/RBS powerful/JJ supercomputer,/JJ beats/NNS China/NNP))
Thus you can see it has identified two noun phrases (NP) and one verb phrase (VP) in
the news article. Each word’s POS tags are also visible. We can also visualize this in the
form of a tree as follows. You might need to install ghostscript in case nltk throws an
error.
https://towardsdatascience.com/a-practitioners-guide-to-natural-language-processing-part-i-processing-understanding-text-9f4abfd13e72 23/46
8/23/2019 A Practitioner's Guide to Natural Language Processing (Part I) — Processing & Understanding Text
The preceding output gives a good sense of structure after shallow parsing the news
headline.
Constituency Parsing
Constituent-based grammars are used to analyze and determine the constituents of a
sentence. These grammars can be used to model or represent the internal structure of
sentences in terms of a hierarchically ordered structure of their constituents. Each and
every word usually belongs to a specific lexical category in the case and forms the head
word of different phrases. These phrases are formed based on rules called phrase
structure rules.
Phrase structure rules form the core of constituency grammars, because they talk
about syntax and rules that govern the hierarchy and ordering of the various
constituents in the sentences. These rules cater to two things primarily.
They determine what words are used to construct the phrases or constituents.
The generic representation of a phrase structure rule is S → AB , which depicts that the
structure S consists of constituents A and B , and the ordering is A followed by B . While
there are several rules (refer to Chapter 1, Page 19: Text Analytics with Python, if you
want to dive deeper), the most important rule describes how to divide a sentence or a
clause. The phrase structure rule denotes a binary division for a sentence or a clause as
S → NP VP where S is the sentence or clause, and it is divided into the subject, denoted
by the noun phrase (NP) and the predicate, denoted by the verb phrase (VP).
A constituency parser can be built based on such grammars/rules, which are usually
collectively available as context-free grammar (CFG) or phrase-structured grammar.
The parser will process input sentences according to these rules, and help in building a
parse tree.
https://towardsdatascience.com/a-practitioners-guide-to-natural-language-processing-part-i-processing-understanding-text-9f4abfd13e72 24/46
8/23/2019 A Practitioner's Guide to Natural Language Processing (Part I) — Processing & Understanding Text
We will be using nltk and the StanfordParser here to generate parse trees.
Prerequisites: Download the official Stanford Parser from here, which seems to work
quite well. You can try out a later version by going to this website and checking the
Release History section. After downloading, unzip it to a known location in your
filesystem. Once done, you are now ready to use the parser from nltk , which we will be
exploring soon.
(ROOT
(SINV
https://towardsdatascience.com/a-practitioners-guide-to-natural-language-processing-part-i-processing-understanding-text-9f4abfd13e72 25/46
8/23/2019 A Practitioner's Guide to Natural Language Processing (Part I) — Processing & Understanding Text
(S
(NP (NNP US))
(VP
(VBZ unveils)
(NP
(NP (NN world) (POS 's))
(ADJP (RBS most) (JJ powerful))
(NN supercomputer))))
(, ,)
(VP (VBZ beats))
(NP (NNP China))))
We can see the constituency parse tree for our news headline. Let’s visualize it to
understand the structure better.
We can see the nested hierarchical structure of the constituents in the preceding output
as compared to the flat structure in shallow parsing. In case you are wondering what
SINV means, it represents an Inverted declarative sentence, i.e. one in which the subject
follows the tensed verb or modal. Refer to the Penn Treebank reference as needed to
lookup other tags.
Dependency Parsing
https://towardsdatascience.com/a-practitioners-guide-to-natural-language-processing-part-i-processing-understanding-text-9f4abfd13e72 26/46
8/23/2019 A Practitioner's Guide to Natural Language Processing (Part I) — Processing & Understanding Text
Considering our sentence “The brown fox is quick and he is jumping over the lazy
dog” , if we wanted to draw the dependency syntax tree for this, we would have the
structure
These dependency relationships each have their own meaning and are a part of a list of
universal dependency types. This is discussed in an original paper, Universal Stanford
Dependencies: A Cross-Linguistic Typology by de Marneffe et al, 2014). You can check out
the exhaustive list of dependency types and their meanings here.
The dependency tag det is pretty intuitive — it denotes the determiner relationship
between a nominal head and the determiner. Usually, the word with POS tag DET
will also have the det dependency tag relation. Examples include fox → the and dog
→ the .
https://towardsdatascience.com/a-practitioners-guide-to-natural-language-processing-part-i-processing-understanding-text-9f4abfd13e72 27/46
8/23/2019 A Practitioner's Guide to Natural Language Processing (Part I) — Processing & Understanding Text
The dependency tag amod stands for adjectival modifier and stands for any
adjective that modifies the meaning of a noun. Examples include fox → brown and
dog → lazy .
The dependency tag nsubj stands for an entity that acts as a subject or agent in a
clause. Examples include is → fox and jumping → he .
The dependencies cc and conj have more to do with linkages related to words
connected by coordinating conjunctions . Examples include is → and and is →
jumping .
The dependency tag aux indicates the auxiliary or secondary verb in the clause.
Example: jumping → is .
The dependency tag acomp stands for adjective complement and acts as the
complement or object to a verb in the sentence. Example: is → quick
The dependency tag prep denotes a prepositional modifier, which usually modifies
the meaning of a noun, verb, adjective, or preposition. Usually, this representation
is used for prepositions having a noun or noun phrase complement. Example:
jumping → over .
The dependency tag pobj is used to denote the object of a preposition . This is
usually the head of a noun phrase following a preposition in the sentence.
Example: over → dog .
Spacy had two types of English dependency parsers based on what language models
you use, you can find more details here. Based on language models, you can use the
Universal Dependencies Scheme or the CLEAR Style Dependency Scheme also
available in NLP4J now. We will now leverage spacy and print out the dependencies for
each token in our news headline.
1 dependency_pattern = '{left}<---{word}[{w_type}]--->{right}\n--------'
2 for token in sentence_nlp:
3 print(dependency_pattern.format(word=token.orth_,
4 w_type=token.dep_,
5 left=[t.orth_
6 for t
7 in token.lefts],
8 right=[t.orth_
9 for t
10 in token.rights]))
https://towardsdatascience.com/a-practitioners-guide-to-natural-language-processing-part-i-processing-understanding-text-9f4abfd13e72 28/46
8/23/2019 A Practitioner's Guide to Natural Language Processing (Part I) — Processing & Understanding Text
[]<---US[compound]--->[]
--------
['US']<---unveils[nsubj]--->['supercomputer', ',']
--------
[]<---world[poss]--->["'s"]
--------
[]<---'s[case]--->[]
--------
[]<---most[amod]--->[]
--------
[]<---powerful[compound]--->[]
--------
['world', 'most', 'powerful']<---supercomputer[appos]--->[]
--------
[]<---,[punct]--->[]
--------
['unveils']<---beats[ROOT]--->['China']
--------
[]<---China[dobj]--->[]
--------
It is evident that the verb beats is the ROOT since it doesn’t have any other
dependencies as compared to the other tokens. For knowing more about each
annotation you can always refer to the CLEAR dependency scheme. We can also
visualize the above dependencies in a better way.
https://towardsdatascience.com/a-practitioners-guide-to-natural-language-processing-part-i-processing-understanding-text-9f4abfd13e72 29/46
8/23/2019 A Practitioner's Guide to Natural Language Processing (Part I) — Processing & Understanding Text
You can also leverage nltk and the StanfordDependencyParser to visualize and build out the
dependency tree. We showcase the dependency tree both in its raw and annotated form
as follows.
https://towardsdatascience.com/a-practitioners-guide-to-natural-language-processing-part-i-processing-understanding-text-9f4abfd13e72 30/46
8/23/2019 A Practitioner's Guide to Natural Language Processing (Part I) — Processing & Understanding Text
You can notice the similarities with the tree we had obtained earlier. The annotations
help with understanding the type of dependency among the different tokens.
SpaCy has some excellent capabilities for named entity recognition. Let’s try and use it
on one of our sample news articles.
1 sentence = str(news_df.iloc[1].full_text)
2 sentence_nlp = nlp(sentence)
3
4 # print named entities in article
5 print([(word, word.ent_type_) for word in sentence_nlp if word.ent_type_])
6
7 # visualize named entities
8 displacy.render(sentence_nlp, style='ent', jupyter=True)
https://towardsdatascience.com/a-practitioners-guide-to-natural-language-processing-part-i-processing-understanding-text-9f4abfd13e72 31/46
8/23/2019 A Practitioner's Guide to Natural Language Processing (Part I) — Processing & Understanding Text
We can clearly see that the major named entities have been identified by spacy . To
understand more in detail about what each named entity means, you can refer to the
documentation or check out the following table for convenience.
Let’s now find out the most frequent named entities in our news corpus! For this, we
will build out a data frame of all the named entities and their types using the following
code.
https://towardsdatascience.com/a-practitioners-guide-to-natural-language-processing-part-i-processing-understanding-text-9f4abfd13e72 32/46
8/23/2019 A Practitioner's Guide to Natural Language Processing (Part I) — Processing & Understanding Text
1 named_entities = []
2 for sentence in corpus:
3 temp_entity_name = ''
4 temp_named_entity = None
5 sentence = nlp(sentence)
6 for word in sentence:
7 term = word.text
8 tag = word.ent_type_
9 if tag:
10 temp_entity_name = ' '.join([temp_entity_name, term]).strip()
11 temp_named_entity = (temp_entity_name, tag)
12 else:
13 if temp_named_entity:
14 named_entities.append(temp_named_entity)
15 temp_entity_name = ''
16 temp_named_entity = None
17
18 entity_frame = pd.DataFrame(named_entities,
19 columns=['Entity Name', 'Entity Type'])
We can now transform and aggregate this data frame to find the top occuring entities
and types.
Do you notice anything interesting? (Hint: Maybe the supposed summit between Trump
and Kim Jong!). We also see that it has correctly identified ‘Messenger’ as a product
(from Facebook).
https://towardsdatascience.com/a-practitioners-guide-to-natural-language-processing-part-i-processing-understanding-text-9f4abfd13e72 33/46
8/23/2019 A Practitioner's Guide to Natural Language Processing (Part I) — Processing & Understanding Text
We can also group by the entity types to get a sense of what types of entites occur most
in our news corpus.
We can see that people, places and organizations are the most mentioned entities
though interestingly we also have many other entities.
Another nice NER tagger is the StanfordNERTagger available from the nltk interface. For
this, you need to have Java installed and then download the Stanford NER resources.
Unzip them to a location of your choice (I used E:/stanford in my system).
https://towardsdatascience.com/a-practitioners-guide-to-natural-language-processing-part-i-processing-understanding-text-9f4abfd13e72 35/46
8/23/2019 A Practitioner's Guide to Natural Language Processing (Part I) — Processing & Understanding Text
Top named entities and types from Stanford NER on our news corpus
We notice quite similar results though restricted to only three types of named entities.
Interestingly, we see a number of mentioned of several people in various sports.
Usually, sentiment analysis works best on text that has a subjective context than on text
with only an objective context. Objective text usually depicts some normal statements
or facts without expressing any emotion, feelings, or mood. Subjective text contains
text that is usually expressed by a human having typical moods, emotions, and feelings.
Sentiment analysis is widely used, especially as a part of social media analysis for any
domain, be it a business, a recent movie, or a product launch, to understand its
reception by the people and what they think of it based on their opinions or, you
guessed it, sentiment!
https://towardsdatascience.com/a-practitioners-guide-to-natural-language-processing-part-i-processing-understanding-text-9f4abfd13e72 36/46
8/23/2019 A Practitioner's Guide to Natural Language Processing (Part I) — Processing & Understanding Text
Typically, sentiment analysis for text data can be computed on several levels, including
on an individual sentence level, paragraph level, or the entire document as a whole.
Often, sentiment is computed on the document as a whole or some aggregations are
done after computing the sentiment for individual sentences. There are two major
approaches to sentiment analysis.
For the first approach we typically need pre-labeled data. Hence, we will be focusing on
the second approach. For a comprehensive coverage of sentiment analysis, refer to
Chapter 7: Analyzing Movie Reviews Sentiment, Practical Machine Learning with Python,
Springer\Apress, 2018. In this scenario, we do not have the convenience of a well-
labeled training dataset. Hence, we will need to use unsupervised techniques for
predicting the sentiment by using knowledgebases, ontologies, databases, and lexicons
that have detailed information, specially curated and prepared just for sentiment
analysis. A lexicon is a dictionary, vocabulary, or a book of words. In our case, lexicons
are special dictionaries or vocabularies that have been created for analyzing
sentiments. Most of these lexicons have a list of positive and negative polar words with
some score associated with them, and using various techniques like the position of
words, surrounding words, context, parts of speech, phrases, and so on, scores are
assigned to the text documents for which we want to compute the sentiment. After
aggregating these scores, we get the final sentiment.
https://towardsdatascience.com/a-practitioners-guide-to-natural-language-processing-part-i-processing-understanding-text-9f4abfd13e72 37/46
8/23/2019 A Practitioner's Guide to Natural Language Processing (Part I) — Processing & Understanding Text
Various popular lexicons are used for sentiment analysis, including the following.
AFINN lexicon
SentiWordNet
VADER lexicon
TextBlob lexicon
This is not an exhaustive list of lexicons that can be leveraged for sentiment analysis,
and there are several other lexicons which can be easily obtained from the Internet.
Feel free to check out each of these links and explore them. We will be covering two
techniques in this section.
The following code computes sentiment for all our news articles and shows summary
statistics of general sentiment per news category.
We can get a good idea of general sentiment statistics across different news categories.
Looks like the average sentiment is very positive in sports and reasonably negative in
technology! Let’s look at some visualizations now.
https://towardsdatascience.com/a-practitioners-guide-to-natural-language-processing-part-i-processing-understanding-text-9f4abfd13e72 39/46
8/23/2019 A Practitioner's Guide to Natural Language Processing (Part I) — Processing & Understanding Text
We can see that the spread of sentiment polarity is much higher in sports and world as
compared to technology where a lot of the articles seem to be having a negative
polarity. We can also visualize the frequency of sentiment labels.
1 fc = sns.factorplot(x="news_category", hue="sentiment_category",
2 data=df, kind="count",
3 palette={"negative": "#FE2020",
4 "positive": "#BADD07",
5 "neutral": "#68BFF5"})
No surprises here that technology has the most number of negative articles and world
the most number of positive articles. Sports might have more neutral articles due to
the presence of articles which are more objective in nature (talking about sporting
events without the presence of any emotion or feelings). Let’s dive deeper into the most
positive and negative sentiment news articles for technology news.
https://towardsdatascience.com/a-practitioners-guide-to-natural-language-processing-part-i-processing-understanding-text-9f4abfd13e72 40/46
8/23/2019 A Practitioner's Guide to Natural Language Processing (Part I) — Processing & Understanding Text
Looks like the most negative article is all about a recent smartphone scam in India and
the most positive article is about a contest to get married in a self-driving shuttle.
Interesting! Let’s do a similar analysis for world news.
Interestingly Trump features in both the most positive and the most negative world
news articles. Do read the articles to get some more perspective into why the model
selected one of them as the most negative and the other one as the most positive (no
surprises here!).
Looks like the average sentiment is the most positive in world and least positive in
technology! However, these metrics might be indicating that the model is predicting
more articles as positive. Let’s look at the sentiment frequency distribution per news
category.
1 fc = sns.factorplot(x="news_category", hue="sentiment_category",
2 data=df, kind="count",
3 palette={"negative": "#FE2020",
4 "positive": "#BADD07",
5 "neutral": "#68BFF5"})
https://towardsdatascience.com/a-practitioners-guide-to-natural-language-processing-part-i-processing-understanding-text-9f4abfd13e72 42/46
8/23/2019 A Practitioner's Guide to Natural Language Processing (Part I) — Processing & Understanding Text
There definitely seems to be more positive articles across the news categories here as
compared to our previous model. However, still looks like technology has the most
negative articles and world, the most positive articles similar to our previous analysis.
Let’s now do a comparative analysis and see if we still get similar articles in the most
positive and negative categories for world news.
Well, looks like the most negative world news article here is even more depressing than
what we saw the last time! The most positive article is still the same as what we had
obtained in our last model.
Finally, we can even evaluate and compare between these two models as to how many
predictions are matching and how many are not (by leveraging a confusion matrix
which is often used in classification). We leverage our nifty model_evaluation_utils
https://towardsdatascience.com/a-practitioners-guide-to-natural-language-processing-part-i-processing-understanding-text-9f4abfd13e72 43/46
8/23/2019 A Practitioner's Guide to Natural Language Processing (Part I) — Processing & Understanding Text
In the preceding table, the ‘Actual’ labels are predictions from the Afinn sentiment
analyzer and the ‘Predicted’ labels are predictions from TextBlob . Looks like our
previous assumption was correct. TextBlob definitely predicts several neutral and
negative articles as positive. Overall most of the sentiment predictions seem to match,
which is good!
Conclusion
This was definitely one of my longer articles! If you are reading this, I really commend
your efforts for staying with me till the end of this article. These examples should give
you a good idea about how to start working with a corpus of text documents and
popular strategies for text retrieval, pre-processing, parsing, understanding structure,
entities and sentiment. We will be covering feature engineering and representation
techniques with hands-on examples in the next article of this series. Stay tuned!
. . .
All the code and datasets used in this article can be accessed from my GitHub
I often mentor and help students at Springboard to learn essential skills around Data
Science. Thanks to them for helping me develop this content. Do check out Springboard’s
DSC bootcamp if you are interested in a career-focused structured path towards learning
Data Science.
https://towardsdatascience.com/a-practitioners-guide-to-natural-language-processing-part-i-processing-understanding-text-9f4abfd13e72 44/46
8/23/2019 A Practitioner's Guide to Natural Language Processing (Part I) — Processing & Understanding Text
A lot of this code comes from the research and work that I had done during writing my
book “Text Analytics with Python”. The code is open-sourced on GitHub. (Python 3.x
edition coming by end of this year!)
“Practical Machine Learning with Python”, my other book also covers text
classification and sentiment analysis in detail. The code is open-sourced on GitHub for
your convenience.
If you have any feedback, comments or interesting insights to share about my article or
data science in general, feel free to reach out to me on my LinkedIn social media
channel.
Machine Learning Data Science Python Arti cial Intelligence Towards Data Science
https://towardsdatascience.com/a-practitioners-guide-to-natural-language-processing-part-i-processing-understanding-text-9f4abfd13e72 45/46
8/23/2019 A Practitioner's Guide to Natural Language Processing (Part I) — Processing & Understanding Text
https://towardsdatascience.com/a-practitioners-guide-to-natural-language-processing-part-i-processing-understanding-text-9f4abfd13e72 46/46