0% found this document useful (0 votes)
15 views446 pages

Final Summary NLP

Uploaded by

211184
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
15 views446 pages

Final Summary NLP

Uploaded by

211184
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 446

Introduction to

regular expressions
I N T R O D U C T I O N T O N AT U R A L L A N G U A G E P R O C E S S I N G I N P Y T H O N

Katharine Jarmul
Founder, kjamistan
What is Natural Language Processing?
Field of study focused on making sense of language
Using statistics and computers

You will learn the basics of NLP


Topic identi cation

Text classi cation

NLP applications include:


Chatbots

Translation

Sentiment analysis

... and many more!

INTRODUCTION TO NATURAL LANGUAGE PROCESSING IN PYTHON


What exactly are regular expressions?
Strings with a special syntax → Find all web links in a document

Allow us to match pa erns in


→ Parse email addresses
other strings
→ Remove/replace unwanted
Applications of regular
characters
expressions:

import re <_sre.SRE_Match object; span=(0, 3), match='abc'>


re.match('abc', 'abcdef')

word_regex = '\w+'
re.match(word_regex, <_sre.SRE_Match object; span=(0, 2), match='hi'>
'hi there!')

INTRODUCTION TO NATURAL LANGUAGE PROCESSING IN PYTHON


Common regex patterns
pa ern matches example
\w+ word 'Magic'

INTRODUCTION TO NATURAL LANGUAGE PROCESSING IN PYTHON


Common regex patterns (2)
pa ern matches example
\w+ word 'Magic'
\d digit 9

INTRODUCTION TO NATURAL LANGUAGE PROCESSING IN PYTHON


Common regex patterns (3)
pa ern matches example
\w+ word 'Magic'
\d digit 9
\s space ''

INTRODUCTION TO NATURAL LANGUAGE PROCESSING IN PYTHON


Common regex patterns (4)
pa ern matches example
\w+ word 'Magic'
\d digit 9
\s space ''
.* wildcard 'username74'

INTRODUCTION TO NATURAL LANGUAGE PROCESSING IN PYTHON


Common regex patterns (5)
pa ern matches example
\w+ word 'Magic'
\d digit 9
\s space ''
.* wildcard 'username74'
+ or * greedy match 'aaaaaa'

INTRODUCTION TO NATURAL LANGUAGE PROCESSING IN PYTHON


Common regex patterns (6)
pa ern matches example
\w+ word 'Magic'
\d digit 9
\s space ''
.* wildcard 'username74'
+ or * greedy match 'aaaaaa'
\S not space 'no_spaces'

INTRODUCTION TO NATURAL LANGUAGE PROCESSING IN PYTHON


Common regex patterns (7)
pa ern matches example
\w+ word 'Magic'
\d digit 9
\s space ''
.* wildcard 'username74'
+ or * greedy match 'aaaaaa'
\S not space 'no_spaces'
[a-z] lowercase group 'abcdefg'

INTRODUCTION TO NATURAL LANGUAGE PROCESSING IN PYTHON


Python's re module
re module

split : split a string on regex

findall : nd all pa erns in a string

search : search for a pa ern

match : match an entire string or substring based on a


pa ern
Pa ern rst, and the string second

May return an iterator, string, or match object

re.split('\s+', 'Split on spaces.')

['Split', 'on', 'spaces.']

INTRODUCTION TO NATURAL LANGUAGE PROCESSING IN PYTHON


Let's practice!
I N T R O D U C T I O N T O N AT U R A L L A N G U A G E P R O C E S S I N G I N P Y T H O N
Introduction to
tokenization
I N T R O D U C T I O N T O N AT U R A L L A N G U A G E P R O C E S S I N G I N P Y T H O N

Katharine Jarmul
Founder, kjamistan
What is tokenization?
Turning a string or document into tokens (smaller chunks)

One step in preparing a text for NLP

Many di erent theories and rules

You can create your own rules using regular expressions

Some examples:
Breaking out words or sentences

Separating punctuation

Separating all hashtags in a tweet

INTRODUCTION TO NATURAL LANGUAGE PROCESSING IN PYTHON


nltk library
nltk : natural language toolkit

from nltk.tokenize import word_tokenize


word_tokenize("Hi there!")

['Hi', 'there', '!']

INTRODUCTION TO NATURAL LANGUAGE PROCESSING IN PYTHON


Why tokenize?
Easier to map part of speech

Matching common words

Removing unwanted tokens

"I don't like Sam's shoes."

"I", "do", "n't", "like", "Sam", "'s", "shoes", "."

INTRODUCTION TO NATURAL LANGUAGE PROCESSING IN PYTHON


Other nltk tokenizers
sent_tokenize : tokenize a document into sentences

regexp_tokenize : tokenize a string or document based on a


regular expression pa ern

TweetTokenizer : special class just for tweet tokenization,


allowing you to separate hashtags, mentions and lots of
exclamation points!!!

INTRODUCTION TO NATURAL LANGUAGE PROCESSING IN PYTHON


More regex practice
Di erence between re.search() and re.match()

import re
re.match('abc', 'abcde')

<_sre.SRE_Match object; span=(0, 3), match='abc'>

re.search('abc', 'abcde')

<_sre.SRE_Match object; span=(0, 3), match='abc'>

re.match('cd', 'abcde')
re.search('cd', 'abcde')

<_sre.SRE_Match object; span=(2, 4), match='cd'>

INTRODUCTION TO NATURAL LANGUAGE PROCESSING IN PYTHON


Let's practice!
I N T R O D U C T I O N T O N AT U R A L L A N G U A G E P R O C E S S I N G I N P Y T H O N
Advanced
tokenization with
regex
I N T R O D U C T I O N T O N AT U R A L L A N G U A G E P R O C E S S I N G I N P Y T H O N

Katharine Jarmul
Founder, kjamistan
Regex groups using or "|"
OR is represented using |

You can de ne a group using ()

You can de ne explicit character ranges using []

import re
match_digits_and_words = ('(\d+|\w+)')
re.findall(match_digits_and_words, 'He has 11 cats.')

['He', 'has', '11', 'cats']

INTRODUCTION TO NATURAL LANGUAGE PROCESSING IN PYTHON


Regex ranges and groups
pa ern matches example
upper and lowercase English
[A-Za-z]+ 'ABCDEFghijk'
alphabet
[0-9] numbers from 0 to 9 9
[A-Za-z\- upper and lowercase English 'My-
\.]+ alphabet, - and . Website.com'
(a-z) a, - and z 'a-z'
(\s+l,) spaces or a comma ', '

INTRODUCTION TO NATURAL LANGUAGE PROCESSING IN PYTHON


Character range with `re.match()`
import re
my_str = 'match lowercase spaces nums like 12, but no commas'
re.match('[a-z0-9 ]+', my_str)

<_sre.SRE_Match object;
span=(0, 42), match='match lowercase spaces nums like 12'>

INTRODUCTION TO NATURAL LANGUAGE PROCESSING IN PYTHON


Let's practice!
I N T R O D U C T I O N T O N AT U R A L L A N G U A G E P R O C E S S I N G I N P Y T H O N
Charting word
length with nltk
I N T R O D U C T I O N T O N AT U R A L L A N G U A G E P R O C E S S I N G I N P Y T H O N

Katharine Jarmul
Founder, kjamistan
Getting started with matplotlib
Charting library used by many open source Python projects

Straightforward functionality with lots of options


Histograms

Bar charts

Line charts

Sca er plots

... and also advanced functionality like 3D graphs and


animations!

INTRODUCTION TO NATURAL LANGUAGE PROCESSING IN PYTHON


Plotting a histogram with matplotlib
from matplotlib import pyplot as plt
plt.hist([1, 5, 5, 7, 7, 7, 9])

(array([ 1., 0., 0., 0., 0., 2., 0., 3., 0., 1.]),
array([ 1., 1.8, 2.6, 3.4, 4.2, 5., 5.8, 6.6, 7.4, 8.2, 9.]),
<a list of 10 Patch objects>)

plt.show()

INTRODUCTION TO NATURAL LANGUAGE PROCESSING IN PYTHON


Generated histogram

INTRODUCTION TO NATURAL LANGUAGE PROCESSING IN PYTHON


Combining NLP data extraction with plotting
from matplotlib import pyplot as plt
from nltk.tokenize import word_tokenize
words = word_tokenize("This is a pretty cool tool!")
word_lengths = [len(w) for w in words]
plt.hist(word_lengths)

(array([ 2., 0., 1., 0., 0., 0., 3., 0., 0., 1.]),
array([ 1., 1.5, 2., 2.5, 3., 3.5, 4., 4.5, 5., 5.5, 6.]),
<a list of 10 Patch objects>)

plt.show()

INTRODUCTION TO NATURAL LANGUAGE PROCESSING IN PYTHON


Word length histogram

INTRODUCTION TO NATURAL LANGUAGE PROCESSING IN PYTHON


Let's practice!
I N T R O D U C T I O N T O N AT U R A L L A N G U A G E P R O C E S S I N G I N P Y T H O N
Word counts with
bag-of-words
I N T R O D U C T I O N T O N AT U R A L L A N G U A G E P R O C E S S I N G I N P Y T H O N

Katharine Jarmul
Founder, kjamistan
Bag-of-words
Basic method for nding topics in a text

Need to rst create tokens using tokenization

... and then count up all the tokens

The more frequent a word, the more important it might be

Can be a great way to determine the signi cant words in a


text

INTRODUCTION TO NATURAL LANGUAGE PROCESSING IN PYTHON


Bag-of-words example
Text: "The cat is in the box. The cat likes the box. The box is
over the cat."

Bag of words (stripped punctuation):


"The": 3, "box": 3

"cat": 3, "the": 3

"is": 2

"in": 1, "likes": 1, "over": 1

INTRODUCTION TO NATURAL LANGUAGE PROCESSING IN PYTHON


Bag-of-words in Python
from nltk.tokenize import word_tokenize
from collections import Counter
Counter(word_tokenize("""The cat is in the box. The cat likes the box.
The box is over the cat."""))

Counter({'.': 3,
'The': 3,
'box': 3,
'cat': 3,
'in': 1,
...
'the': 3})

counter.most_common(2)

[('The', 3), ('box', 3)]

INTRODUCTION TO NATURAL LANGUAGE PROCESSING IN PYTHON


Let's practice!
I N T R O D U C T I O N T O N AT U R A L L A N G U A G E P R O C E S S I N G I N P Y T H O N
Simple text
preprocessing
I N T R O D U C T I O N T O N AT U R A L L A N G U A G E P R O C E S S I N G I N P Y T H O N

Katharine Jarmul
Founder, kjamistan
Why preprocess?
Helps make for be er input data
When performing machine learning or other statistical
methods

Examples:
Tokenization to create a bag of words

Lowercasing words

Lemmatization/Stemming
Shorten words to their root stems

Removing stop words, punctuation, or unwanted tokens

Good to experiment with di erent approaches

INTRODUCTION TO NATURAL LANGUAGE PROCESSING IN PYTHON


Preprocessing example
Input text: Cats, dogs and birds are common pets. So are sh.

Output tokens: cat, dog, bird, common, pet, sh

INTRODUCTION TO NATURAL LANGUAGE PROCESSING IN PYTHON


Text preprocessing with Python
from nltk.corpus import stopwords
text = """The cat is in the box. The cat likes the box.
The box is over the cat."""
tokens = [w for w in word_tokenize(text.lower())
if w.isalpha()]
no_stops = [t for t in tokens
if t not in stopwords.words('english')]
Counter(no_stops).most_common(2)

[('cat', 3), ('box', 3)]

INTRODUCTION TO NATURAL LANGUAGE PROCESSING IN PYTHON


Let's practice!
I N T R O D U C T I O N T O N AT U R A L L A N G U A G E P R O C E S S I N G I N P Y T H O N
Introduction to
gensim
I N T R O D U C T I O N T O N AT U R A L L A N G U A G E P R O C E S S I N G I N P Y T H O N

Katharine Jarmul
Founder, kjamistan
What is gensim?
Popular open-source NLP library

Uses top academic models to perform complex tasks


Building document or word vectors

Performing topic identi cation and document comparison

INTRODUCTION TO NATURAL LANGUAGE PROCESSING IN PYTHON


What is a word vector?

INTRODUCTION TO NATURAL LANGUAGE PROCESSING IN PYTHON


Gensim example

(Source: h p://tlfvincent.github.io/2015/10/23/presidential-
speech-topics)

INTRODUCTION TO NATURAL LANGUAGE PROCESSING IN PYTHON


from gensim.corpora.dictionary import Dictionary
from nltk.tokenize import word_tokenize
my_documents = ['The movie was about a spaceship and aliens.',
'I really liked the movie!',
'Awesome action scenes, but boring characters.',
'The movie was awful! I hate alien films.',
'Space is cool! I liked the movie.',
'More space films, please!',]

tokenized_docs = [word_tokenize(doc.lower()) {'!': 11,


for doc in my_documents] ',': 17,
dictionary = Dictionary(tokenized_docs) '.': 7,
dictionary.token2id 'a': 2,
'about': 4,
...}

INTRODUCTION TO NATURAL LANGUAGE PROCESSING IN PYTHON


Creating a gensim corpus
corpus = [dictionary.doc2bow(doc) for doc in tokenized_docs]
corpus

[[(0, 1), (1, 1), (2, 1), (3, 1), (4, 1), (5, 1), (6, 1), (7, 1), (8, 1)],
[(0, 1), (1, 1), (9, 1), (10, 1), (11, 1), (12, 1)],
...]

gensim models can be easily saved, updated, and reused

Our dictionary can also be updated

This more advanced and feature rich bag-of-words can be


used in future exercises

INTRODUCTION TO NATURAL LANGUAGE PROCESSING IN PYTHON


Let's practice!
I N T R O D U C T I O N T O N AT U R A L L A N G U A G E P R O C E S S I N G I N P Y T H O N
Tf-idf with gensim
I N T R O D U C T I O N T O N AT U R A L L A N G U A G E P R O C E S S I N G I N P Y T H O N

Katharine Jarmul
Founder, kjamistan
What is tf-idf?
Term frequency - inverse document frequency

Allows you to determine the most important words in each


document

Each corpus may have shared words beyond just stopwords

These words should be down-weighted in importance

Example from astronomy: "Sky"

Ensures most common words don't show up as key words

Keeps document speci c frequent words weighted high

INTRODUCTION TO NATURAL LANGUAGE PROCESSING IN PYTHON


Tf-idf formula
N
wi,j = tfi,j ∗ log( )
dfi
wi,j = tf-idf weight for token i in document j

tfi,j = number of occurences of token i in document j

dfi = number of documents that contain token i

N = total number of documents

INTRODUCTION TO NATURAL LANGUAGE PROCESSING IN PYTHON


Tf-idf with gensim
from gensim.models.tfidfmodel import TfidfModel
tfidf = TfidfModel(corpus)
tfidf[corpus[1]]

[(0, 0.1746298276735174),
(1, 0.1746298276735174),
(9, 0.29853166221463673),
(10, 0.7716931521027908),
...
]

INTRODUCTION TO NATURAL LANGUAGE PROCESSING IN PYTHON


Let's practice!
I N T R O D U C T I O N T O N AT U R A L L A N G U A G E P R O C E S S I N G I N P Y T H O N
Named Entity
Recognition
I N T R O D U C T I O N T O N AT U R A L L A N G U A G E P R O C E S S I N G I N P Y T H O N

Katharine Jarmul
Founder, kjamistan
What is Named Entity Recognition?
NLP task to identify important named entities in the text
People, places, organizations

Dates, states, works of art

... and other categories!

Can be used alongside topic identi cation


... or on its own!

Who? What? When? Where?

INTRODUCTION TO NATURAL LANGUAGE PROCESSING IN PYTHON


Example of NER

(Source: Europeana Newspapers (h p://www.europeana-


newspapers.eu))

INTRODUCTION TO NATURAL LANGUAGE PROCESSING IN PYTHON


nltk and the Stanford CoreNLP Library
The Stanford CoreNLP library:
Integrated into Python via nltk

Java based

Support for NER as well as coreference and dependency


trees

INTRODUCTION TO NATURAL LANGUAGE PROCESSING IN PYTHON


Using nltk for Named Entity Recognition
import nltk
sentence = '''In New York, I like to ride the Metro to
visit MOMA and some restaurants rated
well by Ruth Reichl.'''
tokenized_sent = nltk.word_tokenize(sentence)
tagged_sent = nltk.pos_tag(tokenized_sent)
tagged_sent[:3]

[('In', 'IN'), ('New', 'NNP'), ('York', 'NNP')]

INTRODUCTION TO NATURAL LANGUAGE PROCESSING IN PYTHON


print(nltk.ne_chunk(tagged_sent))

(S
In/IN
(GPE New/NNP York/NNP)
,/,
I/PRP
like/VBP
to/TO
ride/VB
the/DT
(ORGANIZATION Metro/NNP)
to/TO
visit/VB
(ORGANIZATION MOMA/NNP)
and/CC
some/DT
restaurants/NNS
rated/VBN
well/RB
by/IN
(PERSON Ruth/NNP Reichl/NNP)
./.)

INTRODUCTION TO NATURAL LANGUAGE PROCESSING IN PYTHON


Let's practice!
I N T R O D U C T I O N T O N AT U R A L L A N G U A G E P R O C E S S I N G I N P Y T H O N
Introduction to
SpaCy
I N T R O D U C T I O N T O N AT U R A L L A N G U A G E P R O C E S S I N G I N P Y T H O N

Katharine Jarmul
Founder, kjamistan
What is SpaCy?
NLP library similar to gensim , with di erent implementations

Focus on creating NLP pipelines to generate models and


corpora

Open-source, with extra libraries and tools


Displacy

INTRODUCTION TO NATURAL LANGUAGE PROCESSING IN PYTHON


Displacy entity recognition visualizer

(source: h ps://demos.explosion.ai/displacy-ent/)

INTRODUCTION TO NATURAL LANGUAGE PROCESSING IN PYTHON


import spacy
nlp = spacy.load('en_core_web_sm')
nlp.entity

<spacy.pipeline.EntityRecognizer at 0x7f76b75e68b8>

doc = nlp("""Berlin is the capital of Germany;


and the residence of Chancellor Angela Merkel.""")
doc.ents

(Berlin, Germany, Angela Merkel)

print(doc.ents[0], doc.ents[0].label_)

Berlin GPE

INTRODUCTION TO NATURAL LANGUAGE PROCESSING IN PYTHON


Why use SpaCy for NER?
Easy pipeline creation

Di erent entity types compared to nltk

Informal language corpora


Easily nd entities in Tweets and chat messages

Quickly growing!

INTRODUCTION TO NATURAL LANGUAGE PROCESSING IN PYTHON


Let's practice!
I N T R O D U C T I O N T O N AT U R A L L A N G U A G E P R O C E S S I N G I N P Y T H O N
Multilingual NER
with polyglot
I N T R O D U C T I O N T O N AT U R A L L A N G U A G E P R O C E S S I N G I N P Y T H O N

Katharine Jarmul
Founder, kjamistan
What is polyglot?
NLP library which uses word
vectors

Why polyglot ?
Vectors for many di erent
languages

More than 130!

INTRODUCTION TO NATURAL LANGUAGE PROCESSING IN PYTHON


Spanish NER with polyglot
from polyglot.text import Text
?ext = """El presidente de la Generalitat de Cataluña,
Carles Puigdemont, ha afirmado hoy a la alcaldesa
de Madrid, Manuela Carmena, que en su etapa de
alcalde de Girona (de julio de 2011 a enero de 2016)
hizo una gran promoción de Madrid."""
ptext = Text(text)
ptext.entities

[I-ORG(['Generalitat', 'de']),
I-LOC(['Generalitat', 'de', 'Cataluña']),
I-PER(['Carles', 'Puigdemont']),
I-LOC(['Madrid']),
I-PER(['Manuela', 'Carmena']),
I-LOC(['Girona']),
I-LOC(['Madrid'])]

INTRODUCTION TO NATURAL LANGUAGE PROCESSING IN PYTHON


Let's practice!
I N T R O D U C T I O N T O N AT U R A L L A N G U A G E P R O C E S S I N G I N P Y T H O N
Classifying fake
news using
supervised learning
with NLP
I N T R O D U C T I O N T O N AT U R A L L A N G U A G E P R O C E S S I N G I N P Y T H O N

Katharine Jarmul
Founder, kjamistan
What is supervised learning?
Form of machine learning
Problem has prede ned training data

This data has a label (or outcome) you want the model to
learn

Classi cation problem

Goal: Make good hypotheses about the species based on


geometric features

Sepal Sepal Petal Petal


Species
length width length width
5.1 3.5 1.4 0.2 I. setosa
7.0 3.2 4.77 1.4 I.versicolor
6.3 3.3 6.0 2.5 I.virginica

INTRODUCTION TO NATURAL LANGUAGE PROCESSING IN PYTHON


Supervised learning with NLP
Need to use language instead of geometric features

scikit-learn : Powerful open-source library

How to create supervised learning data from text?


Use bag-of-words models or tf-idf as features

INTRODUCTION TO NATURAL LANGUAGE PROCESSING IN PYTHON


IMDB Movie Dataset
Sci-
Plot Action
Fi
In a post-apocalyptic world in human decay, a
1 0
...
Mohei is a wandering swordsman. He arrives
0 1
in ...
#137 is a SCI/FI thriller about a girl, Marla,... 1 0

Goal: Predict movie genre based on plot summary

Categorical features generated using preprocessing

INTRODUCTION TO NATURAL LANGUAGE PROCESSING IN PYTHON


Supervised learning steps
Collect and preprocess our data

Determine a label (Example: Movie genre)

Split data into training and test sets

Extract features from the text to help predict the label


Bag-of-words vector built into scikit-learn

Evaluate trained model using the test set

INTRODUCTION TO NATURAL LANGUAGE PROCESSING IN PYTHON


Let's practice!
I N T R O D U C T I O N T O N AT U R A L L A N G U A G E P R O C E S S I N G I N P Y T H O N
Building word count
vectors with scikit-
learn
I N T R O D U C T I O N T O N AT U R A L L A N G U A G E P R O C E S S I N G I N P Y T H O N

Katharine Jarmul
Founder, kjamistan
Predicting movie genre
Dataset consisting of movie plots and corresponding genre

Goal: Create bag-of-word vectors for the movie plots


Can we predict genre based on the words used in the plot
summary?

INTRODUCTION TO NATURAL LANGUAGE PROCESSING IN PYTHON


Count Vectorizer with Python
import pandas as pd

from sklearn.model_selection import train_test_split

from sklearn.feature_extraction.text import CountVectorizer


df = ... # Load data into DataFrame
y = df['Sci-Fi']
X_train, X_test, y_train, y_test = train_test_split(
df['plot'], y,
test_size=0.33,
random_state=53)
count_vectorizer = CountVectorizer(stop_words='english')
count_train = count_vectorizer.fit_transform(X_train.values)
count_test = count_vectorizer.transform(X_test.values)

INTRODUCTION TO NATURAL LANGUAGE PROCESSING IN PYTHON


Let's practice!
I N T R O D U C T I O N T O N AT U R A L L A N G U A G E P R O C E S S I N G I N P Y T H O N
Training and testing
a classification
model with scikit-
learn
I N T R O D U C T I O N T O N AT U R A L L A N G U A G E P R O C E S S I N G I N P Y T H O N

Katharine Jarmul
Founder, kjamistan
Naive Bayes classifier
Naive Bayes Model
Commonly used for testing NLP classi cation problems

Basis in probability

Given a particular piece of data, how likely is a particular


outcome?

Examples:
If the plot has a spaceship, how likely is it to be sci- ?

Given a spaceship and an alien, how likely now is it sci- ?

Each word from CountVectorizer acts as a feature

Naive Bayes: Simple and e ective

INTRODUCTION TO NATURAL LANGUAGE PROCESSING IN PYTHON


Naive Bayes with scikit-learn
from sklearn.naive_bayes import MultinomialNB
from sklearn import metrics
nb_classifier = MultinomialNB()

nb_classifier.fit(count_train, y_train)
pred = nb_classifier.predict(count_test)
metrics.accuracy_score(y_test, pred)

0.85841849389820424

INTRODUCTION TO NATURAL LANGUAGE PROCESSING IN PYTHON


Confusion matrix
metrics.confusion_matrix(y_test, pred, labels=[0,1])

array([[6410, 563],
[ 864, 2242]])

Action Sci-Fi
Action 6410 563
Sci-Fi 864 2242

INTRODUCTION TO NATURAL LANGUAGE PROCESSING IN PYTHON


Let's practice!
I N T R O D U C T I O N T O N AT U R A L L A N G U A G E P R O C E S S I N G I N P Y T H O N
Simple NLP, complex
problems
I N T R O D U C T I O N T O N AT U R A L L A N G U A G E P R O C E S S I N G I N P Y T H O N

Katharine Jarmul
Founder, kjamistan
Translation

source:
(h ps://twi er.com/Lupintweets/status/865533182455685121)

INTRODUCTION TO NATURAL LANGUAGE PROCESSING IN PYTHON


Sentiment analysis

(source: h ps://nlp.stanford.edu/projects/socialsent/)

INTRODUCTION TO NATURAL LANGUAGE PROCESSING IN PYTHON


Language biases

(related talk: h ps://www.youtube.com/watch?


v=j7FwpZB1hWc)

INTRODUCTION TO NATURAL LANGUAGE PROCESSING IN PYTHON


Let's practice!
I N T R O D U C T I O N T O N AT U R A L L A N G U A G E P R O C E S S I N G I N P Y T H O N
Welcome!
S E N T I M E N T A N A LY S I S I N P Y T H O N

Violeta Misheva
Data Scientist
What is sentiment analysis?

Sentiment analysis is the process of understanding the opinion of an author about a


subject.

SENTIMENT ANALYSIS IN PYTHON


What goes into a sentiment analysis system?
First element: Opinion/emotion

Opinion (polarity): pos, neutral, neg Emotion

SENTIMENT ANALYSIS IN PYTHON


What goes into a sentiment analysis system?
Second element: subject

Subject of discussion: What is being talked about ?

_The camera on this phone is great but its ba ery life is rather disappointing. _

Third element: opinion holder

Opinion holder (entity): By whom?

SENTIMENT ANALYSIS IN PYTHON


Why sentiment analysis?
Social media monitoring
Not only what people are talking about but HOW they are talking about it

Sentiment can be found also in forums, blogs, news

Brand monitoring

Customer service

Product analytics

Market research and analysis

SENTIMENT ANALYSIS IN PYTHON


Let's look at movie reviews!
data.head()

SENTIMENT ANALYSIS IN PYTHON


How many positive and negative reviews?
data.label.value_counts()

0 3782
1 3719
Name: label, dtype: int64

SENTIMENT ANALYSIS IN PYTHON


Percentage of positive and negative reviews
data.label.value_counts() / len(data)

0 0.504199
1 0.495801
Name: label, dtype: float64

SENTIMENT ANALYSIS IN PYTHON


How long is the longest review?
length_reviews = data.review.str.len()

type(length_reviews)
pandas.core.series.Series

# Finding the review with max length


max(length_reviews)

0 667
1 2982
2 669
3 1087
....

SENTIMENT ANALYSIS IN PYTHON


How long is the shortest review?
length_reviews = data.review.str.len()

# Finding the review with min length


min(length_reviews)

0 667
1 2982
2 669
3 1087
4 724
....

SENTIMENT ANALYSIS IN PYTHON


Let's practice!
S E N T I M E N T A N A LY S I S I N P Y T H O N
Sentiment analysis
types and
approaches
S E N T I M E N T A N A LY S I S I N P Y T H O N

Violeta Misheva
Data Scientist
Levels of granularity
1. Document level

2. Sentence level

3. Aspect level

The camera in this phone is pre y good but the ba ery life is disappointing.

SENTIMENT ANALYSIS IN PYTHON


Type of sentiment analysis algorithms
Rule/lexicon-based

nice:+2, good:+1, terrible: -3 ...

Today was a good day.

Today: 0, was:0, a:0, good:+1, day:0


Total valence: +1

Automatic/ Machine learning

SENTIMENT ANALYSIS IN PYTHON


What is the valence of a sentence?
text = "Today was a good day."

from textblob import TextBlob

my_valence = TextBlob(text)
my_valence.sentiment

Sentiment(polarity=0.7, subjectivity=0.6000000000000001)

SENTIMENT ANALYSIS IN PYTHON


Automated or rule-based?
Automated/Machine learning Rule/lexicon-based

Rely on having labelled historical data Rely on manually cra ed valence scores

Might take a while to train Di erent words might have di erent


polarity in di erent contexts
Latest machine learning models can be
quite powerful Can be quite fast

SENTIMENT ANALYSIS IN PYTHON


Let's practice!
S E N T I M E N T A N A LY S I S I N P Y T H O N
Let's build a word
cloud!
S E N T I M E N T A N A LY S I S I N P Y T H O N

Violeta Misheva
Data Scientist
Word cloud example

SENTIMENT ANALYSIS IN PYTHON


How do word clouds work?

The more frequent a word is, the BIGGER and bolder it will appear on the word cloud.

SENTIMENT ANALYSIS IN PYTHON


Word cloud generated by one of the longest reviews

SENTIMENT ANALYSIS IN PYTHON


Why word clouds?
Pros Cons
Can reveal the essential
Sometimes confusing and uninformative
Provide an overall sense of the text
With larger text, require more work
Easy to grasp and engaging

SENTIMENT ANALYSIS IN PYTHON


Let's build a word cloud in Python!
from wordcloud import WordCloud
import matplotlib.pyplot as plt

two_cities = "It was the best of times, it was the worst of times,
it was the age of wisdom, it was the age of foolishness,
it was the epoch of belief, it was the epoch of incredulity,
it was the season of Light, it was the season of Darkness,
it was the spring of hope, it was the winter of despair,
we had everything before us, we had nothing before us,
we were all going direct to Heaven, we were all going
direct the other way – in short, the period was so far
like the present period, that some of its noisiest
authorities insisted on its being received, for good
or for evil, in the superlative degree of comparison only."

SENTIMENT ANALYSIS IN PYTHON


Define the WordCloud object
cloud_two_cities = WordCloud().generate(two_cities)

# To see all arguments of the function


?WordCloud

Background color

Size and font of the words, scaling

Stopwords

# How does cloud_two_cities look like?


cloud_two_cities
<wordcloud.wordcloud.WordCloud at 0x2585f286d68>

SENTIMENT ANALYSIS IN PYTHON


Dislaying the word cloud!
plt.imshow(cloud_two_cities, interpolation='bilinear')

plt.axis('off')
plt.show()

SENTIMENT ANALYSIS IN PYTHON


Let's practice!
S E N T I M E N T A N A LY S I S I N P Y T H O N
Bag-of-words
S E N T I M E N T A N A LY S I S I N P Y T H O N

Violeta Misheva
Data Scientist
What is a bag-of-words (BOW) ?

Describes the occurrence of words within a document or a collection of documents (corpus)

Builds a vocabulary of the words and a measure of their presence

SENTIMENT ANALYSIS IN PYTHON


Amazon product reviews

SENTIMENT ANALYSIS IN PYTHON


Sentiment analysis with BOW: Example
This is the best book ever. I loved the book and highly recommend it!!!

{'This': 1, 'is': 1, 'the': 2 , 'best': 1 , 'book': 2,


'ever': 1, 'I':1 , 'loved':1 , 'and': 1 , 'highly': 1,
'recommend': 1 , 'it': 1 }

Lose word order and grammar rules!

SENTIMENT ANALYSIS IN PYTHON


BOW end result
The output will look something like this:

SENTIMENT ANALYSIS IN PYTHON


CountVectorizer function
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer

vect = CountVectorizer(max_features=1000)
vect.fit(data.review)
X = vect.transform(data.review)

SENTIMENT ANALYSIS IN PYTHON


CountVectorizer output
X

<10000x1000 sparse matrix of type '<class 'numpy.int64'>'


with 406668 stored elements in Compressed Sparse Row format>

SENTIMENT ANALYSIS IN PYTHON


Transforming the vectorizer
# Transform to an array
my_array = X.toarray()

# Transform back to a dataframe, assign column names


X_df = pd.DataFrame(my_array, columns=vect.get_feature_names())

SENTIMENT ANALYSIS IN PYTHON


Let's practice!
S E N T I M E N T A N A LY S I S I N P Y T H O N
Getting granular
with n-grams
S E N T I M E N T A N A LY S I S I N P Y T H O N

Violeta Misheva
Data Scientist
Context matters
I am happy, not sad.

I am sad, not happy.

Pu ing 'not' in front of a word (negation) is one example of how context ma ers.

SENTIMENT ANALYSIS IN PYTHON


Capturing context with a BOW
Unigrams : single tokens

Bigrams: pairs of tokens

Trigrams: triples of tokens

n-grams: sequence of n-tokens

SENTIMENT ANALYSIS IN PYTHON


Capturing context with BOW
The weather today is wonderful.

Unigrams : { The, weather, today, is, wonderful }

Bigrams: {The weather, weather today, today is, is wonderful}

Trigrams: {The weather today, weather today is, today is wonderful}

SENTIMENT ANALYSIS IN PYTHON


n-grams with the CountVectorizer
from sklearn.feature_extraction.text import CountVectorizer

vect = CountVectorizer(ngram_range=(min_n, max_n))

# Only unigrams
ngram_range=(1, 1)

# Uni- and bigrams


ngram_range=(1, 2)

SENTIMENT ANALYSIS IN PYTHON


What is the best n?
Longer sequence of tokens
Results in more features

Higher precision of machine learning models

Risk of over ing

SENTIMENT ANALYSIS IN PYTHON


Specifying vocabulary size
CountVectorizer(max_features, max_df, min_df)

max_features: if speci ed, it will include only the top most frequent words in the vocabulary
If max_features = None, all words will be included

max_df: ignore terms with higher than speci ed frequency


If it is set to integer, then absolute count; if a oat, then it is a proportion

Default is 1, which means it does not ignore any terms

min_df: ignore terms with lower than speci ed frequency


If it is set to integer, then absolute count; if a oat, then it is a proportion

Default is 1, which means it does not ignore any terms

SENTIMENT ANALYSIS IN PYTHON


Let's practice!
S E N T I M E N T A N A LY S I S I N P Y T H O N
Build new features
from text
S E N T I M E N T A N A LY S I S I N P Y T H O N

Violeta Misheva
Data Scientist
Goal of the video

Goal : Enrich the existing dataset with features related to the text column (capturing the
sentiment)

SENTIMENT ANALYSIS IN PYTHON


Product reviews data
reviews.head()

SENTIMENT ANALYSIS IN PYTHON


Features from the review column

How long is each review?

How many sentences does it contain?

What parts of speech are involved?

How many punctuation marks?

SENTIMENT ANALYSIS IN PYTHON


Tokenizing a string
from nltk import word_tokenize

anna_k = 'Happy families are all alike, every unhappy family is unhappy in its own way.'

word_tokenize(anna_k)

['Happy','families','are', 'all','alike',',',
'every','unhappy', 'family', 'is','unhappy','in',
'its','own','way','.']

SENTIMENT ANALYSIS IN PYTHON


Tokens from a column
# General form of list comprehension
[expression for item in iterable]

word_tokens = [word_tokenize(review) for review in reviews.review]


type(word_tokens)

list

type(word_tokens[0])

list

SENTIMENT ANALYSIS IN PYTHON


Tokens from a column
len_tokens = []

# Iterate over the word_tokens list


for i in range(len(word_tokens)):
len_tokens.append(len(word_tokens[i]))

# Create a new feature for the length of each review


reviews['n_tokens'] = len_tokens

SENTIMENT ANALYSIS IN PYTHON


Dealing with punctuation
We did not address it but you can exclude it

A feature that measures the number of punctuation signs


A review with many punctuation signs could signal a very emotionally charged opinion

SENTIMENT ANALYSIS IN PYTHON


Reviews with a feature for the length
reviews.head()

SENTIMENT ANALYSIS IN PYTHON


Let's practice!
S E N T I M E N T A N A LY S I S I N P Y T H O N
Can you guess the
language?
S E N T I M E N T A N A LY S I S I N P Y T H O N

Violeta Misheva
Data Scientist
Language of a string in Python
from langdetect import detect_langs
foreign = 'Este libro ha sido uno de los mejores libros que he leido.'

detect_langs(foreign)

[es:0.9999945352697024]

SENTIMENT ANALYSIS IN PYTHON


Language of a column
Problem: Detect the language of each of the strings and capture the most likely language in
a new column

from langdetect import detect_langs


reviews = pd.read_csv('product_reviews.csv')

reviews.head()

SENTIMENT ANALYSIS IN PYTHON


Building a feature for the language
languages = []

for row in range(len(reviews)):


languages.append(detect_langs(reviews.iloc[row, 1]))

languages
[it:0.9999982541301151],
[es:0.9999954153640488],
[es:0.7142833997345875, en:0.2857160465706441],
[es:0.9999942365605781],
[es:0.999997956049055] ...

SENTIMENT ANALYSIS IN PYTHON


Building a feature for the language
# Transform the first list to a string and split on a colon
str(languages[0]).split(':')
['[es', '0.9999954153640488]']

str(languages[0]).split(':')[0]
'[es'

str(languages[0]).split(':')[0][1:]
'es'

SENTIMENT ANALYSIS IN PYTHON


Building a feature for the language
languages = [str(lang).split(':')[0][1:] for lang in languages]

reviews['language'] = languages

SENTIMENT ANALYSIS IN PYTHON


Let's practice!
S E N T I M E N T A N A LY S I S I N P Y T H O N
Stop words
S E N T I M E N T A N A LY S I S I N P Y T H O N

Violeta Misheva
Data Scientist
What are stop words and how to find them?
Stop words: words that occur too frequently and not considered informative

Lists of stop words in most languages

{'the', 'a', 'an', 'and', 'but', 'for', 'on', 'in', 'at' ...}

Context ma ers

{'movie', 'movies', 'film', 'films', 'cinema'}

SENTIMENT ANALYSIS IN PYTHON


Stop words with word clouds
Word cloud, not removing stop words Word cloud with stop words removed

SENTIMENT ANALYSIS IN PYTHON


Remove stop words from word clouds
# Import libraries
from wordcloud import WordCloud, STOPWORDS

# Define the stopwords list


my_stopwords = set(STOPWORDS)
my_stopwords.update(["movie", "movies", "film", "films", "watch", "br"])

# Generate and show the word cloud


my_cloud = WordCloud(background_color='white', stopwords=my_stopwords).generate(name_string)
plt.imshow(my_cloud, interpolation='bilinear')

SENTIMENT ANALYSIS IN PYTHON


Stop words with BOW
from sklearn.feature_extraction.text import CountVectorizer, ENGLISH_STOP_WORDS

# Define the set of stop words


my_stop_words = ENGLISH_STOP_WORDS.union(['film', 'movie', 'cinema', 'theatre'])

vect = CountVectorizer(stop_words=my_stop_words)
vect.fit(movies.review)
X = vect.transform(movies.review)

SENTIMENT ANALYSIS IN PYTHON


Let's practice!
S E N T I M E N T A N A LY S I S I N P Y T H O N
Capturing a token
pattern
S E N T I M E N T A N A LY S I S I N P Y T H O N

Violeta Misheva
Data Scientist
String operators and comparisons
# Checks if a string is composed only of letters
my_string.isalpha()

# Checks if a string is composed only of digits


my_string.isdigit()

# Checks if a string is composed only of alphanumeric characters


my_string.isalnum()

SENTIMENT ANALYSIS IN PYTHON


String operators with list comprehension
# Original word tokenization
word_tokens = [word_tokenize(review) for review in reviews.review]

# Keeping only tokens composed of letters


cleaned_tokens = [[word for word in item if word.isalpha()] for item in word_tokens]

len(word_tokens[0])

87

len(cleaned_tokens[0])

78

SENTIMENT ANALYSIS IN PYTHON


Regular expressions
import re

my_string = '#Wonderfulday'
# Extract #, followed by any letter, small or capital
x = re.search('#[A-Za-z]', my_string)

x
<re.Match object; span=(0, 2), match='#W'>

SENTIMENT ANALYSIS IN PYTHON


Token pattern with a BOW
# Default token pattern in CountVectorizer
'\b\w\w+\b'

# Specify a particular token pattern


CountVectorizer(token_pattern=r'\b[^\d\W][^\d\W]+\b')

SENTIMENT ANALYSIS IN PYTHON


Let's practice!
S E N T I M E N T A N A LY S I S I N P Y T H O N
Stemming and
lemmatization
S E N T I M E N T A N A LY S I S I N P Y T H O N

Violeta Misheva
Data Scientist
What is stemming?
Stemming is the process of transforming words to their root forms, even if the stem itself is
not a valid word in the language.

staying, stays, stayed ----> stay


house, houses, housing ----> hous

SENTIMENT ANALYSIS IN PYTHON


What is lemmatization?
Lemmatization is quite similar to stemming but unlike stemming, it reduces the words to
roots that are valid words in the language.

stay, stays, staying, stayed ----> stay


house, houses, housing ----> house

SENTIMENT ANALYSIS IN PYTHON


Stemming vs. lemmatization
Stemming Lemmatization

Produces roots of words Produces actual words

Fast and e cient to compute Slower than stemming and can depend on
the part-of-speech

SENTIMENT ANALYSIS IN PYTHON


Stemming of strings
from nltk.stem import PorterStemmer

porter = PorterStemmer()

porter.stem('wonderful')

'wonder'

SENTIMENT ANALYSIS IN PYTHON


Non-English stemmers
Snowball Stemmer: Danish, Dutch, English, Finnish, French, German, Hungarian,Italian,
Norwegian, Portuguese, Romanian, Russian, Spanish, Swedish

from nltk.stem.snowball import SnowballStemmer

DutchStemmer = SnowballStemmer("dutch")
DutchStemmer.stem("beginnen")

'begin'

SENTIMENT ANALYSIS IN PYTHON


How to stem a sentence?
porter.stem('Today is a wonderful day!')

'today is a wonderful day!'

tokens = word_tokenize('Today is a wonderful day!')


stemmed_tokens = [porter.stem(token) for token in tokens]
stemmed_tokens

['today', 'is', 'a', 'wonder', 'day', '!']

SENTIMENT ANALYSIS IN PYTHON


Lemmatization of a string
from nltk.stem import WordNetLemmatizer

WNlemmatizer = WordNetLemmatizer()

WNlemmatizer.lemmatize('wonderful', pos='a')

'wonderful'

SENTIMENT ANALYSIS IN PYTHON


Let's practice!
S E N T I M E N T A N A LY S I S I N P Y T H O N
TfIdf: More ways to
transform text
S E N T I M E N T A N A LY S I S I N P Y T H O N

Violeta Misheva
Data Scientist
What are the components of TfIdf?
TF: term frequency: How o en a given word appears within a document in the corpus

Inverse document frequency: Log-ratio between the total number of documents and the
number of documents that contain a speci c word
Used to calculate the weight of words that do not occur frequently

SENTIMENT ANALYSIS IN PYTHON


TfIdf score of a word
TfIdf score:

TfIdf = term frequency * inverse document frequency

BOW does not account for length of a document, TfIdf does.

TfIdf likely to capture words common within a document but not across documents.

SENTIMENT ANALYSIS IN PYTHON


How is TfIdf useful?
Twi er airline sentiment
Low TfIdf scores: United, Virgin America

High TfIdf scores: check-in process (if rare across documents)

More on TfIdf
Since it penalizes frequent words, less need to deal with stop words explicitly.

Quite useful in search queries and information retrieval to rank the relevance of returned
results.

SENTIMENT ANALYSIS IN PYTHON


TfIdf in Python
# Import the TfidfVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer

Arguments of T dfVectorizer: max_features, ngrams_range, stop_words, token_pa ern,


max_df, min_df

vect = TfidfVectorizer(max_features=100).fit(tweets.text)
X = vect.transform(tweets.text)

SENTIMENT ANALYSIS IN PYTHON


TfidfVectorizer
X
<14640x100 sparse matrix of type '<class 'numpy.float64'>'
with 119182 stored elements in Compressed Sparse Row format>

X_df = pd.DataFrame(X_txt.toarray(), columns=vect.get_feature_names())


X_df.head()

SENTIMENT ANALYSIS IN PYTHON


Let's practice!
S E N T I M E N T A N A LY S I S I N P Y T H O N
Let's predict the
sentiment!
S E N T I M E N T A N A LY S I S I N P Y T H O N

Violeta Misheva
Data Scientist
Classification problems
Product and movie reviews: positive or negative sentiment (binary classi cation)

Tweets about airline companies: positive, neutral and negative (multi-class classi cation)

SENTIMENT ANALYSIS IN PYTHON


Linear and logistic regressions

SENTIMENT ANALYSIS IN PYTHON


Logistic function
Linear regression: numeric outcome

Logistic regression: probability:

P robability(sentiment = positive∣review)

SENTIMENT ANALYSIS IN PYTHON


Logistic regression in Python
from sklearn.linear_model import LogisticRegression

log_reg = LogisticRegression().fit(X, y)

SENTIMENT ANALYSIS IN PYTHON


Measuring model performance
Accuracy: Fraction of predictions our model got right.

The higher and closer the accuracy is to 1, the be er

# Accuracy using score


score = log_reg.score(X, y)
print(score)

0.9009

SENTIMENT ANALYSIS IN PYTHON


Using accuracy score
# Accuracy using accuracy_score
from sklearn.metrics import accuracy_score

y_predicted = log_reg.predict(X)
acurracy = accuracy_score(y, y_predicted)

0.9009

SENTIMENT ANALYSIS IN PYTHON


Let's practice!
S E N T I M E N T A N A LY S I S I N P Y T H O N
Did we really predict
the sentiment well?
S E N T I M E N T A N A LY S I S I N P Y T H O N

Violeta Misheva
Data Scientist
Train/test split

Training set: used to train the model (70-80% of the whole data)

Testing set: used to evaluate the performance of the model

SENTIMENT ANALYSIS IN PYTHON


Train/test in Python
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=123, stratify=y)

X : features

y : labels

test_size: proportion of data used in testing

random_state: seed generator used to make the split

stratify: proportion of classes in the sample produced will be the same as the proportion of
values provided to this parameter

SENTIMENT ANALYSIS IN PYTHON


Logistic regression with train/test split
log_reg = LogisticRegression().fit(X_train, y_train)

print('Accuracy on training data: ', log_reg.score(X_train, y_train))

0.76

print('Accuracy on testing data: ', log_reg.score(X_test, y_test))

0.73

SENTIMENT ANALYSIS IN PYTHON


Accuracy score with train/test split
from sklearn.metrics import accuracy_score

log_reg = LogisticRegression().fit(X_train, y_train)

y_predicted = log_reg.predict(X_test)
print('Accuracy score on test data: ', accuracy_score(y_test, y_predicted))

0.73

SENTIMENT ANALYSIS IN PYTHON


Confusion matrix

SENTIMENT ANALYSIS IN PYTHON


Confusion matrix in Python
from sklearn.metrics import confusion_matrix

log_reg = LogisticRegression().fit(X_train, y_train)


y_predicted = log_reg.predict(X_test)

print(confusion_matrix(y_test, y_predicted)/len(y_test))

[[0.3788 0.1224]
[0.1352 0.3636]]

SENTIMENT ANALYSIS IN PYTHON


Let's practice!
S E N T I M E N T A N A LY S I S I N P Y T H O N
Logistic regression:
revisted
S E N T I M E N T A N A LY S I S I N P Y T H O N

Violeta Misheva
Data Scientist
Complex models and regularization
Complex models:
Complex model that captures the noise in the data (over ing)

Having a large number of features or parameters

Regularization:
A way to simplify and ensure we have a less complex model

SENTIMENT ANALYSIS IN PYTHON


Regularization in a logistic regression
from sklearn.linear_model import LogisticRegression

# Regularization arguments
LogisticRegression(penalty='l2', C=1.0)

L2: shrinks all coe cients towards zero

High values of C: low penalization, model ts the training data well.

Low values of C: high penalization, model less exible.

SENTIMENT ANALYSIS IN PYTHON


Predicting a probability vs. predicting a class
log_reg = LogisticRegression().fit(X_train, y_train)

# Predict labels
y_predicted = log_reg.predict(X_test)

# Predict probability
y_probab = log_reg.predict_proba(X_test)

SENTIMENT ANALYSIS IN PYTHON


Predicting a probability vs. predicting a class
y_probab
array([[0.5002245, 0.4997755],
[0.4900345, 0.5099655],
...,
[0.7040499, 0.2959501]])

# Select the probabilities of class 1


y_probab = log_reg.predict_proba(X_test)[:, 1]

array([0.4997755, 0.5099655 ..., 0.2959501]])

SENTIMENT ANALYSIS IN PYTHON


Model metrics with predicted probabilities
Raise ValueError when applied with probabilities.

Accuracy score and confusion matrix work with classes.

# Default probability encoding:


# If probability >= 0.5, then class 1 Else class 0

SENTIMENT ANALYSIS IN PYTHON


Let's practice!
S E N T I M E N T A N A LY S I S I N P Y T H O N
Bringing it all
together
S E N T I M E N T A N A LY S I S I N P Y T H O N

Violeta Misheva
Data Scientist
The Sentiment Analysis problem
Sentiment analysis as the process of understanding the opinion of an author about a
subject

Movie reviews

Amazon product reviews

Twi er airline sentiment

Various emotionally charged literary examples

SENTIMENT ANALYSIS IN PYTHON


Exploration of the reviews
Basic information about size of reviews

Word clouds

Features for the length of reviews: number of words, number of sentences

Feature detecting the language of a review

SENTIMENT ANALYSIS IN PYTHON


Numeric transformations of sentiment-carrying
columns
Bag-of-words

TfIdf vectorization

from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer

# Vectorizer syntax
vect = CountVectorizer().fit(data.text_column)
X = vect.transform(data.text_column)

SENTIMENT ANALYSIS IN PYTHON


Arguments of the vectorizers
stop words: non-informative, frequently occurring words

n-gram range: use phrases not only single words

control size of vocabulary: max_features, max_df, min_df

capturing a pa ern of tokens: remove digits or certain characters

Important but NOT arguments to the vectorizers

lemmas and stems

SENTIMENT ANALYSIS IN PYTHON


Supervised learning model
Logistic regression classi er to predict the sentiment

Evaluated with accuracy and confusion matrix

Importance of train/test split

SENTIMENT ANALYSIS IN PYTHON


Let's practice!
S E N T I M E N T A N A LY S I S I N P Y T H O N
Wrap up
S E N T I M E N T A N A LY S I S I N P Y T H O N

Violeta Misheva
Data Scientist
The Sentiment Analysis world

SENTIMENT ANALYSIS IN PYTHON


Sentiment analysis types

SENTIMENT ANALYSIS IN PYTHON


The automated sentiment analysis system

SENTIMENT ANALYSIS IN PYTHON


Congratulations!
S E N T I M E N T A N A LY S I S I N P Y T H O N
Natural Language
Processing (NLP)
basics
N AT U R A L L A N G U A G E P R O C E S S I N G W I T H S PA C Y

Azadeh Mobasher
Principal Data Scientist
Natural Language Processing (NLP)

A subfield of Artificial Intelligence (AI)

Helps computers to understand human


language

Helps extract insights from unstructured


data

Incorporates statistics, machine learning


models and deep learning models

NATURAL LANGUAGE PROCESSING WITH SPACY


NLP use cases
Sentiment analysis

Use of computers to determine the underlying subjective tone of a piece of writing

NATURAL LANGUAGE PROCESSING WITH SPACY


NLP use cases
Named entity recognition (NER)

Locating and classifying named entities mentioned in unstructured text into pre-defined
categories

Named entities are real-world objects such as a person or location

NATURAL LANGUAGE PROCESSING WITH SPACY


NLP use cases

Generate human-like responses to text input, such as ChatGPT

NATURAL LANGUAGE PROCESSING WITH SPACY


Introduction to spaCy
spaCy is a free, open-source library for NLP in
Python which:

Is designed to build systems for information


extraction

Provides production-ready code for NLP


use cases

Supports 64+ languages

Is robust and fast and has visualization


libraries

NATURAL LANGUAGE PROCESSING WITH SPACY


Install and import spaCy

As the first step, spaCy can be installed $ python3 pip install spacy
using the Python package manager pip

spaCy trained models can be downloaded python3 -m spacy download en_core_web_sm


import spacy
Multiple trained models are available for nlp = spacy.load("en_core_web_sm")
English language at spacy.io

NATURAL LANGUAGE PROCESSING WITH SPACY


Read and process text with spaCy
Loaded spaCy model en_core_web_sm = nlp object
nlp object converts text into a Doc object (container) to store processed text

NATURAL LANGUAGE PROCESSING WITH SPACY


spaCy in action
Processing a string using spaCy

import spacy
nlp = spacy.load("en_core_web_sm")
text = "A spaCy pipeline object is created."
doc = nlp(text)

Tokenization
A Token is defined as the smallest meaningful part of the text.

Tokenization: The process of dividing a text into a list of meaningful tokens

print([token.text for token in doc])

['A', 'spaCy', 'pipeline', 'object', 'is', 'created', '.']

NATURAL LANGUAGE PROCESSING WITH SPACY


Let's practice!
N AT U R A L L A N G U A G E P R O C E S S I N G W I T H S PA C Y
spaCy basics
N AT U R A L L A N G U A G E P R O C E S S I N G W I T H S PA C Y

Azadeh Mobasher
Principal Data Scientist
spaCy NLP pipeline
Import spaCy
import spacy
nlp = spacy.load("en_core_web_sm") Use spacy.load() to return nlp , a
doc = nlp("Here's my spaCy pipeline.") Language class
The Language object is the text
processing pipeline

Apply nlp() on any text to get a Doc


container

NATURAL LANGUAGE PROCESSING WITH SPACY


spaCy NLP pipeline

spaCy applies some processing steps using its Language class:

NATURAL LANGUAGE PROCESSING WITH SPACY


Container objects in spaCy
There are multiple data structures to represent text data in spaCy :

Name Description
Doc A container for accessing linguistic annotations of text

Span A slice from a Doc object

Token An individual token, i.e. a word, punctuation, whitespace, etc.

NATURAL LANGUAGE PROCESSING WITH SPACY


Pipeline components
The spaCy language processing pipeline always depends on the loaded model and its
capabilities.

Component Name Description


Tokenizer Tokenizer Segment text into tokens and create Doc object

Tagger Tagger Assign part-of-speech tags


Lemmatizer Lemmatizer Reduce the words to their root forms
EntityRecognizer NER Detect and label named entities

NATURAL LANGUAGE PROCESSING WITH SPACY


Pipeline components

Each component has unique features to process text


Language

DependencyParser

Sentencizer

NATURAL LANGUAGE PROCESSING WITH SPACY


Tokenization
Always the first operation
All the other operations require tokens

Tokens can be words, numbers and punctuation

import spacy
nlp = spacy.load("en_core_web_sm")

doc = nlp("Tokenization splits a sentence into its tokens.")


print([token.text for token in doc])

['Tokenization', 'splits', 'a', 'sentence', 'into', 'its', 'tokens', '.']

NATURAL LANGUAGE PROCESSING WITH SPACY


Sentence segmentation
More complex than tokenization
Is a part of DependencyParser component

import spacy
nlp = spacy.load("en_core_web_sm")

text = "We are learning NLP. This course introduces spaCy."


doc = nlp(text)
for sent in doc.sents:
print(sent.text)

We are learning NLP.


This course introduces spaCy.

NATURAL LANGUAGE PROCESSING WITH SPACY


Lemmatization
A lemma is a the base form of a token
The lemma of eats and ate is eat

Improves accuracy of language models

import spacy
nlp = spacy.load("en_core_web_sm")
doc = nlp("We are seeing her after one year.")
print([(token.text, token.lemma_) for token in doc])

[('We', 'we'), ('are', 'be'), ('seeing', 'see'), ('her', 'she'),


('after', 'after'), ('one', 'one'), ('year', 'year'), ('.', '.')]

NATURAL LANGUAGE PROCESSING WITH SPACY


Let's practice!
N AT U R A L L A N G U A G E P R O C E S S I N G W I T H S PA C Y
Linguistic features in
spaCy
N AT U R A L L A N G U A G E P R O C E S S I N G W I T H S PA C Y

Azadeh Mobasher
Principal Data Scientist
POS tagging
Categorizing words grammatically, based on function and context within a sentence

POS Description Example


VERB Verb run, eat, ate, take
NOUN Noun man, airplane, tree, flower
ADJ Adjective big, old, incompatible, conflicting
ADV Adverb very, down, there, tomorrow
CONJ Conjunction and, or, but

NATURAL LANGUAGE PROCESSING WITH SPACY


POS tagging with spaCy

POS tagging confirms the meaning of a word

Some words such as watch can be both noun and verb


spaCy captures POS tags in the pos_ feature of the nlp pipeline

spacy.explain() explains a given POS tag

NATURAL LANGUAGE PROCESSING WITH SPACY


POS tagging with spaCy
verb_sent = "I watch TV." noun_sent = "I left without my watch."

print([(token.text, token.pos_, print([(token.text, token.pos_,


spacy.explain(token.pos_)) spacy.explain(token.pos_))
for token in nlp(verb_sent)]) for token in nlp(noun_sent)])

[('I', 'PRON', 'pronoun'), [('I', 'PRON', 'pronoun'),


('watch', 'VERB', 'verb'), ('left', 'VERB', 'verb'),
('TV', 'NOUN', 'noun'), ('without', 'ADP', 'adposition'),
('.', 'PUNCT', 'punctuation')] ('my', 'PRON', 'pronoun'),
('watch', 'NOUN', 'noun'),
('.', 'PUNCT', 'punctuation')]

NATURAL LANGUAGE PROCESSING WITH SPACY


Named entity recognition
A named entity is a word or phrase that refers to a specific entity with a name
Named-entity recognition (NER) classifies named entities into pre-defined categories

Entity type Description


PERSON Named person or family
ORG Companies, institutions, etc.
GPE Geo-political entity, countries, cities, etc.
LOC Non-GPE locations, mountain ranges, etc.
DATE Absolute or relative dates or periods
TIME Time smaller than a day

NATURAL LANGUAGE PROCESSING WITH SPACY


NER and spaCy

spaCy models extract named entities using the NER pipeline component

Named entities are available via the doc.ents property


spaCy will also tag each entity with its entity label ( .label_ )

NATURAL LANGUAGE PROCESSING WITH SPACY


NER and spaCy

import spacy
nlp = spacy.load("en_core_web_sm")
text = "Albert Einstein was genius."
doc = nlp(text)
print([(ent.text, ent.start_char,
ent.end_char, ent.label_) for ent in doc.ents])

>>> [('Albert Einstein', 0, 15, 'PERSON')]

NATURAL LANGUAGE PROCESSING WITH SPACY


NER and spaCy
We can also access entity types of each token in a Doc container

import spacy
nlp = spacy.load("en_core_web_sm")
text = "Albert Einstein was genius."
doc = nlp(text)
print([(token.text, token.ent_type_) for token in doc])

>>> [('Albert', 'PERSON'), ('Einstein', 'PERSON'),


('was', ''), ('genius', ''), ('.', '')]

NATURAL LANGUAGE PROCESSING WITH SPACY


displaCy
import spacy
from spacy import displacy
spaCy is equipped with a modern
visualizer: displaCy
text = "Albert Einstein was genius."
The displaCy entity visualizer highlights nlp = spacy.load("en_core_web_sm")
named entities and their labels
doc = nlp(text)
displacy.serve(doc, style="ent")

NATURAL LANGUAGE PROCESSING WITH SPACY


Let's practice!
N AT U R A L L A N G U A G E P R O C E S S I N G W I T H S PA C Y
Linguistic features
N AT U R A L L A N G U A G E P R O C E S S I N G W I T H S PA C Y

Azadeh Mobasher
Principal Data Scientist
POS tagging
POS tags depend on the context, surrounding words and their tags

import spacy
nlp = spacy.load("en_core_web_sm")
text = "My cat will fish for a fish tomorrrow in a fishy way."
print([(token.text, token.pos_, spacy.explain(token.pos_))
for token in nlp(text)])

NATURAL LANGUAGE PROCESSING WITH SPACY


What is the importance of POS?

Better accuracy for many NLP tasks Translation system use case

I will fish tomorrow. verb -> pescaré


I ate fish. noun -> pescado

NATURAL LANGUAGE PROCESSING WITH SPACY


What is the importance of POS?

Word-sense disambiguation (WSD) is the problem of deciding in which sense a word is used
in a sentence.

Determining the sense of the word can be crucial in machine translation, etc.

NATURAL LANGUAGE PROCESSING WITH SPACY


Word-sense disambiguation
import spacy
nlp = spacy.load("en_core_web_sm")

verb_text = "I will fish tomorrow."


noun_text = "I ate fish."

print([(token.text, token.pos_) for token in nlp(verb_text) if "fish" in token.text], "\n")


print([(token.text, token.pos_) for token in nlp(noun_text) if "fish" in token.text])

[('fish', 'VERB', 'verb')]


[('fish', 'NOUN', 'noun')]

NATURAL LANGUAGE PROCESSING WITH SPACY


Dependency parsing
Explores a sentence syntax
Links between two tokens

Results in a tree

NATURAL LANGUAGE PROCESSING WITH SPACY


Dependency parsing and spaCy

Dependency label describes the type of syntactic relation between two tokens

Dependency label Description


nsubj Nominal subject
root Root
det Determiner
dobj Direct object
aux Auxiliary

NATURAL LANGUAGE PROCESSING WITH SPACY


Dependency parsing and displaCy
displaCy can draw dependency trees

doc = nlp("We understand the differences.")

spacy.displacy.serve(doc, style="dep")

NATURAL LANGUAGE PROCESSING WITH SPACY


Dependency parsing and spaCy
.dep_ attribute to access the dependency label of a token

doc = nlp("We understand the differences.")


print([(token.text, token.dep_, spacy.explain(token.dep_)) for token in doc])

[('We', 'nsubj', 'nominal subject'), ('understand', 'ROOT', 'root'),


('the', 'det', 'determiner'), ('differences', 'dobj', 'direct object'),
('.', 'punct', 'punctuation')]

NATURAL LANGUAGE PROCESSING WITH SPACY


Let's practice!
N AT U R A L L A N G U A G E P R O C E S S I N G W I T H S PA C Y
Introduction to word
vectors
N AT U R A L L A N G U A G E P R O C E S S I N G W I T H S PA C Y

Azadeh Mobasher
Principal Data Scientist
Word vectors (embeddings)

Numerical representations of words

Bag of words method: {"I": 1, "got": 2, ...}

Older methods do not allow to understand the meaning:

Sentences I got covid coronavirus


I got covid 1 2 3
I got coronavirus 1 2 4

NATURAL LANGUAGE PROCESSING WITH SPACY


Word vectors
A pre-defined number of dimensions
Considers word frequencies and the presence of other words in similar contexts

NATURAL LANGUAGE PROCESSING WITH SPACY


Word vectors
Multiple approaches to produce word vectors:
word2vec, Glove, fastText and transformer-based architectures

An example of a word vector:

NATURAL LANGUAGE PROCESSING WITH SPACY


spaCy vocabulary

A part of many spaCy models.

en_core_web_md has 300-dimensional vectors for 20,000 words.

import spacy
nlp = spacy.load("en_core_web_md")
print(nlp.meta["vectors"])

>>> {'width': 300, 'vectors': 20000, 'keys': 514157,


'name': 'en_vectors', 'mode': 'default'}

NATURAL LANGUAGE PROCESSING WITH SPACY


Word vectors in spaCy
nlp.vocab : to access vocabulary ( Vocab class)

nlp.vocab.strings : to access word IDs in a vocabulary

import spacy
nlp = spacy.load("en_core_web_md")
like_id = nlp.vocab.strings["like"]
print(like_id)

>>> 18194338103975822726

.vocab.vectors : to access words vectors of a model or a word, given its corresponding ID

print(nlp.vocab.vectors[like_id])

>>> array([-2.3334e+00, -1.3695e+00, -1.1330e+00, -6.8461e-01, ...])

NATURAL LANGUAGE PROCESSING WITH SPACY


Let's practice!
N AT U R A L L A N G U A G E P R O C E S S I N G W I T H S PA C Y
Word vectors and
spaCy
N AT U R A L L A N G U A G E P R O C E S S I N G W I T H S PA C Y

Azadeh Mobasher
Principal Data Scientist
Word vectors visualization
Word vectors allow to understand how Principal Component Analysis projects
words are grouped word vectors into a two-dimensional space

NATURAL LANGUAGE PROCESSING WITH SPACY


Word vectors visualization
Import required libraries and a spaCy model.

import matplotlib.pyplot as plt


from sklearn.decomposition import PCA
import numpy as np
nlp = spacy.load("en_core_web_md")

Extract word vectors for a given list of words and stack them vertically.

words = ["wonderful", "horrible",


"apple", "banana", "orange", "watermelon",
"dog", "cat"]
word_vectors = np.vstack([nlp.vocab.vectors[nlp.vocab.strings[w]] for w in words])

NATURAL LANGUAGE PROCESSING WITH SPACY


Word vectors visualizations
Extract two principal components using PCA.

pca = PCA(n_components=2)
word_vectors_transformed = pca.fit_transform(word_vectors)

Visualize the scatter plot of transformed vectors.

plt.figure(figsize=(10, 8))
plt.scatter(word_vectors_transformed[:, 0], word_vectors_transformed[:, 1])
for word, coord in zip(words, word_vectors_transformed):
x, y = coord
plt.text(x, y, word, size=10)
plt.show()

NATURAL LANGUAGE PROCESSING WITH SPACY


Analogies and vector operations
A semantic relationship between a pair of words.
Word embeddings generate analogies such as gender and tense:
queen - woman + man = king

NATURAL LANGUAGE PROCESSING WITH SPACY


Similar words in a vocabulary
spaCy find semantically similar terms to a given term

import numpy as np
import spacy
nlp = spacy.load("en_core_web_md")

word = "covid"
most_similar_words = nlp.vocab.vectors.most_similar(
np.asarray([nlp.vocab.vectors[nlp.vocab.strings[word]]]), n=5)

words = [nlp.vocab.strings[w] for w in most_similar_words[0][0]]


print(words)

>>> ['Covi', 'CoVid', 'Covici', 'COVID-19', 'corona']

NATURAL LANGUAGE PROCESSING WITH SPACY


Let's practice!
N AT U R A L L A N G U A G E P R O C E S S I N G W I T H S PA C Y
Measuring semantic
similarity with
spaCy
N AT U R A L L A N G U A G E P R O C E S S I N G W I T H S PA C Y

Azadeh Mobasher
Principal Data Scientist
The semantic similarity method

Process of analyzing texts to identify similarities

Categorizes texts into predefined categories or detect relevant texts


Similarity score measures how similar two pieces of text are

What is the cheapest flight from Boston to Seattle?


Which airline serves Denver, Pittsburgh and Atlanta?
What kinds of planes are used by American Airlines?

NATURAL LANGUAGE PROCESSING WITH SPACY


Similarity score
A metric defined over texts
To measure similarity use Cosine similarity and word vectors

Cosine similarity is any number between 0 and 1

NATURAL LANGUAGE PROCESSING WITH SPACY


Token similarity
spaCy calculates similarity scores between Token objects

nlp = spacy.load("en_core_web_md")
doc1 = nlp("We eat pizza")
doc2 = nlp("We like to eat pasta")
token1 = doc1[2]
token2 = doc2[4]
print(f"Similarity between {token1} and {token2} = ", round(token1.similarity(token2), 3))

>>> Similarity between pizza and pasta = 0.685

NATURAL LANGUAGE PROCESSING WITH SPACY


Span similarity
spaCy calculates semantic similarity of two given Span objects

doc1 = nlp("We eat pizza")


doc2 = nlp("We like to eat pasta")

span1 = doc1[1:]
span2 = doc2[1:]
print(f"Similarity between \"{span1}\" and \"{span2}\" = ",
round(span1.similarity(span2), 3))

>>> Similarity between "eat pizza" and "like to eat pasta" = 0.588

print(f"Similarity between \"{doc1[1:]}\" and \"{doc2[3:]}\" = ",


round(doc1[1:].similarity(doc2[3:]), 3))

>>> Similarity between "eat pizza" and "eat pasta" = 0.936

NATURAL LANGUAGE PROCESSING WITH SPACY


Doc similarity
spaCy calculates the similarity scores between two documents

nlp = spacy.load("en_core_web_md")

doc1 = nlp("I like to play basketball")


doc2 = nlp("I love to play basketball")
print("Similarity score :", round(doc1.similarity(doc2), 3))

>>> Similarity score : 0.975

High cosine similarity shows highly semantically similar contents

Doc vectors default to an average of word vectors

NATURAL LANGUAGE PROCESSING WITH SPACY


Sentence similarity
spaCy finds relevant content to a given keyword

Finding similar customer questions to the word price:

sentences = nlp("What is the cheapest flight from Boston to Seattle?


Which airline serves Denver, Pittsburgh and Atlanta?
What kinds of planes are used by American Airlines?")

keyword = nlp("price")
for i, sentence in enumerate(sentences.sents):
print(f"Similarity score with sentence {i+1}: ", round(sentence.similarity(keyword), 5))

>>> Similarity score with sentence 1: 0.26136


Similarity score with sentence 2: 0.14021
Similarity score with sentence 3: 0.13885

NATURAL LANGUAGE PROCESSING WITH SPACY


spaCy pipelines
N AT U R A L L A N G U A G E P R O C E S S I N G W I T H S PA C Y

Azadeh Mobasher
Principal Data Scientist
spaCy pipelines

spaCy first tokenizes the text to produce a Doc object

The Doc is processed in several different steps of processing pipeline

import spacy
nlp = spacy.load("en_core_web_sm")

doc = nlp(example_text)

NATURAL LANGUAGE PROCESSING WITH SPACY


spaCy pipelines
A pipeline is a sequence of pipes, or actors on data

A spaCy NER pipeline:


Tokenization
Named entity identification

Named entity classification

print([ent.text for ent in doc.ents])

NATURAL LANGUAGE PROCESSING WITH SPACY


Adding pipes

sentencizer : spaCy pipeline component for sentence segmentation.

text = " ".join(["This is a test sentence."]*10000)


en_core_sm_nlp = spacy.load("en_core_web_sm")
start_time = time.time()
doc = en_core_sm_nlp(text)
print(f"Finished processing with en_core_web_sm model in
{round((time.time() - start_time)/60.0 , 5)} minutes")

>>> Finished processing with en_core_web_sm model in 0.09332 minutes

NATURAL LANGUAGE PROCESSING WITH SPACY


Adding pipes

Create a blank model and add a sentencizer pipe:

blank_nlp = spacy.blank("en")
blank_nlp.add_pipe("sentencizer")
start_time = time.time()
doc = blank_nlp(text)
print(f"Finished processing with blank model in
{round((time.time() - start_time)/60.0 , 5)} minutes")

>>> Finished processing with blank model in 0.00091 minutes

NATURAL LANGUAGE PROCESSING WITH SPACY


Analyzing pipeline components
nlp.analyze_pipes() analyzes a spaCy pipeline to determine:
Attributes that pipeline components set

Scores a component produces during training

Presence of all required attributes

Setting pretty to True will print a table instead of only returning the structured data.

import spacy

nlp = spacy.load("en_core_web_sm")
analysis = nlp.analyze_pipes(pretty=True)

NATURAL LANGUAGE PROCESSING WITH SPACY


Analyzing pipeline components

NATURAL LANGUAGE PROCESSING WITH SPACY


Let's practice!
N AT U R A L L A N G U A G E P R O C E S S I N G W I T H S PA C Y
spaCy EntityRuler
N AT U R A L L A N G U A G E P R O C E S S I N G W I T H S PA C Y

Azadeh Mobasher
Principal Data Scientist
spaCy EntityRuler

EntityRuler adds named-entities to a Doc container

It can be used on its own or combined with EntityRecognizer


Phrase entity patterns for exact string matches (string):

{"label": "ORG", "pattern": "Microsoft"}

Token entity patterns with one dictionary describing one token (list):

{"label": "GPE", "pattern": [{"LOWER": "san"}, {"LOWER": "francisco"}]}

NATURAL LANGUAGE PROCESSING WITH SPACY


Adding EntityRuler to spaCy pipeline

Using .add_pipe() method

List of patterns can be added using .add_patterns() method

nlp = spacy.blank("en")
entity_ruler = nlp.add_pipe("entity_ruler")
patterns = [{"label": "ORG", "pattern": "Microsoft"},
{"label": "GPE", "pattern": [{"LOWER": "san"}, {"LOWER": "francisco"}]}]
entity_ruler.add_patterns(patterns)

NATURAL LANGUAGE PROCESSING WITH SPACY


Adding EntityRuler to spaCy pipeline

.ents store the results of an EntityLinker component

doc = nlp("Microsoft is hiring software developer in San Francisco.")


print([(ent.text, ent.label_) for ent in doc.ents])

[('Microsoft', 'ORG'), ('San Francisco', 'GPE')]

NATURAL LANGUAGE PROCESSING WITH SPACY


EntityRuler in action

Integrates with spaCy pipeline components

Enhances the named-entity recognizer


spaCy model without EntityRuler :

nlp = spacy.load("en_core_web_sm")

doc = nlp("Manhattan associates is a company in the U.S.")


print([(ent.text, ent.label_) for ent in doc.ents])

>>> [('Manhattan', 'GPE'), ('U.S.', 'GPE')]

NATURAL LANGUAGE PROCESSING WITH SPACY


EntityRuler in action

EntityRuler added after existing ner component:

nlp = spacy.load("en_core_web_sm")
ruler = nlp.add_pipe("entity_ruler", after='ner')
patterns = [{"label": "ORG", "pattern": [{"lower": "manhattan"}, {"lower": "associates"}]}]
ruler.add_patterns(patterns)

doc = nlp("Manhattan associates is a company in the U.S.")


print([(ent.text, ent.label_) for ent in doc.ents])

>>> [('Manhattan', 'GPE'), ('U.S.', 'GPE')]

NATURAL LANGUAGE PROCESSING WITH SPACY


EntityRuler in action

EntityRuler added before existing ner component:

nlp = spacy.load("en_core_web_sm")
ruler = nlp.add_pipe("entity_ruler", before='ner')
patterns = [{"label": "ORG", "pattern": [{"lower": "manhattan"}, {"lower": "associates"}]}]
ruler.add_patterns(patterns)

doc = nlp("Manhattan associates is a company in the U.S.")


print([(ent.text, ent.label_) for ent in doc.ents])

>>> [('Manhattan associates', 'ORG'), ('U.S.', 'GPE')]

NATURAL LANGUAGE PROCESSING WITH SPACY


Let's practice!
N AT U R A L L A N G U A G E P R O C E S S I N G W I T H S PA C Y
RegEx with spaCy
N AT U R A L L A N G U A G E P R O C E S S I N G W I T H S PA C Y

Azadeh Mobasher
Principal Data Scientist
What is RegEx?

Rule-based information extraction (IR) is useful for many NLP tasks

Regular expression (RegEx) is used with complex string matching patterns

RegEx finds and retrieves patterns or replace matching patterns

NATURAL LANGUAGE PROCESSING WITH SPACY


RegEx strengths and weaknesses
Pros: Cons:

Enables writing robust rules to retrieve Syntax is challenging for beginners


information
Requires knowledge of all the ways a
Can allow us to find many types of pattern may be mentioned in texts
variance in strings

Runs fast

Supported by programming languages

NATURAL LANGUAGE PROCESSING WITH SPACY


RegEx in Python

Python comes prepackaged with a RegEx library, re .

The first step in using re package is to define a pattern .


The resulting pattern is used to find matching content.

import re

pattern = r"((\d){3}-(\d){3}-(\d){4})"
text = "Our phone number is 832-123-5555 and their phone number is 425-123-4567."

NATURAL LANGUAGE PROCESSING WITH SPACY


RegEx in Python

We use .finditer() method from re package

iter_matches = re.finditer(pattern, text)


for match in iter_matches:
start_char = match.start()
end_char = match.end()
print ("Start character: ", start_char, "| End character: ", end_char,
"| Matching text: ", text[start_char:end_char])

>>> Start character: 20 | End character: 32 | Matching text: 832-123-5555


Start character: 59 | End character: 71 | Matching text: 425-123-4567

NATURAL LANGUAGE PROCESSING WITH SPACY


RegEx in spaCy
RegEx in three pipeline components: Matcher , PhraseMatcher and EntityRuler .

text = "Our phone number is 832-123-5555 and their phone number is 425-123-4567."
nlp = spacy.blank("en")
patterns = [{"label": "PHONE_NUMBER", "pattern": [{"SHAPE": "ddd"},
{"ORTH": "-"}, {"SHAPE": "ddd"},
{"ORTH": "-"}, {"SHAPE": "dddd"}]}]
ruler = nlp.add_pipe("entity_ruler")
ruler.add_patterns(patterns)
doc = nlp(text)
print ([(ent.text, ent.label_) for ent in doc.ents])

>>> [('832-123-5555', 'PHONE_NUMBER'), ('425-123-4567', 'PHONE_NUMBER')]

NATURAL LANGUAGE PROCESSING WITH SPACY


Let's practice!
N AT U R A L L A N G U A G E P R O C E S S I N G W I T H S PA C Y
spaCy Matcher and
PhraseMatcher
N AT U R A L L A N G U A G E P R O C E S S I N G W I T H S PA C Y

Azadeh Mobasher
Principal Data Scientist
Matcher in spaCy

RegEx patterns can be complex, difficult to read and debug.

spaCy provides a readable and production-level alternative, the Matcher class.

import spacy
from spacy.matcher import Matcher
nlp = spacy.load("en_core_web_sm")
doc = nlp("Good morning, this is our first day on campus.")
matcher = Matcher(nlp.vocab)

NATURAL LANGUAGE PROCESSING WITH SPACY


Matcher in spaCy

Matching output include start and end token indices of the matched pattern.

pattern = [{"LOWER": "good"}, {"LOWER": "morning"}]


matcher.add("morning_greeting", [pattern])
matches = matcher(doc)
for match_id, start, end in matches:
print("Start token: ", start, " | End token: ", end,
"| Matched text: ", doc[start:end].text)

>>> Start token: 0 | End token: 2 | Matched text: Good morning

NATURAL LANGUAGE PROCESSING WITH SPACY


Matcher extended syntax support

Allows operators in defining the matching patterns.

Similar operators to Python's in , not in and comparison operators

Attribute Value type Description


IN any type Attribute value is a member of a list

NOT_IN any type Attribute value is not a member of a list

== , >= , <= , > , < int, float Comparison operators for equality or inequality checks

NATURAL LANGUAGE PROCESSING WITH SPACY


Matcher extended syntax support
Using IN operator to match both good morning and good evening

doc = nlp("Good morning and good evening.")


matcher = Matcher(nlp.vocab)
pattern = [{"LOWER": "good"}, {"LOWER": {"IN": ["morning", "evening"]}}]
matcher.add("morning_greeting", [pattern])
matches = matcher(doc)

The output of matching using IN operator

for match_id, start, end in matches:


print("Start token: ", start, " | End token: ", end,
"| Matched text: ", doc[start:end].text)

>>> Start token: 0 | End token: 2 | Matched text: Good morning


Start token: 3 | End token: 5 | Matched text: good evening

NATURAL LANGUAGE PROCESSING WITH SPACY


PhraseMatcher in spaCy

PhraseMatcher class matches a long list of phrases in a given text.

from spacy.matcher import PhraseMatcher


nlp = spacy.load("en_core_web_sm")
matcher = PhraseMatcher(nlp.vocab)
terms = ["Bill Gates", "John Smith"]

NATURAL LANGUAGE PROCESSING WITH SPACY


PhraseMatcher in spaCy
PhraseMatcher outputs include start and end token indices of the matched pattern

patterns = [nlp.make_doc(term) for term in terms]


matcher.add("PeopleOfInterest", patterns)
doc = nlp("Bill Gates met John Smith for an important discussion regarding
importance of AI.")
matches = matcher(doc)
for match_id, start, end in matches:
print("Start token: ", start, " | End token: ", end,
"| Matched text: ", doc[start:end].text)

>>> Start token: 0 | End token: 2 | Matched text: Bill Gates


Start token: 3 | End token: 5 | Matched text: John Smith

NATURAL LANGUAGE PROCESSING WITH SPACY


PhraseMatcher in spaCy
We can use attr argument of the PhraseMatcher class

matcher = PhraseMatcher(nlp.vocab, attr = "LOWER")


terms = ["Government", "Investment"]
patterns = [nlp.make_doc(term) for term in terms]
matcher.add("InvestmentTerms", patterns)
doc = nlp("It was interesting to the investment division of the government.")

matcher = PhraseMatcher(nlp.vocab, attr = "SHAPE")


terms = ["110.0.0.0", "101.243.0.0"]
patterns = [nlp.make_doc(term) for term in terms]
matcher.add("IPAddresses", patterns)
doc = nlp("The tracked IP address was 234.135.0.0.")

NATURAL LANGUAGE PROCESSING WITH SPACY


Let's practice!
N AT U R A L L A N G U A G E P R O C E S S I N G W I T H S PA C Y
Customizing spaCy
models
N AT U R A L L A N G U A G E P R O C E S S I N G W I T H S PA C Y

Azadeh Mobasher
Principal data scientist
Why train spaCy models?
Go a long way for general NLP use cases
But may not have seen specific domains data during their training, e.g.
Twitter data

Medical data

NATURAL LANGUAGE PROCESSING WITH SPACY


Why train spaCy models?

Better results on your specific domain

Essential for domain specific text classification

Before start training, ask the following questions:

Do spaCy models perform well enough on our data?

Does our domain include many labels that are absent in spaCy models?

NATURAL LANGUAGE PROCESSING WITH SPACY


Models performance on our data
Do spaCy models perform well enough on our data?
Oxford Street is not correctly classified with a GPE label:

import spacy
nlp = spacy.load("en_core_web_sm")

text = "The car was navigating to the Oxford Street."


doc = nlp(text)
print([(ent.text, ent.label_) for ent in doc.ents])

[('the Oxford Street', 'ORG')]

NATURAL LANGUAGE PROCESSING WITH SPACY


Output labels in spaCy models
Does our domain include many labels that are absent in spaCy models?

NATURAL LANGUAGE PROCESSING WITH SPACY


Output labels in spaCy models

If we need custom model training, we follow these steps:

Collect our domain specific data

Annotate our data

Determine to update an existing model or train a model from scratch

NATURAL LANGUAGE PROCESSING WITH SPACY


Let's practice!
N AT U R A L L A N G U A G E P R O C E S S I N G W I T H S PA C Y
Training data
preparation
N AT U R A L L A N G U A G E P R O C E S S I N G W I T H S PA C Y

Azadeh Mobasher
Principal data scientist
Training steps

1. Annotate and prepare input data

2. Initialize the model weight


3. Predict a few examples with the current weights

4. Compare prediction with correct answers

5. Use optimizer to calculate weights that improve model performance

6. Update weights slightly

7. Go back to step 3.

NATURAL LANGUAGE PROCESSING WITH SPACY


Annotating and preparing data
First step is to prepare training data in required format

After collecting data, we annotate it

Annotation means labeling the intent, entities, etc.

This is an example of annotated data:

annotated_data = {
"sentence": "An antiviral drugs used against influenza is neuraminidase inhibitors.",
"entities": {
"label": "Medicine",
"value": "neuraminidase inhibitors",
}
}

NATURAL LANGUAGE PROCESSING WITH SPACY


Annotating and preparing data
Here's another example of annotated data:

annotated_data = {
"sentence": "Bill Gates visited the SFO Airport.",
"entities": [{"label": "PERSON", "value": "Bill Gates"},
{"label": "LOC", "value": "SFO Airport"}]
}

NATURAL LANGUAGE PROCESSING WITH SPACY


spaCy training data format
Data annotation prepares training data for what we want the model to learn
Training dataset has to be stored as a dictionary:

training_data = [
("I will visit you in Austin.", {"entities": [(20, 26, "GPE")]}),
("I'm going to Sam's house.", {"entities": [(13,18, "PERSON"), (19, 24, "GPE")]}),
("I will go.", {"entities": []})
]

Three example pairs:

Each example pair includes a sentence as the first element

Pair's second element is list of annotated entities and start and end characters

NATURAL LANGUAGE PROCESSING WITH SPACY


Example object data for training
We cannot feed the raw text directly to spaCy
We need to create an Example object for each training example

import spacy
from spacy.training import Example

nlp = spacy.load("en_core_web_sm")

doc = nlp("I will visit you in Austin.")


annotations = {"entities": [(20, 26, "GPE")]}

example_sentence = Example.from_dict(doc, annotations)


print(example_sentence.to_dict())

NATURAL LANGUAGE PROCESSING WITH SPACY


Let's practice!
N AT U R A L L A N G U A G E P R O C E S S I N G W I T H S PA C Y
Training with spaCy
N AT U R A L L A N G U A G E P R O C E S S I N G W I T H S PA C Y

Azadeh Mobasher
Principal Data Scientist
Training steps

1. Annotate and prepare input data

2. Disable other pipeline components


3. Train a model for a few epochs

4. Evaluate model performance

NATURAL LANGUAGE PROCESSING WITH SPACY


Disabling other pipeline components

Disable all pipeline components except NER:

other_pipes = [pipe for pipe in nlp.pipe_names if pipe != 'ner']

nlp.disable_pipes(*other_pipes)

NATURAL LANGUAGE PROCESSING WITH SPACY


Model training procedure
Go over the training set several times; one iteration is called an epoch .
In each epoch, update the weights of the model with a small number.

Optimizers update the model weights.

optimizer = nlp.create_optimizer()

losses = {}
for i in range(epochs):
random.shuffle(training_data)
for text, annotation in training_data:
doc = nlp.make_doc(text)
example = Example.from_dict(doc, annotation)
nlp.update([example], sgd = optimizer, losses=losses)

NATURAL LANGUAGE PROCESSING WITH SPACY


Save and load a trained model

Save a trained NER model:

ner = nlp.get_pipe("ner")
ner.to_disk("<ner model name>")

Load the saved model:

ner = nlp.create_pipe("ner")
ner.from_disk("<ner model name>")
nlp.add_pipe(ner, "<ner model name>")

NATURAL LANGUAGE PROCESSING WITH SPACY


Model for inference

Use a saved model at inference.

Apply NER model and store tuples of (entity text, entity label):

doc = nlp(text)
entities = [(ent.text, ent.label_) for ent in doc.ents]

NATURAL LANGUAGE PROCESSING WITH SPACY


Let's practice!
N AT U R A L L A N G U A G E P R O C E S S I N G W I T H S PA C Y
Wrap-up
N AT U R A L L A N G U A G E P R O C E S S I N G W I T H S PA C Y

Azadeh Mobasher
Principal data scientist
Chapter 1 - Introduction to NLP and spaCy

Use spaCy 's text processing pipelines to extract linguistic features:

NATURAL LANGUAGE PROCESSING WITH SPACY


Chapter 2 - spaCy linguistic annotations and word
vectors
Work with spaCy 's classes such as Doc , Token and Span and predict semantic similarities
using word vectors:

NATURAL LANGUAGE PROCESSING WITH SPACY


Chapter 3 - Data analysis with spaCy
Write matching patterns to extract terms and phrases using spaCy 's Matcher and
PhraseMatcher :

matcher = Matcher(nlp.vocab)
pattern = [{"LOWER": "good"}, {"LOWER": {"IN": ["morning", "evening"]}}]
matcher.add("morning_greeting", [pattern])

matcher = PhraseMatcher(nlp.vocab, attr = "LOWER")


patterns = [nlp.make_doc(term) for term in terms]
matcher.add("InvestmentTerms", patterns)

NATURAL LANGUAGE PROCESSING WITH SPACY


Chapter 4 - Customizing spaCy models

Annotate and prepare our data for training

Train spaCy models and use them at inference time

NATURAL LANGUAGE PROCESSING WITH SPACY


Recommended resources

Introduction to Deep Learning in Python

Introduction to Deep Learning with PyTorch


Introduction to ChatGPT

NATURAL LANGUAGE PROCESSING WITH SPACY


Congratulations!
N AT U R A L L A N G U A G E P R O C E S S I N G W I T H S PA C Y
Introduction to
audio data in
Python
SPOKEN LANGUAGE PROCESSING IN PYTHON

Daniel Bourke
Machine Learning Engineer/YouTube
Creator
Dealing with audio files in Python
Di erent kinds all of audio les
mp3

wav

m4a

ac

Digital sounds measured in frequency (kHz)


1 kHz = 1000 pieces of information per second

SPOKEN LANGUAGE PROCESSING IN PYTHON


Frequency examples
Streaming songs have a frequency of 32 kHz

Audiobooks and spoken language are between 8 and 16 kHz

We can't see audio les so we have to transform them rst

import wave

SPOKEN LANGUAGE PROCESSING IN PYTHON


Opening an audio file in Python
Audio le saved as good-morning.wav

# Import audio file as wave object


good_morning = wave.open("good-morning.wav", "r")

# Convert wave object to bytes


good_morning_soundwave = good_morning.readframes(-1)

# View the wav file in byte form


good_morning_soundwave

b'\xfd\xff\xfb\xff\xf8\xff\xf8\xff\xf7\...

SPOKEN LANGUAGE PROCESSING IN PYTHON


Working with audio is different
Have to convert the audio to something useful

Small sample of audio = large amount of information

SPOKEN LANGUAGE PROCESSING IN PYTHON


Let's practice!
SPOKEN LANGUAGE PROCESSING IN PYTHON
Converting sound
wave bytes to
integers
SPOKEN LANGUAGE PROCESSING IN PYTHON

Daniel Bourke
Machine Learning Engineer/YouTube
Creator
Converting bytes to integers
Can't use bytes

Convert bytes to integers using numpy

import numpy as np
# Convert soundwave_gm from bytes to integers
signal_gm = np.frombuffer(soundwave_gm, dtype='int16')
# Show the first 10 items
signal_gm[:10]

array([ -3, -5, -8, -8, -9, -13, -8, -10, -9, -11], dtype=int16)

SPOKEN LANGUAGE PROCESSING IN PYTHON


Finding the frame rate
Frequency (Hz) = length of wave object array/duration of audio le (seconds)

# Get the frame rate


framerate_gm = good_morning.getframerate()
# Show the frame rate
framerate_gm

48,000

Duration of audio le (seconds) = length of wave object array/frequency (Hz)

SPOKEN LANGUAGE PROCESSING IN PYTHON


Finding sound wave timestamps
# Return evenly spaced values between start and stop
np.linspace(start=1, stop=10, num=10)

array([ 1., 2., 3., 4., 5., 6., 7., 8., 9., 10.])

# Get the timestamps of the good morning sound wave


time_gm = np.linspace(start=0,
stop=len(soundwave_gm)/framerate_gm,
num=len(soundwave_gm))

SPOKEN LANGUAGE PROCESSING IN PYTHON


Finding sound wave timestamps
# View first 10 time stamps of good morning sound wave
time_gm[:10]

array([0.00000000e+00, 2.08334167e-05, 4.16668333e-05, 6.25002500e-05,


8.33336667e-05, 1.04167083e-04, 1.25000500e-04, 1.45833917e-04,
1.66667333e-04, 1.87500750e-04])

SPOKEN LANGUAGE PROCESSING IN PYTHON


Let's practice!
SPOKEN LANGUAGE PROCESSING IN PYTHON
Visualizing sound
waves
SPOKEN LANGUAGE PROCESSING IN PYTHON

Daniel Bourke
Machine Learning Engineer/YouTube
Creator
Adding another sound wave
New audio le: good_afternoon.wav

Both are 48 kHz

Same data transformations to all audio les

SPOKEN LANGUAGE PROCESSING IN PYTHON


Setting up a plot
import matplotlib.pyplot as plt
# Initialize figure and setup title
plt.title("Good Afternoon vs. Good Morning")
# x and y axis labels
plt.xlabel("Time (seconds)")
plt.ylabel("Amplitude")
# Add good morning and good afternoon values
plt.plot(time_ga, soundwave_ga, label ="Good Afternoon")
plt.plot(time_gm, soundwave_gm, label="Good Morning",
alpha=0.5)
# Create a legend and show our plot
plt.legend()
plt.show()

SPOKEN LANGUAGE PROCESSING IN PYTHON


SPOKEN LANGUAGE PROCESSING IN PYTHON
Time to visualize!
SPOKEN LANGUAGE PROCESSING IN PYTHON
SpeechRecognition
Python library
SPOKEN LANGUAGE PROCESSING IN PYTHON

Daniel Bourke
Machine Learning Engineer/YouTube
Creator
Why the SpeechRecognition library?
Some existing python libraries

CMU Sphinx

Kaldi

SpeechRecognition

Wav2le er++ by Facebook

SPOKEN LANGUAGE PROCESSING IN PYTHON


Getting started with SpeechRecognition
Install from PyPi:

$ pip install SpeechRecognition

Compatible with Python 2 and 3

We'll use Python 3

SPOKEN LANGUAGE PROCESSING IN PYTHON


Using the Recognizer class
# Import the SpeechRecognition library
import speech_recognition as sr
# Create an instance of Recognizer
recognizer = sr.Recognizer()
# Set the energy threshold
recognizer.energy_threshold = 300

SPOKEN LANGUAGE PROCESSING IN PYTHON


Using the Recognizer class to recognize speech
Recognizer class has built-in functions which interact with speech APIs
recognize_bing()

recognize_google()

recognize_google_cloud()

recognize_wit()

Input: audio_file

Output: transcribed speech from audio_file

SPOKEN LANGUAGE PROCESSING IN PYTHON


SpeechRecognition Example
Focus on recognize_google()

Recognize speech from an audio le with SpeechRecognition:

# Import SpeechRecognition library


import speech_recognition as sr
# Instantiate Recognizer class
recognizer = sr.Recognizer()
# Transcribe speech using Goole web API
recognizer.recognize_google(audio_data=audio_file
language="en-US")

Learning speech recognition on DataCamp is awesome!

SPOKEN LANGUAGE PROCESSING IN PYTHON


Your turn!
SPOKEN LANGUAGE PROCESSING IN PYTHON
Reading audio files
with
SpeechRecognition
SPOKEN LANGUAGE PROCESSING IN PYTHON

Daniel Bourke
Machine Learning Engineer/YouTube
Creator
The AudioFile class
import speech_recognition as sr
# Setup recognizer instance
recognizer = sr.Recognizer()
# Read in audio file
clean_support_call = sr.AudioFile("clean-support-call.wav")
# Check type of clean_support_call
type(clean_support_call)

<class 'speech_recognition.AudioFile'>

SPOKEN LANGUAGE PROCESSING IN PYTHON


From AudioFile to AudioData
recognizer.recognize_google(audio_data=clean_support_call)

AssertionError: ``audio_data`` must be audio data

# Convert from AudioFile to AudioData


with clean_support_call as source:
# Record the audio
clean_support_call_audio = recognizer.record(source)
# Check the type
type(clean_support_call_audio)

<class 'speech_recognition.AudioData'>

SPOKEN LANGUAGE PROCESSING IN PYTHON


Transcribing our AudioData
# Transcribe clean support call
recognizer.recognize_google(audio_data=clean_support_call_audio)

hello I'd like to get some help setting up my account please

SPOKEN LANGUAGE PROCESSING IN PYTHON


Duration and offset
duration and offset both None by default

# Leave duration and offset as default


with clean_support_call as source:
clean_support_call_audio = recognizer.record(source,
duration=None,
offset=None)

# Get first 2-seconds of clean support call


with clean_support_call as source:
clean_support_call_audio = recognizer.record(source,
duration=2.0)

hello I'd like to get

SPOKEN LANGUAGE PROCESSING IN PYTHON


Let's practice!
SPOKEN LANGUAGE PROCESSING IN PYTHON
Dealing with
different kinds of
audio
SPOKEN LANGUAGE PROCESSING IN PYTHON

Daniel Bourke
Machine Learning Engineer/YouTube
Creator
What language?
# Create a recognizer class
recognizer = sr.Recognizer()
# Pass the Japanese audio to recognize_google
text = recognizer.recognize_google(japanese_good_morning,
language="en-US")
# Print the text
print(text)

Ohio gozaimasu

SPOKEN LANGUAGE PROCESSING IN PYTHON


What language?
# Create a recognizer class
recognizer = sr.Recognizer()
# Pass the Japanese audio to recognize_google
text = recognizer.recognize_google(japanese_good_morning,
language="ja")
# Print the text
print(text)

?????????

SPOKEN LANGUAGE PROCESSING IN PYTHON


Non-speech audio
# Import the leopard roar audio file
leopard_roar = sr.AudioFile("leopard_roar.wav")
# Convert the AudioFile to AudioData
with leopard_roar as source:
leopard_roar_audio = recognizer.record(source)
# Recognize the AudioData
recognizer.recognize_google(leopard_roar_audio)

UnknownValueError:

SPOKEN LANGUAGE PROCESSING IN PYTHON


Non-speech audio
# Import the leopard roar audio file
leopard_roar = sr.AudioFile("leopard_roar.wav")
# Convert the AudioFile to AudioData
with leopard_roar as source:
leopard_roar_audio = recognizer.record(source)
# Recognize the AudioData with show_all turned on
recognizer.recognize_google(leopard_roar_audio,
show_all=True)

[]

SPOKEN LANGUAGE PROCESSING IN PYTHON


Showing all
# Recognizing Japanese audio with show_all=True
text = recognizer.recognize_google(japanese_good_morning,
language="en-US",
show_all=True)
# Print the text
print(text)

{'alternative': [{'transcript': 'Ohio gozaimasu', 'confidence': 0.89041114},


{'transcript': 'all hail gozaimasu'},
{'transcript': 'ohayo gozaimasu'},
{'transcript': 'olho gozaimasu'},
{'transcript': 'all Hale gozaimasu'}],
'final': True}

SPOKEN LANGUAGE PROCESSING IN PYTHON


Multiple speakers
# Import an audio file with multiple speakers
multiple_speakers = sr.AudioFile("multiple-speakers.wav")
# Convert AudioFile to AudioData
with multiple_speakers as source:
multiple_speakers_audio = recognizer.record(source)
# Recognize the AudioData
recognizer.recognize_google(multiple_speakers_audio)

one of the limitations of the speech recognition library is that it doesn't


recognise different speakers and voices it will just return it all as one block
of text

SPOKEN LANGUAGE PROCESSING IN PYTHON


Multiple speakers
# Import audio files separately
speakers = [sr.AudioFile("s0.wav"), sr.AudioFile("s1.wav"), sr.AudioFile("s2.wav")]
# Transcribe each speaker individually
for i, speaker in enumerate(speakers):
with speaker as source:
speaker_audio = recognizer.record(source)
print(f"Text from speaker {i}: {recognizer.recognize_google(speaker_audio)}")

Text from speaker 0: one of the limitations of the speech recognition library
Text from speaker 1: is that it doesn't recognise different speakers and voices
Text from speaker 2: it will just return it all as one block a text

SPOKEN LANGUAGE PROCESSING IN PYTHON


Noisy audio
If you have trouble hearing the speech, so will the APIs

# Import audio file with background nosie


noisy_support_call = sr.AudioFile(noisy_support_call.wav)
with noisy_support_call as source:
# Adjust for ambient noise and record
recognizer.adjust_for_ambient_noise(source,
duration=0.5)
noisy_support_call_audio = recognizer.record(source)
# Recognize the audio
recognizer.recognize_google(noisy_support_call_audio)

hello ID like to get some help setting up my calories

SPOKEN LANGUAGE PROCESSING IN PYTHON


Let's practice!
SPOKEN LANGUAGE PROCESSING IN PYTHON
Introduction to
PyDub
SPOKEN LANGUAGE PROCESSING IN PYTHON

Daniel Bourke
Machine Learning Engineer/YouTube
Creator
Installing PyDub
$ pip install pydub

If using les other than .wav , install ffmpeg via mpeg.org

SPOKEN LANGUAGE PROCESSING IN PYTHON


PyDub's main class, AudioSegment
# Import PyDub main class
from pydub import AudioSegment

# Import an audio file


wav_file = AudioSegment.from_file(file="wav_file.wav", format="wav")

# Format parameter only for readability


wav_file = AudioSegment.from_file(file="wav_file.wav")

type(wav_file)

pydub.audio_segment.AudioSegment

SPOKEN LANGUAGE PROCESSING IN PYTHON


Playing an audio file
# Install simpleaudio for wav playback
$pip install simpleaudio

# Import play function


from pydub.playback import play

# Import audio file


wav_file = AudioSegment.from_file(file="wav_file.wav")

# Play audio file


play(wav_file)

SPOKEN LANGUAGE PROCESSING IN PYTHON


Audio parameters
# Import audio files
wav_file = AudioSegment.from_file(file="wav_file.wav")
two_speakers = AudioSegment.from_file(file="two_speakers.wav")
# Check number of channels
wav_file.channels, two_speakers.channels

1, 2

wav_file.frame_rate

480000

SPOKEN LANGUAGE PROCESSING IN PYTHON


Audio parameters
# Find the number of bytes per sample
wav_file.sample_width

# Find the max amplitude


wav_file.max

8488

SPOKEN LANGUAGE PROCESSING IN PYTHON


Audio parameters
# Duration of audio file in milliseconds
len(wav_file)

3284

SPOKEN LANGUAGE PROCESSING IN PYTHON


Changing audio parameters
# Change ATTRIBUTENAME of AudioSegment to x
changeed_audio_segment = audio_segment.set_ATTRIBUTENAME(x)

# Change sample width to 1


wav_file_width_1 = wav_file.sample_width(1)
wav_file_width_1.sample_width

SPOKEN LANGUAGE PROCESSING IN PYTHON


Changing audio parameters
# Change sample rate
wav_file_16k = wav_file.frame_rate(16000)
wav_file_16k.frame_rate

16000

# Change number of channels


wav_file_1_channel = wav_file.set_channels(1)
wav_file_1_channel.channels

SPOKEN LANGUAGE PROCESSING IN PYTHON


Let's practice!
SPOKEN LANGUAGE PROCESSING IN PYTHON
Manipulating audio
files with PyDub
SPOKEN LANGUAGE PROCESSING IN PYTHON

Daniel Bourke
Machine Learning Engineer/YouTube
Creator
Turning it down to 11
# Import audio file
wav_file = AudioSegment.from_file("wav_file.wav")
# Minus 60 dB
quiet_wav_file = wav_file - 60

# Try to recognize quiet audio


recognizer.recognize_google(quiet_wav_file)

UnknownValueError:

SPOKEN LANGUAGE PROCESSING IN PYTHON


Increasing the volume
# Increase the volume by 10 dB
louder_wav_file = wav_file + 10

# Try to recognize
recognizer.recognize_google(louder_wav_file)

this is a wav file

SPOKEN LANGUAGE PROCESSING IN PYTHON


This all sounds the same
# Import AudioSegment and normalize
from pydub import AudioSegment
from pydub.effects import normalize
from pydub.playback import play

# Import uneven sound audio file


loud_quiet = AudioSegment.from_file("loud_quiet.wav")
# Normalize the sound levels
normalized_loud_quiet = normalize(loud_quiet)

# Check the sound


play(normalized_loud_quiet)

SPOKEN LANGUAGE PROCESSING IN PYTHON


Remixing your audio files
# Import audio with static at start
static_at_start = AudioSegment.from_file("static_at_start.wav")

# Remove the static via slicing


no_static_at_start = static_at_start[5000:]

# Check the new sound


play(no_static_at_start)

SPOKEN LANGUAGE PROCESSING IN PYTHON


Remixing your audio files
# Import two audio files
wav_file_1 = AudioSegment.from_file("wav_file_1.wav")
wav_file_2 = AudioSegment.from_file("wav_file_2.wav")

# Combine the two audio files


wav_file_3 = wav_file_1 + wav_file_2

# Check the sound


play(wav_file_3)

# Combine two wav files and make the combination louder


louder_wav_file_3 = wav_file_1 + wav_file_2 + 10

SPOKEN LANGUAGE PROCESSING IN PYTHON


Splitting your audio
# Import phone call audio
phone_call = AudioSegment.from_file("phone_call.wav")
# Find number of channels
phone_call.channels

# Split stereo to mono


phone_call_channels = phone_call.split_to_mono()
phone_call_channels

[<pydub.audio_segment.AudioSegment, <pydub.audio_segment.AudioSegment>]

SPOKEN LANGUAGE PROCESSING IN PYTHON


Splitting your audio
# Find number of channels of first list item
phone_call_channels[0].channels

# Recognize the first channel


recognizer.recognize_google(phone_call_channel_1)

the pydub library is really useful

SPOKEN LANGUAGE PROCESSING IN PYTHON


Let's code!
SPOKEN LANGUAGE PROCESSING IN PYTHON
Converting and
saving audio files
with PyDub
SPOKEN LANGUAGE PROCESSING IN PYTHON

Daniel Bourke
Machine Learning Engineer/YouTube
Creator
Exporting audio files
from pydub import AudioSegment

# Import audio file


wav_file = AudioSegment.from_file("wav_file.wav")
# Increase by 10 decibels
louder_wav_file = wav_file + 10
# Export louder audio file
louder_wav_file.export(out_f="louder_wav_file.wav", format="wav")

<_io.BufferedRandom name='louder_wav_file.wav'>

SPOKEN LANGUAGE PROCESSING IN PYTHON


Reformatting and exporting multiple audio files
def make_wav(wrong_folder_path, right_folder_path):
# Loop through wrongly formatted files
for file in os.scandir(wrong_folder_path):
# Only work with files with audio extensions we're fixing
if file.path.endswith(".mp3") or file.path.endswith(".flac"):
# Create the new .wav filename
out_file = right_folder_path + os.path.splitext(os.path.basename(file.path))[0] + ".wav"
# Read in the audio file and export it in wav format
AudioSegment.from_file(file.path).export(out_file,
format="wav")
print(f"Creating {out_file}")

SPOKEN LANGUAGE PROCESSING IN PYTHON


Reformatting and exporting multiple audio files
# Call our new function
make_wav("data/wrong_formats/", "data/right_format/")

Creating data/right_types/wav_file.wav
Creating data/right_types/flac_file.wav
Creating data/right_types/mp3_file.wav

SPOKEN LANGUAGE PROCESSING IN PYTHON


Manipulating and exporting
def make_no_static_louder(static_quiet, louder_no_static):
# Loop through files with static and quiet (already in wav format)
for file in os.scandir(static_quiet_folder_path):
# Create new file path
out_file = louder_no_static + os.path.splitext(os.path.basename(file.path))[0] + ".wav"
# Read the audio file
audio_file = AudioSegment.from_file(file.path)
# Remove first three seconds and add 10 decibels and export
audio_file = (audio_file[3100:] + 10).export(out_file, format="wav")

print(f"Creating {out_file}")

SPOKEN LANGUAGE PROCESSING IN PYTHON


Manipulating and exporting
# Remove static and make louder
make_no_static_louder("data/static_quiet/", "data/louder_no_static/")

Creating data/louder_no_static/speech-recognition-services.wav
Creating data/louder_no_static/order-issue.wav
Creating data/louder_no_static/help-with-acount.wav

SPOKEN LANGUAGE PROCESSING IN PYTHON


Your turn!
SPOKEN LANGUAGE PROCESSING IN PYTHON
Creating
transcription helper
functions
SPOKEN LANGUAGE PROCESSING IN PYTHON

Daniel Bourke
Machine Learning Engineer/YouTube
Creator
Exploring audio files
# Import os module
import os

# Check the folder of audio files


os.listdir("acme_audio_files")

(['call_1.mp3',
'call_2.mp3',
'call_3.mp3',
'call_4.mp3'])

SPOKEN LANGUAGE PROCESSING IN PYTHON


Preparing for the proof of concept
import speech_recognition as sr
from pydub import AudioSegment
# Import call 1 and convert to .wav
call_1 = AudioSegment.from_file("acme_audio_files/call_1.mp3")
call_1.export("acme_audio_files/call_1.wav", format="wav")
# Transcribe call 1
recognizer = sr.Recognizer()
call_1_file = sr.AudioFile("acme_audio_files/call_1.wav")
with call_1_file as source:
call_1_audio = recognizer.record(call_1_file)
recognizer.recognize_google(call_1_audio)

SPOKEN LANGUAGE PROCESSING IN PYTHON


Functions we'll create
convert_to_wav() converts non- .wav les to .wav les.

show_pydub_stats() shows the audio a ributes of a .wav le.

transcribe_audio() uses recognize_google() to transcribe a .wav le.

SPOKEN LANGUAGE PROCESSING IN PYTHON


Creating a file format conversion function
# Create function to convert audio file to wav
def convert_to_wav(filename):
"Takes an audio file of non .wav format and converts to .wav"
# Import audio file
audio = AudioSegment.from_file(filename)
# Create new filename
new_filename = filename.split(".")[0] + ".wav"
# Export file as .wav
audio.export(new_filename, format="wav")
print(f"Converting {filename} to {new_filename}...")

SPOKEN LANGUAGE PROCESSING IN PYTHON


Using the file format conversion function
convert_to_wav("acme_studios_audio/call_1.mp3")

Converting acme_audio_files/call_1.mp3 to acme_audio_files/call_1.wav...

SPOKEN LANGUAGE PROCESSING IN PYTHON


Creating an attribute showing function
def show_pydub_stats(filename):
"Returns different audio attributes related to an audio file."
# Create AudioSegment instance
audio_segment = AudioSegment.from_file(filename)
# Print attributes
print(f"Channels: {audio_segment.channels}")
print(f"Sample width: {audio_segment.sample_width}")
print(f"Frame rate (sample rate): {audio_segment.frame_rate}")
print(f"Frame width: {audio_segment.frame_width}")
print(f"Length (ms): {len(audio_segment)}")
print(f"Frame count: {audio_segment.frame_count()}")

SPOKEN LANGUAGE PROCESSING IN PYTHON


Using the attribute showing function
show_pydub_stats("acme_audio_files/call_1.wav")

Channels: 2
Sample width: 2
Frame rate (sample rate): 32000
Frame width: 4
Length (ms): 54888
Frame count: 1756416.0

SPOKEN LANGUAGE PROCESSING IN PYTHON


Creating a transcribe function
# Create a function to transcribe audio
def transcribe_audio(filename):
"Takes a .wav format audio file and transcribes it to text."
# Setup a recognizer instance
recognizer = sr.Recognizer()

# Import the audio file and convert to audio data


audio_file = sr.AudioFile(filename)
with audio_file as source:
audio_data = recognizer.record(audio_file)

# Return the transcribed text


return recognizer.recognize_google(audio_data)

SPOKEN LANGUAGE PROCESSING IN PYTHON


Using the transcribe function
transcribe_audio("acme_audio_files/call_1.wav")

"hello welcome to Acme studio support line my name is Daniel how can I best help
you hey Daniel this is John I've recently bought a smart from you guys and I know
that's not good to hear John let's let's get your cell number and then we
can we can set up a way to fix it for you one number for 1757 varies how long do
you reckon this is going to take about an hour now while John we're going to try
our best hour I will we get the sealing member will start up this support case
I'm just really really really really I've been trying to contact 34 been put on
hold more than an hour and half so I'm not really happy I kind of wanna get this
issue 6 is fossil"

SPOKEN LANGUAGE PROCESSING IN PYTHON


Let's practice!
SPOKEN LANGUAGE PROCESSING IN PYTHON
Sentiment analysis
on spoken language
text
SPOKEN LANGUAGE PROCESSING IN PYTHON

Daniel Bourke
Machine Learning Engineer/YouTube
Creator
Installing sentiment analysis libraries
$ pip install nltk

# Download required NLTK packages


import nltk
nltk.download("punkt")
nltk.download("vader_lexicon")

SPOKEN LANGUAGE PROCESSING IN PYTHON


Sentiment analysis with VADER
# Import sentiment analysis class
from nltk.sentiment.vader import SentimentIntensityAnalyzer
# Create sentiment analysis instance
sid = SentimentIntensityAnalyzer()
# Test sentiment analysis on negative text
print(sid.polarity_scores("This customer service is terrible."))

{'neg': 0.437, 'neu': 0.563, 'pos': 0.0, 'compound': -0.4767}

SPOKEN LANGUAGE PROCESSING IN PYTHON


Sentiment analysis on transcribed text
# Transcribe customer channel of call_3
call_3_channel_2_text = transcribe_audio("call_3_channel_2.wav")
print(call_3_channel_2_text)

"hey Dave is this any better do I order products are currently on July 1st and I haven't
received the product a three-week step down this parable 6987 5"

# Sentiment analysis on customer channel of call_3


sid.polarity_scores(call_3_channel_2_text)

{'neg': 0.0, 'neu': 0.892, 'pos': 0.108, 'compound': 0.4404}

SPOKEN LANGUAGE PROCESSING IN PYTHON


Sentence by sentence
call_3_paid_api_text = "Okay. Yeah. Hi, Diane. This is paid on this call and obvi...

# Import sent tokenizer


from nltk.tokenize import sent_tokenize
# Find sentiment on each sentence
for sentence in sent_tokenize(call_3_paid_api_text):
print(sentence)
print(sid.polarity_scores(sentence))

SPOKEN LANGUAGE PROCESSING IN PYTHON


Sentence by sentence
Okay.
{'neg': 0.0, 'neu': 0.0, 'pos': 1.0, 'compound': 0.2263}
Yeah.
{'neg': 0.0, 'neu': 0.0, 'pos': 1.0, 'compound': 0.296}
Hi, Diane.
{'neg': 0.0, 'neu': 1.0, 'pos': 0.0, 'compound': 0.0}
This is paid on this call and obviously the status of my orders at three weeks ago,
and that service is terrible.
{'neg': 0.129, 'neu': 0.871, 'pos': 0.0, 'compound': -0.4767}
Is this any better?
{'neg': 0.0, 'neu': 0.508, 'pos': 0.492, 'compound': 0.4404}
Yes...

SPOKEN LANGUAGE PROCESSING IN PYTHON


Time to code!
SPOKEN LANGUAGE PROCESSING IN PYTHON
Named entity
recognition on
transcribed text
SPOKEN LANGUAGE PROCESSING IN PYTHON

Daniel Bourke
Machine Learning Engineer/YouTube
Creator
Installing spaCy
# Install spaCy
$ pip install spacy

# Download spaCy language model


$ python -m spacy download en_core_web_sm

SPOKEN LANGUAGE PROCESSING IN PYTHON


Using spaCy
import spacy

# Load spaCy language model


nlp = spacy.load("en_core_web_sm")

# Create a spaCy doc


doc = nlp("I'd like to talk about a smartphone I ordered on July 31st from your
Sydney store, my order number is 40939440. I spoke to Georgia about it last week.")

SPOKEN LANGUAGE PROCESSING IN PYTHON


spaCy tokens
# Show different tokens and positions
for token in doc:
print(token.text, token.idx)

I 0
'd 1
like 4
to 9
talk 12
about 17
a 23
smartphone 25...

SPOKEN LANGUAGE PROCESSING IN PYTHON


spaCy sentences
# Show sentences in doc
for sentences in doc.sents:
print(sentence)

I'd like to talk about a smartphone I ordered on July 31st from your Sydney store,
my order number is 4093829.
I spoke to one of your customer service team, Georgia, yesterday.

SPOKEN LANGUAGE PROCESSING IN PYTHON


spaCy named entities
Some of spaCy's built-in named entities:

PERSON People, including ctional.

ORG Companies, agencies, institutions, etc.

GPE Countries, cities, states.

PRODUCT Objects, vehicles, foods, etc. (Not services.)

DATE Absolute or relative dates or periods.

TIME Times smaller than a day.

MONEY Monetary values, including unit.

CARDINAL Numerals that do not fall under another type.

SPOKEN LANGUAGE PROCESSING IN PYTHON


spaCy named entities
# Find named entities in doc
for entity in doc.ents:
print(entity.text, entity.label_)

July 31st DATE


Sydney GPE
4093829 CARDINAL
one CARDINAL
Georgia GPE
yesterday DATE

SPOKEN LANGUAGE PROCESSING IN PYTHON


Custom named entities
# Import EntityRuler class
from spacy.pipeline import EntityRuler

# Check spaCy pipeline


print(nlp.pipeline)

[('tagger', <spacy.pipeline.pipes.Tagger at 0x1c3aa8a470>),


('parser', <spacy.pipeline.pipes.DependencyParser at 0x1c3bb60588>),
('ner', <spacy.pipeline.pipes.EntityRecognizer at 0x1c3bb605e8>)]

SPOKEN LANGUAGE PROCESSING IN PYTHON


Changing the pipeline
# Create EntityRuler instance
ruler = EntityRuler(nlp)

# Add token pattern to ruler


ruler.add_patterns([{"label":"PRODUCT", "pattern": "smartphone"}])

# Add new rule to pipeline before ner


nlp.add_pipe(ruler, before="ner")

# Check updated pipeline


nlp.pipeline

SPOKEN LANGUAGE PROCESSING IN PYTHON


Changing the pipeline
[('tagger', <spacy.pipeline.pipes.Tagger at 0x1c1f9c9b38>),
('parser', <spacy.pipeline.pipes.DependencyParser at 0x1c3c9cba08>),
('entity_ruler', <spacy.pipeline.entityruler.EntityRuler at 0x1c1d834b70>),
('ner', <spacy.pipeline.pipes.EntityRecognizer at 0x1c3c9cba68>)]

SPOKEN LANGUAGE PROCESSING IN PYTHON


Testing the new pipeline
# Test new entity rule
for entity in doc.ents:
print(entity.text, entity.label_)

smartphone PRODUCT
July 31st DATE
Sydney GPE
4093829 CARDINAL
one CARDINAL
Georgia GPE
yesterday DATE

SPOKEN LANGUAGE PROCESSING IN PYTHON


Let's rocket and
practice spaCy!
SPOKEN LANGUAGE PROCESSING IN PYTHON
Classifying
transcribed speech
with Sklearn
SPOKEN LANGUAGE PROCESSING IN PYTHON

Daniel Bourke
Machine Learning Engineer/YouTube
creator
Inspecting the data
# Inspect post purchase audio folder
import os
post_purchase_audio = os.listdir("post_purchase")
print(post_purchase_audio[:5])

['post-purchase-audio-0.mp3',
'post-purchase-audio-1.mp3',
'post-purchase-audio-2.mp3',
'post-purchase-audio-3.mp3',
'post-purchase-audio-4.mp3']

SPOKEN LANGUAGE PROCESSING IN PYTHON


Converting to wav
# Loop through mp3 files
for file in post_purchase_audio:
print(f"Converting {file} to .wav...")
# Use previously made function to convert to .wav
convert_to_wav(file)

Converting post-purchase-audio-0.mp3 to .wav...


Converting post-purchase-audio-1.mp3 to .wav...
Converting post-purchase-audio-2.mp3 to .wav...
Converting post-purchase-audio-3.mp3 to .wav...
Converting post-purchase-audio-4.mp3 to .wav...

SPOKEN LANGUAGE PROCESSING IN PYTHON


Transcribing all phone call excerpts
# Transcribe text from wav files
def create_text_list(folder):
text_list = []
# Loop through folder
for file in folder:
# Check for .wav extension
if file.endswith(".wav"):
# Transcribe audio
text = transcribe_audio(file)
# Add transcribed text to list
text_list.append(text)
return text_list

SPOKEN LANGUAGE PROCESSING IN PYTHON


Transcribing all phone call excerpts
# Convert post purchase audio to text
post_purchase_text = create_text_list(post_purchase_audio)
print(post_purchase_text[:5])

['hey man I just water product from you guys and I think is amazing but I leave a li
'these clothes I just bought from you guys too small is there anyway I can change t
"I recently got these pair of shoes but they're too big can I change the size",
"I bought a pair of pants from you guys but they're way too small",
"I bought a pair of pants and they're the wrong colour is there any chance I can ch

SPOKEN LANGUAGE PROCESSING IN PYTHON


Organizing transcribed text
import pandas as pd
# Create post purchase dataframe
post_purchase_df = pd.DataFrame({"label": "post_purchase", "text": post_purchase_text})
# Create pre purchase dataframe
pre_purchase_df = pd.DataFrame({"label": "pre_purchase", "text": pre_purchase_text})

# Combine pre purchase and post purhcase


df = pd.concat([post_purchase_df, pre_purchase_df])

# View the combined dataframe


df.head()

SPOKEN LANGUAGE PROCESSING IN PYTHON


Organizing transcribed text
label text
0 post_purchase yeah hello someone this morning delivered a pa...
1 post_purchase my shipment arrived yesterday but it's not the...
2 post_purchase hey my name is Daniel I received my shipment y...
3 post_purchase hey mate how are you doing I'm just calling in...
4 pre_purchase hey I was wondering if you know where my new p...

SPOKEN LANGUAGE PROCESSING IN PYTHON


Building a text classifier
# Import text classification packages
import numpy as np
from sklearn.pipeline import Pipeline
from sklearn.naive_bayes import MultinomialNB
from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer
from sklearn.model_selection import train_test_split

# Split data into train and test sets


X_train, X_test, y_train, y_test = train_test_split(
X=df["text"],
y=df["label"],
test_size=0.3)

SPOKEN LANGUAGE PROCESSING IN PYTHON


Naive Bayes Pipeline
# Create text classifier pipeline
text_classifier = Pipeline([
("vectorizer", CountVectorizer()),
("tfidf", TfidfTransformer()),
("classifier", MultinomialNB())
])

# Fit the classifier pipeline on the training data


text_classifier.fit(X_train, y_train)

SPOKEN LANGUAGE PROCESSING IN PYTHON


Not so Naive
# Make predictions and compare them to test labels
predictions = text_classifier.predict(X_test)
accuracy = 100 * np.mean(predictions == y_test.label)
print(f"The model is {accuracy:.2f}% accurate.")

The model is 97.87% accurate.

SPOKEN LANGUAGE PROCESSING IN PYTHON


Let's practice!
SPOKEN LANGUAGE PROCESSING IN PYTHON
Congratulations!
SPOKEN LANGUAGE PROCESSING IN PYTHON

Daniel Bourke
Machine Learning Engineer/YouTube
creator
What you've done
1. Converted audio les into soundwaves with Python and NumPy .

2. Transcribed speech with speech_recognition .

3. Prepared and manipulated audio les using PyDub .

4. Built a spoken language processing pipeline with NLTK , spaCy and sklearn .

SPOKEN LANGUAGE PROCESSING IN PYTHON


What next?
Practice your skills with a project of your own.

Check out speech_recognition 's Microphone() class.

SPOKEN LANGUAGE PROCESSING IN PYTHON


One last transcription
one_last_transcription = transcribe_audio("congratulations.wav")

print(one_last_transcription)

Congratlutions on finishing the Spoken Language Processing with Python course!


You should be proud.
Now get out there and recognize some speech!

SPOKEN LANGUAGE PROCESSING IN PYTHON


Keep learning!
SPOKEN LANGUAGE PROCESSING IN PYTHON

You might also like