0% found this document useful (0 votes)

15 views446 pages

Final Summary NLP

Uploaded by

211184

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

15 views446 pages

Final Summary NLP

Uploaded by

211184

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 446

Introduction to

regular expressions
I N T R O D U C T I O N T O N AT U R A L L A N G U A G E P R O C E S S I N G I N P Y T H O N

Katharine Jarmul
Founder, kjamistan
What is Natural Language Processing?
Field of study focused on making sense of language
Using statistics and computers

You will learn the basics of NLP

Topic identi cation

Text classi cation

NLP applications include:

Chatbots

Translation

Sentiment analysis

... and many more!

INTRODUCTION TO NATURAL LANGUAGE PROCESSING IN PYTHON

What exactly are regular expressions?
Strings with a special syntax → Find all web links in a document

Allow us to match pa erns in

→ Parse email addresses
other strings
→ Remove/replace unwanted
Applications of regular
characters
expressions:

import re <_sre.SRE_Match object; span=(0, 3), match='abc'>

re.match('abc', 'abcdef')

word_regex = '\w+'
re.match(word_regex, <_sre.SRE_Match object; span=(0, 2), match='hi'>
'hi there!')

INTRODUCTION TO NATURAL LANGUAGE PROCESSING IN PYTHON

Common regex patterns
pa ern matches example
\w+ word 'Magic'

INTRODUCTION TO NATURAL LANGUAGE PROCESSING IN PYTHON

Common regex patterns (2)
pa ern matches example
\w+ word 'Magic'
\d digit 9

INTRODUCTION TO NATURAL LANGUAGE PROCESSING IN PYTHON

Common regex patterns (3)
pa ern matches example
\w+ word 'Magic'
\d digit 9
\s space ''

INTRODUCTION TO NATURAL LANGUAGE PROCESSING IN PYTHON

Common regex patterns (4)
pa ern matches example
\w+ word 'Magic'
\d digit 9
\s space ''
.* wildcard 'username74'

INTRODUCTION TO NATURAL LANGUAGE PROCESSING IN PYTHON

Common regex patterns (5)
pa ern matches example
\w+ word 'Magic'
\d digit 9
\s space ''
.* wildcard 'username74'
+ or * greedy match 'aaaaaa'

INTRODUCTION TO NATURAL LANGUAGE PROCESSING IN PYTHON

Common regex patterns (6)
pa ern matches example
\w+ word 'Magic'
\d digit 9
\s space ''
.* wildcard 'username74'
+ or * greedy match 'aaaaaa'
\S not space 'no_spaces'

INTRODUCTION TO NATURAL LANGUAGE PROCESSING IN PYTHON

Common regex patterns (7)
pa ern matches example
\w+ word 'Magic'
\d digit 9
\s space ''
.* wildcard 'username74'
+ or * greedy match 'aaaaaa'
\S not space 'no_spaces'
[a-z] lowercase group 'abcdefg'

INTRODUCTION TO NATURAL LANGUAGE PROCESSING IN PYTHON

Python's re module
re module

split : split a string on regex

findall : nd all pa erns in a string

search : search for a pa ern

match : match an entire string or substring based on a

pa ern
Pa ern rst, and the string second

May return an iterator, string, or match object

re.split('\s+', 'Split on spaces.')

['Split', 'on', 'spaces.']

INTRODUCTION TO NATURAL LANGUAGE PROCESSING IN PYTHON

Let's practice!
I N T R O D U C T I O N T O N AT U R A L L A N G U A G E P R O C E S S I N G I N P Y T H O N
Introduction to
tokenization
I N T R O D U C T I O N T O N AT U R A L L A N G U A G E P R O C E S S I N G I N P Y T H O N

Katharine Jarmul
Founder, kjamistan
What is tokenization?
Turning a string or document into tokens (smaller chunks)

One step in preparing a text for NLP

Many di erent theories and rules

You can create your own rules using regular expressions

Some examples:
Breaking out words or sentences

Separating punctuation

Separating all hashtags in a tweet

INTRODUCTION TO NATURAL LANGUAGE PROCESSING IN PYTHON

nltk library
nltk : natural language toolkit

from nltk.tokenize import word_tokenize

word_tokenize("Hi there!")

['Hi', 'there', '!']

INTRODUCTION TO NATURAL LANGUAGE PROCESSING IN PYTHON

Why tokenize?
Easier to map part of speech

Matching common words

Removing unwanted tokens

"I don't like Sam's shoes."

"I", "do", "n't", "like", "Sam", "'s", "shoes", "."

INTRODUCTION TO NATURAL LANGUAGE PROCESSING IN PYTHON

Other nltk tokenizers
sent_tokenize : tokenize a document into sentences

regexp_tokenize : tokenize a string or document based on a

regular expression pa ern

TweetTokenizer : special class just for tweet tokenization,

allowing you to separate hashtags, mentions and lots of
exclamation points!!!

INTRODUCTION TO NATURAL LANGUAGE PROCESSING IN PYTHON

More regex practice
Di erence between re.search() and re.match()

import re
re.match('abc', 'abcde')

<_sre.SRE_Match object; span=(0, 3), match='abc'>

re.search('abc', 'abcde')

<_sre.SRE_Match object; span=(0, 3), match='abc'>

re.match('cd', 'abcde')
re.search('cd', 'abcde')

<_sre.SRE_Match object; span=(2, 4), match='cd'>

INTRODUCTION TO NATURAL LANGUAGE PROCESSING IN PYTHON

Let's practice!
I N T R O D U C T I O N T O N AT U R A L L A N G U A G E P R O C E S S I N G I N P Y T H O N
Advanced
tokenization with
regex
I N T R O D U C T I O N T O N AT U R A L L A N G U A G E P R O C E S S I N G I N P Y T H O N

Katharine Jarmul
Founder, kjamistan
Regex groups using or "|"
OR is represented using |

You can de ne a group using ()

You can de ne explicit character ranges using []

import re
match_digits_and_words = ('(\d+|\w+)')
re.findall(match_digits_and_words, 'He has 11 cats.')

['He', 'has', '11', 'cats']

INTRODUCTION TO NATURAL LANGUAGE PROCESSING IN PYTHON

Regex ranges and groups
pa ern matches example
upper and lowercase English
[A-Za-z]+ 'ABCDEFghijk'
alphabet
[0-9] numbers from 0 to 9 9
[A-Za-z\- upper and lowercase English 'My-
\.]+ alphabet, - and . Website.com'
(a-z) a, - and z 'a-z'
(\s+l,) spaces or a comma ', '

INTRODUCTION TO NATURAL LANGUAGE PROCESSING IN PYTHON

Character range with `re.match()`
import re
my_str = 'match lowercase spaces nums like 12, but no commas'
re.match('[a-z0-9 ]+', my_str)

<_sre.SRE_Match object;
span=(0, 42), match='match lowercase spaces nums like 12'>

INTRODUCTION TO NATURAL LANGUAGE PROCESSING IN PYTHON

Let's practice!
I N T R O D U C T I O N T O N AT U R A L L A N G U A G E P R O C E S S I N G I N P Y T H O N
Charting word
length with nltk
I N T R O D U C T I O N T O N AT U R A L L A N G U A G E P R O C E S S I N G I N P Y T H O N

Katharine Jarmul
Founder, kjamistan
Getting started with matplotlib
Charting library used by many open source Python projects

Straightforward functionality with lots of options

Histograms

Bar charts

Line charts

Sca er plots

... and also advanced functionality like 3D graphs and

animations!

INTRODUCTION TO NATURAL LANGUAGE PROCESSING IN PYTHON

Plotting a histogram with matplotlib
from matplotlib import pyplot as plt
plt.hist([1, 5, 5, 7, 7, 7, 9])

(array([ 1., 0., 0., 0., 0., 2., 0., 3., 0., 1.]),
array([ 1., 1.8, 2.6, 3.4, 4.2, 5., 5.8, 6.6, 7.4, 8.2, 9.]),
<a list of 10 Patch objects>)

plt.show()

INTRODUCTION TO NATURAL LANGUAGE PROCESSING IN PYTHON

Generated histogram

INTRODUCTION TO NATURAL LANGUAGE PROCESSING IN PYTHON

Combining NLP data extraction with plotting
from matplotlib import pyplot as plt
from nltk.tokenize import word_tokenize
words = word_tokenize("This is a pretty cool tool!")
word_lengths = [len(w) for w in words]
plt.hist(word_lengths)

(array([ 2., 0., 1., 0., 0., 0., 3., 0., 0., 1.]),
array([ 1., 1.5, 2., 2.5, 3., 3.5, 4., 4.5, 5., 5.5, 6.]),
<a list of 10 Patch objects>)

plt.show()

INTRODUCTION TO NATURAL LANGUAGE PROCESSING IN PYTHON

Word length histogram

INTRODUCTION TO NATURAL LANGUAGE PROCESSING IN PYTHON

Let's practice!
I N T R O D U C T I O N T O N AT U R A L L A N G U A G E P R O C E S S I N G I N P Y T H O N
Word counts with
bag-of-words
I N T R O D U C T I O N T O N AT U R A L L A N G U A G E P R O C E S S I N G I N P Y T H O N

Katharine Jarmul
Founder, kjamistan
Bag-of-words
Basic method for nding topics in a text

Need to rst create tokens using tokenization

... and then count up all the tokens

The more frequent a word, the more important it might be

Can be a great way to determine the signi cant words in a

text

INTRODUCTION TO NATURAL LANGUAGE PROCESSING IN PYTHON

Bag-of-words example
Text: "The cat is in the box. The cat likes the box. The box is
over the cat."

Bag of words (stripped punctuation):

"The": 3, "box": 3

"cat": 3, "the": 3

"is": 2

"in": 1, "likes": 1, "over": 1

INTRODUCTION TO NATURAL LANGUAGE PROCESSING IN PYTHON

Bag-of-words in Python
from nltk.tokenize import word_tokenize
from collections import Counter
Counter(word_tokenize("""The cat is in the box. The cat likes the box.
The box is over the cat."""))

Counter({'.': 3,
'The': 3,
'box': 3,
'cat': 3,
'in': 1,
...
'the': 3})

counter.most_common(2)

[('The', 3), ('box', 3)]

INTRODUCTION TO NATURAL LANGUAGE PROCESSING IN PYTHON

Let's practice!
I N T R O D U C T I O N T O N AT U R A L L A N G U A G E P R O C E S S I N G I N P Y T H O N
Simple text
preprocessing
I N T R O D U C T I O N T O N AT U R A L L A N G U A G E P R O C E S S I N G I N P Y T H O N

Katharine Jarmul
Founder, kjamistan
Why preprocess?
Helps make for be er input data
When performing machine learning or other statistical
methods

Examples:
Tokenization to create a bag of words

Lowercasing words

Lemmatization/Stemming
Shorten words to their root stems

Removing stop words, punctuation, or unwanted tokens

Good to experiment with di erent approaches

INTRODUCTION TO NATURAL LANGUAGE PROCESSING IN PYTHON

Preprocessing example
Input text: Cats, dogs and birds are common pets. So are sh.

Output tokens: cat, dog, bird, common, pet, sh

INTRODUCTION TO NATURAL LANGUAGE PROCESSING IN PYTHON

Text preprocessing with Python
from nltk.corpus import stopwords
text = """The cat is in the box. The cat likes the box.
The box is over the cat."""
tokens = [w for w in word_tokenize(text.lower())
if w.isalpha()]
no_stops = [t for t in tokens
if t not in stopwords.words('english')]
Counter(no_stops).most_common(2)

[('cat', 3), ('box', 3)]

INTRODUCTION TO NATURAL LANGUAGE PROCESSING IN PYTHON

Let's practice!
I N T R O D U C T I O N T O N AT U R A L L A N G U A G E P R O C E S S I N G I N P Y T H O N
Introduction to
gensim
I N T R O D U C T I O N T O N AT U R A L L A N G U A G E P R O C E S S I N G I N P Y T H O N

Katharine Jarmul
Founder, kjamistan
What is gensim?
Popular open-source NLP library

Uses top academic models to perform complex tasks

Building document or word vectors

Performing topic identi cation and document comparison

INTRODUCTION TO NATURAL LANGUAGE PROCESSING IN PYTHON

What is a word vector?

INTRODUCTION TO NATURAL LANGUAGE PROCESSING IN PYTHON

Gensim example

(Source: h p://tlfvincent.github.io/2015/10/23/presidential-
speech-topics)

INTRODUCTION TO NATURAL LANGUAGE PROCESSING IN PYTHON

from gensim.corpora.dictionary import Dictionary
from nltk.tokenize import word_tokenize
my_documents = ['The movie was about a spaceship and aliens.',
'I really liked the movie!',
'Awesome action scenes, but boring characters.',
'The movie was awful! I hate alien films.',
'Space is cool! I liked the movie.',
'More space films, please!',]

tokenized_docs = [word_tokenize(doc.lower()) {'!': 11,

for doc in my_documents] ',': 17,
dictionary = Dictionary(tokenized_docs) '.': 7,
dictionary.token2id 'a': 2,
'about': 4,
...}

INTRODUCTION TO NATURAL LANGUAGE PROCESSING IN PYTHON

Creating a gensim corpus
corpus = [dictionary.doc2bow(doc) for doc in tokenized_docs]
corpus

[[(0, 1), (1, 1), (2, 1), (3, 1), (4, 1), (5, 1), (6, 1), (7, 1), (8, 1)],
[(0, 1), (1, 1), (9, 1), (10, 1), (11, 1), (12, 1)],
...]

gensim models can be easily saved, updated, and reused

Our dictionary can also be updated

This more advanced and feature rich bag-of-words can be

used in future exercises

INTRODUCTION TO NATURAL LANGUAGE PROCESSING IN PYTHON

Let's practice!
I N T R O D U C T I O N T O N AT U R A L L A N G U A G E P R O C E S S I N G I N P Y T H O N
Tf-idf with gensim
I N T R O D U C T I O N T O N AT U R A L L A N G U A G E P R O C E S S I N G I N P Y T H O N

Katharine Jarmul
Founder, kjamistan
What is tf-idf?
Term frequency - inverse document frequency

Allows you to determine the most important words in each

document

Each corpus may have shared words beyond just stopwords

These words should be down-weighted in importance

Example from astronomy: "Sky"

Ensures most common words don't show up as key words

Keeps document speci c frequent words weighted high

INTRODUCTION TO NATURAL LANGUAGE PROCESSING IN PYTHON

Tf-idf formula
N
wi,j = tfi,j ∗ log( )
dfi
wi,j = tf-idf weight for token i in document j

tfi,j = number of occurences of token i in document j

dfi = number of documents that contain token i

N = total number of documents

INTRODUCTION TO NATURAL LANGUAGE PROCESSING IN PYTHON

Tf-idf with gensim
from gensim.models.tfidfmodel import TfidfModel
tfidf = TfidfModel(corpus)
tfidf[corpus[1]]

[(0, 0.1746298276735174),
(1, 0.1746298276735174),
(9, 0.29853166221463673),
(10, 0.7716931521027908),
...
]

INTRODUCTION TO NATURAL LANGUAGE PROCESSING IN PYTHON

Let's practice!
I N T R O D U C T I O N T O N AT U R A L L A N G U A G E P R O C E S S I N G I N P Y T H O N
Named Entity
Recognition
I N T R O D U C T I O N T O N AT U R A L L A N G U A G E P R O C E S S I N G I N P Y T H O N

Katharine Jarmul
Founder, kjamistan
What is Named Entity Recognition?
NLP task to identify important named entities in the text
People, places, organizations

Dates, states, works of art

... and other categories!

Can be used alongside topic identi cation

... or on its own!

Who? What? When? Where?

INTRODUCTION TO NATURAL LANGUAGE PROCESSING IN PYTHON

Example of NER

(Source: Europeana Newspapers (h p://www.europeana-

newspapers.eu))

INTRODUCTION TO NATURAL LANGUAGE PROCESSING IN PYTHON

nltk and the Stanford CoreNLP Library
The Stanford CoreNLP library:
Integrated into Python via nltk

Java based

Support for NER as well as coreference and dependency

trees

INTRODUCTION TO NATURAL LANGUAGE PROCESSING IN PYTHON

Using nltk for Named Entity Recognition
import nltk
sentence = '''In New York, I like to ride the Metro to
visit MOMA and some restaurants rated
well by Ruth Reichl.'''
tokenized_sent = nltk.word_tokenize(sentence)
tagged_sent = nltk.pos_tag(tokenized_sent)
tagged_sent[:3]

[('In', 'IN'), ('New', 'NNP'), ('York', 'NNP')]

INTRODUCTION TO NATURAL LANGUAGE PROCESSING IN PYTHON

print(nltk.ne_chunk(tagged_sent))

(S
In/IN
(GPE New/NNP York/NNP)
,/,
I/PRP
like/VBP
to/TO
ride/VB
the/DT
(ORGANIZATION Metro/NNP)
to/TO
visit/VB
(ORGANIZATION MOMA/NNP)
and/CC
some/DT
restaurants/NNS
rated/VBN
well/RB
by/IN
(PERSON Ruth/NNP Reichl/NNP)
./.)

INTRODUCTION TO NATURAL LANGUAGE PROCESSING IN PYTHON

Let's practice!
I N T R O D U C T I O N T O N AT U R A L L A N G U A G E P R O C E S S I N G I N P Y T H O N
Introduction to
SpaCy
I N T R O D U C T I O N T O N AT U R A L L A N G U A G E P R O C E S S I N G I N P Y T H O N

Katharine Jarmul
Founder, kjamistan
What is SpaCy?
NLP library similar to gensim , with di erent implementations

Focus on creating NLP pipelines to generate models and

corpora

Open-source, with extra libraries and tools

Displacy

INTRODUCTION TO NATURAL LANGUAGE PROCESSING IN PYTHON

Displacy entity recognition visualizer

(source: h ps://demos.explosion.ai/displacy-ent/)

INTRODUCTION TO NATURAL LANGUAGE PROCESSING IN PYTHON

import spacy
nlp = spacy.load('en_core_web_sm')
nlp.entity

<spacy.pipeline.EntityRecognizer at 0x7f76b75e68b8>

doc = nlp("""Berlin is the capital of Germany;

and the residence of Chancellor Angela Merkel.""")
doc.ents

(Berlin, Germany, Angela Merkel)

print(doc.ents[0], doc.ents[0].label_)

Berlin GPE

INTRODUCTION TO NATURAL LANGUAGE PROCESSING IN PYTHON

Why use SpaCy for NER?
Easy pipeline creation

Di erent entity types compared to nltk

Informal language corpora

Easily nd entities in Tweets and chat messages

Quickly growing!

INTRODUCTION TO NATURAL LANGUAGE PROCESSING IN PYTHON

Let's practice!
I N T R O D U C T I O N T O N AT U R A L L A N G U A G E P R O C E S S I N G I N P Y T H O N
Multilingual NER
with polyglot
I N T R O D U C T I O N T O N AT U R A L L A N G U A G E P R O C E S S I N G I N P Y T H O N

Katharine Jarmul
Founder, kjamistan
What is polyglot?
NLP library which uses word
vectors

Why polyglot ?
Vectors for many di erent
languages

More than 130!

INTRODUCTION TO NATURAL LANGUAGE PROCESSING IN PYTHON

Spanish NER with polyglot
from polyglot.text import Text
?ext = """El presidente de la Generalitat de Cataluña,
Carles Puigdemont, ha afirmado hoy a la alcaldesa
de Madrid, Manuela Carmena, que en su etapa de
alcalde de Girona (de julio de 2011 a enero de 2016)
hizo una gran promoción de Madrid."""
ptext = Text(text)
ptext.entities

[I-ORG(['Generalitat', 'de']),
I-LOC(['Generalitat', 'de', 'Cataluña']),
I-PER(['Carles', 'Puigdemont']),
I-LOC(['Madrid']),
I-PER(['Manuela', 'Carmena']),
I-LOC(['Girona']),
I-LOC(['Madrid'])]

INTRODUCTION TO NATURAL LANGUAGE PROCESSING IN PYTHON

Let's practice!
I N T R O D U C T I O N T O N AT U R A L L A N G U A G E P R O C E S S I N G I N P Y T H O N
Classifying fake
news using
supervised learning
with NLP
I N T R O D U C T I O N T O N AT U R A L L A N G U A G E P R O C E S S I N G I N P Y T H O N

Katharine Jarmul
Founder, kjamistan
What is supervised learning?
Form of machine learning
Problem has prede ned training data

This data has a label (or outcome) you want the model to
learn

Classi cation problem

Goal: Make good hypotheses about the species based on

geometric features

Sepal Sepal Petal Petal

Species
length width length width
5.1 3.5 1.4 0.2 I. setosa
7.0 3.2 4.77 1.4 I.versicolor
6.3 3.3 6.0 2.5 I.virginica

INTRODUCTION TO NATURAL LANGUAGE PROCESSING IN PYTHON

Supervised learning with NLP
Need to use language instead of geometric features

scikit-learn : Powerful open-source library

How to create supervised learning data from text?

Use bag-of-words models or tf-idf as features

INTRODUCTION TO NATURAL LANGUAGE PROCESSING IN PYTHON

IMDB Movie Dataset
Sci-
Plot Action
Fi
In a post-apocalyptic world in human decay, a
1 0
...
Mohei is a wandering swordsman. He arrives
0 1
in ...
#137 is a SCI/FI thriller about a girl, Marla,... 1 0

Goal: Predict movie genre based on plot summary

Categorical features generated using preprocessing

INTRODUCTION TO NATURAL LANGUAGE PROCESSING IN PYTHON

Supervised learning steps
Collect and preprocess our data

Determine a label (Example: Movie genre)

Split data into training and test sets

Extract features from the text to help predict the label

Bag-of-words vector built into scikit-learn

Evaluate trained model using the test set

INTRODUCTION TO NATURAL LANGUAGE PROCESSING IN PYTHON

Let's practice!
I N T R O D U C T I O N T O N AT U R A L L A N G U A G E P R O C E S S I N G I N P Y T H O N
Building word count
vectors with scikit-
learn
I N T R O D U C T I O N T O N AT U R A L L A N G U A G E P R O C E S S I N G I N P Y T H O N

Katharine Jarmul
Founder, kjamistan
Predicting movie genre
Dataset consisting of movie plots and corresponding genre

Goal: Create bag-of-word vectors for the movie plots

Can we predict genre based on the words used in the plot
summary?

INTRODUCTION TO NATURAL LANGUAGE PROCESSING IN PYTHON

Count Vectorizer with Python
import pandas as pd

from sklearn.model_selection import train_test_split

from sklearn.feature_extraction.text import CountVectorizer

df = ... # Load data into DataFrame
y = df['Sci-Fi']
X_train, X_test, y_train, y_test = train_test_split(
df['plot'], y,
test_size=0.33,
random_state=53)
count_vectorizer = CountVectorizer(stop_words='english')
count_train = count_vectorizer.fit_transform(X_train.values)
count_test = count_vectorizer.transform(X_test.values)

INTRODUCTION TO NATURAL LANGUAGE PROCESSING IN PYTHON

Let's practice!
I N T R O D U C T I O N T O N AT U R A L L A N G U A G E P R O C E S S I N G I N P Y T H O N
Training and testing
a classification
model with scikit-
learn
I N T R O D U C T I O N T O N AT U R A L L A N G U A G E P R O C E S S I N G I N P Y T H O N

Katharine Jarmul
Founder, kjamistan
Naive Bayes classifier
Naive Bayes Model
Commonly used for testing NLP classi cation problems

Basis in probability

Given a particular piece of data, how likely is a particular

outcome?

Examples:
If the plot has a spaceship, how likely is it to be sci- ?

Given a spaceship and an alien, how likely now is it sci- ?

Each word from CountVectorizer acts as a feature

Naive Bayes: Simple and e ective

INTRODUCTION TO NATURAL LANGUAGE PROCESSING IN PYTHON

Naive Bayes with scikit-learn
from sklearn.naive_bayes import MultinomialNB
from sklearn import metrics
nb_classifier = MultinomialNB()

nb_classifier.fit(count_train, y_train)
pred = nb_classifier.predict(count_test)
metrics.accuracy_score(y_test, pred)

0.85841849389820424

INTRODUCTION TO NATURAL LANGUAGE PROCESSING IN PYTHON

Confusion matrix
metrics.confusion_matrix(y_test, pred, labels=[0,1])

array([[6410, 563],
[ 864, 2242]])

Action Sci-Fi
Action 6410 563
Sci-Fi 864 2242

INTRODUCTION TO NATURAL LANGUAGE PROCESSING IN PYTHON

Let's practice!
I N T R O D U C T I O N T O N AT U R A L L A N G U A G E P R O C E S S I N G I N P Y T H O N
Simple NLP, complex
problems
I N T R O D U C T I O N T O N AT U R A L L A N G U A G E P R O C E S S I N G I N P Y T H O N

Katharine Jarmul
Founder, kjamistan
Translation

source:
(h ps://twi er.com/Lupintweets/status/865533182455685121)

INTRODUCTION TO NATURAL LANGUAGE PROCESSING IN PYTHON

Sentiment analysis

(source: h ps://nlp.stanford.edu/projects/socialsent/)

INTRODUCTION TO NATURAL LANGUAGE PROCESSING IN PYTHON

Language biases

(related talk: h ps://www.youtube.com/watch?

v=j7FwpZB1hWc)

INTRODUCTION TO NATURAL LANGUAGE PROCESSING IN PYTHON

Let's practice!
I N T R O D U C T I O N T O N AT U R A L L A N G U A G E P R O C E S S I N G I N P Y T H O N
Welcome!
S E N T I M E N T A N A LY S I S I N P Y T H O N

Violeta Misheva
Data Scientist
What is sentiment analysis?

Sentiment analysis is the process of understanding the opinion of an author about a

subject.

SENTIMENT ANALYSIS IN PYTHON

What goes into a sentiment analysis system?
First element: Opinion/emotion

Opinion (polarity): pos, neutral, neg Emotion

SENTIMENT ANALYSIS IN PYTHON

What goes into a sentiment analysis system?
Second element: subject

Subject of discussion: What is being talked about ?

_The camera on this phone is great but its ba ery life is rather disappointing. _

Third element: opinion holder

Opinion holder (entity): By whom?

SENTIMENT ANALYSIS IN PYTHON

Why sentiment analysis?
Social media monitoring
Not only what people are talking about but HOW they are talking about it

Sentiment can be found also in forums, blogs, news

Brand monitoring

Customer service

Product analytics

Market research and analysis

SENTIMENT ANALYSIS IN PYTHON

Let's look at movie reviews!
data.head()

SENTIMENT ANALYSIS IN PYTHON

How many positive and negative reviews?
data.label.value_counts()

0 3782
1 3719
Name: label, dtype: int64

SENTIMENT ANALYSIS IN PYTHON

Percentage of positive and negative reviews
data.label.value_counts() / len(data)

0 0.504199
1 0.495801
Name: label, dtype: float64

SENTIMENT ANALYSIS IN PYTHON

How long is the longest review?
length_reviews = data.review.str.len()

type(length_reviews)
pandas.core.series.Series

# Finding the review with max length

max(length_reviews)

0 667
1 2982
2 669
3 1087
....

SENTIMENT ANALYSIS IN PYTHON

How long is the shortest review?
length_reviews = data.review.str.len()

# Finding the review with min length

min(length_reviews)

0 667
1 2982
2 669
3 1087
4 724
....

SENTIMENT ANALYSIS IN PYTHON

Let's practice!
S E N T I M E N T A N A LY S I S I N P Y T H O N
Sentiment analysis
types and
approaches
S E N T I M E N T A N A LY S I S I N P Y T H O N

Violeta Misheva
Data Scientist
Levels of granularity
1. Document level

2. Sentence level

3. Aspect level

The camera in this phone is pre y good but the ba ery life is disappointing.

SENTIMENT ANALYSIS IN PYTHON

Type of sentiment analysis algorithms
Rule/lexicon-based

nice:+2, good:+1, terrible: -3 ...

Today was a good day.

Today: 0, was:0, a:0, good:+1, day:0

Total valence: +1

Automatic/ Machine learning

SENTIMENT ANALYSIS IN PYTHON

What is the valence of a sentence?
text = "Today was a good day."

from textblob import TextBlob

my_valence = TextBlob(text)
my_valence.sentiment

Sentiment(polarity=0.7, subjectivity=0.6000000000000001)

SENTIMENT ANALYSIS IN PYTHON

Automated or rule-based?
Automated/Machine learning Rule/lexicon-based

Rely on having labelled historical data Rely on manually cra ed valence scores

Might take a while to train Di erent words might have di erent

polarity in di erent contexts
Latest machine learning models can be
quite powerful Can be quite fast

SENTIMENT ANALYSIS IN PYTHON

Let's practice!
S E N T I M E N T A N A LY S I S I N P Y T H O N
Let's build a word
cloud!
S E N T I M E N T A N A LY S I S I N P Y T H O N

Violeta Misheva
Data Scientist
Word cloud example

SENTIMENT ANALYSIS IN PYTHON

How do word clouds work?

The more frequent a word is, the BIGGER and bolder it will appear on the word cloud.

SENTIMENT ANALYSIS IN PYTHON

Word cloud generated by one of the longest reviews

SENTIMENT ANALYSIS IN PYTHON

Why word clouds?
Pros Cons
Can reveal the essential
Sometimes confusing and uninformative
Provide an overall sense of the text
With larger text, require more work
Easy to grasp and engaging

SENTIMENT ANALYSIS IN PYTHON

Let's build a word cloud in Python!
from wordcloud import WordCloud
import matplotlib.pyplot as plt

two_cities = "It was the best of times, it was the worst of times,
it was the age of wisdom, it was the age of foolishness,
it was the epoch of belief, it was the epoch of incredulity,
it was the season of Light, it was the season of Darkness,
it was the spring of hope, it was the winter of despair,
we had everything before us, we had nothing before us,
we were all going direct to Heaven, we were all going
direct the other way – in short, the period was so far
like the present period, that some of its noisiest
authorities insisted on its being received, for good
or for evil, in the superlative degree of comparison only."

SENTIMENT ANALYSIS IN PYTHON

Define the WordCloud object
cloud_two_cities = WordCloud().generate(two_cities)

# To see all arguments of the function

?WordCloud

Background color

Size and font of the words, scaling

Stopwords

# How does cloud_two_cities look like?

cloud_two_cities
<wordcloud.wordcloud.WordCloud at 0x2585f286d68>

SENTIMENT ANALYSIS IN PYTHON

Dislaying the word cloud!
plt.imshow(cloud_two_cities, interpolation='bilinear')

plt.axis('off')
plt.show()

SENTIMENT ANALYSIS IN PYTHON

Let's practice!
S E N T I M E N T A N A LY S I S I N P Y T H O N
Bag-of-words
S E N T I M E N T A N A LY S I S I N P Y T H O N

Violeta Misheva
Data Scientist
What is a bag-of-words (BOW) ?

Describes the occurrence of words within a document or a collection of documents (corpus)

Builds a vocabulary of the words and a measure of their presence

SENTIMENT ANALYSIS IN PYTHON

Amazon product reviews

SENTIMENT ANALYSIS IN PYTHON

Sentiment analysis with BOW: Example
This is the best book ever. I loved the book and highly recommend it!!!

{'This': 1, 'is': 1, 'the': 2 , 'best': 1 , 'book': 2,

'ever': 1, 'I':1 , 'loved':1 , 'and': 1 , 'highly': 1,
'recommend': 1 , 'it': 1 }

Lose word order and grammar rules!

SENTIMENT ANALYSIS IN PYTHON

BOW end result
The output will look something like this:

SENTIMENT ANALYSIS IN PYTHON

CountVectorizer function
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer

vect = CountVectorizer(max_features=1000)
vect.fit(data.review)
X = vect.transform(data.review)

SENTIMENT ANALYSIS IN PYTHON

CountVectorizer output
X

<10000x1000 sparse matrix of type '<class 'numpy.int64'>'

with 406668 stored elements in Compressed Sparse Row format>

SENTIMENT ANALYSIS IN PYTHON

Transforming the vectorizer
# Transform to an array
my_array = X.toarray()

# Transform back to a dataframe, assign column names

X_df = pd.DataFrame(my_array, columns=vect.get_feature_names())

SENTIMENT ANALYSIS IN PYTHON

Let's practice!
S E N T I M E N T A N A LY S I S I N P Y T H O N
Getting granular
with n-grams
S E N T I M E N T A N A LY S I S I N P Y T H O N

Violeta Misheva
Data Scientist
Context matters
I am happy, not sad.

I am sad, not happy.

Pu ing 'not' in front of a word (negation) is one example of how context ma ers.

SENTIMENT ANALYSIS IN PYTHON

Capturing context with a BOW
Unigrams : single tokens

Bigrams: pairs of tokens

Trigrams: triples of tokens

n-grams: sequence of n-tokens

SENTIMENT ANALYSIS IN PYTHON

Capturing context with BOW
The weather today is wonderful.

Unigrams : { The, weather, today, is, wonderful }

Bigrams: {The weather, weather today, today is, is wonderful}

Trigrams: {The weather today, weather today is, today is wonderful}

SENTIMENT ANALYSIS IN PYTHON

n-grams with the CountVectorizer
from sklearn.feature_extraction.text import CountVectorizer

vect = CountVectorizer(ngram_range=(min_n, max_n))

# Only unigrams
ngram_range=(1, 1)

# Uni- and bigrams

ngram_range=(1, 2)

SENTIMENT ANALYSIS IN PYTHON

What is the best n?
Longer sequence of tokens
Results in more features

Higher precision of machine learning models

Risk of over ing

SENTIMENT ANALYSIS IN PYTHON

Specifying vocabulary size
CountVectorizer(max_features, max_df, min_df)

max_features: if speci ed, it will include only the top most frequent words in the vocabulary
If max_features = None, all words will be included

max_df: ignore terms with higher than speci ed frequency

If it is set to integer, then absolute count; if a oat, then it is a proportion

Default is 1, which means it does not ignore any terms

min_df: ignore terms with lower than speci ed frequency

If it is set to integer, then absolute count; if a oat, then it is a proportion

Default is 1, which means it does not ignore any terms

SENTIMENT ANALYSIS IN PYTHON

Let's practice!
S E N T I M E N T A N A LY S I S I N P Y T H O N
Build new features
from text
S E N T I M E N T A N A LY S I S I N P Y T H O N

Violeta Misheva
Data Scientist
Goal of the video

Goal : Enrich the existing dataset with features related to the text column (capturing the
sentiment)

SENTIMENT ANALYSIS IN PYTHON

Product reviews data
reviews.head()

SENTIMENT ANALYSIS IN PYTHON

Features from the review column

How long is each review?

How many sentences does it contain?

What parts of speech are involved?

How many punctuation marks?

SENTIMENT ANALYSIS IN PYTHON

Tokenizing a string
from nltk import word_tokenize

anna_k = 'Happy families are all alike, every unhappy family is unhappy in its own way.'

word_tokenize(anna_k)

['Happy','families','are', 'all','alike',',',
'every','unhappy', 'family', 'is','unhappy','in',
'its','own','way','.']

SENTIMENT ANALYSIS IN PYTHON

Tokens from a column
# General form of list comprehension
[expression for item in iterable]

word_tokens = [word_tokenize(review) for review in reviews.review]

type(word_tokens)

list

type(word_tokens[0])

list

SENTIMENT ANALYSIS IN PYTHON

Tokens from a column
len_tokens = []

# Iterate over the word_tokens list

for i in range(len(word_tokens)):
len_tokens.append(len(word_tokens[i]))

# Create a new feature for the length of each review

reviews['n_tokens'] = len_tokens

SENTIMENT ANALYSIS IN PYTHON

Dealing with punctuation
We did not address it but you can exclude it

A feature that measures the number of punctuation signs

A review with many punctuation signs could signal a very emotionally charged opinion

SENTIMENT ANALYSIS IN PYTHON

Reviews with a feature for the length
reviews.head()

SENTIMENT ANALYSIS IN PYTHON

Let's practice!
S E N T I M E N T A N A LY S I S I N P Y T H O N
Can you guess the
language?
S E N T I M E N T A N A LY S I S I N P Y T H O N

Violeta Misheva
Data Scientist
Language of a string in Python
from langdetect import detect_langs
foreign = 'Este libro ha sido uno de los mejores libros que he leido.'

detect_langs(foreign)

[es:0.9999945352697024]

SENTIMENT ANALYSIS IN PYTHON

Language of a column
Problem: Detect the language of each of the strings and capture the most likely language in
a new column

from langdetect import detect_langs

reviews = pd.read_csv('product_reviews.csv')

reviews.head()

SENTIMENT ANALYSIS IN PYTHON

Building a feature for the language
languages = []

for row in range(len(reviews)):

languages.append(detect_langs(reviews.iloc[row, 1]))

languages
[it:0.9999982541301151],
[es:0.9999954153640488],
[es:0.7142833997345875, en:0.2857160465706441],
[es:0.9999942365605781],
[es:0.999997956049055] ...

SENTIMENT ANALYSIS IN PYTHON

Building a feature for the language
# Transform the first list to a string and split on a colon
str(languages[0]).split(':')
['[es', '0.9999954153640488]']

str(languages[0]).split(':')[0]
'[es'

str(languages[0]).split(':')[0][1:]
'es'

SENTIMENT ANALYSIS IN PYTHON

Building a feature for the language
languages = [str(lang).split(':')[0][1:] for lang in languages]

reviews['language'] = languages

SENTIMENT ANALYSIS IN PYTHON

Let's practice!
S E N T I M E N T A N A LY S I S I N P Y T H O N
Stop words
S E N T I M E N T A N A LY S I S I N P Y T H O N

Violeta Misheva
Data Scientist
What are stop words and how to find them?
Stop words: words that occur too frequently and not considered informative

Lists of stop words in most languages

{'the', 'a', 'an', 'and', 'but', 'for', 'on', 'in', 'at' ...}

Context ma ers

{'movie', 'movies', 'film', 'films', 'cinema'}

SENTIMENT ANALYSIS IN PYTHON

Stop words with word clouds
Word cloud, not removing stop words Word cloud with stop words removed

SENTIMENT ANALYSIS IN PYTHON

Remove stop words from word clouds
# Import libraries
from wordcloud import WordCloud, STOPWORDS

# Define the stopwords list

my_stopwords = set(STOPWORDS)
my_stopwords.update(["movie", "movies", "film", "films", "watch", "br"])

# Generate and show the word cloud

my_cloud = WordCloud(background_color='white', stopwords=my_stopwords).generate(name_string)
plt.imshow(my_cloud, interpolation='bilinear')

SENTIMENT ANALYSIS IN PYTHON

Stop words with BOW
from sklearn.feature_extraction.text import CountVectorizer, ENGLISH_STOP_WORDS

# Define the set of stop words

my_stop_words = ENGLISH_STOP_WORDS.union(['film', 'movie', 'cinema', 'theatre'])

vect = CountVectorizer(stop_words=my_stop_words)
vect.fit(movies.review)
X = vect.transform(movies.review)

SENTIMENT ANALYSIS IN PYTHON

Let's practice!
S E N T I M E N T A N A LY S I S I N P Y T H O N
Capturing a token
pattern
S E N T I M E N T A N A LY S I S I N P Y T H O N

Violeta Misheva
Data Scientist
String operators and comparisons
# Checks if a string is composed only of letters
my_string.isalpha()

# Checks if a string is composed only of digits

my_string.isdigit()

# Checks if a string is composed only of alphanumeric characters

my_string.isalnum()

SENTIMENT ANALYSIS IN PYTHON

String operators with list comprehension
# Original word tokenization
word_tokens = [word_tokenize(review) for review in reviews.review]

# Keeping only tokens composed of letters

cleaned_tokens = [[word for word in item if word.isalpha()] for item in word_tokens]

len(word_tokens[0])

len(cleaned_tokens[0])

SENTIMENT ANALYSIS IN PYTHON

Regular expressions
import re

my_string = '#Wonderfulday'
# Extract #, followed by any letter, small or capital
x = re.search('#[A-Za-z]', my_string)

x
<re.Match object; span=(0, 2), match='#W'>

SENTIMENT ANALYSIS IN PYTHON

Token pattern with a BOW
# Default token pattern in CountVectorizer
'\b\w\w+\b'

# Specify a particular token pattern

CountVectorizer(token_pattern=r'\b[^\d\W][^\d\W]+\b')

SENTIMENT ANALYSIS IN PYTHON

Let's practice!
S E N T I M E N T A N A LY S I S I N P Y T H O N
Stemming and
lemmatization
S E N T I M E N T A N A LY S I S I N P Y T H O N

Violeta Misheva
Data Scientist
What is stemming?
Stemming is the process of transforming words to their root forms, even if the stem itself is
not a valid word in the language.

staying, stays, stayed ----> stay

house, houses, housing ----> hous

SENTIMENT ANALYSIS IN PYTHON

What is lemmatization?
Lemmatization is quite similar to stemming but unlike stemming, it reduces the words to
roots that are valid words in the language.

stay, stays, staying, stayed ----> stay

house, houses, housing ----> house

SENTIMENT ANALYSIS IN PYTHON

Stemming vs. lemmatization
Stemming Lemmatization

Produces roots of words Produces actual words

Fast and e cient to compute Slower than stemming and can depend on
the part-of-speech

SENTIMENT ANALYSIS IN PYTHON

Stemming of strings
from nltk.stem import PorterStemmer

porter = PorterStemmer()

porter.stem('wonderful')

'wonder'

SENTIMENT ANALYSIS IN PYTHON

Non-English stemmers
Snowball Stemmer: Danish, Dutch, English, Finnish, French, German, Hungarian,Italian,
Norwegian, Portuguese, Romanian, Russian, Spanish, Swedish

from nltk.stem.snowball import SnowballStemmer

DutchStemmer = SnowballStemmer("dutch")
DutchStemmer.stem("beginnen")

'begin'

SENTIMENT ANALYSIS IN PYTHON

How to stem a sentence?
porter.stem('Today is a wonderful day!')

'today is a wonderful day!'

tokens = word_tokenize('Today is a wonderful day!')

stemmed_tokens = [porter.stem(token) for token in tokens]
stemmed_tokens

['today', 'is', 'a', 'wonder', 'day', '!']

SENTIMENT ANALYSIS IN PYTHON

Lemmatization of a string
from nltk.stem import WordNetLemmatizer

WNlemmatizer = WordNetLemmatizer()

WNlemmatizer.lemmatize('wonderful', pos='a')

'wonderful'

SENTIMENT ANALYSIS IN PYTHON

Let's practice!
S E N T I M E N T A N A LY S I S I N P Y T H O N
TfIdf: More ways to
transform text
S E N T I M E N T A N A LY S I S I N P Y T H O N

Violeta Misheva
Data Scientist
What are the components of TfIdf?
TF: term frequency: How o en a given word appears within a document in the corpus

Inverse document frequency: Log-ratio between the total number of documents and the
number of documents that contain a speci c word
Used to calculate the weight of words that do not occur frequently

SENTIMENT ANALYSIS IN PYTHON

TfIdf score of a word
TfIdf score:

TfIdf = term frequency * inverse document frequency

BOW does not account for length of a document, TfIdf does.

TfIdf likely to capture words common within a document but not across documents.

SENTIMENT ANALYSIS IN PYTHON

How is TfIdf useful?
Twi er airline sentiment
Low TfIdf scores: United, Virgin America

High TfIdf scores: check-in process (if rare across documents)

More on TfIdf
Since it penalizes frequent words, less need to deal with stop words explicitly.

Quite useful in search queries and information retrieval to rank the relevance of returned
results.

SENTIMENT ANALYSIS IN PYTHON

TfIdf in Python
# Import the TfidfVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer

Arguments of T dfVectorizer: max_features, ngrams_range, stop_words, token_pa ern,

max_df, min_df

vect = TfidfVectorizer(max_features=100).fit(tweets.text)
X = vect.transform(tweets.text)

SENTIMENT ANALYSIS IN PYTHON

TfidfVectorizer
X
<14640x100 sparse matrix of type '<class 'numpy.float64'>'
with 119182 stored elements in Compressed Sparse Row format>

X_df = pd.DataFrame(X_txt.toarray(), columns=vect.get_feature_names())

X_df.head()

SENTIMENT ANALYSIS IN PYTHON

Let's practice!
S E N T I M E N T A N A LY S I S I N P Y T H O N
Let's predict the
sentiment!
S E N T I M E N T A N A LY S I S I N P Y T H O N

Violeta Misheva
Data Scientist
Classification problems
Product and movie reviews: positive or negative sentiment (binary classi cation)

Tweets about airline companies: positive, neutral and negative (multi-class classi cation)

SENTIMENT ANALYSIS IN PYTHON

Linear and logistic regressions

SENTIMENT ANALYSIS IN PYTHON

Logistic function
Linear regression: numeric outcome

Logistic regression: probability:

P robability(sentiment = positive∣review)

SENTIMENT ANALYSIS IN PYTHON

Logistic regression in Python
from sklearn.linear_model import LogisticRegression

log_reg = LogisticRegression().fit(X, y)

SENTIMENT ANALYSIS IN PYTHON

Measuring model performance
Accuracy: Fraction of predictions our model got right.

The higher and closer the accuracy is to 1, the be er

# Accuracy using score

score = log_reg.score(X, y)
print(score)

0.9009

SENTIMENT ANALYSIS IN PYTHON

Using accuracy score
# Accuracy using accuracy_score
from sklearn.metrics import accuracy_score

y_predicted = log_reg.predict(X)
acurracy = accuracy_score(y, y_predicted)

0.9009

SENTIMENT ANALYSIS IN PYTHON

Let's practice!
S E N T I M E N T A N A LY S I S I N P Y T H O N
Did we really predict
the sentiment well?
S E N T I M E N T A N A LY S I S I N P Y T H O N

Violeta Misheva
Data Scientist
Train/test split

Training set: used to train the model (70-80% of the whole data)

Testing set: used to evaluate the performance of the model

SENTIMENT ANALYSIS IN PYTHON

Train/test in Python
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=123, stratify=y)

X : features

y : labels

test_size: proportion of data used in testing

random_state: seed generator used to make the split

stratify: proportion of classes in the sample produced will be the same as the proportion of
values provided to this parameter

SENTIMENT ANALYSIS IN PYTHON

Logistic regression with train/test split
log_reg = LogisticRegression().fit(X_train, y_train)

print('Accuracy on training data: ', log_reg.score(X_train, y_train))

0.76

print('Accuracy on testing data: ', log_reg.score(X_test, y_test))

0.73

SENTIMENT ANALYSIS IN PYTHON

Accuracy score with train/test split
from sklearn.metrics import accuracy_score

log_reg = LogisticRegression().fit(X_train, y_train)

y_predicted = log_reg.predict(X_test)
print('Accuracy score on test data: ', accuracy_score(y_test, y_predicted))

0.73

SENTIMENT ANALYSIS IN PYTHON

Confusion matrix

SENTIMENT ANALYSIS IN PYTHON

Confusion matrix in Python
from sklearn.metrics import confusion_matrix

log_reg = LogisticRegression().fit(X_train, y_train)

y_predicted = log_reg.predict(X_test)

print(confusion_matrix(y_test, y_predicted)/len(y_test))

[[0.3788 0.1224]
[0.1352 0.3636]]

SENTIMENT ANALYSIS IN PYTHON

Let's practice!
S E N T I M E N T A N A LY S I S I N P Y T H O N
Logistic regression:
revisted
S E N T I M E N T A N A LY S I S I N P Y T H O N

Violeta Misheva
Data Scientist
Complex models and regularization
Complex models:
Complex model that captures the noise in the data (over ing)

Having a large number of features or parameters

Regularization:
A way to simplify and ensure we have a less complex model

SENTIMENT ANALYSIS IN PYTHON

Regularization in a logistic regression
from sklearn.linear_model import LogisticRegression

# Regularization arguments
LogisticRegression(penalty='l2', C=1.0)

L2: shrinks all coe cients towards zero

High values of C: low penalization, model ts the training data well.

Low values of C: high penalization, model less exible.

SENTIMENT ANALYSIS IN PYTHON

Predicting a probability vs. predicting a class
log_reg = LogisticRegression().fit(X_train, y_train)

# Predict labels
y_predicted = log_reg.predict(X_test)

# Predict probability
y_probab = log_reg.predict_proba(X_test)

SENTIMENT ANALYSIS IN PYTHON

Predicting a probability vs. predicting a class
y_probab
array([[0.5002245, 0.4997755],
[0.4900345, 0.5099655],
...,
[0.7040499, 0.2959501]])

# Select the probabilities of class 1

y_probab = log_reg.predict_proba(X_test)[:, 1]

array([0.4997755, 0.5099655 ..., 0.2959501]])

SENTIMENT ANALYSIS IN PYTHON

Model metrics with predicted probabilities
Raise ValueError when applied with probabilities.

Accuracy score and confusion matrix work with classes.

# Default probability encoding:

# If probability >= 0.5, then class 1 Else class 0

SENTIMENT ANALYSIS IN PYTHON

Let's practice!
S E N T I M E N T A N A LY S I S I N P Y T H O N
Bringing it all
together
S E N T I M E N T A N A LY S I S I N P Y T H O N

Violeta Misheva
Data Scientist
The Sentiment Analysis problem
Sentiment analysis as the process of understanding the opinion of an author about a
subject

Movie reviews

Amazon product reviews

Twi er airline sentiment

Various emotionally charged literary examples

SENTIMENT ANALYSIS IN PYTHON

Exploration of the reviews
Basic information about size of reviews

Word clouds

Features for the length of reviews: number of words, number of sentences

Feature detecting the language of a review

SENTIMENT ANALYSIS IN PYTHON

Numeric transformations of sentiment-carrying
columns
Bag-of-words

TfIdf vectorization

from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer

# Vectorizer syntax
vect = CountVectorizer().fit(data.text_column)
X = vect.transform(data.text_column)

SENTIMENT ANALYSIS IN PYTHON

Arguments of the vectorizers
stop words: non-informative, frequently occurring words

n-gram range: use phrases not only single words

control size of vocabulary: max_features, max_df, min_df

capturing a pa ern of tokens: remove digits or certain characters

Important but NOT arguments to the vectorizers

lemmas and stems

SENTIMENT ANALYSIS IN PYTHON

Supervised learning model
Logistic regression classi er to predict the sentiment

Evaluated with accuracy and confusion matrix

Importance of train/test split

SENTIMENT ANALYSIS IN PYTHON

Let's practice!
S E N T I M E N T A N A LY S I S I N P Y T H O N
Wrap up
S E N T I M E N T A N A LY S I S I N P Y T H O N

Violeta Misheva
Data Scientist
The Sentiment Analysis world

SENTIMENT ANALYSIS IN PYTHON

Sentiment analysis types

SENTIMENT ANALYSIS IN PYTHON

The automated sentiment analysis system

SENTIMENT ANALYSIS IN PYTHON

Congratulations!
S E N T I M E N T A N A LY S I S I N P Y T H O N
Natural Language
Processing (NLP)
basics
N AT U R A L L A N G U A G E P R O C E S S I N G W I T H S PA C Y

Azadeh Mobasher
Principal Data Scientist
Natural Language Processing (NLP)

A subfield of Artificial Intelligence (AI)

Helps computers to understand human

language

Helps extract insights from unstructured

data

Incorporates statistics, machine learning

models and deep learning models

NATURAL LANGUAGE PROCESSING WITH SPACY

NLP use cases
Sentiment analysis

Use of computers to determine the underlying subjective tone of a piece of writing

NATURAL LANGUAGE PROCESSING WITH SPACY

NLP use cases
Named entity recognition (NER)

Locating and classifying named entities mentioned in unstructured text into pre-defined
categories

Named entities are real-world objects such as a person or location

NATURAL LANGUAGE PROCESSING WITH SPACY

NLP use cases

Generate human-like responses to text input, such as ChatGPT

NATURAL LANGUAGE PROCESSING WITH SPACY

Introduction to spaCy
spaCy is a free, open-source library for NLP in
Python which:

Is designed to build systems for information

extraction

Provides production-ready code for NLP

use cases

Supports 64+ languages

Is robust and fast and has visualization

libraries

NATURAL LANGUAGE PROCESSING WITH SPACY

Install and import spaCy

As the first step, spaCy can be installed $ python3 pip install spacy
using the Python package manager pip

spaCy trained models can be downloaded python3 -m spacy download en_core_web_sm

import spacy
Multiple trained models are available for nlp = spacy.load("en_core_web_sm")
English language at spacy.io

NATURAL LANGUAGE PROCESSING WITH SPACY

Read and process text with spaCy
Loaded spaCy model en_core_web_sm = nlp object
nlp object converts text into a Doc object (container) to store processed text

NATURAL LANGUAGE PROCESSING WITH SPACY

spaCy in action
Processing a string using spaCy

import spacy
nlp = spacy.load("en_core_web_sm")
text = "A spaCy pipeline object is created."
doc = nlp(text)

Tokenization
A Token is defined as the smallest meaningful part of the text.

Tokenization: The process of dividing a text into a list of meaningful tokens

print([token.text for token in doc])

['A', 'spaCy', 'pipeline', 'object', 'is', 'created', '.']

NATURAL LANGUAGE PROCESSING WITH SPACY

Let's practice!
N AT U R A L L A N G U A G E P R O C E S S I N G W I T H S PA C Y
spaCy basics
N AT U R A L L A N G U A G E P R O C E S S I N G W I T H S PA C Y

Azadeh Mobasher
Principal Data Scientist
spaCy NLP pipeline
Import spaCy
import spacy
nlp = spacy.load("en_core_web_sm") Use spacy.load() to return nlp , a
doc = nlp("Here's my spaCy pipeline.") Language class
The Language object is the text
processing pipeline

Apply nlp() on any text to get a Doc

container

NATURAL LANGUAGE PROCESSING WITH SPACY

spaCy NLP pipeline

spaCy applies some processing steps using its Language class:

NATURAL LANGUAGE PROCESSING WITH SPACY

Container objects in spaCy
There are multiple data structures to represent text data in spaCy :

Name Description
Doc A container for accessing linguistic annotations of text

Span A slice from a Doc object

Token An individual token, i.e. a word, punctuation, whitespace, etc.

NATURAL LANGUAGE PROCESSING WITH SPACY

Pipeline components
The spaCy language processing pipeline always depends on the loaded model and its
capabilities.

Component Name Description

Tokenizer Tokenizer Segment text into tokens and create Doc object

Tagger Tagger Assign part-of-speech tags

Lemmatizer Lemmatizer Reduce the words to their root forms
EntityRecognizer NER Detect and label named entities

NATURAL LANGUAGE PROCESSING WITH SPACY

Pipeline components

Each component has unique features to process text

Language

DependencyParser

Sentencizer

NATURAL LANGUAGE PROCESSING WITH SPACY

Tokenization
Always the first operation
All the other operations require tokens

Tokens can be words, numbers and punctuation

import spacy
nlp = spacy.load("en_core_web_sm")

doc = nlp("Tokenization splits a sentence into its tokens.")

print([token.text for token in doc])

['Tokenization', 'splits', 'a', 'sentence', 'into', 'its', 'tokens', '.']

NATURAL LANGUAGE PROCESSING WITH SPACY

Sentence segmentation
More complex than tokenization
Is a part of DependencyParser component

import spacy
nlp = spacy.load("en_core_web_sm")

text = "We are learning NLP. This course introduces spaCy."

doc = nlp(text)
for sent in doc.sents:
print(sent.text)

We are learning NLP.

This course introduces spaCy.

NATURAL LANGUAGE PROCESSING WITH SPACY

Lemmatization
A lemma is a the base form of a token
The lemma of eats and ate is eat

Improves accuracy of language models

import spacy
nlp = spacy.load("en_core_web_sm")
doc = nlp("We are seeing her after one year.")
print([(token.text, token.lemma_) for token in doc])

[('We', 'we'), ('are', 'be'), ('seeing', 'see'), ('her', 'she'),

('after', 'after'), ('one', 'one'), ('year', 'year'), ('.', '.')]

NATURAL LANGUAGE PROCESSING WITH SPACY

Let's practice!
N AT U R A L L A N G U A G E P R O C E S S I N G W I T H S PA C Y
Linguistic features in
spaCy
N AT U R A L L A N G U A G E P R O C E S S I N G W I T H S PA C Y

Azadeh Mobasher
Principal Data Scientist
POS tagging
Categorizing words grammatically, based on function and context within a sentence

POS Description Example

VERB Verb run, eat, ate, take
NOUN Noun man, airplane, tree, flower
ADJ Adjective big, old, incompatible, conflicting
ADV Adverb very, down, there, tomorrow
CONJ Conjunction and, or, but

NATURAL LANGUAGE PROCESSING WITH SPACY

POS tagging with spaCy

POS tagging confirms the meaning of a word

Some words such as watch can be both noun and verb

spaCy captures POS tags in the pos_ feature of the nlp pipeline

spacy.explain() explains a given POS tag

NATURAL LANGUAGE PROCESSING WITH SPACY

POS tagging with spaCy
verb_sent = "I watch TV." noun_sent = "I left without my watch."

print([(token.text, token.pos_, print([(token.text, token.pos_,

spacy.explain(token.pos_)) spacy.explain(token.pos_))
for token in nlp(verb_sent)]) for token in nlp(noun_sent)])

[('I', 'PRON', 'pronoun'), [('I', 'PRON', 'pronoun'),

('watch', 'VERB', 'verb'), ('left', 'VERB', 'verb'),
('TV', 'NOUN', 'noun'), ('without', 'ADP', 'adposition'),
('.', 'PUNCT', 'punctuation')] ('my', 'PRON', 'pronoun'),
('watch', 'NOUN', 'noun'),
('.', 'PUNCT', 'punctuation')]

NATURAL LANGUAGE PROCESSING WITH SPACY

Named entity recognition
A named entity is a word or phrase that refers to a specific entity with a name
Named-entity recognition (NER) classifies named entities into pre-defined categories

Entity type Description

PERSON Named person or family
ORG Companies, institutions, etc.
GPE Geo-political entity, countries, cities, etc.
LOC Non-GPE locations, mountain ranges, etc.
DATE Absolute or relative dates or periods
TIME Time smaller than a day

NATURAL LANGUAGE PROCESSING WITH SPACY

NER and spaCy

spaCy models extract named entities using the NER pipeline component

Named entities are available via the doc.ents property

spaCy will also tag each entity with its entity label ( .label_ )

NATURAL LANGUAGE PROCESSING WITH SPACY

NER and spaCy

import spacy
nlp = spacy.load("en_core_web_sm")
text = "Albert Einstein was genius."
doc = nlp(text)
print([(ent.text, ent.start_char,
ent.end_char, ent.label_) for ent in doc.ents])

>>> [('Albert Einstein', 0, 15, 'PERSON')]

NATURAL LANGUAGE PROCESSING WITH SPACY

NER and spaCy
We can also access entity types of each token in a Doc container

import spacy
nlp = spacy.load("en_core_web_sm")
text = "Albert Einstein was genius."
doc = nlp(text)
print([(token.text, token.ent_type_) for token in doc])

>>> [('Albert', 'PERSON'), ('Einstein', 'PERSON'),

('was', ''), ('genius', ''), ('.', '')]

NATURAL LANGUAGE PROCESSING WITH SPACY

displaCy
import spacy
from spacy import displacy
spaCy is equipped with a modern
visualizer: displaCy
text = "Albert Einstein was genius."
The displaCy entity visualizer highlights nlp = spacy.load("en_core_web_sm")
named entities and their labels
doc = nlp(text)
displacy.serve(doc, style="ent")

NATURAL LANGUAGE PROCESSING WITH SPACY

Let's practice!
N AT U R A L L A N G U A G E P R O C E S S I N G W I T H S PA C Y
Linguistic features
N AT U R A L L A N G U A G E P R O C E S S I N G W I T H S PA C Y

Azadeh Mobasher
Principal Data Scientist
POS tagging
POS tags depend on the context, surrounding words and their tags

import spacy
nlp = spacy.load("en_core_web_sm")
text = "My cat will fish for a fish tomorrrow in a fishy way."
print([(token.text, token.pos_, spacy.explain(token.pos_))
for token in nlp(text)])

NATURAL LANGUAGE PROCESSING WITH SPACY

What is the importance of POS?

Better accuracy for many NLP tasks Translation system use case

I will fish tomorrow. verb -> pescaré

I ate fish. noun -> pescado

NATURAL LANGUAGE PROCESSING WITH SPACY

What is the importance of POS?

Word-sense disambiguation (WSD) is the problem of deciding in which sense a word is used
in a sentence.

Determining the sense of the word can be crucial in machine translation, etc.

NATURAL LANGUAGE PROCESSING WITH SPACY

Word-sense disambiguation
import spacy
nlp = spacy.load("en_core_web_sm")

verb_text = "I will fish tomorrow."

noun_text = "I ate fish."

print([(token.text, token.pos_) for token in nlp(verb_text) if "fish" in token.text], "\n")

print([(token.text, token.pos_) for token in nlp(noun_text) if "fish" in token.text])

[('fish', 'VERB', 'verb')]

[('fish', 'NOUN', 'noun')]

NATURAL LANGUAGE PROCESSING WITH SPACY

Dependency parsing
Explores a sentence syntax
Links between two tokens

Results in a tree

NATURAL LANGUAGE PROCESSING WITH SPACY

Dependency parsing and spaCy

Dependency label describes the type of syntactic relation between two tokens

Dependency label Description

nsubj Nominal subject
root Root
det Determiner
dobj Direct object
aux Auxiliary

NATURAL LANGUAGE PROCESSING WITH SPACY

Dependency parsing and displaCy
displaCy can draw dependency trees

doc = nlp("We understand the differences.")

spacy.displacy.serve(doc, style="dep")

NATURAL LANGUAGE PROCESSING WITH SPACY

Dependency parsing and spaCy
.dep_ attribute to access the dependency label of a token

doc = nlp("We understand the differences.")

print([(token.text, token.dep_, spacy.explain(token.dep_)) for token in doc])

[('We', 'nsubj', 'nominal subject'), ('understand', 'ROOT', 'root'),

('the', 'det', 'determiner'), ('differences', 'dobj', 'direct object'),
('.', 'punct', 'punctuation')]

NATURAL LANGUAGE PROCESSING WITH SPACY

Let's practice!
N AT U R A L L A N G U A G E P R O C E S S I N G W I T H S PA C Y
Introduction to word
vectors
N AT U R A L L A N G U A G E P R O C E S S I N G W I T H S PA C Y

Azadeh Mobasher
Principal Data Scientist
Word vectors (embeddings)

Numerical representations of words

Bag of words method: {"I": 1, "got": 2, ...}

Older methods do not allow to understand the meaning:

Sentences I got covid coronavirus

I got covid 1 2 3
I got coronavirus 1 2 4

NATURAL LANGUAGE PROCESSING WITH SPACY

Word vectors
A pre-defined number of dimensions
Considers word frequencies and the presence of other words in similar contexts

NATURAL LANGUAGE PROCESSING WITH SPACY

Word vectors
Multiple approaches to produce word vectors:
word2vec, Glove, fastText and transformer-based architectures

An example of a word vector:

NATURAL LANGUAGE PROCESSING WITH SPACY

spaCy vocabulary

A part of many spaCy models.

en_core_web_md has 300-dimensional vectors for 20,000 words.

import spacy
nlp = spacy.load("en_core_web_md")
print(nlp.meta["vectors"])

>>> {'width': 300, 'vectors': 20000, 'keys': 514157,

'name': 'en_vectors', 'mode': 'default'}

NATURAL LANGUAGE PROCESSING WITH SPACY

Word vectors in spaCy
nlp.vocab : to access vocabulary ( Vocab class)

nlp.vocab.strings : to access word IDs in a vocabulary

import spacy
nlp = spacy.load("en_core_web_md")
like_id = nlp.vocab.strings["like"]
print(like_id)

>>> 18194338103975822726

.vocab.vectors : to access words vectors of a model or a word, given its corresponding ID

print(nlp.vocab.vectors[like_id])

>>> array([-2.3334e+00, -1.3695e+00, -1.1330e+00, -6.8461e-01, ...])

NATURAL LANGUAGE PROCESSING WITH SPACY

Let's practice!
N AT U R A L L A N G U A G E P R O C E S S I N G W I T H S PA C Y
Word vectors and
spaCy
N AT U R A L L A N G U A G E P R O C E S S I N G W I T H S PA C Y

Azadeh Mobasher
Principal Data Scientist
Word vectors visualization
Word vectors allow to understand how Principal Component Analysis projects
words are grouped word vectors into a two-dimensional space

NATURAL LANGUAGE PROCESSING WITH SPACY

Word vectors visualization
Import required libraries and a spaCy model.

import matplotlib.pyplot as plt

from sklearn.decomposition import PCA
import numpy as np
nlp = spacy.load("en_core_web_md")

Extract word vectors for a given list of words and stack them vertically.

words = ["wonderful", "horrible",

"apple", "banana", "orange", "watermelon",
"dog", "cat"]
word_vectors = np.vstack([nlp.vocab.vectors[nlp.vocab.strings[w]] for w in words])

NATURAL LANGUAGE PROCESSING WITH SPACY

Word vectors visualizations
Extract two principal components using PCA.

pca = PCA(n_components=2)
word_vectors_transformed = pca.fit_transform(word_vectors)

Visualize the scatter plot of transformed vectors.

plt.figure(figsize=(10, 8))
plt.scatter(word_vectors_transformed[:, 0], word_vectors_transformed[:, 1])
for word, coord in zip(words, word_vectors_transformed):
x, y = coord
plt.text(x, y, word, size=10)
plt.show()

NATURAL LANGUAGE PROCESSING WITH SPACY

Analogies and vector operations
A semantic relationship between a pair of words.
Word embeddings generate analogies such as gender and tense:
queen - woman + man = king

NATURAL LANGUAGE PROCESSING WITH SPACY

Similar words in a vocabulary
spaCy find semantically similar terms to a given term

import numpy as np
import spacy
nlp = spacy.load("en_core_web_md")

word = "covid"
most_similar_words = nlp.vocab.vectors.most_similar(
np.asarray([nlp.vocab.vectors[nlp.vocab.strings[word]]]), n=5)

words = [nlp.vocab.strings[w] for w in most_similar_words[0][0]]

print(words)

>>> ['Covi', 'CoVid', 'Covici', 'COVID-19', 'corona']

NATURAL LANGUAGE PROCESSING WITH SPACY

Let's practice!
N AT U R A L L A N G U A G E P R O C E S S I N G W I T H S PA C Y
Measuring semantic
similarity with
spaCy
N AT U R A L L A N G U A G E P R O C E S S I N G W I T H S PA C Y

Azadeh Mobasher
Principal Data Scientist
The semantic similarity method

Process of analyzing texts to identify similarities

Categorizes texts into predefined categories or detect relevant texts

Similarity score measures how similar two pieces of text are

What is the cheapest flight from Boston to Seattle?

Which airline serves Denver, Pittsburgh and Atlanta?
What kinds of planes are used by American Airlines?

NATURAL LANGUAGE PROCESSING WITH SPACY

Similarity score
A metric defined over texts
To measure similarity use Cosine similarity and word vectors

Cosine similarity is any number between 0 and 1

NATURAL LANGUAGE PROCESSING WITH SPACY

Token similarity
spaCy calculates similarity scores between Token objects

nlp = spacy.load("en_core_web_md")
doc1 = nlp("We eat pizza")
doc2 = nlp("We like to eat pasta")
token1 = doc1[2]
token2 = doc2[4]
print(f"Similarity between {token1} and {token2} = ", round(token1.similarity(token2), 3))

>>> Similarity between pizza and pasta = 0.685

NATURAL LANGUAGE PROCESSING WITH SPACY

Span similarity
spaCy calculates semantic similarity of two given Span objects

doc1 = nlp("We eat pizza")

doc2 = nlp("We like to eat pasta")

span1 = doc1[1:]
span2 = doc2[1:]
print(f"Similarity between \"{span1}\" and \"{span2}\" = ",
round(span1.similarity(span2), 3))

>>> Similarity between "eat pizza" and "like to eat pasta" = 0.588

print(f"Similarity between \"{doc1[1:]}\" and \"{doc2[3:]}\" = ",

round(doc1[1:].similarity(doc2[3:]), 3))

>>> Similarity between "eat pizza" and "eat pasta" = 0.936

NATURAL LANGUAGE PROCESSING WITH SPACY

Doc similarity
spaCy calculates the similarity scores between two documents

nlp = spacy.load("en_core_web_md")

doc1 = nlp("I like to play basketball")

doc2 = nlp("I love to play basketball")
print("Similarity score :", round(doc1.similarity(doc2), 3))

>>> Similarity score : 0.975

High cosine similarity shows highly semantically similar contents

Doc vectors default to an average of word vectors

NATURAL LANGUAGE PROCESSING WITH SPACY

Sentence similarity
spaCy finds relevant content to a given keyword

Finding similar customer questions to the word price:

sentences = nlp("What is the cheapest flight from Boston to Seattle?

Which airline serves Denver, Pittsburgh and Atlanta?
What kinds of planes are used by American Airlines?")

keyword = nlp("price")
for i, sentence in enumerate(sentences.sents):
print(f"Similarity score with sentence {i+1}: ", round(sentence.similarity(keyword), 5))

>>> Similarity score with sentence 1: 0.26136

Similarity score with sentence 2: 0.14021
Similarity score with sentence 3: 0.13885

NATURAL LANGUAGE PROCESSING WITH SPACY

spaCy pipelines
N AT U R A L L A N G U A G E P R O C E S S I N G W I T H S PA C Y

Azadeh Mobasher
Principal Data Scientist
spaCy pipelines

spaCy first tokenizes the text to produce a Doc object

The Doc is processed in several different steps of processing pipeline

import spacy
nlp = spacy.load("en_core_web_sm")

doc = nlp(example_text)

NATURAL LANGUAGE PROCESSING WITH SPACY

spaCy pipelines
A pipeline is a sequence of pipes, or actors on data

A spaCy NER pipeline:

Tokenization
Named entity identification

Named entity classification

print([ent.text for ent in doc.ents])

NATURAL LANGUAGE PROCESSING WITH SPACY

Adding pipes

sentencizer : spaCy pipeline component for sentence segmentation.

text = " ".join(["This is a test sentence."]*10000)

en_core_sm_nlp = spacy.load("en_core_web_sm")
start_time = time.time()
doc = en_core_sm_nlp(text)
print(f"Finished processing with en_core_web_sm model in
{round((time.time() - start_time)/60.0 , 5)} minutes")

>>> Finished processing with en_core_web_sm model in 0.09332 minutes

NATURAL LANGUAGE PROCESSING WITH SPACY

Adding pipes

Create a blank model and add a sentencizer pipe:

blank_nlp = spacy.blank("en")
blank_nlp.add_pipe("sentencizer")
start_time = time.time()
doc = blank_nlp(text)
print(f"Finished processing with blank model in
{round((time.time() - start_time)/60.0 , 5)} minutes")

>>> Finished processing with blank model in 0.00091 minutes

NATURAL LANGUAGE PROCESSING WITH SPACY

Analyzing pipeline components
nlp.analyze_pipes() analyzes a spaCy pipeline to determine:
Attributes that pipeline components set

Scores a component produces during training

Presence of all required attributes

Setting pretty to True will print a table instead of only returning the structured data.

import spacy

nlp = spacy.load("en_core_web_sm")
analysis = nlp.analyze_pipes(pretty=True)

NATURAL LANGUAGE PROCESSING WITH SPACY

Analyzing pipeline components

NATURAL LANGUAGE PROCESSING WITH SPACY

Let's practice!
N AT U R A L L A N G U A G E P R O C E S S I N G W I T H S PA C Y
spaCy EntityRuler
N AT U R A L L A N G U A G E P R O C E S S I N G W I T H S PA C Y

Azadeh Mobasher
Principal Data Scientist
spaCy EntityRuler

EntityRuler adds named-entities to a Doc container

It can be used on its own or combined with EntityRecognizer

Phrase entity patterns for exact string matches (string):

{"label": "ORG", "pattern": "Microsoft"}

Token entity patterns with one dictionary describing one token (list):

{"label": "GPE", "pattern": [{"LOWER": "san"}, {"LOWER": "francisco"}]}

NATURAL LANGUAGE PROCESSING WITH SPACY

Adding EntityRuler to spaCy pipeline

Using .add_pipe() method

List of patterns can be added using .add_patterns() method

nlp = spacy.blank("en")
entity_ruler = nlp.add_pipe("entity_ruler")
patterns = [{"label": "ORG", "pattern": "Microsoft"},
{"label": "GPE", "pattern": [{"LOWER": "san"}, {"LOWER": "francisco"}]}]
entity_ruler.add_patterns(patterns)

NATURAL LANGUAGE PROCESSING WITH SPACY

Adding EntityRuler to spaCy pipeline

.ents store the results of an EntityLinker component

doc = nlp("Microsoft is hiring software developer in San Francisco.")

print([(ent.text, ent.label_) for ent in doc.ents])

[('Microsoft', 'ORG'), ('San Francisco', 'GPE')]

NATURAL LANGUAGE PROCESSING WITH SPACY

EntityRuler in action

Integrates with spaCy pipeline components

Enhances the named-entity recognizer

spaCy model without EntityRuler :

nlp = spacy.load("en_core_web_sm")

doc = nlp("Manhattan associates is a company in the U.S.")

print([(ent.text, ent.label_) for ent in doc.ents])

>>> [('Manhattan', 'GPE'), ('U.S.', 'GPE')]

NATURAL LANGUAGE PROCESSING WITH SPACY

EntityRuler in action

EntityRuler added after existing ner component:

nlp = spacy.load("en_core_web_sm")
ruler = nlp.add_pipe("entity_ruler", after='ner')
patterns = [{"label": "ORG", "pattern": [{"lower": "manhattan"}, {"lower": "associates"}]}]
ruler.add_patterns(patterns)

doc = nlp("Manhattan associates is a company in the U.S.")

print([(ent.text, ent.label_) for ent in doc.ents])

>>> [('Manhattan', 'GPE'), ('U.S.', 'GPE')]

NATURAL LANGUAGE PROCESSING WITH SPACY

EntityRuler in action

EntityRuler added before existing ner component:

nlp = spacy.load("en_core_web_sm")
ruler = nlp.add_pipe("entity_ruler", before='ner')
patterns = [{"label": "ORG", "pattern": [{"lower": "manhattan"}, {"lower": "associates"}]}]
ruler.add_patterns(patterns)

doc = nlp("Manhattan associates is a company in the U.S.")

print([(ent.text, ent.label_) for ent in doc.ents])

>>> [('Manhattan associates', 'ORG'), ('U.S.', 'GPE')]

NATURAL LANGUAGE PROCESSING WITH SPACY

Let's practice!
N AT U R A L L A N G U A G E P R O C E S S I N G W I T H S PA C Y
RegEx with spaCy
N AT U R A L L A N G U A G E P R O C E S S I N G W I T H S PA C Y

Azadeh Mobasher
Principal Data Scientist
What is RegEx?

Rule-based information extraction (IR) is useful for many NLP tasks

Regular expression (RegEx) is used with complex string matching patterns

RegEx finds and retrieves patterns or replace matching patterns

NATURAL LANGUAGE PROCESSING WITH SPACY

RegEx strengths and weaknesses
Pros: Cons:

Enables writing robust rules to retrieve Syntax is challenging for beginners

information
Requires knowledge of all the ways a
Can allow us to find many types of pattern may be mentioned in texts
variance in strings

Runs fast

Supported by programming languages

NATURAL LANGUAGE PROCESSING WITH SPACY

RegEx in Python

Python comes prepackaged with a RegEx library, re .

The first step in using re package is to define a pattern .

The resulting pattern is used to find matching content.

import re

pattern = r"((\d){3}-(\d){3}-(\d){4})"
text = "Our phone number is 832-123-5555 and their phone number is 425-123-4567."

NATURAL LANGUAGE PROCESSING WITH SPACY

RegEx in Python

We use .finditer() method from re package

iter_matches = re.finditer(pattern, text)

for match in iter_matches:
start_char = match.start()
end_char = match.end()
print ("Start character: ", start_char, "| End character: ", end_char,
"| Matching text: ", text[start_char:end_char])

>>> Start character: 20 | End character: 32 | Matching text: 832-123-5555

Start character: 59 | End character: 71 | Matching text: 425-123-4567

NATURAL LANGUAGE PROCESSING WITH SPACY

RegEx in spaCy
RegEx in three pipeline components: Matcher , PhraseMatcher and EntityRuler .

text = "Our phone number is 832-123-5555 and their phone number is 425-123-4567."
nlp = spacy.blank("en")
patterns = [{"label": "PHONE_NUMBER", "pattern": [{"SHAPE": "ddd"},
{"ORTH": "-"}, {"SHAPE": "ddd"},
{"ORTH": "-"}, {"SHAPE": "dddd"}]}]
ruler = nlp.add_pipe("entity_ruler")
ruler.add_patterns(patterns)
doc = nlp(text)
print ([(ent.text, ent.label_) for ent in doc.ents])

>>> [('832-123-5555', 'PHONE_NUMBER'), ('425-123-4567', 'PHONE_NUMBER')]

NATURAL LANGUAGE PROCESSING WITH SPACY

Let's practice!
N AT U R A L L A N G U A G E P R O C E S S I N G W I T H S PA C Y
spaCy Matcher and
PhraseMatcher
N AT U R A L L A N G U A G E P R O C E S S I N G W I T H S PA C Y

Azadeh Mobasher
Principal Data Scientist
Matcher in spaCy

RegEx patterns can be complex, difficult to read and debug.

spaCy provides a readable and production-level alternative, the Matcher class.

import spacy
from spacy.matcher import Matcher
nlp = spacy.load("en_core_web_sm")
doc = nlp("Good morning, this is our first day on campus.")
matcher = Matcher(nlp.vocab)

NATURAL LANGUAGE PROCESSING WITH SPACY

Matcher in spaCy

Matching output include start and end token indices of the matched pattern.

pattern = [{"LOWER": "good"}, {"LOWER": "morning"}]

matcher.add("morning_greeting", [pattern])
matches = matcher(doc)
for match_id, start, end in matches:
print("Start token: ", start, " | End token: ", end,
"| Matched text: ", doc[start:end].text)

>>> Start token: 0 | End token: 2 | Matched text: Good morning

NATURAL LANGUAGE PROCESSING WITH SPACY

Matcher extended syntax support

Allows operators in defining the matching patterns.

Similar operators to Python's in , not in and comparison operators

Attribute Value type Description

IN any type Attribute value is a member of a list

NOT_IN any type Attribute value is not a member of a list

== , >= , <= , > , < int, float Comparison operators for equality or inequality checks

NATURAL LANGUAGE PROCESSING WITH SPACY

Matcher extended syntax support
Using IN operator to match both good morning and good evening

doc = nlp("Good morning and good evening.")

matcher = Matcher(nlp.vocab)
pattern = [{"LOWER": "good"}, {"LOWER": {"IN": ["morning", "evening"]}}]
matcher.add("morning_greeting", [pattern])
matches = matcher(doc)

The output of matching using IN operator

for match_id, start, end in matches:

print("Start token: ", start, " | End token: ", end,
"| Matched text: ", doc[start:end].text)

>>> Start token: 0 | End token: 2 | Matched text: Good morning

Start token: 3 | End token: 5 | Matched text: good evening

NATURAL LANGUAGE PROCESSING WITH SPACY

PhraseMatcher in spaCy

PhraseMatcher class matches a long list of phrases in a given text.

from spacy.matcher import PhraseMatcher

nlp = spacy.load("en_core_web_sm")
matcher = PhraseMatcher(nlp.vocab)
terms = ["Bill Gates", "John Smith"]

NATURAL LANGUAGE PROCESSING WITH SPACY

PhraseMatcher in spaCy
PhraseMatcher outputs include start and end token indices of the matched pattern

patterns = [nlp.make_doc(term) for term in terms]

matcher.add("PeopleOfInterest", patterns)
doc = nlp("Bill Gates met John Smith for an important discussion regarding
importance of AI.")
matches = matcher(doc)
for match_id, start, end in matches:
print("Start token: ", start, " | End token: ", end,
"| Matched text: ", doc[start:end].text)

>>> Start token: 0 | End token: 2 | Matched text: Bill Gates

Start token: 3 | End token: 5 | Matched text: John Smith

NATURAL LANGUAGE PROCESSING WITH SPACY

PhraseMatcher in spaCy
We can use attr argument of the PhraseMatcher class

matcher = PhraseMatcher(nlp.vocab, attr = "LOWER")

terms = ["Government", "Investment"]
patterns = [nlp.make_doc(term) for term in terms]
matcher.add("InvestmentTerms", patterns)
doc = nlp("It was interesting to the investment division of the government.")

matcher = PhraseMatcher(nlp.vocab, attr = "SHAPE")

terms = ["110.0.0.0", "101.243.0.0"]
patterns = [nlp.make_doc(term) for term in terms]
matcher.add("IPAddresses", patterns)
doc = nlp("The tracked IP address was 234.135.0.0.")

NATURAL LANGUAGE PROCESSING WITH SPACY

Let's practice!
N AT U R A L L A N G U A G E P R O C E S S I N G W I T H S PA C Y
Customizing spaCy
models
N AT U R A L L A N G U A G E P R O C E S S I N G W I T H S PA C Y

Azadeh Mobasher
Principal data scientist
Why train spaCy models?
Go a long way for general NLP use cases
But may not have seen specific domains data during their training, e.g.
Twitter data

Medical data

NATURAL LANGUAGE PROCESSING WITH SPACY

Why train spaCy models?

Better results on your specific domain

Essential for domain specific text classification

Before start training, ask the following questions:

Do spaCy models perform well enough on our data?

Does our domain include many labels that are absent in spaCy models?

NATURAL LANGUAGE PROCESSING WITH SPACY

Models performance on our data
Do spaCy models perform well enough on our data?
Oxford Street is not correctly classified with a GPE label:

import spacy
nlp = spacy.load("en_core_web_sm")

text = "The car was navigating to the Oxford Street."

doc = nlp(text)
print([(ent.text, ent.label_) for ent in doc.ents])

[('the Oxford Street', 'ORG')]

NATURAL LANGUAGE PROCESSING WITH SPACY

Output labels in spaCy models
Does our domain include many labels that are absent in spaCy models?

NATURAL LANGUAGE PROCESSING WITH SPACY

Output labels in spaCy models

If we need custom model training, we follow these steps:

Collect our domain specific data

Annotate our data

Determine to update an existing model or train a model from scratch

NATURAL LANGUAGE PROCESSING WITH SPACY

Let's practice!
N AT U R A L L A N G U A G E P R O C E S S I N G W I T H S PA C Y
Training data
preparation
N AT U R A L L A N G U A G E P R O C E S S I N G W I T H S PA C Y

Azadeh Mobasher
Principal data scientist
Training steps

1. Annotate and prepare input data

2. Initialize the model weight

3. Predict a few examples with the current weights

4. Compare prediction with correct answers

5. Use optimizer to calculate weights that improve model performance

6. Update weights slightly

7. Go back to step 3.

NATURAL LANGUAGE PROCESSING WITH SPACY

Annotating and preparing data
First step is to prepare training data in required format

After collecting data, we annotate it

Annotation means labeling the intent, entities, etc.

This is an example of annotated data:

annotated_data = {
"sentence": "An antiviral drugs used against influenza is neuraminidase inhibitors.",
"entities": {
"label": "Medicine",
"value": "neuraminidase inhibitors",
}
}

NATURAL LANGUAGE PROCESSING WITH SPACY

Annotating and preparing data
Here's another example of annotated data:

annotated_data = {
"sentence": "Bill Gates visited the SFO Airport.",
"entities": [{"label": "PERSON", "value": "Bill Gates"},
{"label": "LOC", "value": "SFO Airport"}]
}

NATURAL LANGUAGE PROCESSING WITH SPACY

spaCy training data format
Data annotation prepares training data for what we want the model to learn
Training dataset has to be stored as a dictionary:

training_data = [
("I will visit you in Austin.", {"entities": [(20, 26, "GPE")]}),
("I'm going to Sam's house.", {"entities": [(13,18, "PERSON"), (19, 24, "GPE")]}),
("I will go.", {"entities": []})
]

Three example pairs:

Each example pair includes a sentence as the first element

Pair's second element is list of annotated entities and start and end characters

NATURAL LANGUAGE PROCESSING WITH SPACY

Example object data for training
We cannot feed the raw text directly to spaCy
We need to create an Example object for each training example

import spacy
from spacy.training import Example

nlp = spacy.load("en_core_web_sm")

doc = nlp("I will visit you in Austin.")

annotations = {"entities": [(20, 26, "GPE")]}

example_sentence = Example.from_dict(doc, annotations)

print(example_sentence.to_dict())

NATURAL LANGUAGE PROCESSING WITH SPACY

Let's practice!
N AT U R A L L A N G U A G E P R O C E S S I N G W I T H S PA C Y
Training with spaCy
N AT U R A L L A N G U A G E P R O C E S S I N G W I T H S PA C Y

Azadeh Mobasher
Principal Data Scientist
Training steps

1. Annotate and prepare input data

2. Disable other pipeline components

3. Train a model for a few epochs

4. Evaluate model performance

NATURAL LANGUAGE PROCESSING WITH SPACY

Disabling other pipeline components

Disable all pipeline components except NER:

other_pipes = [pipe for pipe in nlp.pipe_names if pipe != 'ner']

nlp.disable_pipes(*other_pipes)

NATURAL LANGUAGE PROCESSING WITH SPACY

Model training procedure
Go over the training set several times; one iteration is called an epoch .
In each epoch, update the weights of the model with a small number.

Optimizers update the model weights.

optimizer = nlp.create_optimizer()

losses = {}
for i in range(epochs):
random.shuffle(training_data)
for text, annotation in training_data:
doc = nlp.make_doc(text)
example = Example.from_dict(doc, annotation)
nlp.update([example], sgd = optimizer, losses=losses)

NATURAL LANGUAGE PROCESSING WITH SPACY

Save and load a trained model

Save a trained NER model:

ner = nlp.get_pipe("ner")
ner.to_disk("<ner model name>")

Load the saved model:

ner = nlp.create_pipe("ner")
ner.from_disk("<ner model name>")
nlp.add_pipe(ner, "<ner model name>")

NATURAL LANGUAGE PROCESSING WITH SPACY

Model for inference

Use a saved model at inference.

Apply NER model and store tuples of (entity text, entity label):

doc = nlp(text)
entities = [(ent.text, ent.label_) for ent in doc.ents]

NATURAL LANGUAGE PROCESSING WITH SPACY

Let's practice!
N AT U R A L L A N G U A G E P R O C E S S I N G W I T H S PA C Y
Wrap-up
N AT U R A L L A N G U A G E P R O C E S S I N G W I T H S PA C Y

Azadeh Mobasher
Principal data scientist
Chapter 1 - Introduction to NLP and spaCy

Use spaCy 's text processing pipelines to extract linguistic features:

NATURAL LANGUAGE PROCESSING WITH SPACY

Chapter 2 - spaCy linguistic annotations and word
vectors
Work with spaCy 's classes such as Doc , Token and Span and predict semantic similarities
using word vectors:

NATURAL LANGUAGE PROCESSING WITH SPACY

Chapter 3 - Data analysis with spaCy
Write matching patterns to extract terms and phrases using spaCy 's Matcher and
PhraseMatcher :

matcher = Matcher(nlp.vocab)
pattern = [{"LOWER": "good"}, {"LOWER": {"IN": ["morning", "evening"]}}]
matcher.add("morning_greeting", [pattern])

matcher = PhraseMatcher(nlp.vocab, attr = "LOWER")

patterns = [nlp.make_doc(term) for term in terms]
matcher.add("InvestmentTerms", patterns)

NATURAL LANGUAGE PROCESSING WITH SPACY

Chapter 4 - Customizing spaCy models

Annotate and prepare our data for training

Train spaCy models and use them at inference time

NATURAL LANGUAGE PROCESSING WITH SPACY

Recommended resources

Introduction to Deep Learning in Python

Introduction to Deep Learning with PyTorch

Introduction to ChatGPT

NATURAL LANGUAGE PROCESSING WITH SPACY

Congratulations!
N AT U R A L L A N G U A G E P R O C E S S I N G W I T H S PA C Y
Introduction to
audio data in
Python
SPOKEN LANGUAGE PROCESSING IN PYTHON

Daniel Bourke
Machine Learning Engineer/YouTube
Creator
Dealing with audio files in Python
Di erent kinds all of audio les
mp3

wav

m4a

Digital sounds measured in frequency (kHz)

1 kHz = 1000 pieces of information per second

SPOKEN LANGUAGE PROCESSING IN PYTHON

Frequency examples
Streaming songs have a frequency of 32 kHz

Audiobooks and spoken language are between 8 and 16 kHz

We can't see audio les so we have to transform them rst

import wave

SPOKEN LANGUAGE PROCESSING IN PYTHON

Opening an audio file in Python
Audio le saved as good-morning.wav

# Import audio file as wave object

good_morning = wave.open("good-morning.wav", "r")

# Convert wave object to bytes

good_morning_soundwave = good_morning.readframes(-1)

# View the wav file in byte form

good_morning_soundwave

b'\xfd\xff\xfb\xff\xf8\xff\xf8\xff\xf7\...

SPOKEN LANGUAGE PROCESSING IN PYTHON

Working with audio is different
Have to convert the audio to something useful

Small sample of audio = large amount of information

SPOKEN LANGUAGE PROCESSING IN PYTHON

Let's practice!
SPOKEN LANGUAGE PROCESSING IN PYTHON
Converting sound
wave bytes to
integers
SPOKEN LANGUAGE PROCESSING IN PYTHON

Daniel Bourke
Machine Learning Engineer/YouTube
Creator
Converting bytes to integers
Can't use bytes

Convert bytes to integers using numpy

import numpy as np
# Convert soundwave_gm from bytes to integers
signal_gm = np.frombuffer(soundwave_gm, dtype='int16')
# Show the first 10 items
signal_gm[:10]

array([ -3, -5, -8, -8, -9, -13, -8, -10, -9, -11], dtype=int16)

SPOKEN LANGUAGE PROCESSING IN PYTHON

Finding the frame rate
Frequency (Hz) = length of wave object array/duration of audio le (seconds)

# Get the frame rate

framerate_gm = good_morning.getframerate()
# Show the frame rate
framerate_gm

48,000

Duration of audio le (seconds) = length of wave object array/frequency (Hz)

SPOKEN LANGUAGE PROCESSING IN PYTHON

Finding sound wave timestamps
# Return evenly spaced values between start and stop
np.linspace(start=1, stop=10, num=10)

array([ 1., 2., 3., 4., 5., 6., 7., 8., 9., 10.])

# Get the timestamps of the good morning sound wave

time_gm = np.linspace(start=0,
stop=len(soundwave_gm)/framerate_gm,
num=len(soundwave_gm))

SPOKEN LANGUAGE PROCESSING IN PYTHON

Finding sound wave timestamps
# View first 10 time stamps of good morning sound wave
time_gm[:10]

array([0.00000000e+00, 2.08334167e-05, 4.16668333e-05, 6.25002500e-05,

8.33336667e-05, 1.04167083e-04, 1.25000500e-04, 1.45833917e-04,
1.66667333e-04, 1.87500750e-04])

SPOKEN LANGUAGE PROCESSING IN PYTHON

Let's practice!
SPOKEN LANGUAGE PROCESSING IN PYTHON
Visualizing sound
waves
SPOKEN LANGUAGE PROCESSING IN PYTHON

Daniel Bourke
Machine Learning Engineer/YouTube
Creator
Adding another sound wave
New audio le: good_afternoon.wav

Both are 48 kHz

Same data transformations to all audio les

SPOKEN LANGUAGE PROCESSING IN PYTHON

Setting up a plot
import matplotlib.pyplot as plt
# Initialize figure and setup title
plt.title("Good Afternoon vs. Good Morning")
# x and y axis labels
plt.xlabel("Time (seconds)")
plt.ylabel("Amplitude")
# Add good morning and good afternoon values
plt.plot(time_ga, soundwave_ga, label ="Good Afternoon")
plt.plot(time_gm, soundwave_gm, label="Good Morning",
alpha=0.5)
# Create a legend and show our plot
plt.legend()
plt.show()

SPOKEN LANGUAGE PROCESSING IN PYTHON

SPOKEN LANGUAGE PROCESSING IN PYTHON
Time to visualize!
SPOKEN LANGUAGE PROCESSING IN PYTHON
SpeechRecognition
Python library
SPOKEN LANGUAGE PROCESSING IN PYTHON

Daniel Bourke
Machine Learning Engineer/YouTube
Creator
Why the SpeechRecognition library?
Some existing python libraries

CMU Sphinx

Kaldi

SpeechRecognition

Wav2le er++ by Facebook

SPOKEN LANGUAGE PROCESSING IN PYTHON

Getting started with SpeechRecognition
Install from PyPi:

$ pip install SpeechRecognition

Compatible with Python 2 and 3

We'll use Python 3

SPOKEN LANGUAGE PROCESSING IN PYTHON

Using the Recognizer class
# Import the SpeechRecognition library
import speech_recognition as sr
# Create an instance of Recognizer
recognizer = sr.Recognizer()
# Set the energy threshold
recognizer.energy_threshold = 300

SPOKEN LANGUAGE PROCESSING IN PYTHON

Using the Recognizer class to recognize speech
Recognizer class has built-in functions which interact with speech APIs
recognize_bing()

recognize_google()

recognize_google_cloud()

recognize_wit()

Input: audio_file

Output: transcribed speech from audio_file

SPOKEN LANGUAGE PROCESSING IN PYTHON

SpeechRecognition Example
Focus on recognize_google()

Recognize speech from an audio le with SpeechRecognition:

# Import SpeechRecognition library

import speech_recognition as sr
# Instantiate Recognizer class
recognizer = sr.Recognizer()
# Transcribe speech using Goole web API
recognizer.recognize_google(audio_data=audio_file
language="en-US")

Learning speech recognition on DataCamp is awesome!

SPOKEN LANGUAGE PROCESSING IN PYTHON

Your turn!
SPOKEN LANGUAGE PROCESSING IN PYTHON
Reading audio files
with
SpeechRecognition
SPOKEN LANGUAGE PROCESSING IN PYTHON

Daniel Bourke
Machine Learning Engineer/YouTube
Creator
The AudioFile class
import speech_recognition as sr
# Setup recognizer instance
recognizer = sr.Recognizer()
# Read in audio file
clean_support_call = sr.AudioFile("clean-support-call.wav")
# Check type of clean_support_call
type(clean_support_call)

SPOKEN LANGUAGE PROCESSING IN PYTHON

From AudioFile to AudioData
recognizer.recognize_google(audio_data=clean_support_call)

AssertionError: ``audio_data`` must be audio data

# Convert from AudioFile to AudioData

with clean_support_call as source:
# Record the audio
clean_support_call_audio = recognizer.record(source)
# Check the type
type(clean_support_call_audio)

SPOKEN LANGUAGE PROCESSING IN PYTHON

Transcribing our AudioData
# Transcribe clean support call
recognizer.recognize_google(audio_data=clean_support_call_audio)

hello I'd like to get some help setting up my account please

SPOKEN LANGUAGE PROCESSING IN PYTHON

Duration and offset
duration and offset both None by default

# Leave duration and offset as default

with clean_support_call as source:
clean_support_call_audio = recognizer.record(source,
duration=None,
offset=None)

# Get first 2-seconds of clean support call

with clean_support_call as source:
clean_support_call_audio = recognizer.record(source,
duration=2.0)

hello I'd like to get

SPOKEN LANGUAGE PROCESSING IN PYTHON

Let's practice!
SPOKEN LANGUAGE PROCESSING IN PYTHON
Dealing with
different kinds of
audio
SPOKEN LANGUAGE PROCESSING IN PYTHON

Daniel Bourke
Machine Learning Engineer/YouTube
Creator
What language?
# Create a recognizer class
recognizer = sr.Recognizer()
# Pass the Japanese audio to recognize_google
text = recognizer.recognize_google(japanese_good_morning,
language="en-US")
# Print the text
print(text)

Ohio gozaimasu

SPOKEN LANGUAGE PROCESSING IN PYTHON

What language?
# Create a recognizer class
recognizer = sr.Recognizer()
# Pass the Japanese audio to recognize_google
text = recognizer.recognize_google(japanese_good_morning,
language="ja")
# Print the text
print(text)

?????????

SPOKEN LANGUAGE PROCESSING IN PYTHON

Non-speech audio
# Import the leopard roar audio file
leopard_roar = sr.AudioFile("leopard_roar.wav")
# Convert the AudioFile to AudioData
with leopard_roar as source:
leopard_roar_audio = recognizer.record(source)
# Recognize the AudioData
recognizer.recognize_google(leopard_roar_audio)

UnknownValueError:

SPOKEN LANGUAGE PROCESSING IN PYTHON

[]

SPOKEN LANGUAGE PROCESSING IN PYTHON

Showing all
# Recognizing Japanese audio with show_all=True
text = recognizer.recognize_google(japanese_good_morning,
language="en-US",
show_all=True)
# Print the text
print(text)

{'alternative': [{'transcript': 'Ohio gozaimasu', 'confidence': 0.89041114},

{'transcript': 'all hail gozaimasu'},
{'transcript': 'ohayo gozaimasu'},
{'transcript': 'olho gozaimasu'},
{'transcript': 'all Hale gozaimasu'}],
'final': True}

SPOKEN LANGUAGE PROCESSING IN PYTHON

Multiple speakers
# Import an audio file with multiple speakers
multiple_speakers = sr.AudioFile("multiple-speakers.wav")
# Convert AudioFile to AudioData
with multiple_speakers as source:
multiple_speakers_audio = recognizer.record(source)
# Recognize the AudioData
recognizer.recognize_google(multiple_speakers_audio)

one of the limitations of the speech recognition library is that it doesn't

recognise different speakers and voices it will just return it all as one block
of text

SPOKEN LANGUAGE PROCESSING IN PYTHON

Multiple speakers
# Import audio files separately
speakers = [sr.AudioFile("s0.wav"), sr.AudioFile("s1.wav"), sr.AudioFile("s2.wav")]
# Transcribe each speaker individually
for i, speaker in enumerate(speakers):
with speaker as source:
speaker_audio = recognizer.record(source)
print(f"Text from speaker {i}: {recognizer.recognize_google(speaker_audio)}")

Text from speaker 0: one of the limitations of the speech recognition library
Text from speaker 1: is that it doesn't recognise different speakers and voices
Text from speaker 2: it will just return it all as one block a text

SPOKEN LANGUAGE PROCESSING IN PYTHON

Noisy audio
If you have trouble hearing the speech, so will the APIs

# Import audio file with background nosie

noisy_support_call = sr.AudioFile(noisy_support_call.wav)
with noisy_support_call as source:
# Adjust for ambient noise and record
recognizer.adjust_for_ambient_noise(source,
duration=0.5)
noisy_support_call_audio = recognizer.record(source)
# Recognize the audio
recognizer.recognize_google(noisy_support_call_audio)

hello ID like to get some help setting up my calories

SPOKEN LANGUAGE PROCESSING IN PYTHON

Let's practice!
SPOKEN LANGUAGE PROCESSING IN PYTHON
Introduction to
PyDub
SPOKEN LANGUAGE PROCESSING IN PYTHON

Daniel Bourke
Machine Learning Engineer/YouTube
Creator
Installing PyDub
$ pip install pydub

If using les other than .wav , install ffmpeg via mpeg.org

SPOKEN LANGUAGE PROCESSING IN PYTHON

PyDub's main class, AudioSegment
# Import PyDub main class
from pydub import AudioSegment

# Import an audio file

wav_file = AudioSegment.from_file(file="wav_file.wav", format="wav")

# Format parameter only for readability

wav_file = AudioSegment.from_file(file="wav_file.wav")

type(wav_file)

pydub.audio_segment.AudioSegment

SPOKEN LANGUAGE PROCESSING IN PYTHON

Playing an audio file
# Install simpleaudio for wav playback
$pip install simpleaudio

# Import play function

from pydub.playback import play

# Import audio file

wav_file = AudioSegment.from_file(file="wav_file.wav")

# Play audio file

play(wav_file)

SPOKEN LANGUAGE PROCESSING IN PYTHON

Audio parameters
# Import audio files
wav_file = AudioSegment.from_file(file="wav_file.wav")
two_speakers = AudioSegment.from_file(file="two_speakers.wav")
# Check number of channels
wav_file.channels, two_speakers.channels

1, 2

wav_file.frame_rate

480000

SPOKEN LANGUAGE PROCESSING IN PYTHON

Audio parameters
# Find the number of bytes per sample
wav_file.sample_width

# Find the max amplitude

wav_file.max

8488

SPOKEN LANGUAGE PROCESSING IN PYTHON

Audio parameters
# Duration of audio file in milliseconds
len(wav_file)

3284

SPOKEN LANGUAGE PROCESSING IN PYTHON

Changing audio parameters
# Change ATTRIBUTENAME of AudioSegment to x
changeed_audio_segment = audio_segment.set_ATTRIBUTENAME(x)

# Change sample width to 1

wav_file_width_1 = wav_file.sample_width(1)
wav_file_width_1.sample_width

SPOKEN LANGUAGE PROCESSING IN PYTHON

Changing audio parameters
# Change sample rate
wav_file_16k = wav_file.frame_rate(16000)
wav_file_16k.frame_rate

16000

# Change number of channels

wav_file_1_channel = wav_file.set_channels(1)
wav_file_1_channel.channels

SPOKEN LANGUAGE PROCESSING IN PYTHON

Let's practice!
SPOKEN LANGUAGE PROCESSING IN PYTHON
Manipulating audio
files with PyDub
SPOKEN LANGUAGE PROCESSING IN PYTHON

Daniel Bourke
Machine Learning Engineer/YouTube
Creator
Turning it down to 11
# Import audio file
wav_file = AudioSegment.from_file("wav_file.wav")
# Minus 60 dB
quiet_wav_file = wav_file - 60

# Try to recognize quiet audio

recognizer.recognize_google(quiet_wav_file)

UnknownValueError:

SPOKEN LANGUAGE PROCESSING IN PYTHON

Increasing the volume
# Increase the volume by 10 dB
louder_wav_file = wav_file + 10

# Try to recognize
recognizer.recognize_google(louder_wav_file)

this is a wav file

SPOKEN LANGUAGE PROCESSING IN PYTHON

This all sounds the same
# Import AudioSegment and normalize
from pydub import AudioSegment
from pydub.effects import normalize
from pydub.playback import play

# Import uneven sound audio file

loud_quiet = AudioSegment.from_file("loud_quiet.wav")
# Normalize the sound levels
normalized_loud_quiet = normalize(loud_quiet)

# Check the sound

play(normalized_loud_quiet)

SPOKEN LANGUAGE PROCESSING IN PYTHON

Remixing your audio files
# Import audio with static at start
static_at_start = AudioSegment.from_file("static_at_start.wav")

# Remove the static via slicing

no_static_at_start = static_at_start[5000:]

# Check the new sound

play(no_static_at_start)

SPOKEN LANGUAGE PROCESSING IN PYTHON

Remixing your audio files
# Import two audio files
wav_file_1 = AudioSegment.from_file("wav_file_1.wav")
wav_file_2 = AudioSegment.from_file("wav_file_2.wav")

# Combine the two audio files

wav_file_3 = wav_file_1 + wav_file_2

# Check the sound

play(wav_file_3)

# Combine two wav files and make the combination louder

louder_wav_file_3 = wav_file_1 + wav_file_2 + 10

SPOKEN LANGUAGE PROCESSING IN PYTHON

Splitting your audio
# Import phone call audio
phone_call = AudioSegment.from_file("phone_call.wav")
# Find number of channels
phone_call.channels

# Split stereo to mono

phone_call_channels = phone_call.split_to_mono()
phone_call_channels

[<pydub.audio_segment.AudioSegment, <pydub.audio_segment.AudioSegment>]

SPOKEN LANGUAGE PROCESSING IN PYTHON

Splitting your audio
# Find number of channels of first list item
phone_call_channels[0].channels

# Recognize the first channel

recognizer.recognize_google(phone_call_channel_1)

the pydub library is really useful

SPOKEN LANGUAGE PROCESSING IN PYTHON

Let's code!
SPOKEN LANGUAGE PROCESSING IN PYTHON
Converting and
saving audio files
with PyDub
SPOKEN LANGUAGE PROCESSING IN PYTHON

Daniel Bourke
Machine Learning Engineer/YouTube
Creator
Exporting audio files
from pydub import AudioSegment

# Import audio file

wav_file = AudioSegment.from_file("wav_file.wav")
# Increase by 10 decibels
louder_wav_file = wav_file + 10
# Export louder audio file
louder_wav_file.export(out_f="louder_wav_file.wav", format="wav")

<_io.BufferedRandom name='louder_wav_file.wav'>

SPOKEN LANGUAGE PROCESSING IN PYTHON

Reformatting and exporting multiple audio files
def make_wav(wrong_folder_path, right_folder_path):
# Loop through wrongly formatted files
for file in os.scandir(wrong_folder_path):
# Only work with files with audio extensions we're fixing
if file.path.endswith(".mp3") or file.path.endswith(".flac"):
# Create the new .wav filename
out_file = right_folder_path + os.path.splitext(os.path.basename(file.path))[0] + ".wav"
# Read in the audio file and export it in wav format
AudioSegment.from_file(file.path).export(out_file,
format="wav")
print(f"Creating {out_file}")

SPOKEN LANGUAGE PROCESSING IN PYTHON

Reformatting and exporting multiple audio files
# Call our new function
make_wav("data/wrong_formats/", "data/right_format/")

Creating data/right_types/wav_file.wav
Creating data/right_types/flac_file.wav
Creating data/right_types/mp3_file.wav

SPOKEN LANGUAGE PROCESSING IN PYTHON

Manipulating and exporting
def make_no_static_louder(static_quiet, louder_no_static):
# Loop through files with static and quiet (already in wav format)
for file in os.scandir(static_quiet_folder_path):
# Create new file path
out_file = louder_no_static + os.path.splitext(os.path.basename(file.path))[0] + ".wav"
# Read the audio file
audio_file = AudioSegment.from_file(file.path)
# Remove first three seconds and add 10 decibels and export
audio_file = (audio_file[3100:] + 10).export(out_file, format="wav")

print(f"Creating {out_file}")

SPOKEN LANGUAGE PROCESSING IN PYTHON

Manipulating and exporting
# Remove static and make louder
make_no_static_louder("data/static_quiet/", "data/louder_no_static/")

Creating data/louder_no_static/speech-recognition-services.wav
Creating data/louder_no_static/order-issue.wav
Creating data/louder_no_static/help-with-acount.wav

SPOKEN LANGUAGE PROCESSING IN PYTHON

Your turn!
SPOKEN LANGUAGE PROCESSING IN PYTHON
Creating
transcription helper
functions
SPOKEN LANGUAGE PROCESSING IN PYTHON

Daniel Bourke
Machine Learning Engineer/YouTube
Creator
Exploring audio files
# Import os module
import os

# Check the folder of audio files

os.listdir("acme_audio_files")

(['call_1.mp3',
'call_2.mp3',
'call_3.mp3',
'call_4.mp3'])

SPOKEN LANGUAGE PROCESSING IN PYTHON

Preparing for the proof of concept
import speech_recognition as sr
from pydub import AudioSegment
# Import call 1 and convert to .wav
call_1 = AudioSegment.from_file("acme_audio_files/call_1.mp3")
call_1.export("acme_audio_files/call_1.wav", format="wav")
# Transcribe call 1
recognizer = sr.Recognizer()
call_1_file = sr.AudioFile("acme_audio_files/call_1.wav")
with call_1_file as source:
call_1_audio = recognizer.record(call_1_file)
recognizer.recognize_google(call_1_audio)

SPOKEN LANGUAGE PROCESSING IN PYTHON

Functions we'll create
convert_to_wav() converts non- .wav les to .wav les.

show_pydub_stats() shows the audio a ributes of a .wav le.

transcribe_audio() uses recognize_google() to transcribe a .wav le.

SPOKEN LANGUAGE PROCESSING IN PYTHON

Creating a file format conversion function
# Create function to convert audio file to wav
def convert_to_wav(filename):
"Takes an audio file of non .wav format and converts to .wav"
# Import audio file
audio = AudioSegment.from_file(filename)
# Create new filename
new_filename = filename.split(".")[0] + ".wav"
# Export file as .wav
audio.export(new_filename, format="wav")
print(f"Converting {filename} to {new_filename}...")

SPOKEN LANGUAGE PROCESSING IN PYTHON

Using the file format conversion function
convert_to_wav("acme_studios_audio/call_1.mp3")

Converting acme_audio_files/call_1.mp3 to acme_audio_files/call_1.wav...

SPOKEN LANGUAGE PROCESSING IN PYTHON

Creating an attribute showing function
def show_pydub_stats(filename):
"Returns different audio attributes related to an audio file."
# Create AudioSegment instance
audio_segment = AudioSegment.from_file(filename)
# Print attributes
print(f"Channels: {audio_segment.channels}")
print(f"Sample width: {audio_segment.sample_width}")
print(f"Frame rate (sample rate): {audio_segment.frame_rate}")
print(f"Frame width: {audio_segment.frame_width}")
print(f"Length (ms): {len(audio_segment)}")
print(f"Frame count: {audio_segment.frame_count()}")

SPOKEN LANGUAGE PROCESSING IN PYTHON

Using the attribute showing function
show_pydub_stats("acme_audio_files/call_1.wav")

Channels: 2
Sample width: 2
Frame rate (sample rate): 32000
Frame width: 4
Length (ms): 54888
Frame count: 1756416.0

SPOKEN LANGUAGE PROCESSING IN PYTHON

Creating a transcribe function
# Create a function to transcribe audio
def transcribe_audio(filename):
"Takes a .wav format audio file and transcribes it to text."
# Setup a recognizer instance
recognizer = sr.Recognizer()

# Import the audio file and convert to audio data

audio_file = sr.AudioFile(filename)
with audio_file as source:
audio_data = recognizer.record(audio_file)

# Return the transcribed text

return recognizer.recognize_google(audio_data)

SPOKEN LANGUAGE PROCESSING IN PYTHON

Using the transcribe function
transcribe_audio("acme_audio_files/call_1.wav")

"hello welcome to Acme studio support line my name is Daniel how can I best help
you hey Daniel this is John I've recently bought a smart from you guys and I know
that's not good to hear John let's let's get your cell number and then we
can we can set up a way to fix it for you one number for 1757 varies how long do
you reckon this is going to take about an hour now while John we're going to try
our best hour I will we get the sealing member will start up this support case
I'm just really really really really I've been trying to contact 34 been put on
hold more than an hour and half so I'm not really happy I kind of wanna get this
issue 6 is fossil"

SPOKEN LANGUAGE PROCESSING IN PYTHON

Let's practice!
SPOKEN LANGUAGE PROCESSING IN PYTHON
Sentiment analysis
on spoken language
text
SPOKEN LANGUAGE PROCESSING IN PYTHON

Daniel Bourke
Machine Learning Engineer/YouTube
Creator
Installing sentiment analysis libraries
$ pip install nltk

# Download required NLTK packages

import nltk
nltk.download("punkt")
nltk.download("vader_lexicon")

SPOKEN LANGUAGE PROCESSING IN PYTHON

Sentiment analysis with VADER
# Import sentiment analysis class
from nltk.sentiment.vader import SentimentIntensityAnalyzer
# Create sentiment analysis instance
sid = SentimentIntensityAnalyzer()
# Test sentiment analysis on negative text
print(sid.polarity_scores("This customer service is terrible."))

{'neg': 0.437, 'neu': 0.563, 'pos': 0.0, 'compound': -0.4767}

SPOKEN LANGUAGE PROCESSING IN PYTHON

Sentiment analysis on transcribed text
# Transcribe customer channel of call_3
call_3_channel_2_text = transcribe_audio("call_3_channel_2.wav")
print(call_3_channel_2_text)

"hey Dave is this any better do I order products are currently on July 1st and I haven't
received the product a three-week step down this parable 6987 5"

# Sentiment analysis on customer channel of call_3

sid.polarity_scores(call_3_channel_2_text)

{'neg': 0.0, 'neu': 0.892, 'pos': 0.108, 'compound': 0.4404}

SPOKEN LANGUAGE PROCESSING IN PYTHON

Sentence by sentence
call_3_paid_api_text = "Okay. Yeah. Hi, Diane. This is paid on this call and obvi...

# Import sent tokenizer

from nltk.tokenize import sent_tokenize
# Find sentiment on each sentence
for sentence in sent_tokenize(call_3_paid_api_text):
print(sentence)
print(sid.polarity_scores(sentence))

SPOKEN LANGUAGE PROCESSING IN PYTHON

Sentence by sentence
Okay.
{'neg': 0.0, 'neu': 0.0, 'pos': 1.0, 'compound': 0.2263}
Yeah.
{'neg': 0.0, 'neu': 0.0, 'pos': 1.0, 'compound': 0.296}
Hi, Diane.
{'neg': 0.0, 'neu': 1.0, 'pos': 0.0, 'compound': 0.0}
This is paid on this call and obviously the status of my orders at three weeks ago,
and that service is terrible.
{'neg': 0.129, 'neu': 0.871, 'pos': 0.0, 'compound': -0.4767}
Is this any better?
{'neg': 0.0, 'neu': 0.508, 'pos': 0.492, 'compound': 0.4404}
Yes...

SPOKEN LANGUAGE PROCESSING IN PYTHON

Time to code!
SPOKEN LANGUAGE PROCESSING IN PYTHON
Named entity
recognition on
transcribed text
SPOKEN LANGUAGE PROCESSING IN PYTHON

Daniel Bourke
Machine Learning Engineer/YouTube
Creator
Installing spaCy
# Install spaCy
$ pip install spacy

# Download spaCy language model

$ python -m spacy download en_core_web_sm

SPOKEN LANGUAGE PROCESSING IN PYTHON

Using spaCy
import spacy

# Load spaCy language model

nlp = spacy.load("en_core_web_sm")

# Create a spaCy doc

doc = nlp("I'd like to talk about a smartphone I ordered on July 31st from your
Sydney store, my order number is 40939440. I spoke to Georgia about it last week.")

SPOKEN LANGUAGE PROCESSING IN PYTHON

spaCy tokens
# Show different tokens and positions
for token in doc:
print(token.text, token.idx)

I 0
'd 1
like 4
to 9
talk 12
about 17
a 23
smartphone 25...

SPOKEN LANGUAGE PROCESSING IN PYTHON

spaCy sentences
# Show sentences in doc
for sentences in doc.sents:
print(sentence)

I'd like to talk about a smartphone I ordered on July 31st from your Sydney store,
my order number is 4093829.
I spoke to one of your customer service team, Georgia, yesterday.

SPOKEN LANGUAGE PROCESSING IN PYTHON

spaCy named entities
Some of spaCy's built-in named entities:

PERSON People, including ctional.

ORG Companies, agencies, institutions, etc.

GPE Countries, cities, states.

PRODUCT Objects, vehicles, foods, etc. (Not services.)

DATE Absolute or relative dates or periods.

TIME Times smaller than a day.

MONEY Monetary values, including unit.

CARDINAL Numerals that do not fall under another type.

SPOKEN LANGUAGE PROCESSING IN PYTHON

spaCy named entities
# Find named entities in doc
for entity in doc.ents:
print(entity.text, entity.label_)

July 31st DATE

Sydney GPE
4093829 CARDINAL
one CARDINAL
Georgia GPE
yesterday DATE

SPOKEN LANGUAGE PROCESSING IN PYTHON

Custom named entities
# Import EntityRuler class
from spacy.pipeline import EntityRuler

# Check spaCy pipeline

print(nlp.pipeline)

[('tagger', <spacy.pipeline.pipes.Tagger at 0x1c3aa8a470>),

('parser', <spacy.pipeline.pipes.DependencyParser at 0x1c3bb60588>),
('ner', <spacy.pipeline.pipes.EntityRecognizer at 0x1c3bb605e8>)]

SPOKEN LANGUAGE PROCESSING IN PYTHON

Changing the pipeline
# Create EntityRuler instance
ruler = EntityRuler(nlp)

# Add token pattern to ruler

ruler.add_patterns([{"label":"PRODUCT", "pattern": "smartphone"}])

# Add new rule to pipeline before ner

nlp.add_pipe(ruler, before="ner")

# Check updated pipeline

nlp.pipeline

SPOKEN LANGUAGE PROCESSING IN PYTHON

Changing the pipeline
[('tagger', <spacy.pipeline.pipes.Tagger at 0x1c1f9c9b38>),
('parser', <spacy.pipeline.pipes.DependencyParser at 0x1c3c9cba08>),
('entity_ruler', <spacy.pipeline.entityruler.EntityRuler at 0x1c1d834b70>),
('ner', <spacy.pipeline.pipes.EntityRecognizer at 0x1c3c9cba68>)]

SPOKEN LANGUAGE PROCESSING IN PYTHON

Testing the new pipeline
# Test new entity rule
for entity in doc.ents:
print(entity.text, entity.label_)

smartphone PRODUCT
July 31st DATE
Sydney GPE
4093829 CARDINAL
one CARDINAL
Georgia GPE
yesterday DATE

SPOKEN LANGUAGE PROCESSING IN PYTHON

Let's rocket and
practice spaCy!
SPOKEN LANGUAGE PROCESSING IN PYTHON
Classifying
transcribed speech
with Sklearn
SPOKEN LANGUAGE PROCESSING IN PYTHON

Daniel Bourke
Machine Learning Engineer/YouTube
creator
Inspecting the data
# Inspect post purchase audio folder
import os
post_purchase_audio = os.listdir("post_purchase")
print(post_purchase_audio[:5])

['post-purchase-audio-0.mp3',
'post-purchase-audio-1.mp3',
'post-purchase-audio-2.mp3',
'post-purchase-audio-3.mp3',
'post-purchase-audio-4.mp3']

SPOKEN LANGUAGE PROCESSING IN PYTHON

Converting to wav
# Loop through mp3 files
for file in post_purchase_audio:
print(f"Converting {file} to .wav...")
# Use previously made function to convert to .wav
convert_to_wav(file)

Converting post-purchase-audio-0.mp3 to .wav...

Converting post-purchase-audio-1.mp3 to .wav...
Converting post-purchase-audio-2.mp3 to .wav...
Converting post-purchase-audio-3.mp3 to .wav...
Converting post-purchase-audio-4.mp3 to .wav...

SPOKEN LANGUAGE PROCESSING IN PYTHON

Transcribing all phone call excerpts
# Transcribe text from wav files
def create_text_list(folder):
text_list = []
# Loop through folder
for file in folder:
# Check for .wav extension
if file.endswith(".wav"):
# Transcribe audio
text = transcribe_audio(file)
# Add transcribed text to list
text_list.append(text)
return text_list

SPOKEN LANGUAGE PROCESSING IN PYTHON

Transcribing all phone call excerpts
# Convert post purchase audio to text
post_purchase_text = create_text_list(post_purchase_audio)
print(post_purchase_text[:5])

['hey man I just water product from you guys and I think is amazing but I leave a li
'these clothes I just bought from you guys too small is there anyway I can change t
"I recently got these pair of shoes but they're too big can I change the size",
"I bought a pair of pants from you guys but they're way too small",
"I bought a pair of pants and they're the wrong colour is there any chance I can ch

SPOKEN LANGUAGE PROCESSING IN PYTHON

Organizing transcribed text
import pandas as pd
# Create post purchase dataframe
post_purchase_df = pd.DataFrame({"label": "post_purchase", "text": post_purchase_text})
# Create pre purchase dataframe
pre_purchase_df = pd.DataFrame({"label": "pre_purchase", "text": pre_purchase_text})

# Combine pre purchase and post purhcase

df = pd.concat([post_purchase_df, pre_purchase_df])

# View the combined dataframe

df.head()

SPOKEN LANGUAGE PROCESSING IN PYTHON

Organizing transcribed text
label text
0 post_purchase yeah hello someone this morning delivered a pa...
1 post_purchase my shipment arrived yesterday but it's not the...
2 post_purchase hey my name is Daniel I received my shipment y...
3 post_purchase hey mate how are you doing I'm just calling in...
4 pre_purchase hey I was wondering if you know where my new p...

SPOKEN LANGUAGE PROCESSING IN PYTHON

Building a text classifier
# Import text classification packages
import numpy as np
from sklearn.pipeline import Pipeline
from sklearn.naive_bayes import MultinomialNB
from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer
from sklearn.model_selection import train_test_split

# Split data into train and test sets

X_train, X_test, y_train, y_test = train_test_split(
X=df["text"],
y=df["label"],
test_size=0.3)

SPOKEN LANGUAGE PROCESSING IN PYTHON

Naive Bayes Pipeline
# Create text classifier pipeline
text_classifier = Pipeline([
("vectorizer", CountVectorizer()),
("tfidf", TfidfTransformer()),
("classifier", MultinomialNB())
])

# Fit the classifier pipeline on the training data

text_classifier.fit(X_train, y_train)

SPOKEN LANGUAGE PROCESSING IN PYTHON

Not so Naive
# Make predictions and compare them to test labels
predictions = text_classifier.predict(X_test)
accuracy = 100 * np.mean(predictions == y_test.label)
print(f"The model is {accuracy:.2f}% accurate.")

The model is 97.87% accurate.

SPOKEN LANGUAGE PROCESSING IN PYTHON

Let's practice!
SPOKEN LANGUAGE PROCESSING IN PYTHON
Congratulations!
SPOKEN LANGUAGE PROCESSING IN PYTHON

Daniel Bourke
Machine Learning Engineer/YouTube
creator
What you've done
1. Converted audio les into soundwaves with Python and NumPy .

2. Transcribed speech with speech_recognition .

3. Prepared and manipulated audio les using PyDub .

4. Built a spoken language processing pipeline with NLTK , spaCy and sklearn .

SPOKEN LANGUAGE PROCESSING IN PYTHON

What next?
Practice your skills with a project of your own.

Check out speech_recognition 's Microphone() class.

SPOKEN LANGUAGE PROCESSING IN PYTHON

One last transcription
one_last_transcription = transcribe_audio("congratulations.wav")

print(one_last_transcription)

Congratlutions on finishing the Spoken Language Processing with Python course!

You should be proud.
Now get out there and recognize some speech!

SPOKEN LANGUAGE PROCESSING IN PYTHON

Keep learning!
SPOKEN LANGUAGE PROCESSING IN PYTHON

Intro To NLP
No ratings yet
Intro To NLP
44 pages
2-Regular Expressions, Text Normalization, Edit Distance
No ratings yet
2-Regular Expressions, Text Normalization, Edit Distance
42 pages
Beginners Practical Guide To NLP
No ratings yet
Beginners Practical Guide To NLP
18 pages
2 Text Processing
No ratings yet
2 Text Processing
58 pages
Tsa Labmanual
No ratings yet
Tsa Labmanual
26 pages
Text Proc
No ratings yet
Text Proc
55 pages
3b TextProcessing
No ratings yet
3b TextProcessing
32 pages
Gentle Start To Natural Language Processing Using Python
No ratings yet
Gentle Start To Natural Language Processing Using Python
6 pages
CCS369-Text and Speech Analysis Lab (1-9)
No ratings yet
CCS369-Text and Speech Analysis Lab (1-9)
37 pages
Session 1
No ratings yet
Session 1
60 pages
Lecture 2n 04032024 081220pm 19022025 105409am
No ratings yet
Lecture 2n 04032024 081220pm 19022025 105409am
38 pages
Python NLP
No ratings yet
Python NLP
15 pages
NLP m1
No ratings yet
NLP m1
148 pages
2 TextProc 2023
No ratings yet
2 TextProc 2023
74 pages
Chapter 1
No ratings yet
Chapter 1
31 pages
Module 5
No ratings yet
Module 5
69 pages
NLP Programming
No ratings yet
NLP Programming
39 pages
Week 2
No ratings yet
Week 2
90 pages
For Assignment-10 (Machine Learning With Python - NLP-2)
No ratings yet
For Assignment-10 (Machine Learning With Python - NLP-2)
37 pages
Introduction To NLP
No ratings yet
Introduction To NLP
68 pages
2 TextProc Mar 25 2021
No ratings yet
2 TextProc Mar 25 2021
71 pages
NLP Unit1Content
No ratings yet
NLP Unit1Content
106 pages
Chapter1 NLP
No ratings yet
Chapter1 NLP
31 pages
Introduction To Regular Expressions: Katharine Jarmul
No ratings yet
Introduction To Regular Expressions: Katharine Jarmul
31 pages
Big Data Finance t8 1 Choi Neoma NLP 2024
No ratings yet
Big Data Finance t8 1 Choi Neoma NLP 2024
13 pages
5 Basic Text Processing
No ratings yet
5 Basic Text Processing
6 pages
Tsa Lab Record - Cse
No ratings yet
Tsa Lab Record - Cse
53 pages
Natural Language Processing Dossier 20231110 141736 0000
No ratings yet
Natural Language Processing Dossier 20231110 141736 0000
114 pages
NLP - Course EDC 1 29
No ratings yet
NLP - Course EDC 1 29
29 pages
Natural Language Processing (NLP) & Computational Linguistics
No ratings yet
Natural Language Processing (NLP) & Computational Linguistics
60 pages
UBC Summer School in NLP - VSP 2019 Lecture 8
No ratings yet
UBC Summer School in NLP - VSP 2019 Lecture 8
27 pages
Chapter 7.1 - Introducing Natural Language Processing
No ratings yet
Chapter 7.1 - Introducing Natural Language Processing
39 pages
Natural Language Processing (NLP) With Python - Tutorial
No ratings yet
Natural Language Processing (NLP) With Python - Tutorial
72 pages
Computer Programming: A Step-by-Step Guide to Learn Python, SQL, C++, C#, Raspberry Pi, and Data Science
From Everand
Computer Programming: A Step-by-Step Guide to Learn Python, SQL, C++, C#, Raspberry Pi, and Data Science
Vere salazar
No ratings yet
A Tutorial On: Linguistic Data Analysis
No ratings yet
A Tutorial On: Linguistic Data Analysis
99 pages
Natural Language Processing With Python
100% (1)
Natural Language Processing With Python
504 pages
NLP Lab Manual (R20)
50% (2)
NLP Lab Manual (R20)
24 pages
Natural Language Processing CS 1462: Some Slides Borrows From Carl Sable
No ratings yet
Natural Language Processing CS 1462: Some Slides Borrows From Carl Sable
54 pages
NLP Experiment 1
No ratings yet
NLP Experiment 1
13 pages
Chapter 2
No ratings yet
Chapter 2
22 pages
Lecture 8 - Text Analytics NLP
No ratings yet
Lecture 8 - Text Analytics NLP
24 pages
NLP Manual (1-12) 1
No ratings yet
NLP Manual (1-12) 1
56 pages
Natural Language Processing Manual
No ratings yet
Natural Language Processing Manual
39 pages
Unit 5 Machine Learning
No ratings yet
Unit 5 Machine Learning
9 pages
Lab Prgms Weel1-Output
No ratings yet
Lab Prgms Weel1-Output
4 pages
Natural Language Processing: Practical 1
No ratings yet
Natural Language Processing: Practical 1
64 pages
NLP Manual (1-12)
No ratings yet
NLP Manual (1-12)
54 pages
NLP - Shortnotes Unit 1 & 2
No ratings yet
NLP - Shortnotes Unit 1 & 2
16 pages
Theory of Computation
No ratings yet
Theory of Computation
33 pages
Understanding Language Model
No ratings yet
Understanding Language Model
5 pages
Detail NLP
No ratings yet
Detail NLP
5 pages
NLP Manual (1-12)
No ratings yet
NLP Manual (1-12)
55 pages
Unit 5
No ratings yet
Unit 5
4 pages
Chapter2 PDF
No ratings yet
Chapter2 PDF
22 pages
Natural Language Processing - Session 3 - Regular Expressions
No ratings yet
Natural Language Processing - Session 3 - Regular Expressions
39 pages
Coll. - English For Afghan Elementary School. Grade 5 Pashto-Ministry of Education (2010)
No ratings yet
Coll. - English For Afghan Elementary School. Grade 5 Pashto-Ministry of Education (2010)
129 pages
AI Zone: Log in Sign Up
No ratings yet
AI Zone: Log in Sign Up
24 pages
Lesson Plan Grade 4 - 10th of June - 13th of June 2025
No ratings yet
Lesson Plan Grade 4 - 10th of June - 13th of June 2025
8 pages
CFP: 11th International Conference On Natural Language Computing (NATL 2025)
No ratings yet
CFP: 11th International Conference On Natural Language Computing (NATL 2025)
2 pages
Python for Beginners: This comprehensive introduction to the world of coding introduces you to the Python programming language
From Everand
Python for Beginners: This comprehensive introduction to the world of coding introduces you to the Python programming language
Vere salazar
No ratings yet
Ngram 2x3
No ratings yet
Ngram 2x3
5 pages
IExplore Leaflet
No ratings yet
IExplore Leaflet
64 pages
Estelle Darcy: Certified English Tutor
No ratings yet
Estelle Darcy: Certified English Tutor
14 pages
CSDM2-Text Preprocessing For NL Data - 011050
No ratings yet
CSDM2-Text Preprocessing For NL Data - 011050
6 pages
Quanti Fiers
No ratings yet
Quanti Fiers
3 pages
Week 5 - Grade 6
No ratings yet
Week 5 - Grade 6
5 pages
Common Devices in Poetry
No ratings yet
Common Devices in Poetry
3 pages
Ba Kelompok 7 Bahasa Indonesia
No ratings yet
Ba Kelompok 7 Bahasa Indonesia
10 pages
4 Visual Communication
No ratings yet
4 Visual Communication
28 pages
Vygotsky S-C Theory Moodle
No ratings yet
Vygotsky S-C Theory Moodle
7 pages
ESP Vocabulary
50% (2)
ESP Vocabulary
7 pages
Skill: Verbal::Worksheet Number:26: A) 1-2-3-4-5-6-7 B) 1-4-2-3-6-5-7 C) 1-4-3-2-6-5-7 D) 1-4-3-2-5-6-7
No ratings yet
Skill: Verbal::Worksheet Number:26: A) 1-2-3-4-5-6-7 B) 1-4-2-3-6-5-7 C) 1-4-3-2-6-5-7 D) 1-4-3-2-5-6-7
2 pages
1.3.1. Video Script: BP B1 Unit 1 Video Scripts and Ex-Es
No ratings yet
1.3.1. Video Script: BP B1 Unit 1 Video Scripts and Ex-Es
5 pages
Urdu Vocabulary, Script and Grammar: A Learner's Suggestions
No ratings yet
Urdu Vocabulary, Script and Grammar: A Learner's Suggestions
24 pages
16S FirstLangSanskrit Model - QP - 23 24
No ratings yet
16S FirstLangSanskrit Model - QP - 23 24
12 pages
Homework Sheet - LSN 5
No ratings yet
Homework Sheet - LSN 5
5 pages
Natural Language Processing
No ratings yet
Natural Language Processing
12 pages
French Term Plans (Grades 10)
No ratings yet
French Term Plans (Grades 10)
3 pages
Ielts Academic Top 40 Language Frequency 2022
No ratings yet
Ielts Academic Top 40 Language Frequency 2022
2 pages
Behavioural Communication
100% (2)
Behavioural Communication
13 pages
Sequence 01 MS 01+
No ratings yet
Sequence 01 MS 01+
12 pages
Permutation and Combination
100% (1)
Permutation and Combination
32 pages
ENG10-Week 3
No ratings yet
ENG10-Week 3
3 pages
Linguistics: Introduction To Phonetics and Phonology
No ratings yet
Linguistics: Introduction To Phonetics and Phonology
11 pages
Names of Common Flowers in English, Hindi, Tamil, Sanskrit and Malay
No ratings yet
Names of Common Flowers in English, Hindi, Tamil, Sanskrit and Malay
4 pages
English Form 3
100% (3)
English Form 3
2 pages
Explaining Doraemon
No ratings yet
Explaining Doraemon
4 pages
Narrative Text - Google Formulir
No ratings yet
Narrative Text - Google Formulir
6 pages
English 2am18 Rattr1
100% (1)
English 2am18 Rattr1
2 pages
688 - 354783 - Lesson 2.definites - Articles.nouns
No ratings yet
688 - 354783 - Lesson 2.definites - Articles.nouns
6 pages