Final Summary NLP
Final Summary NLP
regular expressions
I N T R O D U C T I O N T O N AT U R A L L A N G U A G E P R O C E S S I N G I N P Y T H O N
Katharine Jarmul
Founder, kjamistan
What is Natural Language Processing?
Field of study focused on making sense of language
Using statistics and computers
Translation
Sentiment analysis
word_regex = '\w+'
re.match(word_regex, <_sre.SRE_Match object; span=(0, 2), match='hi'>
'hi there!')
Katharine Jarmul
Founder, kjamistan
What is tokenization?
Turning a string or document into tokens (smaller chunks)
Some examples:
Breaking out words or sentences
Separating punctuation
import re
re.match('abc', 'abcde')
re.search('abc', 'abcde')
re.match('cd', 'abcde')
re.search('cd', 'abcde')
Katharine Jarmul
Founder, kjamistan
Regex groups using or "|"
OR is represented using |
import re
match_digits_and_words = ('(\d+|\w+)')
re.findall(match_digits_and_words, 'He has 11 cats.')
<_sre.SRE_Match object;
span=(0, 42), match='match lowercase spaces nums like 12'>
Katharine Jarmul
Founder, kjamistan
Getting started with matplotlib
Charting library used by many open source Python projects
Bar charts
Line charts
Sca er plots
(array([ 1., 0., 0., 0., 0., 2., 0., 3., 0., 1.]),
array([ 1., 1.8, 2.6, 3.4, 4.2, 5., 5.8, 6.6, 7.4, 8.2, 9.]),
<a list of 10 Patch objects>)
plt.show()
(array([ 2., 0., 1., 0., 0., 0., 3., 0., 0., 1.]),
array([ 1., 1.5, 2., 2.5, 3., 3.5, 4., 4.5, 5., 5.5, 6.]),
<a list of 10 Patch objects>)
plt.show()
Katharine Jarmul
Founder, kjamistan
Bag-of-words
Basic method for nding topics in a text
"cat": 3, "the": 3
"is": 2
Counter({'.': 3,
'The': 3,
'box': 3,
'cat': 3,
'in': 1,
...
'the': 3})
counter.most_common(2)
Katharine Jarmul
Founder, kjamistan
Why preprocess?
Helps make for be er input data
When performing machine learning or other statistical
methods
Examples:
Tokenization to create a bag of words
Lowercasing words
Lemmatization/Stemming
Shorten words to their root stems
Katharine Jarmul
Founder, kjamistan
What is gensim?
Popular open-source NLP library
(Source: h p://tlfvincent.github.io/2015/10/23/presidential-
speech-topics)
[[(0, 1), (1, 1), (2, 1), (3, 1), (4, 1), (5, 1), (6, 1), (7, 1), (8, 1)],
[(0, 1), (1, 1), (9, 1), (10, 1), (11, 1), (12, 1)],
...]
Katharine Jarmul
Founder, kjamistan
What is tf-idf?
Term frequency - inverse document frequency
[(0, 0.1746298276735174),
(1, 0.1746298276735174),
(9, 0.29853166221463673),
(10, 0.7716931521027908),
...
]
Katharine Jarmul
Founder, kjamistan
What is Named Entity Recognition?
NLP task to identify important named entities in the text
People, places, organizations
Java based
(S
In/IN
(GPE New/NNP York/NNP)
,/,
I/PRP
like/VBP
to/TO
ride/VB
the/DT
(ORGANIZATION Metro/NNP)
to/TO
visit/VB
(ORGANIZATION MOMA/NNP)
and/CC
some/DT
restaurants/NNS
rated/VBN
well/RB
by/IN
(PERSON Ruth/NNP Reichl/NNP)
./.)
Katharine Jarmul
Founder, kjamistan
What is SpaCy?
NLP library similar to gensim , with di erent implementations
(source: h ps://demos.explosion.ai/displacy-ent/)
<spacy.pipeline.EntityRecognizer at 0x7f76b75e68b8>
print(doc.ents[0], doc.ents[0].label_)
Berlin GPE
Quickly growing!
Katharine Jarmul
Founder, kjamistan
What is polyglot?
NLP library which uses word
vectors
Why polyglot ?
Vectors for many di erent
languages
[I-ORG(['Generalitat', 'de']),
I-LOC(['Generalitat', 'de', 'Cataluña']),
I-PER(['Carles', 'Puigdemont']),
I-LOC(['Madrid']),
I-PER(['Manuela', 'Carmena']),
I-LOC(['Girona']),
I-LOC(['Madrid'])]
Katharine Jarmul
Founder, kjamistan
What is supervised learning?
Form of machine learning
Problem has prede ned training data
This data has a label (or outcome) you want the model to
learn
Katharine Jarmul
Founder, kjamistan
Predicting movie genre
Dataset consisting of movie plots and corresponding genre
Katharine Jarmul
Founder, kjamistan
Naive Bayes classifier
Naive Bayes Model
Commonly used for testing NLP classi cation problems
Basis in probability
Examples:
If the plot has a spaceship, how likely is it to be sci- ?
nb_classifier.fit(count_train, y_train)
pred = nb_classifier.predict(count_test)
metrics.accuracy_score(y_test, pred)
0.85841849389820424
array([[6410, 563],
[ 864, 2242]])
Action Sci-Fi
Action 6410 563
Sci-Fi 864 2242
Katharine Jarmul
Founder, kjamistan
Translation
source:
(h ps://twi er.com/Lupintweets/status/865533182455685121)
(source: h ps://nlp.stanford.edu/projects/socialsent/)
Violeta Misheva
Data Scientist
What is sentiment analysis?
_The camera on this phone is great but its ba ery life is rather disappointing. _
Brand monitoring
Customer service
Product analytics
0 3782
1 3719
Name: label, dtype: int64
0 0.504199
1 0.495801
Name: label, dtype: float64
type(length_reviews)
pandas.core.series.Series
0 667
1 2982
2 669
3 1087
....
0 667
1 2982
2 669
3 1087
4 724
....
Violeta Misheva
Data Scientist
Levels of granularity
1. Document level
2. Sentence level
3. Aspect level
The camera in this phone is pre y good but the ba ery life is disappointing.
my_valence = TextBlob(text)
my_valence.sentiment
Sentiment(polarity=0.7, subjectivity=0.6000000000000001)
Rely on having labelled historical data Rely on manually cra ed valence scores
Violeta Misheva
Data Scientist
Word cloud example
The more frequent a word is, the BIGGER and bolder it will appear on the word cloud.
two_cities = "It was the best of times, it was the worst of times,
it was the age of wisdom, it was the age of foolishness,
it was the epoch of belief, it was the epoch of incredulity,
it was the season of Light, it was the season of Darkness,
it was the spring of hope, it was the winter of despair,
we had everything before us, we had nothing before us,
we were all going direct to Heaven, we were all going
direct the other way – in short, the period was so far
like the present period, that some of its noisiest
authorities insisted on its being received, for good
or for evil, in the superlative degree of comparison only."
Background color
Stopwords
plt.axis('off')
plt.show()
Violeta Misheva
Data Scientist
What is a bag-of-words (BOW) ?
vect = CountVectorizer(max_features=1000)
vect.fit(data.review)
X = vect.transform(data.review)
Violeta Misheva
Data Scientist
Context matters
I am happy, not sad.
Pu ing 'not' in front of a word (negation) is one example of how context ma ers.
# Only unigrams
ngram_range=(1, 1)
max_features: if speci ed, it will include only the top most frequent words in the vocabulary
If max_features = None, all words will be included
Violeta Misheva
Data Scientist
Goal of the video
Goal : Enrich the existing dataset with features related to the text column (capturing the
sentiment)
anna_k = 'Happy families are all alike, every unhappy family is unhappy in its own way.'
word_tokenize(anna_k)
['Happy','families','are', 'all','alike',',',
'every','unhappy', 'family', 'is','unhappy','in',
'its','own','way','.']
list
type(word_tokens[0])
list
Violeta Misheva
Data Scientist
Language of a string in Python
from langdetect import detect_langs
foreign = 'Este libro ha sido uno de los mejores libros que he leido.'
detect_langs(foreign)
[es:0.9999945352697024]
reviews.head()
languages
[it:0.9999982541301151],
[es:0.9999954153640488],
[es:0.7142833997345875, en:0.2857160465706441],
[es:0.9999942365605781],
[es:0.999997956049055] ...
str(languages[0]).split(':')[0]
'[es'
str(languages[0]).split(':')[0][1:]
'es'
reviews['language'] = languages
Violeta Misheva
Data Scientist
What are stop words and how to find them?
Stop words: words that occur too frequently and not considered informative
{'the', 'a', 'an', 'and', 'but', 'for', 'on', 'in', 'at' ...}
Context ma ers
vect = CountVectorizer(stop_words=my_stop_words)
vect.fit(movies.review)
X = vect.transform(movies.review)
Violeta Misheva
Data Scientist
String operators and comparisons
# Checks if a string is composed only of letters
my_string.isalpha()
len(word_tokens[0])
87
len(cleaned_tokens[0])
78
my_string = '#Wonderfulday'
# Extract #, followed by any letter, small or capital
x = re.search('#[A-Za-z]', my_string)
x
<re.Match object; span=(0, 2), match='#W'>
Violeta Misheva
Data Scientist
What is stemming?
Stemming is the process of transforming words to their root forms, even if the stem itself is
not a valid word in the language.
Fast and e cient to compute Slower than stemming and can depend on
the part-of-speech
porter = PorterStemmer()
porter.stem('wonderful')
'wonder'
DutchStemmer = SnowballStemmer("dutch")
DutchStemmer.stem("beginnen")
'begin'
WNlemmatizer = WordNetLemmatizer()
WNlemmatizer.lemmatize('wonderful', pos='a')
'wonderful'
Violeta Misheva
Data Scientist
What are the components of TfIdf?
TF: term frequency: How o en a given word appears within a document in the corpus
Inverse document frequency: Log-ratio between the total number of documents and the
number of documents that contain a speci c word
Used to calculate the weight of words that do not occur frequently
TfIdf likely to capture words common within a document but not across documents.
More on TfIdf
Since it penalizes frequent words, less need to deal with stop words explicitly.
Quite useful in search queries and information retrieval to rank the relevance of returned
results.
vect = TfidfVectorizer(max_features=100).fit(tweets.text)
X = vect.transform(tweets.text)
Violeta Misheva
Data Scientist
Classification problems
Product and movie reviews: positive or negative sentiment (binary classi cation)
Tweets about airline companies: positive, neutral and negative (multi-class classi cation)
P robability(sentiment = positive∣review)
log_reg = LogisticRegression().fit(X, y)
0.9009
y_predicted = log_reg.predict(X)
acurracy = accuracy_score(y, y_predicted)
0.9009
Violeta Misheva
Data Scientist
Train/test split
Training set: used to train the model (70-80% of the whole data)
X : features
y : labels
stratify: proportion of classes in the sample produced will be the same as the proportion of
values provided to this parameter
0.76
0.73
y_predicted = log_reg.predict(X_test)
print('Accuracy score on test data: ', accuracy_score(y_test, y_predicted))
0.73
print(confusion_matrix(y_test, y_predicted)/len(y_test))
[[0.3788 0.1224]
[0.1352 0.3636]]
Violeta Misheva
Data Scientist
Complex models and regularization
Complex models:
Complex model that captures the noise in the data (over ing)
Regularization:
A way to simplify and ensure we have a less complex model
# Regularization arguments
LogisticRegression(penalty='l2', C=1.0)
# Predict labels
y_predicted = log_reg.predict(X_test)
# Predict probability
y_probab = log_reg.predict_proba(X_test)
Violeta Misheva
Data Scientist
The Sentiment Analysis problem
Sentiment analysis as the process of understanding the opinion of an author about a
subject
Movie reviews
Word clouds
TfIdf vectorization
# Vectorizer syntax
vect = CountVectorizer().fit(data.text_column)
X = vect.transform(data.text_column)
Violeta Misheva
Data Scientist
The Sentiment Analysis world
Azadeh Mobasher
Principal Data Scientist
Natural Language Processing (NLP)
Locating and classifying named entities mentioned in unstructured text into pre-defined
categories
As the first step, spaCy can be installed $ python3 pip install spacy
using the Python package manager pip
import spacy
nlp = spacy.load("en_core_web_sm")
text = "A spaCy pipeline object is created."
doc = nlp(text)
Tokenization
A Token is defined as the smallest meaningful part of the text.
Azadeh Mobasher
Principal Data Scientist
spaCy NLP pipeline
Import spaCy
import spacy
nlp = spacy.load("en_core_web_sm") Use spacy.load() to return nlp , a
doc = nlp("Here's my spaCy pipeline.") Language class
The Language object is the text
processing pipeline
Name Description
Doc A container for accessing linguistic annotations of text
DependencyParser
Sentencizer
import spacy
nlp = spacy.load("en_core_web_sm")
import spacy
nlp = spacy.load("en_core_web_sm")
import spacy
nlp = spacy.load("en_core_web_sm")
doc = nlp("We are seeing her after one year.")
print([(token.text, token.lemma_) for token in doc])
Azadeh Mobasher
Principal Data Scientist
POS tagging
Categorizing words grammatically, based on function and context within a sentence
spaCy models extract named entities using the NER pipeline component
import spacy
nlp = spacy.load("en_core_web_sm")
text = "Albert Einstein was genius."
doc = nlp(text)
print([(ent.text, ent.start_char,
ent.end_char, ent.label_) for ent in doc.ents])
import spacy
nlp = spacy.load("en_core_web_sm")
text = "Albert Einstein was genius."
doc = nlp(text)
print([(token.text, token.ent_type_) for token in doc])
Azadeh Mobasher
Principal Data Scientist
POS tagging
POS tags depend on the context, surrounding words and their tags
import spacy
nlp = spacy.load("en_core_web_sm")
text = "My cat will fish for a fish tomorrrow in a fishy way."
print([(token.text, token.pos_, spacy.explain(token.pos_))
for token in nlp(text)])
Better accuracy for many NLP tasks Translation system use case
Word-sense disambiguation (WSD) is the problem of deciding in which sense a word is used
in a sentence.
Determining the sense of the word can be crucial in machine translation, etc.
Results in a tree
Dependency label describes the type of syntactic relation between two tokens
spacy.displacy.serve(doc, style="dep")
Azadeh Mobasher
Principal Data Scientist
Word vectors (embeddings)
import spacy
nlp = spacy.load("en_core_web_md")
print(nlp.meta["vectors"])
import spacy
nlp = spacy.load("en_core_web_md")
like_id = nlp.vocab.strings["like"]
print(like_id)
>>> 18194338103975822726
print(nlp.vocab.vectors[like_id])
Azadeh Mobasher
Principal Data Scientist
Word vectors visualization
Word vectors allow to understand how Principal Component Analysis projects
words are grouped word vectors into a two-dimensional space
Extract word vectors for a given list of words and stack them vertically.
pca = PCA(n_components=2)
word_vectors_transformed = pca.fit_transform(word_vectors)
plt.figure(figsize=(10, 8))
plt.scatter(word_vectors_transformed[:, 0], word_vectors_transformed[:, 1])
for word, coord in zip(words, word_vectors_transformed):
x, y = coord
plt.text(x, y, word, size=10)
plt.show()
import numpy as np
import spacy
nlp = spacy.load("en_core_web_md")
word = "covid"
most_similar_words = nlp.vocab.vectors.most_similar(
np.asarray([nlp.vocab.vectors[nlp.vocab.strings[word]]]), n=5)
Azadeh Mobasher
Principal Data Scientist
The semantic similarity method
nlp = spacy.load("en_core_web_md")
doc1 = nlp("We eat pizza")
doc2 = nlp("We like to eat pasta")
token1 = doc1[2]
token2 = doc2[4]
print(f"Similarity between {token1} and {token2} = ", round(token1.similarity(token2), 3))
span1 = doc1[1:]
span2 = doc2[1:]
print(f"Similarity between \"{span1}\" and \"{span2}\" = ",
round(span1.similarity(span2), 3))
>>> Similarity between "eat pizza" and "like to eat pasta" = 0.588
nlp = spacy.load("en_core_web_md")
keyword = nlp("price")
for i, sentence in enumerate(sentences.sents):
print(f"Similarity score with sentence {i+1}: ", round(sentence.similarity(keyword), 5))
Azadeh Mobasher
Principal Data Scientist
spaCy pipelines
import spacy
nlp = spacy.load("en_core_web_sm")
doc = nlp(example_text)
blank_nlp = spacy.blank("en")
blank_nlp.add_pipe("sentencizer")
start_time = time.time()
doc = blank_nlp(text)
print(f"Finished processing with blank model in
{round((time.time() - start_time)/60.0 , 5)} minutes")
Setting pretty to True will print a table instead of only returning the structured data.
import spacy
nlp = spacy.load("en_core_web_sm")
analysis = nlp.analyze_pipes(pretty=True)
Azadeh Mobasher
Principal Data Scientist
spaCy EntityRuler
Token entity patterns with one dictionary describing one token (list):
nlp = spacy.blank("en")
entity_ruler = nlp.add_pipe("entity_ruler")
patterns = [{"label": "ORG", "pattern": "Microsoft"},
{"label": "GPE", "pattern": [{"LOWER": "san"}, {"LOWER": "francisco"}]}]
entity_ruler.add_patterns(patterns)
nlp = spacy.load("en_core_web_sm")
nlp = spacy.load("en_core_web_sm")
ruler = nlp.add_pipe("entity_ruler", after='ner')
patterns = [{"label": "ORG", "pattern": [{"lower": "manhattan"}, {"lower": "associates"}]}]
ruler.add_patterns(patterns)
nlp = spacy.load("en_core_web_sm")
ruler = nlp.add_pipe("entity_ruler", before='ner')
patterns = [{"label": "ORG", "pattern": [{"lower": "manhattan"}, {"lower": "associates"}]}]
ruler.add_patterns(patterns)
Azadeh Mobasher
Principal Data Scientist
What is RegEx?
Runs fast
import re
pattern = r"((\d){3}-(\d){3}-(\d){4})"
text = "Our phone number is 832-123-5555 and their phone number is 425-123-4567."
text = "Our phone number is 832-123-5555 and their phone number is 425-123-4567."
nlp = spacy.blank("en")
patterns = [{"label": "PHONE_NUMBER", "pattern": [{"SHAPE": "ddd"},
{"ORTH": "-"}, {"SHAPE": "ddd"},
{"ORTH": "-"}, {"SHAPE": "dddd"}]}]
ruler = nlp.add_pipe("entity_ruler")
ruler.add_patterns(patterns)
doc = nlp(text)
print ([(ent.text, ent.label_) for ent in doc.ents])
Azadeh Mobasher
Principal Data Scientist
Matcher in spaCy
import spacy
from spacy.matcher import Matcher
nlp = spacy.load("en_core_web_sm")
doc = nlp("Good morning, this is our first day on campus.")
matcher = Matcher(nlp.vocab)
Matching output include start and end token indices of the matched pattern.
== , >= , <= , > , < int, float Comparison operators for equality or inequality checks
Azadeh Mobasher
Principal data scientist
Why train spaCy models?
Go a long way for general NLP use cases
But may not have seen specific domains data during their training, e.g.
Twitter data
Medical data
Does our domain include many labels that are absent in spaCy models?
import spacy
nlp = spacy.load("en_core_web_sm")
Azadeh Mobasher
Principal data scientist
Training steps
7. Go back to step 3.
annotated_data = {
"sentence": "An antiviral drugs used against influenza is neuraminidase inhibitors.",
"entities": {
"label": "Medicine",
"value": "neuraminidase inhibitors",
}
}
annotated_data = {
"sentence": "Bill Gates visited the SFO Airport.",
"entities": [{"label": "PERSON", "value": "Bill Gates"},
{"label": "LOC", "value": "SFO Airport"}]
}
training_data = [
("I will visit you in Austin.", {"entities": [(20, 26, "GPE")]}),
("I'm going to Sam's house.", {"entities": [(13,18, "PERSON"), (19, 24, "GPE")]}),
("I will go.", {"entities": []})
]
Pair's second element is list of annotated entities and start and end characters
import spacy
from spacy.training import Example
nlp = spacy.load("en_core_web_sm")
Azadeh Mobasher
Principal Data Scientist
Training steps
nlp.disable_pipes(*other_pipes)
optimizer = nlp.create_optimizer()
losses = {}
for i in range(epochs):
random.shuffle(training_data)
for text, annotation in training_data:
doc = nlp.make_doc(text)
example = Example.from_dict(doc, annotation)
nlp.update([example], sgd = optimizer, losses=losses)
ner = nlp.get_pipe("ner")
ner.to_disk("<ner model name>")
ner = nlp.create_pipe("ner")
ner.from_disk("<ner model name>")
nlp.add_pipe(ner, "<ner model name>")
Apply NER model and store tuples of (entity text, entity label):
doc = nlp(text)
entities = [(ent.text, ent.label_) for ent in doc.ents]
Azadeh Mobasher
Principal data scientist
Chapter 1 - Introduction to NLP and spaCy
matcher = Matcher(nlp.vocab)
pattern = [{"LOWER": "good"}, {"LOWER": {"IN": ["morning", "evening"]}}]
matcher.add("morning_greeting", [pattern])
Daniel Bourke
Machine Learning Engineer/YouTube
Creator
Dealing with audio files in Python
Di erent kinds all of audio les
mp3
wav
m4a
ac
import wave
b'\xfd\xff\xfb\xff\xf8\xff\xf8\xff\xf7\...
Daniel Bourke
Machine Learning Engineer/YouTube
Creator
Converting bytes to integers
Can't use bytes
import numpy as np
# Convert soundwave_gm from bytes to integers
signal_gm = np.frombuffer(soundwave_gm, dtype='int16')
# Show the first 10 items
signal_gm[:10]
array([ -3, -5, -8, -8, -9, -13, -8, -10, -9, -11], dtype=int16)
48,000
array([ 1., 2., 3., 4., 5., 6., 7., 8., 9., 10.])
Daniel Bourke
Machine Learning Engineer/YouTube
Creator
Adding another sound wave
New audio le: good_afternoon.wav
Daniel Bourke
Machine Learning Engineer/YouTube
Creator
Why the SpeechRecognition library?
Some existing python libraries
CMU Sphinx
Kaldi
SpeechRecognition
recognize_google()
recognize_google_cloud()
recognize_wit()
Input: audio_file
Daniel Bourke
Machine Learning Engineer/YouTube
Creator
The AudioFile class
import speech_recognition as sr
# Setup recognizer instance
recognizer = sr.Recognizer()
# Read in audio file
clean_support_call = sr.AudioFile("clean-support-call.wav")
# Check type of clean_support_call
type(clean_support_call)
<class 'speech_recognition.AudioFile'>
<class 'speech_recognition.AudioData'>
Daniel Bourke
Machine Learning Engineer/YouTube
Creator
What language?
# Create a recognizer class
recognizer = sr.Recognizer()
# Pass the Japanese audio to recognize_google
text = recognizer.recognize_google(japanese_good_morning,
language="en-US")
# Print the text
print(text)
Ohio gozaimasu
?????????
UnknownValueError:
[]
Text from speaker 0: one of the limitations of the speech recognition library
Text from speaker 1: is that it doesn't recognise different speakers and voices
Text from speaker 2: it will just return it all as one block a text
Daniel Bourke
Machine Learning Engineer/YouTube
Creator
Installing PyDub
$ pip install pydub
type(wav_file)
pydub.audio_segment.AudioSegment
1, 2
wav_file.frame_rate
480000
8488
3284
16000
Daniel Bourke
Machine Learning Engineer/YouTube
Creator
Turning it down to 11
# Import audio file
wav_file = AudioSegment.from_file("wav_file.wav")
# Minus 60 dB
quiet_wav_file = wav_file - 60
UnknownValueError:
# Try to recognize
recognizer.recognize_google(louder_wav_file)
[<pydub.audio_segment.AudioSegment, <pydub.audio_segment.AudioSegment>]
Daniel Bourke
Machine Learning Engineer/YouTube
Creator
Exporting audio files
from pydub import AudioSegment
<_io.BufferedRandom name='louder_wav_file.wav'>
Creating data/right_types/wav_file.wav
Creating data/right_types/flac_file.wav
Creating data/right_types/mp3_file.wav
print(f"Creating {out_file}")
Creating data/louder_no_static/speech-recognition-services.wav
Creating data/louder_no_static/order-issue.wav
Creating data/louder_no_static/help-with-acount.wav
Daniel Bourke
Machine Learning Engineer/YouTube
Creator
Exploring audio files
# Import os module
import os
(['call_1.mp3',
'call_2.mp3',
'call_3.mp3',
'call_4.mp3'])
Channels: 2
Sample width: 2
Frame rate (sample rate): 32000
Frame width: 4
Length (ms): 54888
Frame count: 1756416.0
"hello welcome to Acme studio support line my name is Daniel how can I best help
you hey Daniel this is John I've recently bought a smart from you guys and I know
that's not good to hear John let's let's get your cell number and then we
can we can set up a way to fix it for you one number for 1757 varies how long do
you reckon this is going to take about an hour now while John we're going to try
our best hour I will we get the sealing member will start up this support case
I'm just really really really really I've been trying to contact 34 been put on
hold more than an hour and half so I'm not really happy I kind of wanna get this
issue 6 is fossil"
Daniel Bourke
Machine Learning Engineer/YouTube
Creator
Installing sentiment analysis libraries
$ pip install nltk
"hey Dave is this any better do I order products are currently on July 1st and I haven't
received the product a three-week step down this parable 6987 5"
Daniel Bourke
Machine Learning Engineer/YouTube
Creator
Installing spaCy
# Install spaCy
$ pip install spacy
I 0
'd 1
like 4
to 9
talk 12
about 17
a 23
smartphone 25...
I'd like to talk about a smartphone I ordered on July 31st from your Sydney store,
my order number is 4093829.
I spoke to one of your customer service team, Georgia, yesterday.
smartphone PRODUCT
July 31st DATE
Sydney GPE
4093829 CARDINAL
one CARDINAL
Georgia GPE
yesterday DATE
Daniel Bourke
Machine Learning Engineer/YouTube
creator
Inspecting the data
# Inspect post purchase audio folder
import os
post_purchase_audio = os.listdir("post_purchase")
print(post_purchase_audio[:5])
['post-purchase-audio-0.mp3',
'post-purchase-audio-1.mp3',
'post-purchase-audio-2.mp3',
'post-purchase-audio-3.mp3',
'post-purchase-audio-4.mp3']
['hey man I just water product from you guys and I think is amazing but I leave a li
'these clothes I just bought from you guys too small is there anyway I can change t
"I recently got these pair of shoes but they're too big can I change the size",
"I bought a pair of pants from you guys but they're way too small",
"I bought a pair of pants and they're the wrong colour is there any chance I can ch
Daniel Bourke
Machine Learning Engineer/YouTube
creator
What you've done
1. Converted audio les into soundwaves with Python and NumPy .
4. Built a spoken language processing pipeline with NLTK , spaCy and sklearn .
print(one_last_transcription)