Data Science with Python
Natural Language Processing (NLP) with
SciKit Learn
Learning Objectives
By the end of this lesson, you will be able to:
Define natural language processing
Explain the importance of natural language processing
List the applications using natural language processing
Outline the modules to load content and category
Apply feature extraction techniques
Implement the approaches of natural language processing
Introduction to Natural Language
Processing
Natural Language Processing (NLP)
Natural language processing is an automated way to understand and analyze natural human languages and extract
information from such data by applying machine algorithms.
Extract information
Analyze human languages
Machine algorithms and
translations
(mathematics and statistics)
Data from various sources
Natural Language Processing
It is also referred to as, the field of computer science or AI to extract the
linguistics information from the underlying data.
Extract the linguistics information
Why Natural Language Processing
The world is now connected globally due to the advancement of technology and devices.
Analyzing tons of data
Identifying various languages
Applying quantitative analysis
Handling ambiguities
Why Natural Language Processing
NLP can achieve full automation by using modern software libraries, modules, and packages.
Full Intelligent
automation processing
Knowledge about Modern Machine
languages and world software models
libraries
NLP Terminology
Determines where one word Word Splits words, phrases, and idioms
ends and the other begins boundaries Tokenization
Discovers topics in a collection Stemming Maps to the valid root word
Topic
of documents NLP
models
Disambig-
Determines meaning and uation Tf-idf
sense of words (context vs. Semantic Represents term frequency and
intent) analytics inverse document frequency
Compares words, phrases, and
idioms in a set of documents to
extract meaning
NLP Approach for Text Data
Let us look at the Natural Language Processing approaches to analyze text data.
Conduct basic text
processing
Analyze the meaning
Categorize and tag words
Build feature-based NLP Classify text
structure
Analyze sentence
Extract information
structure
NLP Environmental Setup
Problem Statement: Demonstrate the installation of NLP environment
Access: Click on the Practice Labs tab on the left side panel of the LMS. Copy or note the
username and password that is generated. Click on the Launch Lab button. On the page that
appears, enter the username and password in the respective fields, and click Login.
Sentence Analysis
Problem Statement: Demonstrate how to perform the sentence analysis
Access: Click on the Practice Labs tab on the left side panel of the LMS. Copy or note the
username and password that is generated. Click on the Launch Lab button. On the page that
appears, enter the username and password in the respective fields, and click Login.
Applications of NLP
Applications of NLP
Machine translation is used to translate one language into another. Google Translate
Machine Translation is an example. It uses NLP to translate the input data from one language to another.
Speech Recognition
Sentiment Analysis
Applications of NLP
The speech recognition application understands human speech and uses it as input
Machine Translation information. It is useful for applications like Siri, Google Now, and Microsoft Cortana.
Speech Recognition
Sentiment Analysis
Applications of NLP
Sentiment analysis is achieved by processing tons of data received from different
Machine Translation interfaces and sources. For example, NLP uses all social media activities to find out
the popular topic of discussion or importance.
Speech Recognition
Sentiment Analysis
Major NLP Libraries
NLTK
Scikit-learn
NLP libraries
TextBlob
spaCy
The Scikit-Learn Approach
The Scikit-Learn Approach
It is a very powerful library with a set of modules to process and analyze natural language data, such as text and
images, and extract information using machine learning algorithms.
Built-in module Feature extraction Model training
Contains built-in A way to extract Analyzes the content
modules to load information from based on particular
the dataset’s data which can be categories and then
content and text or images. trains them according
categories. to a specific model.
The Scikit-Learn Approach
It is a very powerful library with a set of modules to process and analyze natural language data, such as texts and
images, and extract information using machine learning algorithms.
Pipeline building
mechanism
A technique to Various stages of pipeline
streamline the learning 1. Vectorization
NLP process into 2. Transformation
stages. 3. Model training and application
The Scikit-Learn Approach
It is a very powerful library with a set of modules to process and analyze natural language data, such as texts and
images, and extract information using machine learning algorithms.
Pipeline building Performance Grid search for finding
mechanism optimization good parameters
A technique in In this stage It’s a powerful way
Scikit-learn we train the to search
approach to models to parameters
streamline the optimize the affecting the
NLP process into overall outcome for
stages. process. model training
purposes.
Modules to Load Content and Category
Modules to Load Content and Category
Scikit-learn has many built-in datasets. There are several methods to load these datasets with the help of a data
load object.
Container
folder
Category 1
Data load object
Category 2
Data load object
Modules to Load Content and Category
The text files are loaded with categories as subfolder names.
Container Extract features
folder
Category 1
NumPy array
SciPy matrix
Category 2
Modules to Load Content and Category
In [ ] : #Build a feature extraction transformer
From sklearn.feature_extraction.text import <appropriate transformer>
Modules to Load Content and Category
The attributes of a data load object are:
Contains fields and can be accessed
Bunch
as dict keys or an object
Attributes
Data load object Target names Has the list of requested categories
Data Refers to an attribute in the memory
Modules to Load Content and Category
The example shows how a dataset can be loaded using Scikit-learn:
Import the dataset
Load dataset
Describe the dataset
Modules to Load Content and Category
Let us see how functions like type, .data, and .target help in analyzing a dataset.
View type of dataset
View data
View target
Feature Extraction
Feature extraction is a technique to convert the content into the numerical vectors to perform machine learning.
For example: Large datasets or documents
Text feature extraction
For example: Patch extraction, hierarchical
clustering
Image feature extraction
Bag of Words
Bag of Words
Bag of words is used to convert text data into numerical feature vectors with a fixed size.
Storing
Counting Store as the
value feature
Tokenizing Number of
occurrences of
each word
Assign a fixed
integer id to each
word
Token 1 Token 2 Token 3 Token 4
Document 1 42 32 119 3
Corpus of document Document 2 1118 0 0 89
Document 3 0 0 0 55
CountVectorizer Class Signature
Specifies number of
components to keep
Class class
sklearn.feature_extraction.text.CountVectorizer
Encoding used to
File name or (input='content', encoding='utf-8',
decode the input
sequence of strings decode_error='strict', strip_accents=None,
Removes accents
lowercase=True, preprocessor=None,
Overrides string
tokenizer tokenizer=None, stop_words=None, Built-in stop words list
token_pattern='(?u)\b\w\w+\b', ngram_range=(1, 1), Min Threshold
analyzer='word', max_df=1.0, min_df=1,
Max Threshold
max_features=None, vocabulary=None,
binary=False, dtype=<class 'numpy.int64'>)
Bags of Words
Problem Statement: Demonstrate the Bag of Words technique
Access: Click on the Practice Labs tab on the left side panel of the LMS. Copy or note the
username and password that is generated. Click on the Launch Lab button. On the page that
appears, enter the username and password in the respective fields, and click Login.
Text Feature Extraction Considerations
Text Feature Extraction Considerations
This utility deals with sparse matrix while storing them in memory. Sparse data
Sparse is commonly noticed when it comes to extracting feature values, especially for
large document datasets.
It implements tokenization and occurrence. Words with minimum two letters
Vectorizer
get tokenized. We can use the analyzer function to vectorize the text data.
It is a term weighing utility for term frequency and inverse document
frequency. Term frequency indicates the frequency of a particular term in the
Tf-idf
document. Inverse document frequency is a factor which diminishes the
weight of terms that occur frequently.
Decoding This utility can decode text files if their encoding is specified.
Model Training
An important task in model training is to identify the right model for the given dataset. The choice of model
completely depends on the type of dataset.
Models predict the outcome of new observations and datasets, and classify
documents based on the features and response of a given dataset.
Supervised
Example: Naïve Bayes, SVM, linear regression, K-NN neighbors
Models identify patterns in the data and extract its structure. They are also used to
Unsupervised group documents using clustering algorithms.
Example: K-means
Naïve Bayes Classifier
It is the most basic technique for classification of text.
Advantages: Uses:
• It is efficient as it uses limited CPU and memory. • Naïve Bayes is used for sentiment analysis,
• It is fast as the model training takes less time. email spam detection, categorization of
documents, and language detection.
• Multinomial Naïve Bayes is used when multiple
occurrences of the words matter.
Naïve Bayes Classifier
Let us take a look at the signature of the multinomial Naïve Bayes classifier:
Learn Class prior probabilities
Class
class sklearn.naive_bayes.MultinomialNB(alpha=1.0, fit_prior=True, class_prior=None)
Smoothing parameter
(0 for no smoothing) Prior probabilities of the
classes
Grid Search and Multiple Parameters
Document classifiers can have many parameters. A Grid approach helps to search the best parameters for
model training and predicting the outcome accurately.
Category 1
Extract features of a document
Document classifier
Category 2
Grid Search and Multiple Parameters
Document classifier Parameter
Parameter Grid searcher
Parameter
Best parameter
Grid Search and Multiple Parameters
In grid search mechanism, the whole dataset can be divided into multiple grids and a search can be run on the
entire grid or a combination of grids.
Grid searcher
Best parameter Parameter 1
Parameter 2
Parameter 3
Pipeline
A pipeline is a combination of vectorizers, transformers, and model training.
Extracts features
around the word of
interest
Transformer Model Training
Vectorizer
(tf-idf) (document classifiers)
Converts a collection
Helps the
of text documents into
model predict
a numerical feature
vector accurately
Pipeline and Grid Search
Problem Statement: Demonstrate the Pipeline and Grid Search technique.
Access: Click on the Practice Labs tab on the left side panel of the LMS. Copy or note the
username and password that is generated. Click on the Launch Lab button. On the page that
appears, enter the username and password in the respective fields, and click Login.
Analyzing the Spam Collection Dataset
Problem Statement:
Analyze the given Spam Collection dataset to:
1.View information on the spam data
2.View the length of messages,
3. Define a function to eliminate stop words
4. Apply Bag of Words
5. Apply tf-idf transformer
6. Detect Spam with Naïve Bayes model
Analyzing the Spam Collection Dataset
Instructions on performing the assignment:
•Download the Spam Collection dataset from the “Resource” tab. Upload it using the right
syntax to use and analyze it.
Common instructions:
•If you are new to Python, download the “Anaconda Installation Instructions” document
from the “Resources” tab to view the steps for installing Anaconda and the Jupyter
notebook.
•Download the “Assignment 01” notebook and upload it on the Jupyter notebook to access
it.
•Follow the provided cues to complete the assignment.
Analyzing the Sentiment Dataset using NLP
Problem Statement:
Analyze the Sentiment dataset using NLP to:
1. View the observations
2. Verify the length of the messages and add it as a new column
3. Apply a transformer and fit the data in the bag of words
4. Print the shape for the transformer
5. Check the model for predicted and expected values
Analyzing the Sentiment Dataset using NLP
Instructions on performing the assignment:
• Download the Sentiment dataset from the “Resource” tab. Upload it to your Jupyter
notebook to work on it.
Common instructions:
• If you are new to Python, download the “Anaconda Installation Instructions” document
from the “Resources” tab to view the steps for installing Anaconda and the Jupyter
notebook.
• Download the “Assignment 02” notebook and upload it on the Jupyter notebook to
access it.
• Follow the provided cues to complete the assignment.
Key Takeaways
You are now able to:
Define natural language processing
Explain the importance of natural language processing
List the applications using natural language processing
Outline the modules to load content and category
Apply feature extraction techniques
Implement the approaches of natural language processing
Knowledge Check
Knowledge
Check
In NLP, tokenization is a way to _______________________.
1
a. Find the grammar of the text
b. Analyze the sentence structure
c. Find ambiguities
d. Split text data into words, phrases, and idioms
Knowledge
Check
In NLP, tokenization is a way to _______________________.
1
a. Find the grammar of the text
b. Analyze the sentence structure
c. Find ambiguities
d. Split text data into words, phrases, and idioms
The correct answer is d
Splitting text data into words, phrases, and idioms is known as tokenization and each individual word is
known as token.
Knowledge
Check
What is the tf-idf value in a document?
2
a. Directly proportional to the number of times a word appears
b. Inversely proportional to the number of times a word appears
c. Offset by frequency of the words in corpus
d. Increase with frequency of the words in corpus
Knowledge
Check
What is the tf-idf value in a document?
2
a. Directly proportional to the number of times a word appears
b. Inversely proportional to the number of times a word appears
c. Offset by frequency of the words in corpus
d. Increase with frequency of the words in corpus
The correct answer is a,c
td-idf value reflects how important a word is to a document. It is directly proportional to the number of
times a word appears and is offset by frequency of the words in corpus.
Knowledge
Check
In grid search, if n_jobs = -1, then which of the following is correct?
3
a. Uses only 1 CPU core
b. Detects all installed cores and uses them all
c. Searches for only one parameter
d. All parameters will be searched on a given grid
Knowledge
Check
In grid search, if n_jobs = -1, then which of the following is correct?
3
a. Uses only 1 CPU core
b. Detects all installed cores and uses them all
c. Searches for only one parameter
d. All parameters will be searched on a given grid
The correct answer is b
Detects all installed cores on the machine and uses all of them.
Knowledge
Check
Identify the correct example of Topic Modeling from the following options:
4
a. Machine translation
b. Speech recognition
c. News aggregators
d. Sentiment analysis
Knowledge
Check
Identify the correct example of Topic Modeling from the following options:
4
a. Machine translation
b. Speech recognition
c. News aggregators
d. Sentiment analysis
The correct answer is c
‘Topic model’ is statistical modeling and used to find latent groupings in the documents based upon the
words. For example, news aggregators.
Knowledge
Check How do we save memory while operating on Bag of Words which typically contain high-
dimensional sparse datasets?
5
a. Distribute datasets in several blocks or chunks
b. Store only non-zero parts of the feature vectors
c. Flatten the dataset
d. Decode them
Knowledge
Check How do we save memory while operating on Bag of Words which typically contain high-
dimensional sparse datasets?
5
a. Distribute datasets in several blocks or chunks
b. Store only non-zero parts of the feature vectors
c. Flatten the dataset
d. Decode them
The correct answer is b
In features vector, there will be several values with zeros. The best way to save memory is to store only non-
zero parts of the feature vectors.
Knowledge
Check
What is the function of the sub-module feature_extraction.text.CountVectorizer?
6
a. Convert a collection of text documents to a matrix of token counts
b. Convert a collection of text documents to a matrix of token occurrences
c. Transform a count matrix to a normalized form
d. Convert a collection of raw documents to a matrix of TF-IDF features
Knowledge
Check
What is the function of the sub-module feature_extraction.text.CountVectorizer?
6
a. Convert a collection of text documents to a matrix of token counts
b. Convert a collection of text documents to a matrix of token occurrences
c. Transform a count matrix to a normalized form
d. Convert a collection of raw documents to a matrix of TF-IDF features
The correct answer is a
The function of the sub-module feature_extraction.text.CountVectorizer is to convert a collection of text
documents to a matrix of token counts.
Thank You