0% found this document useful (0 votes)

242 views62 pages

Data Science With Python - Lesson 09 - Data Science With Python - NLP PDF

This document provides an overview of natural language processing (NLP) with Python. It defines NLP and explains its importance. Some key applications of NLP include machine translation, speech recognition, and sentiment analysis. The document outlines how to set up the NLP environment and perform sentence analysis. It also discusses major NLP libraries like NLTK, Scikit-learn, TextBlob, and spaCy. Specifically, it describes the Scikit-learn approach to NLP, which includes loading content and categories with data load objects, feature extraction, model training, and using modules to optimize the process.

Uploaded by

akshay beniwal

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

242 views62 pages

Data Science With Python - Lesson 09 - Data Science With Python - NLP PDF

Uploaded by

akshay beniwal

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 62

Data Science with Python

Natural Language Processing (NLP) with

SciKit Learn
Learning Objectives

By the end of this lesson, you will be able to:

Define natural language processing

Explain the importance of natural language processing

List the applications using natural language processing

Outline the modules to load content and category

Apply feature extraction techniques

Implement the approaches of natural language processing

Introduction to Natural Language
Processing
Natural Language Processing (NLP)

Natural language processing is an automated way to understand and analyze natural human languages and extract
information from such data by applying machine algorithms.

Extract information

Analyze human languages

Machine algorithms and

translations
(mathematics and statistics)
Data from various sources
Natural Language Processing

It is also referred to as, the field of computer science or AI to extract the

linguistics information from the underlying data.

Extract the linguistics information

Why Natural Language Processing

The world is now connected globally due to the advancement of technology and devices.

Analyzing tons of data

Identifying various languages

Applying quantitative analysis

Handling ambiguities
Why Natural Language Processing

NLP can achieve full automation by using modern software libraries, modules, and packages.

Full Intelligent
automation processing

Knowledge about Modern Machine

languages and world software models
libraries
NLP Terminology

Determines where one word Word Splits words, phrases, and idioms
ends and the other begins boundaries Tokenization

Discovers topics in a collection Stemming Maps to the valid root word

Topic
of documents NLP
models

Disambig-
Determines meaning and uation Tf-idf
sense of words (context vs. Semantic Represents term frequency and
intent) analytics inverse document frequency

Compares words, phrases, and

idioms in a set of documents to
extract meaning
NLP Approach for Text Data

Let us look at the Natural Language Processing approaches to analyze text data.

Conduct basic text

processing

Analyze the meaning

Categorize and tag words

Build feature-based NLP Classify text

structure

Analyze sentence
Extract information
structure
NLP Environmental Setup

Problem Statement: Demonstrate the installation of NLP environment

Access: Click on the Practice Labs tab on the left side panel of the LMS. Copy or note the
username and password that is generated. Click on the Launch Lab button. On the page that
appears, enter the username and password in the respective fields, and click Login.
Sentence Analysis

Problem Statement: Demonstrate how to perform the sentence analysis

Machine translation is used to translate one language into another. Google Translate
Machine Translation is an example. It uses NLP to translate the input data from one language to another.

Speech Recognition

Sentiment Analysis
Applications of NLP

The speech recognition application understands human speech and uses it as input
Machine Translation information. It is useful for applications like Siri, Google Now, and Microsoft Cortana.

Speech Recognition

Sentiment Analysis
Applications of NLP

Sentiment analysis is achieved by processing tons of data received from different

Machine Translation interfaces and sources. For example, NLP uses all social media activities to find out
the popular topic of discussion or importance.

Speech Recognition

Sentiment Analysis
Major NLP Libraries

NLTK

Scikit-learn

NLP libraries
TextBlob

spaCy
The Scikit-Learn Approach
The Scikit-Learn Approach

It is a very powerful library with a set of modules to process and analyze natural language data, such as text and
images, and extract information using machine learning algorithms.

Built-in module Feature extraction Model training

Contains built-in A way to extract Analyzes the content

modules to load information from based on particular
the dataset’s data which can be categories and then
content and text or images. trains them according
categories. to a specific model.
The Scikit-Learn Approach

It is a very powerful library with a set of modules to process and analyze natural language data, such as texts and
images, and extract information using machine learning algorithms.

Pipeline building
mechanism

A technique to Various stages of pipeline

streamline the learning 1. Vectorization
NLP process into 2. Transformation
stages. 3. Model training and application
The Scikit-Learn Approach

It is a very powerful library with a set of modules to process and analyze natural language data, such as texts and
images, and extract information using machine learning algorithms.

Pipeline building Performance Grid search for finding

mechanism optimization good parameters

A technique in In this stage It’s a powerful way

Scikit-learn we train the to search
approach to models to parameters
streamline the optimize the affecting the
NLP process into overall outcome for
stages. process. model training
purposes.
Modules to Load Content and Category
Modules to Load Content and Category

Scikit-learn has many built-in datasets. There are several methods to load these datasets with the help of a data
load object.

Container
folder

Category 1

Data load object

Category 2

Data load object

Modules to Load Content and Category

The text files are loaded with categories as subfolder names.

Container Extract features

folder

Category 1

NumPy array
SciPy matrix

Category 2
Modules to Load Content and Category

In [ ] : #Build a feature extraction transformer

From sklearn.feature_extraction.text import <appropriate transformer>
Modules to Load Content and Category

The attributes of a data load object are:

Contains fields and can be accessed

Bunch
as dict keys or an object
Attributes

Data load object Target names Has the list of requested categories

Data Refers to an attribute in the memory

Modules to Load Content and Category

The example shows how a dataset can be loaded using Scikit-learn:

Import the dataset

Load dataset

Describe the dataset

Modules to Load Content and Category

Let us see how functions like type, .data, and .target help in analyzing a dataset.

View type of dataset

View data

View target
Feature Extraction

Feature extraction is a technique to convert the content into the numerical vectors to perform machine learning.

For example: Large datasets or documents

Text feature extraction

For example: Patch extraction, hierarchical

clustering

Image feature extraction

Bag of Words
Bag of Words

Bag of words is used to convert text data into numerical feature vectors with a fixed size.

Storing

Counting Store as the

value feature
Tokenizing Number of
occurrences of
each word
Assign a fixed
integer id to each
word

Token 1 Token 2 Token 3 Token 4

Document 1 42 32 119 3
Corpus of document Document 2 1118 0 0 89
Document 3 0 0 0 55
CountVectorizer Class Signature

Specifies number of
components to keep
Class class
sklearn.feature_extraction.text.CountVectorizer
Encoding used to
File name or (input='content', encoding='utf-8',
decode the input
sequence of strings decode_error='strict', strip_accents=None,
Removes accents
lowercase=True, preprocessor=None,
Overrides string
tokenizer tokenizer=None, stop_words=None, Built-in stop words list

token_pattern='(?u)\b\w\w+\b', ngram_range=(1, 1), Min Threshold

analyzer='word', max_df=1.0, min_df=1,
Max Threshold
max_features=None, vocabulary=None,
binary=False, dtype=<class 'numpy.int64'>)
Bags of Words

Problem Statement: Demonstrate the Bag of Words technique

This utility deals with sparse matrix while storing them in memory. Sparse data
Sparse is commonly noticed when it comes to extracting feature values, especially for
large document datasets.

It implements tokenization and occurrence. Words with minimum two letters

Vectorizer
get tokenized. We can use the analyzer function to vectorize the text data.

It is a term weighing utility for term frequency and inverse document

frequency. Term frequency indicates the frequency of a particular term in the
Tf-idf
document. Inverse document frequency is a factor which diminishes the
weight of terms that occur frequently.

Decoding This utility can decode text files if their encoding is specified.
Model Training

An important task in model training is to identify the right model for the given dataset. The choice of model
completely depends on the type of dataset.

Models predict the outcome of new observations and datasets, and classify
documents based on the features and response of a given dataset.
Supervised
Example: Naïve Bayes, SVM, linear regression, K-NN neighbors

Models identify patterns in the data and extract its structure. They are also used to
Unsupervised group documents using clustering algorithms.

Example: K-means
Naïve Bayes Classifier

It is the most basic technique for classification of text.

Advantages: Uses:
• It is efficient as it uses limited CPU and memory. • Naïve Bayes is used for sentiment analysis,
• It is fast as the model training takes less time. email spam detection, categorization of
documents, and language detection.
• Multinomial Naïve Bayes is used when multiple
occurrences of the words matter.
Naïve Bayes Classifier

Let us take a look at the signature of the multinomial Naïve Bayes classifier:

Learn Class prior probabilities

Class

class sklearn.naive_bayes.MultinomialNB(alpha=1.0, fit_prior=True, class_prior=None)

Smoothing parameter
(0 for no smoothing) Prior probabilities of the
classes
Grid Search and Multiple Parameters

Document classifiers can have many parameters. A Grid approach helps to search the best parameters for
model training and predicting the outcome accurately.

Category 1

Extract features of a document

Document classifier

Category 2
Grid Search and Multiple Parameters

Document classifier Parameter

Parameter Grid searcher

Parameter
Best parameter
Grid Search and Multiple Parameters

In grid search mechanism, the whole dataset can be divided into multiple grids and a search can be run on the
entire grid or a combination of grids.

Grid searcher
Best parameter Parameter 1

Parameter 2

Parameter 3
Pipeline

A pipeline is a combination of vectorizers, transformers, and model training.

Extracts features
around the word of
interest

Transformer Model Training

Vectorizer
(tf-idf) (document classifiers)

Converts a collection
Helps the
of text documents into
model predict
a numerical feature
vector accurately
Pipeline and Grid Search

Problem Statement: Demonstrate the Pipeline and Grid Search technique.

Problem Statement:

Analyze the given Spam Collection dataset to:

1.View information on the spam data
2.View the length of messages,
3. Define a function to eliminate stop words
4. Apply Bag of Words
5. Apply tf-idf transformer
6. Detect Spam with Naïve Bayes model
Analyzing the Spam Collection Dataset

Instructions on performing the assignment:

•Download the Spam Collection dataset from the “Resource” tab. Upload it using the right
syntax to use and analyze it.

Common instructions:
•If you are new to Python, download the “Anaconda Installation Instructions” document
from the “Resources” tab to view the steps for installing Anaconda and the Jupyter
notebook.
•Download the “Assignment 01” notebook and upload it on the Jupyter notebook to access
it.
•Follow the provided cues to complete the assignment.
Analyzing the Sentiment Dataset using NLP

Problem Statement:

Analyze the Sentiment dataset using NLP to:

1. View the observations
2. Verify the length of the messages and add it as a new column
3. Apply a transformer and fit the data in the bag of words
4. Print the shape for the transformer
5. Check the model for predicted and expected values
Analyzing the Sentiment Dataset using NLP

Instructions on performing the assignment:

• Download the Sentiment dataset from the “Resource” tab. Upload it to your Jupyter
notebook to work on it.

Common instructions:
• If you are new to Python, download the “Anaconda Installation Instructions” document
from the “Resources” tab to view the steps for installing Anaconda and the Jupyter
notebook.
• Download the “Assignment 02” notebook and upload it on the Jupyter notebook to
access it.
• Follow the provided cues to complete the assignment.
Key Takeaways

You are now able to:

Define natural language processing

Explain the importance of natural language processing

List the applications using natural language processing

Outline the modules to load content and category

Apply feature extraction techniques

Implement the approaches of natural language processing

Knowledge Check
Knowledge
Check
In NLP, tokenization is a way to _______________________.
1

a. Find the grammar of the text

b. Analyze the sentence structure

c. Find ambiguities

d. Split text data into words, phrases, and idioms

Knowledge
Check
In NLP, tokenization is a way to _______________________.
1

a. Find the grammar of the text

b. Analyze the sentence structure

c. Find ambiguities

d. Split text data into words, phrases, and idioms

The correct answer is d

Splitting text data into words, phrases, and idioms is known as tokenization and each individual word is
known as token.
Knowledge
Check
What is the tf-idf value in a document?
2

a. Directly proportional to the number of times a word appears

b. Inversely proportional to the number of times a word appears

c. Offset by frequency of the words in corpus

d. Increase with frequency of the words in corpus

Knowledge
Check
What is the tf-idf value in a document?
2

a. Directly proportional to the number of times a word appears

b. Inversely proportional to the number of times a word appears

c. Offset by frequency of the words in corpus

d. Increase with frequency of the words in corpus

The correct answer is a,c

td-idf value reflects how important a word is to a document. It is directly proportional to the number of
times a word appears and is offset by frequency of the words in corpus.
Knowledge
Check
In grid search, if n_jobs = -1, then which of the following is correct?
3

a. Uses only 1 CPU core

b. Detects all installed cores and uses them all

c. Searches for only one parameter

d. All parameters will be searched on a given grid

Knowledge
Check
In grid search, if n_jobs = -1, then which of the following is correct?
3

a. Uses only 1 CPU core

b. Detects all installed cores and uses them all

c. Searches for only one parameter

d. All parameters will be searched on a given grid

The correct answer is b

Detects all installed cores on the machine and uses all of them.
Knowledge
Check
Identify the correct example of Topic Modeling from the following options:
4

a. Machine translation

b. Speech recognition

c. News aggregators

d. Sentiment analysis
Knowledge
Check
Identify the correct example of Topic Modeling from the following options:
4

a. Machine translation

b. Speech recognition

c. News aggregators

d. Sentiment analysis

The correct answer is c

‘Topic model’ is statistical modeling and used to find latent groupings in the documents based upon the
words. For example, news aggregators.
Knowledge
Check How do we save memory while operating on Bag of Words which typically contain high-
dimensional sparse datasets?
5

a. Distribute datasets in several blocks or chunks

b. Store only non-zero parts of the feature vectors

c. Flatten the dataset

d. Decode them
Knowledge
Check How do we save memory while operating on Bag of Words which typically contain high-
dimensional sparse datasets?
5

a. Distribute datasets in several blocks or chunks

b. Store only non-zero parts of the feature vectors

c. Flatten the dataset

d. Decode them

The correct answer is b

In features vector, there will be several values with zeros. The best way to save memory is to store only non-
zero parts of the feature vectors.
Knowledge
Check
What is the function of the sub-module feature_extraction.text.CountVectorizer?
6

a. Convert a collection of text documents to a matrix of token counts

b. Convert a collection of text documents to a matrix of token occurrences

c. Transform a count matrix to a normalized form

d. Convert a collection of raw documents to a matrix of TF-IDF features

Knowledge
Check
What is the function of the sub-module feature_extraction.text.CountVectorizer?
6

a. Convert a collection of text documents to a matrix of token counts

b. Convert a collection of text documents to a matrix of token occurrences

c. Transform a count matrix to a normalized form

d. Convert a collection of raw documents to a matrix of TF-IDF features

The correct answer is a

The function of the sub-module feature_extraction.text.CountVectorizer is to convert a collection of text

documents to a matrix of token counts.
Thank You

Java Interview JavaTpoint
100% (1)
Java Interview JavaTpoint
170 pages
Car Mechanic Simulator 2021 Car Modding Guide
100% (3)
Car Mechanic Simulator 2021 Car Modding Guide
50 pages
51 Cutover Templates
100% (2)
51 Cutover Templates
13 pages
Machine Learning Cheat Sheet ??? - ?
No ratings yet
Machine Learning Cheat Sheet ??? - ?
231 pages
SVM Guide for Data Scientists
No ratings yet
SVM Guide for Data Scientists
24 pages
Tutorials
No ratings yet
Tutorials
17 pages
ML Algorithms
100% (1)
ML Algorithms
1 page
Minorproject Ishant
No ratings yet
Minorproject Ishant
18 pages
Deploy A Machine Learning Model Using Flask - Towards Data Science
No ratings yet
Deploy A Machine Learning Model Using Flask - Towards Data Science
12 pages
Python Basics: Subset Slice
No ratings yet
Python Basics: Subset Slice
1 page
Description of Experiments: GH, 01 (F/J
No ratings yet
Description of Experiments: GH, 01 (F/J
5 pages
Pandas
100% (1)
Pandas
1,131 pages
NumPy Basics: Arrays and Operations
No ratings yet
NumPy Basics: Arrays and Operations
59 pages
ML Lab File
No ratings yet
ML Lab File
53 pages
99 Ta 516149
No ratings yet
99 Ta 516149
2 pages
Boost OEE with TPM and Pareto Analysis
No ratings yet
Boost OEE with TPM and Pareto Analysis
15 pages
Discrete Math for CS Students
No ratings yet
Discrete Math for CS Students
46 pages
NLP with Python & NLTK Guide
No ratings yet
NLP with Python & NLTK Guide
2 pages
Introduction To NLP
No ratings yet
Introduction To NLP
50 pages
Data Science Links
No ratings yet
Data Science Links
1 page
Data Science Experiment Guide
100% (2)
Data Science Experiment Guide
43 pages
ISO (International Organization Standardization)
100% (1)
ISO (International Organization Standardization)
18 pages
SciPy Essentials for Data Scientists
No ratings yet
SciPy Essentials for Data Scientists
48 pages
PHP Math Functions
No ratings yet
PHP Math Functions
5 pages
Review On NLP Paraphrase Detection Approaches
No ratings yet
Review On NLP Paraphrase Detection Approaches
4 pages
Multimedia Unit 4
No ratings yet
Multimedia Unit 4
16 pages
System On Chip
No ratings yet
System On Chip
12 pages
E-Guard: Home Security for Cairo
No ratings yet
E-Guard: Home Security for Cairo
23 pages
Data Visualization - Matplotlib PDF
100% (1)
Data Visualization - Matplotlib PDF
15 pages
Getting Started - TensorFlow
0% (1)
Getting Started - TensorFlow
14 pages
Gatling 2
No ratings yet
Gatling 2
10 pages
Machine Learning: Andrew NG's Course From Coursera: Presentation
100% (1)
Machine Learning: Andrew NG's Course From Coursera: Presentation
4 pages
C Programming Exam Prep
No ratings yet
C Programming Exam Prep
19 pages
C Programming Exam Prep
No ratings yet
C Programming Exam Prep
19 pages
Python For Machine Learning: Sampriti Chatterjee
No ratings yet
Python For Machine Learning: Sampriti Chatterjee
13 pages
Pue - Kar.nic - in PUE PDF Files Colleges NN
No ratings yet
Pue - Kar.nic - in PUE PDF Files Colleges NN
18 pages
AI & ML Honours Course Overview
No ratings yet
AI & ML Honours Course Overview
16 pages
Data Science New
No ratings yet
Data Science New
9 pages
Office of Dean, Faculty Affairs Rajasthan Technical University, Kota
No ratings yet
Office of Dean, Faculty Affairs Rajasthan Technical University, Kota
3 pages
Keywords and Identifiers in C
No ratings yet
Keywords and Identifiers in C
3 pages
Utr - PLN Suar PDF
100% (1)
Utr - PLN Suar PDF
86 pages
LAB Python Basics Ver 7.0
No ratings yet
LAB Python Basics Ver 7.0
17 pages
Analysis of Statistical Parsing in Natural Language Processing
No ratings yet
Analysis of Statistical Parsing in Natural Language Processing
6 pages
7 Time Series Datasets For Machine Learning
No ratings yet
7 Time Series Datasets For Machine Learning
8 pages
Sony hcd-gtr6 gtr6b gtr7 gtr8 gtr8b Ver.1.2 PDF
No ratings yet
Sony hcd-gtr6 gtr6b gtr7 gtr8 gtr8b Ver.1.2 PDF
92 pages
The Beginning of Techno-Life: Tools Crafts Systems Greek
No ratings yet
The Beginning of Techno-Life: Tools Crafts Systems Greek
8 pages
Uganda National Bureau of Standards: Laboratory Test Report
No ratings yet
Uganda National Bureau of Standards: Laboratory Test Report
1 page
Data Science vs. Machine Learning
No ratings yet
Data Science vs. Machine Learning
5 pages
B.Tech CSE Algorithm Design Notes
No ratings yet
B.Tech CSE Algorithm Design Notes
126 pages
Lesson 02 2.01 Introduction To Data Science
No ratings yet
Lesson 02 2.01 Introduction To Data Science
31 pages
Data Science With Python - Lesson 10 - Data Visualization in Python With Matplotlib - Raw
No ratings yet
Data Science With Python - Lesson 10 - Data Visualization in Python With Matplotlib - Raw
71 pages
Supervised Learning 1 PDF
100% (1)
Supervised Learning 1 PDF
162 pages
Pytorch: Tensors and Datasets
No ratings yet
Pytorch: Tensors and Datasets
9 pages
PGDCA Project: Time Table System
No ratings yet
PGDCA Project: Time Table System
4 pages
Dennis
No ratings yet
Dennis
27 pages
Data Science Intro
No ratings yet
Data Science Intro
52 pages
Python With Data Science
No ratings yet
Python With Data Science
102 pages
Machine Learning Data Preparation Guide
No ratings yet
Machine Learning Data Preparation Guide
49 pages
Python For Non-Programmers Final
No ratings yet
Python For Non-Programmers Final
218 pages
CEMS Exam Guidelines 2023
No ratings yet
CEMS Exam Guidelines 2023
1 page
AI & ML: Concepts and Comparisons
No ratings yet
AI & ML: Concepts and Comparisons
179 pages
Phyton
No ratings yet
Phyton
118 pages
Python Data Science
No ratings yet
Python Data Science
25 pages
Data Analytics for Aspiring Analysts
No ratings yet
Data Analytics for Aspiring Analysts
54 pages
BA ZG523 Introduction To Data Science
50% (2)
BA ZG523 Introduction To Data Science
12 pages
Twitter Sentiment Analysis Project
100% (1)
Twitter Sentiment Analysis Project
14 pages
Pandas Data Structures Guide
No ratings yet
Pandas Data Structures Guide
72 pages
A Guide To Teaching Data Science PDF
No ratings yet
A Guide To Teaching Data Science PDF
26 pages
Data Science Masters Program - Curriculum-Updated 2019
No ratings yet
Data Science Masters Program - Curriculum-Updated 2019
52 pages
Career Plans For Next 2 Years
No ratings yet
Career Plans For Next 2 Years
11 pages
1 - Machine Learning (Start)
No ratings yet
1 - Machine Learning (Start)
32 pages
IT Professionals: SCCM & Packaging Expertise
No ratings yet
IT Professionals: SCCM & Packaging Expertise
3 pages
Study Notes - Lesson 1 - 7 PDF
No ratings yet
Study Notes - Lesson 1 - 7 PDF
25 pages
Machine Learning
100% (2)
Machine Learning
211 pages
NLP DeepNLP
No ratings yet
NLP DeepNLP
61 pages
Machine Learning Course Overview
No ratings yet
Machine Learning Course Overview
145 pages
Document
No ratings yet
Document
22 pages
Machine Learning Summarized Notes 1660762916
No ratings yet
Machine Learning Summarized Notes 1660762916
111 pages
2022 Grade 10 3rd Tem Tamil
No ratings yet
2022 Grade 10 3rd Tem Tamil
8 pages
Python Interview Questions
No ratings yet
Python Interview Questions
8 pages
6months ML
No ratings yet
6months ML
161 pages
Mamata Java Developer
No ratings yet
Mamata Java Developer
7 pages
Introduction To R Programming
No ratings yet
Introduction To R Programming
14 pages
Machine Learning Notes
No ratings yet
Machine Learning Notes
27 pages
NLP Notes
No ratings yet
NLP Notes
203 pages
Natural Language Toolkit NLTK PDF
No ratings yet
Natural Language Toolkit NLTK PDF
23 pages
Pandas
No ratings yet
Pandas
167 pages
R22 ML Syllabus
No ratings yet
R22 ML Syllabus
2 pages
DeekshikaJadyada AP24LDS11
No ratings yet
DeekshikaJadyada AP24LDS11
6 pages
Evolution of The Practice of Software Testing in Java Projects
No ratings yet
Evolution of The Practice of Software Testing in Java Projects
5 pages
CCS355 Neural Networks and Deep Learning Lab
No ratings yet
CCS355 Neural Networks and Deep Learning Lab
43 pages
Resume Limpia Banerjee
No ratings yet
Resume Limpia Banerjee
3 pages
E-Sahal Wallet Intro Jemal
No ratings yet
E-Sahal Wallet Intro Jemal
18 pages
Machine Learning
No ratings yet
Machine Learning
31 pages
en Safety Manual VEGASWING 61 63 Two Wire (8 16 MA) With SIL
No ratings yet
en Safety Manual VEGASWING 61 63 Two Wire (8 16 MA) With SIL
20 pages