0% found this document useful (0 votes)

15 views61 pages

AP For NLP-LO1

Uploaded by

shahbazhassan

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

15 views61 pages

AP For NLP-LO1

Uploaded by

shahbazhassan

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 61

Advanced Python for NLP

CONTENTS:

• NLP
• NLTK
• NLP Pre-
processing
WHAT IS NATURAL LANGUAGE
PROCESSING (NLP)?
• Natural Language Processing is an interdisciplinary field of Artificial
Intelligence.
• It is a technique used to teach a computer to understand Human
languages and also interpret just like us.
• It is the art of extracting information, hidden insights from unstructured
text.
• It is a sophisticated field that makes computers process text data on a
large scale.
• The ultimate goal of NLP is to make computers
and computer-controlled bots understand and interpret Human
Languages, just as we do.
COMPONENTS OF
NLP
• Natural Language
Understanding
• Natural Language
Generation

Figure: Components
of NLP
NATURAL LANGUAGE UNDERSTANDING:-
• NLU helps the machine to understand and analyze human language by extracting
the text from large data such as keywords, emotions, relations, and semantics,
etc.
Let’s see what challenges are faced by a machine-
He is looking for a match.
• What do you understand by the ‘match’ keyword?
• This is Lexical Ambiguity. It happens when a word has different meanings. Lexical
ambiguity can be resolved by using parts-of-speech (POS)tagging techniques.
The Fish is ready to eat.
• What do you understand by the above example?
• This is Syntactical Ambiguity which means when we see more meanings in a
sequence of words and also Called Grammatical Ambiguity.
NATURAL LANGUAGE
GENERATION:-

• It is the process of extracting meaningful insights

as phrases and sentences in the form of natural
language.
• It consists −
• Text planning − It includes retrieving the relevant data
from the domain.
• Sentence planning − It is nothing but a selection of
important words, meaningful phrases, or sentences.
APPLICATIONS OF
NLP
• Sentimental Analysis
• Chabot's
• Virtual Assistants
• Speech Recognition
• Machine Translation
• Advertise Matching
• Information Extraction
• Grammatical error detection
• Fake news detection
• Text Summarize
LIBRARIES FOR
NLP
Here are some of the libraries for leveraging the power of
Natural Language Processing.
• Natural Language Toolkit (NLTK)
• spaCY
• Gensim
• Standford CoreNLP
• TextBlob
WHAT IS NATURAL
TOOLKIT?
• NLTK, or Natural Language Toolkit, is a Python package that
you can use for NLP.
• NLTK is a leading platform for building Python programs to
work with human language data.
Installing NLTK
pip install nltk
WHAT IS Spacy

• Tokenization
• POS- Tagging
• NER
• Lemmatization
• Sentence Boundary Detection
• Text Classification
WHAT IS Gensim

• Topic modeling

• Word 2 Vec

• Document Similarity

• Efficient Data Handling

DATA PREPROCESSING USING
NLTK
• The process of cleaning unstructured text data, so that it can be used to
predict, analyze, and extract information. Real-world text data is
unstructured, inconsistent. So, Data preprocessing becomes a necessary
step.
• The various Data Preprocessing methods are:
• Tokenization
• Frequency Distribution of Words
• Filtering Stop Words
• Stemming
• Lemmatization
• Parts of Speech(POS) Tagging
• Name Entity Recognition
• WordNet
• These are some of the methods to process the text data in NLP. The list is
not so exhaustive but serves as a great starting point for anyone who
wants to get started with NLP.
TOKENIZING

• The process of breaking down the text data into individual

tokens(words, sentences, characters) is known as Tokenization. It is
a foremost step in Text Analytics.
• It’s your first step in turning unstructured data into structured data,
which is easier to analyze.
• Tokenizing can be done by two ways
• Tokenizing by word
• Tokenizing by sentence
• Importing the tokenizer from NLTK
from nltk.tokenize import sent_tokenize,
word_tokenize
STOPWORDS

• Stop words are used to filter some words which are repetitive and don’t hold any
information. For example, words like – {that these, below, is, are, etc.} don’t provide any
information, so they need to be removed from the text. Stop Words are considered as
Noise. NLTK provides a huge list of stop words

• Very common words like 'in', 'is', and 'an' are often used as stop words since they
don’t add a lot of meaning to a text in and of themselves.
STEMMING

• Stemming is a text processing task in which you reduce words to

their root, which is the core part of a word. For example, the
words “helping” and “helper” share the root “help.”
• Stemming allows you to zero in on the basic meaning of a word
rather than all the details of how it’s being used.
• NLTK has
• Porter Stemmer
• Snowball Stemmer
• Understemming and overstemming are two ways stemming
can go wrong:
• Understemming happens when two related words should be
reduced to the same stem but aren’t. This is a
false negative.
• Overstemming happens when two unrelated words are
reduced to the same stem even though they shouldn’t be.
This is a false positive.
Porter Stemmer
• It is one of the earliest and most widely used stemming algorithms. It
applies a series of rules to strip suffixes from words, reducing them
to their root form.

• Snowball Stemmer, also known as the Porter2 Stemmer, is an

improvement over the original Porter Stemmer. It was developed by
Martin Porter to address some of the limitations of the original
algorithm.
POS TAGGING
SUMMARY THAT YOU CAN USE TO GET STARTED WITH
NLTK’S POS TAGS:
Tags that start with Deals With
JJ Adjectives
NN Nouns
RB Adverbs
PRP Pronouns
VB Verbs
LEMMATIZING

• Like stemming, lemmatization is also used to reduce the word

to their root word. Lemmatizing gives the complete meaning
of the word which makes sense. It uses vocabulary and
morphological analysis to transform a word into a root word.
• For example:
• “engineers” is lemmatized to “engineer”
STEMMING V/S LEMMATIZATION

Stemming Lemmatization
Stemming is a process that Lemmatization considers the
stems or removes last few context and converts the word to
characters from a word, often its meaningful base form, which
leading to incorrect meanings is called Lemma.
and spelling.
For instance, stemming the word For instance, lemmatizing
‘Caring’ would return ‘Car’. the word ‘Caring’ would
return ‘Care’.
Stemming is used in case of Lemmatization is
large dataset where computationally expensive
performance is an issue since it involves look-up tables
and what not.
CHUNKING

• While tokenizing allows you to identify words and sentences,

chunking allows you to identify phrases.
• Note: A phrase is a word or group of words that works as a single
unit to perform a grammatical function. Noun phrases are built
around a noun.
• Here are some examples:
• “A plan et”
• “A tilting planet”
• “A swiftly tilting planet”
CHINKING

• Chinking is used together with chunking, but while chunking

is used to include a pattern, chinking is used to exclude a
pattern.
USING NAMED ENTITY RECOGNITION
(NER)
• Named entities are noun phrases that refer to specific
locations, people, organizations, and so on. With named entity
recognition, you can find the named entities in your texts
and also determine what kind of named entity they are.
Word Embedding
• Word embedding's are a type of word representation that allows words to
be represented as vectors in a continuous vector space. These vectors
capture semantic meanings of words such that words with similar
meanings have similar vector representations.

• Word embedding's are crucial for various natural language processing

(NLP) tasks because they help in understanding the context and
relationships between words in a more meaningful way than traditional
bag-of-words or one-hot encoding methods.
Word Embedding
• Vector Space Representation: Words are represented as dense vectors of fixed
size. Each word is mapped to a point in a continuous vector space

• Semantic Similarity: Words with similar meanings are located close to each
other in the vector space. For example, the words "king" and "queen" might be
close to each other.

• Dimensionality Reduction: Word embedding's reduce the dimensionality of

word representations, making them more computationally efficient while
preserving semantic information.
Word Embedding

• Word2Vec:

• GloVe (Global Vectors for Word Representation):

• Fast Text:

• BERT (Bidirectional Encoder Representations from

Transformers):
• Python Implementation
THANKS!
Any questions?
k-Nearest Neighbors

• The k-NN algorithm is the simplest machine learning algorithm.

• Building the model consists only of storing the training dataset.

• To make a prediction for a new data point, the algorithm finds the closest
data points in the training dataset. Its “nearest neighbors.”
k- Neighbors Classification

• In its simplest version, the k-NN algorithm only considers exactly

one nearest neighbor, which is the closest training data point to the
point we want to make a prediction for.

• The prediction is then simply the known output for this training point
k- Neighbors Classification
k- Neighbors Classification
• Here, three new data points have been added, marked as Star.

• For each of them, I marked the closest point in the training set. The
prediction of the one-nearest-neighbor algorithm is the label of
that point.

• Instead of considering only the closest neighbor, we can also

consider an arbitrary number, k, of neighbors. This is where the
name of the k-nearest neighbors algorithm comes from
k- Neighbors Classification

• When considering more than one neighbor, we use voting to assign a

label. This means that for each test point, we count how many
neighbors belong to class 0 and how many neighbors belong to class 1.

• We then assign the class that is more frequent: in other words, the
majority class among the k-nearest neighbors
k- Neighbors Classification
• mglearn.plots.plot_knn_classification(n_neighbors=3)
k- Neighbors Classification

• You can see that the prediction for the new data point at the top
left is not the same as the prediction when we used only one
neighbor.

• While this illustration is for a binary classification problem,

this method can be applied to datasets with any number of
classes.

• For more classes, we count how many neighbors belong to each

class and again predict the most common class.
KNN- Working
KNN

• Step 1 − For implementing any algorithm, we need dataset. So during the first step
of KNN, we must load the training as well as test data.

• Step 2 − Next, we need to choose the value of K i.e. the nearest data points. K can
be any integer.

• Step 3 − For each point in the test data do the following

KNN
• 3.1 − Calculate the distance between test data and each row of
training data with the help of any of the method namely: Euclidean,
Manhattan or Hamming distance. The most commonly used method
to calculate distance is Euclidean.

• 3.2 − Now, based on the distance value, sort them in ascending order.

• 3.3 − Next, it will choose the top K rows from the sorted array.

• 3.4 − Now, it will assign a class to the test point based on most
frequent class of these rows.
KNN- Example
Task : Classify the given instance according to the classes of
Training data using KNN- Algorithm . The value of K = 3

Query = X= (Matric= 51, Intermediate = 69)

Matric % Intermediate % University
Entry Test
55 60 PASS
60 52 FAIL
70 68 PASS
55 85 FAIL
87 56 PASS
66 70 FAIL
KNN- Example

• Euclidean distance
KNN- Example

• D1 = |51- 55|^2 + |69 – 60 |^2 = 97^1/2 = 9.84

• D2 = |51-60|^2 + |69- 52|^2 = 370 ^1/2 = 19.23

• D3= |51- 70|^2 + |69- 68|^2= 362 ^1/2 = 19.02

• D4= |51- 55|^2 + |69- 85|^2 = 272^1/2 = 16.49

• D5= |51- 87|^2 + |69-56|^2 = 1465^1/2= 38.27

• D6 = |51- 66|^2 + |69 – 70|^2 = 226 ^1/2 = 15.03

KNN- Example

• Now, based on the distance value, sort them in ascending order

• D1, D6, D4, D3, D2, D5

• 9.84, 15.03, 16.49, 19.02, 19.23, 38.27

• Next, it will choose the top K rows from the sorted array.

• 9.84, 15.03, 16.49 (D1, D6, D4)

KNN- Example

• − Now, it will assign a class to the test point based on most frequent class of these
rows.

• D1 = Pass

• D6= Fail

• D4= Fail

• Query = X= (Matric= 51, FSC = 69) = Fail

KNN- Summary

• What is “K” in KNN algorithm?

• K = Number of nearest neighbors you want to select to predict the class
of a given item
How to Choose K in KNN
• If K is small, then results might not be reliable because noise
will have a higher influence on the result.

• If K is large, then there will be a lot of processing which may

adversely impact the performance of the algorithm. So,
following is must be considered while choosing the value of K:
a. K should be the square root of n (number of data points in
training dataset)
•
b. K should be odd so that there are no ties. If square root is
even, then add or subtract 1 to it.
Why KNN- Lazy Learner

• When it gets the training data, it does not learn and make a
model, it just stores the data

• It does not derive any discriminative function from the training

data

• It uses the training data when it actually needs to do some

prediction. So, KNN does not immediately learn a model, but
delays the learning, that is why it is called lazy learner
KNN- ADVANTAGES

• It can be used for both regression and classification problems

• It is very simple and easy to implement.

• There is not much time cost in training phase

• KNN doesn't make any assumption for the distribution of the

given data.
KNN- Dis-ADVANTAGES

• Finding the optimum value of ‘k’

• we need to store the whole training set for every test set, it
requires a lot of space.

• it is not suitable for high dimensional data.

KNN- Implementation Using
Python
KNN- Implementation Steps

• Import Necessary Libraries

• import numpy as np
• import matplotlib.pyplot as plt
• from sklearn.datasets import load_diabetes
• from sklearn.model_selection import train_test_split
• from sklearn.preprocessing import StandardScaler
• from sklearn.neighbors import KNeighborsClassifier
• from sklearn.metrics import accuracy_score,
confusion_matrix
KNN- Implementation Steps

• Load the Dataset

• diabetes = load_diabetes()
• X = diabetes.data
• y = (diabetes.target > 150).astype(int)
• print(diabetes.DESCR)
•
KNN- Implementation Steps

• Split the Dataset

• X_train, X_test, y_train, y_test = train_test_split(X, y,

test_size=0.2, random_state=42)
KNN- Implementation Steps

• Features Scaling
• scaler = StandardScaler()
• X_train_scaled = scaler.fit_transform(X_train)
• X_test_scaled = scaler.transform(X_test)
KNN- Implementation Steps

• Train the Classifier

• knn = KNeighborsClassifier(n_neighbors=5)
• knn.fit(X_train_scaled, y_train)
KNN- Implementation Steps

• Train the Classifier

• knn = KNeighborsClassifier(n_neighbors=5)
• knn.fit(X_train_scaled, y_train)

• Make Prediction
• y_pred = knn.predict(X_test_scaled)
KNN- Implementation Steps

• Model Accuracy
• accuracy = accuracy_score(y_test, y_pred)
• print("Model Accuracy:", accuracy)
KNN- Implementation Steps

• Confusion matrix
• conf_matrix = confusion_matrix(y_test, y_pred)
• print("Confusion Matrix:")
• print(conf_matrix)
KNN- Implementation Steps

• Visualize Confusion matrix

plt.figure(figsize=(8, 6))
• plt.imshow(conf_matrix, cmap=plt.cm.Blues)
• plt.title('Confusion Matrix')
• plt.colorbar()
• plt.xlabel('Predicted Label')
• plt.ylabel('True Label')
• plt.xticks([0, 1], ['Negative', 'Positive'])
• plt.yticks([0, 1], ['Negative', 'Positive'])
• plt.show()
KNN- Implementation Steps

• Visualize using Scatter Plot

# Plotting the scatter plot
• plt.figure(figsize=(8, 6))
•
# Plotting points of class 0
• plt.scatter(X_test[y_test == 0][:, 0], X_test[y_test == 0][:, 1], c='blue', label='Class
0')
•
# Plotting points of class 1
• plt.scatter(X_test[y_test == 1][:, 0], X_test[y_test == 1][:, 1], c='red', label='Class 1')
•
plt.xlabel('Feature 1')
• plt.ylabel('Feature 2')
• plt.title('Diabetes Dataset - Scatter Plot')
• plt.legend()
• plt.show()

SAT Reading & Writing Manual
No ratings yet
SAT Reading & Writing Manual
130 pages
NLP Pipeline
No ratings yet
NLP Pipeline
58 pages
CAE (C1) : Rephrasing: 50 Examples of Rephrasing For Cambridge Cae Advanced Exam
100% (1)
CAE (C1) : Rephrasing: 50 Examples of Rephrasing For Cambridge Cae Advanced Exam
14 pages
NLP Manual (1-12) 1
No ratings yet
NLP Manual (1-12) 1
56 pages
1stPERIODICALEXAM GRADE3-ENGLISH
100% (1)
1stPERIODICALEXAM GRADE3-ENGLISH
5 pages
NLP Part1
No ratings yet
NLP Part1
67 pages
Introduction To NLP
No ratings yet
Introduction To NLP
50 pages
NLP Steps Basic
No ratings yet
NLP Steps Basic
26 pages
NLP-Lab Manual - Ashwini - Kachare
No ratings yet
NLP-Lab Manual - Ashwini - Kachare
41 pages
Handout of Writing Fundamental
No ratings yet
Handout of Writing Fundamental
32 pages
NLP Sem Imp
No ratings yet
NLP Sem Imp
46 pages
Chapter 7.1 - Introducing Natural Language Processing
No ratings yet
Chapter 7.1 - Introducing Natural Language Processing
39 pages
A Description of The Morning
100% (3)
A Description of The Morning
3 pages
18 Introduction To Classical Ethiopic PDF
No ratings yet
18 Introduction To Classical Ethiopic PDF
239 pages
NLP Unit-2
No ratings yet
NLP Unit-2
12 pages
Ram Chandra Padwal - Pratical Guide To NLTK For Data Science
No ratings yet
Ram Chandra Padwal - Pratical Guide To NLTK For Data Science
37 pages
Text Mining
No ratings yet
Text Mining
62 pages
Phrases
No ratings yet
Phrases
36 pages
Week 6: Introduction To Natural Language Processing
No ratings yet
Week 6: Introduction To Natural Language Processing
18 pages
Dealing With Textual Data
No ratings yet
Dealing With Textual Data
67 pages
AIML-HC Mod 04
No ratings yet
AIML-HC Mod 04
71 pages
NLP Unit 1 Part1
No ratings yet
NLP Unit 1 Part1
61 pages
NLP Manual (1-12)
No ratings yet
NLP Manual (1-12)
55 pages
NLP Lab Manual-1
No ratings yet
NLP Lab Manual-1
18 pages
NLP 9
No ratings yet
NLP 9
44 pages
NLP 1 Week Tutorial NLTK
No ratings yet
NLP 1 Week Tutorial NLTK
15 pages
NLP Unit 1
No ratings yet
NLP Unit 1
44 pages
5th Grade Grammar Guide
No ratings yet
5th Grade Grammar Guide
48 pages
NLP DeepNLP
No ratings yet
NLP DeepNLP
61 pages
DLT Unit-5
No ratings yet
DLT Unit-5
48 pages
NLP Intro
No ratings yet
NLP Intro
74 pages
Ir Manual
No ratings yet
Ir Manual
53 pages
NLP - 1 - 250119 - 222702
No ratings yet
NLP - 1 - 250119 - 222702
71 pages
NLP Final Review
No ratings yet
NLP Final Review
32 pages
Natural Language Processing Manual
No ratings yet
Natural Language Processing Manual
39 pages
Rajeev Mishra 20 SCSE1180087
No ratings yet
Rajeev Mishra 20 SCSE1180087
29 pages
Chapter-1 Introduction To NLP
No ratings yet
Chapter-1 Introduction To NLP
12 pages
Unit-3NaturalLanguageProcessing (NLP) 1 T1743588944524
No ratings yet
Unit-3NaturalLanguageProcessing (NLP) 1 T1743588944524
83 pages
Minorproject Ishant
No ratings yet
Minorproject Ishant
18 pages
NLP - Module 2
No ratings yet
NLP - Module 2
54 pages
English Handout 2
No ratings yet
English Handout 2
42 pages
Natural Language Processing
No ratings yet
Natural Language Processing
25 pages
1st ENGLISH 5 - Q1
No ratings yet
1st ENGLISH 5 - Q1
6 pages
NLP Manual (1-12)
No ratings yet
NLP Manual (1-12)
54 pages
Grade 10 Chapter One
No ratings yet
Grade 10 Chapter One
67 pages
NLP Lecture2 Text Pre Processing
No ratings yet
NLP Lecture2 Text Pre Processing
54 pages
AP For NLP-Word 2 Vec
No ratings yet
AP For NLP-Word 2 Vec
33 pages
Homework 3
No ratings yet
Homework 3
14 pages
UBC Summer School in NLP - VSP 2019 Lecture 10
No ratings yet
UBC Summer School in NLP - VSP 2019 Lecture 10
33 pages
TextMining
No ratings yet
TextMining
43 pages
Project Final Presentation
No ratings yet
Project Final Presentation
30 pages
NLP Unit1
No ratings yet
NLP Unit1
24 pages
1st Sem Final Exam Summary
No ratings yet
1st Sem Final Exam Summary
16 pages
Ai TXT Unit2
No ratings yet
Ai TXT Unit2
14 pages
NLP Pipeline
No ratings yet
NLP Pipeline
50 pages
Text Analytics and Natural Language Processing - KAI073
No ratings yet
Text Analytics and Natural Language Processing - KAI073
24 pages
Text Analytics Basics
No ratings yet
Text Analytics Basics
28 pages
CAT King Study Material 5
No ratings yet
CAT King Study Material 5
21 pages
Lecture 8 - Text Analytics NLP
No ratings yet
Lecture 8 - Text Analytics NLP
24 pages
Present Perfect / Past Perfect / Past Simple / Past Continuous
No ratings yet
Present Perfect / Past Perfect / Past Simple / Past Continuous
2 pages
Brinton & Arnovick (CHP 3, pp.80-95)
No ratings yet
Brinton & Arnovick (CHP 3, pp.80-95)
16 pages
CSDM2-Text Preprocessing For NL Data - 011050
No ratings yet
CSDM2-Text Preprocessing For NL Data - 011050
6 pages
Missionary Timetable Draft 10
No ratings yet
Missionary Timetable Draft 10
12 pages
Adnan Amin
No ratings yet
Adnan Amin
19 pages
NLP Notebook
No ratings yet
NLP Notebook
20 pages
NLP Manual
No ratings yet
NLP Manual
15 pages
Module 2 Cuid
No ratings yet
Module 2 Cuid
14 pages
Intro To NLP: Natural Language Toolkit
No ratings yet
Intro To NLP: Natural Language Toolkit
11 pages
NLP Module 1
No ratings yet
NLP Module 1
71 pages
SL-3 - Assignment No 7
No ratings yet
SL-3 - Assignment No 7
14 pages
Text Preprocessing Stages
No ratings yet
Text Preprocessing Stages
8 pages
Unraveling The Power of Natural Language Processing
No ratings yet
Unraveling The Power of Natural Language Processing
11 pages
NLP CT1
No ratings yet
NLP CT1
6 pages
English Amin
No ratings yet
English Amin
9 pages
Absolute Beginner #9 - Saying You Like (Or Don't Like) Something in Japanese - Lesson Notes
No ratings yet
Absolute Beginner #9 - Saying You Like (Or Don't Like) Something in Japanese - Lesson Notes
5 pages
Natural Language Processing
No ratings yet
Natural Language Processing
10 pages
Linkers: Contrast
No ratings yet
Linkers: Contrast
6 pages
Natural Language Processing
No ratings yet
Natural Language Processing
6 pages
ENG106 - CauseEffect - Peer Review Worksheet
No ratings yet
ENG106 - CauseEffect - Peer Review Worksheet
3 pages
English 2 Dis 5
No ratings yet
English 2 Dis 5
2 pages
Grammar Exercises 2a
No ratings yet
Grammar Exercises 2a
4 pages
A2B1 Unit 12b
No ratings yet
A2B1 Unit 12b
2 pages
Captions
No ratings yet
Captions
3 pages
NLP PDF
No ratings yet
NLP PDF
3 pages
STD 9 - German - Q.Paper I Term Exam - Oct 2021 PDF
No ratings yet
STD 9 - German - Q.Paper I Term Exam - Oct 2021 PDF
3 pages
In - English-A2.2 - StudentsBook (1) - 118-119
No ratings yet
In - English-A2.2 - StudentsBook (1) - 118-119
2 pages
Vremena, Modali, Kondicionali I Pasiv (Engleski Jezik)
No ratings yet
Vremena, Modali, Kondicionali I Pasiv (Engleski Jezik)
3 pages
Tense Chart
No ratings yet
Tense Chart
1 page
Gerunds and Infinitives
No ratings yet
Gerunds and Infinitives
3 pages
Natural Language Processing
From Everand
Natural Language Processing
Ajit Singh
No ratings yet