0% found this document useful (0 votes)
15 views61 pages

AP For NLP-LO1

Uploaded by

shahbazhassan
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
15 views61 pages

AP For NLP-LO1

Uploaded by

shahbazhassan
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 61

Advanced Python for NLP

CONTENTS:

• NLP
• NLTK
• NLP Pre-
processing
WHAT IS NATURAL LANGUAGE
PROCESSING (NLP)?
• Natural Language Processing is an interdisciplinary field of Artificial
Intelligence.
• It is a technique used to teach a computer to understand Human
languages and also interpret just like us.
• It is the art of extracting information, hidden insights from unstructured
text.
• It is a sophisticated field that makes computers process text data on a
large scale.
• The ultimate goal of NLP is to make computers
and computer-controlled bots understand and interpret Human
Languages, just as we do.
COMPONENTS OF
NLP
• Natural Language
Understanding
• Natural Language
Generation

Figure: Components
of NLP
NATURAL LANGUAGE UNDERSTANDING:-
• NLU helps the machine to understand and analyze human language by extracting
the text from large data such as keywords, emotions, relations, and semantics,
etc.
Let’s see what challenges are faced by a machine-
He is looking for a match.
• What do you understand by the ‘match’ keyword?
• This is Lexical Ambiguity. It happens when a word has different meanings. Lexical
ambiguity can be resolved by using parts-of-speech (POS)tagging techniques.
The Fish is ready to eat.
• What do you understand by the above example?
• This is Syntactical Ambiguity which means when we see more meanings in a
sequence of words and also Called Grammatical Ambiguity.
NATURAL LANGUAGE
GENERATION:-

• It is the process of extracting meaningful insights


as phrases and sentences in the form of natural
language.
• It consists −
• Text planning − It includes retrieving the relevant data
from the domain.
• Sentence planning − It is nothing but a selection of
important words, meaningful phrases, or sentences.
APPLICATIONS OF
NLP
• Sentimental Analysis
• Chabot's
• Virtual Assistants
• Speech Recognition
• Machine Translation
• Advertise Matching
• Information Extraction
• Grammatical error detection
• Fake news detection
• Text Summarize
LIBRARIES FOR
NLP
Here are some of the libraries for leveraging the power of
Natural Language Processing.
• Natural Language Toolkit (NLTK)
• spaCY
• Gensim
• Standford CoreNLP
• TextBlob
WHAT IS NATURAL
TOOLKIT?
• NLTK, or Natural Language Toolkit, is a Python package that
you can use for NLP.
• NLTK is a leading platform for building Python programs to
work with human language data.
Installing NLTK
pip install nltk
WHAT IS Spacy

• Tokenization
• POS- Tagging
• NER
• Lemmatization
• Sentence Boundary Detection
• Text Classification
WHAT IS Gensim

• Topic modeling

• Word 2 Vec

• Document Similarity

• Efficient Data Handling


DATA PREPROCESSING USING
NLTK
• The process of cleaning unstructured text data, so that it can be used to
predict, analyze, and extract information. Real-world text data is
unstructured, inconsistent. So, Data preprocessing becomes a necessary
step.
• The various Data Preprocessing methods are:
• Tokenization
• Frequency Distribution of Words
• Filtering Stop Words
• Stemming
• Lemmatization
• Parts of Speech(POS) Tagging
• Name Entity Recognition
• WordNet
• These are some of the methods to process the text data in NLP. The list is
not so exhaustive but serves as a great starting point for anyone who
wants to get started with NLP.
TOKENIZING

• The process of breaking down the text data into individual


tokens(words, sentences, characters) is known as Tokenization. It is
a foremost step in Text Analytics.
• It’s your first step in turning unstructured data into structured data,
which is easier to analyze.
• Tokenizing can be done by two ways
• Tokenizing by word
• Tokenizing by sentence
• Importing the tokenizer from NLTK
from nltk.tokenize import sent_tokenize,
word_tokenize
STOPWORDS

• Stop words are used to filter some words which are repetitive and don’t hold any
information. For example, words like – {that these, below, is, are, etc.} don’t provide any
information, so they need to be removed from the text. Stop Words are considered as
Noise. NLTK provides a huge list of stop words

• Very common words like 'in', 'is', and 'an' are often used as stop words since they
don’t add a lot of meaning to a text in and of themselves.
STEMMING

• Stemming is a text processing task in which you reduce words to


their root, which is the core part of a word. For example, the
words “helping” and “helper” share the root “help.”
• Stemming allows you to zero in on the basic meaning of a word
rather than all the details of how it’s being used.
• NLTK has
• Porter Stemmer
• Snowball Stemmer
• Understemming and overstemming are two ways stemming
can go wrong:
• Understemming happens when two related words should be
reduced to the same stem but aren’t. This is a
false negative.
• Overstemming happens when two unrelated words are
reduced to the same stem even though they shouldn’t be.
This is a false positive.
Porter Stemmer
• It is one of the earliest and most widely used stemming algorithms. It
applies a series of rules to strip suffixes from words, reducing them
to their root form.

• Snowball Stemmer, also known as the Porter2 Stemmer, is an


improvement over the original Porter Stemmer. It was developed by
Martin Porter to address some of the limitations of the original
algorithm.
POS TAGGING
SUMMARY THAT YOU CAN USE TO GET STARTED WITH
NLTK’S POS TAGS:
Tags that start with Deals With
JJ Adjectives
NN Nouns
RB Adverbs
PRP Pronouns
VB Verbs
LEMMATIZING

• Like stemming, lemmatization is also used to reduce the word


to their root word. Lemmatizing gives the complete meaning
of the word which makes sense. It uses vocabulary and
morphological analysis to transform a word into a root word.
• For example:
• “engineers” is lemmatized to “engineer”
STEMMING V/S LEMMATIZATION

Stemming Lemmatization
Stemming is a process that Lemmatization considers the
stems or removes last few context and converts the word to
characters from a word, often its meaningful base form, which
leading to incorrect meanings is called Lemma.
and spelling.
For instance, stemming the word For instance, lemmatizing
‘Caring’ would return ‘Car’. the word ‘Caring’ would
return ‘Care’.
Stemming is used in case of Lemmatization is
large dataset where computationally expensive
performance is an issue since it involves look-up tables
and what not.
CHUNKING

• While tokenizing allows you to identify words and sentences,


chunking allows you to identify phrases.
• Note: A phrase is a word or group of words that works as a single
unit to perform a grammatical function. Noun phrases are built
around a noun.
• Here are some examples:
• “A plan et”
• “A tilting planet”
• “A swiftly tilting planet”
CHINKING

• Chinking is used together with chunking, but while chunking


is used to include a pattern, chinking is used to exclude a
pattern.
USING NAMED ENTITY RECOGNITION
(NER)
• Named entities are noun phrases that refer to specific
locations, people, organizations, and so on. With named entity
recognition, you can find the named entities in your texts
and also determine what kind of named entity they are.
Word Embedding
• Word embedding's are a type of word representation that allows words to
be represented as vectors in a continuous vector space. These vectors
capture semantic meanings of words such that words with similar
meanings have similar vector representations.

• Word embedding's are crucial for various natural language processing


(NLP) tasks because they help in understanding the context and
relationships between words in a more meaningful way than traditional
bag-of-words or one-hot encoding methods.
Word Embedding
• Vector Space Representation: Words are represented as dense vectors of fixed
size. Each word is mapped to a point in a continuous vector space

• Semantic Similarity: Words with similar meanings are located close to each
other in the vector space. For example, the words "king" and "queen" might be
close to each other.

• Dimensionality Reduction: Word embedding's reduce the dimensionality of


word representations, making them more computationally efficient while
preserving semantic information.
Word Embedding

• Word2Vec:

• GloVe (Global Vectors for Word Representation):

• Fast Text:

• BERT (Bidirectional Encoder Representations from

Transformers):
• Python Implementation
THANKS!
Any questions?
k-Nearest Neighbors

• The k-NN algorithm is the simplest machine learning algorithm.

• Building the model consists only of storing the training dataset.

• To make a prediction for a new data point, the algorithm finds the closest
data points in the training dataset. Its “nearest neighbors.”
k- Neighbors Classification

• In its simplest version, the k-NN algorithm only considers exactly


one nearest neighbor, which is the closest training data point to the
point we want to make a prediction for.

• The prediction is then simply the known output for this training point
k- Neighbors Classification
k- Neighbors Classification
• Here, three new data points have been added, marked as Star.

• For each of them, I marked the closest point in the training set. The
prediction of the one-nearest-neighbor algorithm is the label of
that point.

• Instead of considering only the closest neighbor, we can also


consider an arbitrary number, k, of neighbors. This is where the
name of the k-nearest neighbors algorithm comes from
k- Neighbors Classification

• When considering more than one neighbor, we use voting to assign a


label. This means that for each test point, we count how many
neighbors belong to class 0 and how many neighbors belong to class 1.

• We then assign the class that is more frequent: in other words, the
majority class among the k-nearest neighbors
k- Neighbors Classification
• mglearn.plots.plot_knn_classification(n_neighbors=3)
k- Neighbors Classification

• You can see that the prediction for the new data point at the top
left is not the same as the prediction when we used only one
neighbor.

• While this illustration is for a binary classification problem,


this method can be applied to datasets with any number of
classes.

• For more classes, we count how many neighbors belong to each


class and again predict the most common class.
KNN- Working
KNN

• Step 1 − For implementing any algorithm, we need dataset. So during the first step
of KNN, we must load the training as well as test data.

• Step 2 − Next, we need to choose the value of K i.e. the nearest data points. K can
be any integer.

• Step 3 − For each point in the test data do the following


KNN
• 3.1 − Calculate the distance between test data and each row of
training data with the help of any of the method namely: Euclidean,
Manhattan or Hamming distance. The most commonly used method
to calculate distance is Euclidean.

• 3.2 − Now, based on the distance value, sort them in ascending order.

• 3.3 − Next, it will choose the top K rows from the sorted array.

• 3.4 − Now, it will assign a class to the test point based on most
frequent class of these rows.
KNN- Example
Task : Classify the given instance according to the classes of
Training data using KNN- Algorithm . The value of K = 3

Query = X= (Matric= 51, Intermediate = 69)


Matric % Intermediate % University
Entry Test
55 60 PASS
60 52 FAIL
70 68 PASS
55 85 FAIL
87 56 PASS
66 70 FAIL
KNN- Example

• Euclidean distance
KNN- Example

• D1 = |51- 55|^2 + |69 – 60 |^2 = 97^1/2 = 9.84

• D2 = |51-60|^2 + |69- 52|^2 = 370 ^1/2 = 19.23

• D3= |51- 70|^2 + |69- 68|^2= 362 ^1/2 = 19.02

• D4= |51- 55|^2 + |69- 85|^2 = 272^1/2 = 16.49

• D5= |51- 87|^2 + |69-56|^2 = 1465^1/2= 38.27

• D6 = |51- 66|^2 + |69 – 70|^2 = 226 ^1/2 = 15.03


KNN- Example

• Now, based on the distance value, sort them in ascending order


• D1, D6, D4, D3, D2, D5

• 9.84, 15.03, 16.49, 19.02, 19.23, 38.27

• Next, it will choose the top K rows from the sorted array.

• 9.84, 15.03, 16.49 (D1, D6, D4)


KNN- Example

• − Now, it will assign a class to the test point based on most frequent class of these
rows.

• D1 = Pass

• D6= Fail

• D4= Fail

• Query = X= (Matric= 51, FSC = 69) = Fail


KNN- Summary

• What is “K” in KNN algorithm?


• K = Number of nearest neighbors you want to select to predict the class
of a given item
How to Choose K in KNN
• If K is small, then results might not be reliable because noise
will have a higher influence on the result.

• If K is large, then there will be a lot of processing which may


adversely impact the performance of the algorithm. So,
following is must be considered while choosing the value of K:
a. K should be the square root of n (number of data points in
training dataset)

b. K should be odd so that there are no ties. If square root is
even, then add or subtract 1 to it.
Why KNN- Lazy Learner

• When it gets the training data, it does not learn and make a
model, it just stores the data

• It does not derive any discriminative function from the training


data

• It uses the training data when it actually needs to do some


prediction. So, KNN does not immediately learn a model, but
delays the learning, that is why it is called lazy learner
KNN- ADVANTAGES

• It can be used for both regression and classification problems

• It is very simple and easy to implement.

• There is not much time cost in training phase

• KNN doesn't make any assumption for the distribution of the

given data.
KNN- Dis-ADVANTAGES

• Finding the optimum value of ‘k’

• we need to store the whole training set for every test set, it
requires a lot of space.

• it is not suitable for high dimensional data.


KNN- Implementation Using
Python
KNN- Implementation Steps

• Import Necessary Libraries


• import numpy as np
• import matplotlib.pyplot as plt
• from sklearn.datasets import load_diabetes
• from sklearn.model_selection import train_test_split
• from sklearn.preprocessing import StandardScaler
• from sklearn.neighbors import KNeighborsClassifier
• from sklearn.metrics import accuracy_score,
confusion_matrix
KNN- Implementation Steps

• Load the Dataset


• diabetes = load_diabetes()
• X = diabetes.data
• y = (diabetes.target > 150).astype(int)
• print(diabetes.DESCR)

KNN- Implementation Steps

• Split the Dataset

• X_train, X_test, y_train, y_test = train_test_split(X, y,


test_size=0.2, random_state=42)
KNN- Implementation Steps

• Features Scaling
• scaler = StandardScaler()
• X_train_scaled = scaler.fit_transform(X_train)
• X_test_scaled = scaler.transform(X_test)
KNN- Implementation Steps

• Train the Classifier


• knn = KNeighborsClassifier(n_neighbors=5)
• knn.fit(X_train_scaled, y_train)
KNN- Implementation Steps

• Train the Classifier


• knn = KNeighborsClassifier(n_neighbors=5)
• knn.fit(X_train_scaled, y_train)

• Make Prediction
• y_pred = knn.predict(X_test_scaled)
KNN- Implementation Steps

• Model Accuracy
• accuracy = accuracy_score(y_test, y_pred)
• print("Model Accuracy:", accuracy)
KNN- Implementation Steps

• Confusion matrix
• conf_matrix = confusion_matrix(y_test, y_pred)
• print("Confusion Matrix:")
• print(conf_matrix)
KNN- Implementation Steps

• Visualize Confusion matrix


plt.figure(figsize=(8, 6))
• plt.imshow(conf_matrix, cmap=plt.cm.Blues)
• plt.title('Confusion Matrix')
• plt.colorbar()
• plt.xlabel('Predicted Label')
• plt.ylabel('True Label')
• plt.xticks([0, 1], ['Negative', 'Positive'])
• plt.yticks([0, 1], ['Negative', 'Positive'])
• plt.show()
KNN- Implementation Steps

• Visualize using Scatter Plot


# Plotting the scatter plot
• plt.figure(figsize=(8, 6))

# Plotting points of class 0
• plt.scatter(X_test[y_test == 0][:, 0], X_test[y_test == 0][:, 1], c='blue', label='Class
0')

# Plotting points of class 1
• plt.scatter(X_test[y_test == 1][:, 0], X_test[y_test == 1][:, 1], c='red', label='Class 1')

plt.xlabel('Feature 1')
• plt.ylabel('Feature 2')
• plt.title('Diabetes Dataset - Scatter Plot')
• plt.legend()
• plt.show()

You might also like