0% found this document useful (0 votes)
160 views58 pages

NLP Pipeline

The document provides an overview of Natural Language Processing (NLP) as a subfield of Artificial Intelligence (AI), detailing its evolution from rule-based systems to modern machine learning techniques. It outlines the key stages of an NLP pipeline, including data acquisition, text preprocessing, tokenization, and feature extraction methods like Bag of Words and TF-IDF. Additionally, it discusses applications of NLP, evaluation metrics, and deployment strategies for NLP models.

Uploaded by

Shyamala
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
160 views58 pages

NLP Pipeline

The document provides an overview of Natural Language Processing (NLP) as a subfield of Artificial Intelligence (AI), detailing its evolution from rule-based systems to modern machine learning techniques. It outlines the key stages of an NLP pipeline, including data acquisition, text preprocessing, tokenization, and feature extraction methods like Bag of Words and TF-IDF. Additionally, it discusses applications of NLP, evaluation metrics, and deployment strategies for NLP models.

Uploaded by

Shyamala
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 58

NLP PIPELINE

Introduction
AI and NLP are deeply interrelated because Natural Language
Processing (NLP) is a crucial subfield of Artificial Intelligence
(AI) that enables machines to understand, interpret, and
generate human language, thus facilitating human-computer
interaction.
What is Artificial Intelligence
(AI)?
Definition: AI enables computers to mimic human
intelligence.
AI can reason, learn, and make decisions.
Examples: Chatbots, autonomous vehicles, recommendation
systems.
What is Machine Learning
(ML)?
What is Natural Language
Processing (NLP)?
A branch of AI that enables computers to understand and
interpret human language.
Key tasks: Sentiment analysis, automatic text summarization,
speech recognition.
Examples: Google Translate, virtual assistants (Siri, Alexa),
chatbots.
Introduction to NLP
NLP enables machines to understand and process human
language.
It has evolved from rule-based systems to AI-driven
technologies.
NLP impacts industries like healthcare, finance, and customer
service.
The Genesis of NLP (1950s –
1960s)
Alan Turing’s 1950 paper introduced the Turing Test.

1954: Georgetown-IBM experiment – early machine


translation.

1960s: Development of rule-based systems like SHRDLU and


ELIZA.
Rise of Statistical Methods
(1980s – 2000s)
Shift from rule-based to statistical models.

Use of N-grams, Hidden Markov Models (HMMs), and


Support Vector Machines (SVMs).

2006: Google Translate launched, showcasing NLP’s


commercial success.
The Era of Machine Learning
(2000s – 2010s)
Introduction of Word2Vec (2013) for word embeddings

Development of sequence-to-sequence models (2014).

2017: Introduction of Transformer model revolutionized NLP.


Current State & Innovations

BERT and Transformers improving contextual understanding.

Chatbots and virtual assistants (e.g., ParrotGPT).

Speech recognition and multilingual language models.


Future Directions &
Challenges
APPLICATIONS OF NLP
NLP WORKING
Key Techniques in NLP
NLP FRAMEWORKS
Natural Language Tool
Kit(NLTK)
• Natural Language Tool Kit(NLTK) is the most widely used
platform for building Python programs while working with
the human language. This library simplifies the various text
preprocessing steps to a whole new level. It provides a set of
manifold algorithms used for Natural Language Processing.
We can install NLTK using:
INTSALLATION
BUILD A NLP PIPELINE

• In Natural Language Processing (NLP), an NLP pipeline is a


sequence of interconnected steps that systematically
transform raw text data into a desired output suitable for
further analysis or application. It’s analogous to a factory
assembly line, where each step refines the material until it
reaches its final form.
Key stages of NLP pipeline
Data Acquisition
• Collecting raw text data
• Scenarios:
• Data already available (on desk, in DB)
• Data from external sources (web scraping, APIs)
• No data — use clients, generate synthetic data
Text Preprocessing - Basic
Cleaning
• Remove HTML tags
• Handle emojis
• Spell checking
Text Preprocessing - Basic
Steps
• Tokenization
• Stop Word Removal
• Stemming/Lemmatization
• Lowercasing
• Language Detection
TOKENIZATION

• Word Tokenization: This is the first step in any NLP process


that uses text data. Tokenization is a mandatory step, which
simplifies things for our machine learning model. It is the
process of breaking down a piece of text into individual
components or smaller units called tokens. The ultimate goal
of tokenization is to process the raw text data and create a
vocabulary from it.
Text PRE-PROCESSING

24
WORD TOKENIZATION
LOWER CASTING

• Lower casing:

This step reduces complexity. We convert the text data into the
same case, preferably lowercase, so that we don't have to work
with both cases.
Punctuation removal
• Punctuation removal: In this step, all the punctuations
present in the text are removed.
Stop word removal

• Stop word removal: The most commonly used words are


called stopwords. They contribute very less to the predictions
and add very little value analytically. Hence, removing
stopwords will make it easier for our models to train the text
data. We can use the gensim library in python to remove
stopwords.
Stop word Removal
Stemming

• Stemming: Stemming or text standardization converts each


word into its root/base form. For example, the word "faster"
will change into "fast". The drawback of stemming is that it
ignores the semantic meaning of words and simply strips the
words of their prefixes and suffixes. The word "laziness" will
be converted to "lazy" and not "lazy".
Stemming
Lemmatization
• Lemmatization: This process overcomes the drawback of
stemming and makes sure that a word does not lose its
meaning. Lemmatization uses a pre-defined dictionary to store
the context words and reduces the words in data based on
their context.
TWEET PRE-PROCESSING
 The raw data is pre-processed using enhanced text pre-
processing NLP technique.
 In this pre-processing step, Such as text tokenization, Stop word
removal, hash tag removal ,POS tagging, Stemming and
lemmatization is used.
 Emoticon in the data is converted to text

33
Real Time Tweet Data

34
TOKENIZATION

Tokenization is breaking the raw text into small chunks.


Tokenization breaks the raw text into words, sentences called
tokens.
N gram tokenization
Unigram tokenization is process for splitting word in to
single tokens
Bigram is process of spitting a word in to two token
N gram is process of splitting word in to n tokens
35
Text Tokenization

36
Stop word removal

• Articles and pronouns are generally classified as stop words. These


words have no significance in some of the NLP tasks.
• The stop words is removed in the context of the tweet data to give better
accuracy

37
PART OF SPEECH TAGGING

Transformation based part of speech tagging used for tagging


tweet data set
Known words are tagged based on the lexicon and the
unknown words tagged based on the frequent tag in the
training corpus
Lexical tagging based on the vocabulary

38
POS TAGGING OF TWEET DATA

39
STEMMING AND
LEMMATIZATION
• Stemming is a process that stems or removes last few characters from a
word, often leading to incorrect meanings and spelling.
• Lemmatization considers the context and converts the word to its
meaningful base form, which is called Lemma

40
STEMMING OF THE TWEET

41
LEMMATIZATION OF THE
TWEET

42
Named entity recognition

• Named Entity Recognition (NER) is a valuable natural language


processing (NLP) technique that can be used for fake news detection.
NER helps identify and classify named entities in text, such as names of
people, organizations, locations, dates, and more.

43
Named entity recognition

44
Output of Pre-Processed
Tweet

45
World cloud of Pre-processed
tweet

46
Feature engineering

• Feature engineering in Natural Language Processing (NLP)


involves transforming raw text data into numerical features
that machine learning models can comprehend and utilize
effectively. The goal is to represent text in a format that
captures semantic meaning, contextual information, and
relationships between words.
TEXT VECTORIZATION
Text vectorization is the process of converting text data into
numerical representations (vectors) that machine learning
algorithms can understand, enabling computers to process
and analyze text.
Feature Extraction
• (i) Bag of Words (BoW)
• (ii) Term Frequency-Inverse Document Frequency (TF-IDF)
• (iii) One-Hot Encoding
• (iv) Word Embeddings (Word2Vec, GloVe, FastText)
• (v) N-Gram Models
Bag of Words (BoW)

• The-bag-of-words model is a simple way to convert words to


numerical representation by conceptualizing a document as a
“bag” of words and noting the frequency of each word.
Documents can then be embedded and fed into machine
learning algorithms.
TF-IDF

• Term Frequency (TF): Measures how often a term appears in


a document.
• Inverse Document Frequency (IDF): Weighs down frequent
terms and increases the weight of rare terms across the
corpus.
• TF-IDF Calculation: The TF-IDF score is the product of TF and
IDF.
• Purpose: TF-IDF helps in understanding the relevance of
words in a document and is used in various NLP applications,
including text classification, document similarity, and search
engine optimization.
One-Hot Encoding

• One-hot encoding is a technique used to convert categorical


data into a binary format where each category is represented
by a separate column with a 1 indicating its presence and 0s
for all other categories. J
VECTOR REPRESENTATION
NLP LANGUAGE MODEL

• Building language models in NLP involves using probabilistic


models to predict the likelihood of a word sequence in a
sentence based on previous words. These models are key to
tasks like predictive text, speech recognition, machine
translation, and spelling correction.
EVALUATION METRIC

• Evaluating the performance of Natural Language Processing


(NLP) models is crucial for understanding their strengths and
weaknesses, guiding further development and ensuring they
meet the intended goals.
• Accuracy
• Precision
• Recall
• F1-Score
DEPLOYMENT

• Deploying a Natural Language Processing (NLP) model as an


API involves creating a web service that allows users to send
text data to the model and receive predictions, often using
frameworks like FastAPI or Flask, and potentially
containerizing the model with Docker.
NLP FOR TEACHING

• Teachers can establish strong connections with their students


via various NLP techniques like reading non-verbal cues,
mirroring and matching, and active listening skills. This also
helps students understand their mentors better and in
creating a trusting environment.
THANK YOU

You might also like