0% found this document useful (0 votes)

160 views58 pages

NLP Pipeline

The document provides an overview of Natural Language Processing (NLP) as a subfield of Artificial Intelligence (AI), detailing its evolution from rule-based systems to modern machine learning techniques. It outlines the key stages of an NLP pipeline, including data acquisition, text preprocessing, tokenization, and feature extraction methods like Bag of Words and TF-IDF. Additionally, it discusses applications of NLP, evaluation metrics, and deployment strategies for NLP models.

Uploaded by

Shyamala

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

160 views58 pages

NLP Pipeline

Uploaded by

Shyamala

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 58

NLP PIPELINE

Introduction
AI and NLP are deeply interrelated because Natural Language
Processing (NLP) is a crucial subfield of Artificial Intelligence
(AI) that enables machines to understand, interpret, and
generate human language, thus facilitating human-computer
interaction.
What is Artificial Intelligence
(AI)?
Definition: AI enables computers to mimic human
intelligence.
AI can reason, learn, and make decisions.
Examples: Chatbots, autonomous vehicles, recommendation
systems.
What is Machine Learning
(ML)?
What is Natural Language
Processing (NLP)?
A branch of AI that enables computers to understand and
interpret human language.
Key tasks: Sentiment analysis, automatic text summarization,
speech recognition.
Examples: Google Translate, virtual assistants (Siri, Alexa),
chatbots.
Introduction to NLP
NLP enables machines to understand and process human
language.
It has evolved from rule-based systems to AI-driven
technologies.
NLP impacts industries like healthcare, finance, and customer
service.
The Genesis of NLP (1950s –
1960s)
Alan Turing’s 1950 paper introduced the Turing Test.

1954: Georgetown-IBM experiment – early machine

translation.

1960s: Development of rule-based systems like SHRDLU and

ELIZA.
Rise of Statistical Methods
(1980s – 2000s)
Shift from rule-based to statistical models.

Use of N-grams, Hidden Markov Models (HMMs), and

Support Vector Machines (SVMs).

2006: Google Translate launched, showcasing NLP’s

commercial success.
The Era of Machine Learning
(2000s – 2010s)
Introduction of Word2Vec (2013) for word embeddings

Development of sequence-to-sequence models (2014).

2017: Introduction of Transformer model revolutionized NLP.

Current State & Innovations

BERT and Transformers improving contextual understanding.

Chatbots and virtual assistants (e.g., ParrotGPT).

Speech recognition and multilingual language models.

Future Directions &
Challenges
APPLICATIONS OF NLP
NLP WORKING
Key Techniques in NLP
NLP FRAMEWORKS
Natural Language Tool
Kit(NLTK)
• Natural Language Tool Kit(NLTK) is the most widely used
platform for building Python programs while working with
the human language. This library simplifies the various text
preprocessing steps to a whole new level. It provides a set of
manifold algorithms used for Natural Language Processing.
We can install NLTK using:
INTSALLATION
BUILD A NLP PIPELINE

• In Natural Language Processing (NLP), an NLP pipeline is a

sequence of interconnected steps that systematically
transform raw text data into a desired output suitable for
further analysis or application. It’s analogous to a factory
assembly line, where each step refines the material until it
reaches its final form.
Key stages of NLP pipeline
Data Acquisition
• Collecting raw text data
• Scenarios:
• Data already available (on desk, in DB)
• Data from external sources (web scraping, APIs)
• No data — use clients, generate synthetic data
Text Preprocessing - Basic
Cleaning
• Remove HTML tags
• Handle emojis
• Spell checking
Text Preprocessing - Basic
Steps
• Tokenization
• Stop Word Removal
• Stemming/Lemmatization
• Lowercasing
• Language Detection
TOKENIZATION

• Word Tokenization: This is the first step in any NLP process

that uses text data. Tokenization is a mandatory step, which
simplifies things for our machine learning model. It is the
process of breaking down a piece of text into individual
components or smaller units called tokens. The ultimate goal
of tokenization is to process the raw text data and create a
vocabulary from it.
Text PRE-PROCESSING

24
WORD TOKENIZATION
LOWER CASTING

• Lower casing:

This step reduces complexity. We convert the text data into the
same case, preferably lowercase, so that we don't have to work
with both cases.
Punctuation removal
• Punctuation removal: In this step, all the punctuations
present in the text are removed.
Stop word removal

• Stop word removal: The most commonly used words are

called stopwords. They contribute very less to the predictions
and add very little value analytically. Hence, removing
stopwords will make it easier for our models to train the text
data. We can use the gensim library in python to remove
stopwords.
Stop word Removal
Stemming

• Stemming: Stemming or text standardization converts each

word into its root/base form. For example, the word "faster"
will change into "fast". The drawback of stemming is that it
ignores the semantic meaning of words and simply strips the
words of their prefixes and suffixes. The word "laziness" will
be converted to "lazy" and not "lazy".
Stemming
Lemmatization
• Lemmatization: This process overcomes the drawback of
stemming and makes sure that a word does not lose its
meaning. Lemmatization uses a pre-defined dictionary to store
the context words and reduces the words in data based on
their context.
TWEET PRE-PROCESSING
 The raw data is pre-processed using enhanced text pre-
processing NLP technique.
 In this pre-processing step, Such as text tokenization, Stop word
removal, hash tag removal ,POS tagging, Stemming and
lemmatization is used.
 Emoticon in the data is converted to text

33
Real Time Tweet Data

34
TOKENIZATION

Tokenization is breaking the raw text into small chunks.

Tokenization breaks the raw text into words, sentences called
tokens.
N gram tokenization
Unigram tokenization is process for splitting word in to
single tokens
Bigram is process of spitting a word in to two token
N gram is process of splitting word in to n tokens
35
Text Tokenization

36
Stop word removal

• Articles and pronouns are generally classified as stop words. These

words have no significance in some of the NLP tasks.
• The stop words is removed in the context of the tweet data to give better
accuracy

37
PART OF SPEECH TAGGING

Transformation based part of speech tagging used for tagging

tweet data set
Known words are tagged based on the lexicon and the
unknown words tagged based on the frequent tag in the
training corpus
Lexical tagging based on the vocabulary

38
POS TAGGING OF TWEET DATA

39
STEMMING AND
LEMMATIZATION
• Stemming is a process that stems or removes last few characters from a
word, often leading to incorrect meanings and spelling.
• Lemmatization considers the context and converts the word to its
meaningful base form, which is called Lemma

40
STEMMING OF THE TWEET

41
LEMMATIZATION OF THE
TWEET

42
Named entity recognition

• Named Entity Recognition (NER) is a valuable natural language

processing (NLP) technique that can be used for fake news detection.
NER helps identify and classify named entities in text, such as names of
people, organizations, locations, dates, and more.

43
Named entity recognition

44
Output of Pre-Processed
Tweet

45
World cloud of Pre-processed
tweet

46
Feature engineering

• Feature engineering in Natural Language Processing (NLP)

involves transforming raw text data into numerical features
that machine learning models can comprehend and utilize
effectively. The goal is to represent text in a format that
captures semantic meaning, contextual information, and
relationships between words.
TEXT VECTORIZATION
Text vectorization is the process of converting text data into
numerical representations (vectors) that machine learning
algorithms can understand, enabling computers to process
and analyze text.
Feature Extraction
• (i) Bag of Words (BoW)
• (ii) Term Frequency-Inverse Document Frequency (TF-IDF)
• (iii) One-Hot Encoding
• (iv) Word Embeddings (Word2Vec, GloVe, FastText)
• (v) N-Gram Models
Bag of Words (BoW)

• The-bag-of-words model is a simple way to convert words to

numerical representation by conceptualizing a document as a
“bag” of words and noting the frequency of each word.
Documents can then be embedded and fed into machine
learning algorithms.
TF-IDF

• Term Frequency (TF): Measures how often a term appears in

a document.
• Inverse Document Frequency (IDF): Weighs down frequent
terms and increases the weight of rare terms across the
corpus.
• TF-IDF Calculation: The TF-IDF score is the product of TF and
IDF.
• Purpose: TF-IDF helps in understanding the relevance of
words in a document and is used in various NLP applications,
including text classification, document similarity, and search
engine optimization.
One-Hot Encoding

• One-hot encoding is a technique used to convert categorical

data into a binary format where each category is represented
by a separate column with a 1 indicating its presence and 0s
for all other categories. J
VECTOR REPRESENTATION
NLP LANGUAGE MODEL

• Building language models in NLP involves using probabilistic

models to predict the likelihood of a word sequence in a
sentence based on previous words. These models are key to
tasks like predictive text, speech recognition, machine
translation, and spelling correction.
EVALUATION METRIC

• Evaluating the performance of Natural Language Processing

(NLP) models is crucial for understanding their strengths and
weaknesses, guiding further development and ensuring they
meet the intended goals.
• Accuracy
• Precision
• Recall
• F1-Score
DEPLOYMENT

• Deploying a Natural Language Processing (NLP) model as an

API involves creating a web service that allows users to send
text data to the model and receive predictions, often using
frameworks like FastAPI or Flask, and potentially
containerizing the model with Docker.
NLP FOR TEACHING

• Teachers can establish strong connections with their students

via various NLP techniques like reading non-verbal cues,
mirroring and matching, and active listening skills. This also
helps students understand their mentors better and in
creating a trusting environment.
THANK YOU

Raymond S. T. Lee - Natural Language Processing. A Textbook With Python Implementation-Springer (2024)
No ratings yet
Raymond S. T. Lee - Natural Language Processing. A Textbook With Python Implementation-Springer (2024)
454 pages
Unit-3NaturalLanguageProcessing (NLP) 1 T1743588944524
No ratings yet
Unit-3NaturalLanguageProcessing (NLP) 1 T1743588944524
83 pages
NLP Module 1
No ratings yet
NLP Module 1
71 pages
AP For NLP-Word 2 Vec
No ratings yet
AP For NLP-Word 2 Vec
33 pages
Unit 1
No ratings yet
Unit 1
99 pages
AP For NLP-LO1
No ratings yet
AP For NLP-LO1
61 pages
Chapter - 6 Communicating, Perceiving, and Acting
No ratings yet
Chapter - 6 Communicating, Perceiving, and Acting
30 pages
Chapter 7.1 - Introducing Natural Language Processing
No ratings yet
Chapter 7.1 - Introducing Natural Language Processing
39 pages
Introduction To NLP
No ratings yet
Introduction To NLP
50 pages
TOPIC 4 Natural Language Processing
No ratings yet
TOPIC 4 Natural Language Processing
26 pages
Ai CH 4
No ratings yet
Ai CH 4
53 pages
Text Processing For NLP Text Processing
No ratings yet
Text Processing For NLP Text Processing
15 pages
NLP Pipeline
No ratings yet
NLP Pipeline
50 pages
Module-I NLP
No ratings yet
Module-I NLP
35 pages
NLP - 1 - 250119 - 222702
No ratings yet
NLP - 1 - 250119 - 222702
71 pages
NLP Manual (1-12) 1
No ratings yet
NLP Manual (1-12) 1
56 pages
Notes MSC NLP
No ratings yet
Notes MSC NLP
36 pages
Module 1
No ratings yet
Module 1
49 pages
1 Introduction
No ratings yet
1 Introduction
99 pages
NLP Unit 1 Part1
No ratings yet
NLP Unit 1 Part1
61 pages
Natural Language Processing
No ratings yet
Natural Language Processing
6 pages
NLP 1
No ratings yet
NLP 1
11 pages
Lect 02
No ratings yet
Lect 02
23 pages
Natural Language Processing Manual
No ratings yet
Natural Language Processing Manual
39 pages
NLP Presentation
No ratings yet
NLP Presentation
15 pages
NLP Record300
No ratings yet
NLP Record300
24 pages
Module I NLP
No ratings yet
Module I NLP
65 pages
NLP Materia
No ratings yet
NLP Materia
29 pages
Natural Language Processin1
No ratings yet
Natural Language Processin1
86 pages
NLP 1
No ratings yet
NLP 1
29 pages
DLT Unit-5
No ratings yet
DLT Unit-5
48 pages
NLP LectureNotes UNIT 1
No ratings yet
NLP LectureNotes UNIT 1
55 pages
NLP Unit 1
No ratings yet
NLP Unit 1
44 pages
Chapter 6.
No ratings yet
Chapter 6.
31 pages
Natural Language Processing with NLTK: Definitive Reference for Developers and Engineers
From Everand
Natural Language Processing with NLTK: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
NLP - Srilakshmi H - PPT Assignment
No ratings yet
NLP - Srilakshmi H - PPT Assignment
29 pages
NLP 9
No ratings yet
NLP 9
44 pages
Hocken Maier 25
No ratings yet
Hocken Maier 25
46 pages
1 NLP
No ratings yet
1 NLP
26 pages
PresentationDayone-Introduction of NLP
No ratings yet
PresentationDayone-Introduction of NLP
17 pages
Disruptive Technologies AI Lecture 3
No ratings yet
Disruptive Technologies AI Lecture 3
19 pages
Ai 2
No ratings yet
Ai 2
7 pages
Operate Database Application Basic
100% (4)
Operate Database Application Basic
32 pages
NLP Unit1
No ratings yet
NLP Unit1
24 pages
Chapter 4
No ratings yet
Chapter 4
17 pages
Eco 36
No ratings yet
Eco 36
6 pages
NLP Final
No ratings yet
NLP Final
33 pages
NLP Crash Course Comprehensive
No ratings yet
NLP Crash Course Comprehensive
2 pages
Group4 Smartphone Usage and Digital Literacy-2
No ratings yet
Group4 Smartphone Usage and Digital Literacy-2
40 pages
Text Preprocessing Stages
No ratings yet
Text Preprocessing Stages
8 pages
Too Enough Game
No ratings yet
Too Enough Game
1 page
Natural Language Processing 101
No ratings yet
Natural Language Processing 101
26 pages
Topic 2: Introduction To Natural Language Processing (NLP)
No ratings yet
Topic 2: Introduction To Natural Language Processing (NLP)
16 pages
Minorproject Ishant
No ratings yet
Minorproject Ishant
18 pages
Feb 27-Mar 3, 2023
No ratings yet
Feb 27-Mar 3, 2023
4 pages
Lesson Plan Test Taking
100% (1)
Lesson Plan Test Taking
3 pages
ML Daily Tracker 8 Weeks
No ratings yet
ML Daily Tracker 8 Weeks
2 pages
Introduction To Data Science - Week 7 - LAQ's
No ratings yet
Introduction To Data Science - Week 7 - LAQ's
4 pages
Gamit NG Wika Sa Lipunan
0% (1)
Gamit NG Wika Sa Lipunan
5 pages
CTY1 Connect TV Worksheet Unit 1
No ratings yet
CTY1 Connect TV Worksheet Unit 1
4 pages
Natural Language Processing - NOTES
No ratings yet
Natural Language Processing - NOTES
4 pages
Natural Language Processing
No ratings yet
Natural Language Processing
1 page
Chapter-1 Introduction To NLP
No ratings yet
Chapter-1 Introduction To NLP
12 pages
NLP Lab Manual-1
No ratings yet
NLP Lab Manual-1
18 pages
CSDM2-Text Preprocessing For NL Data - 011050
No ratings yet
CSDM2-Text Preprocessing For NL Data - 011050
6 pages
NLP Lecture 1
No ratings yet
NLP Lecture 1
3 pages
Week 6: Introduction To Natural Language Processing
No ratings yet
Week 6: Introduction To Natural Language Processing
18 pages
Action Plan in Journalism
No ratings yet
Action Plan in Journalism
2 pages
Reflection Paper Dukomentaryo Ni Kara David
No ratings yet
Reflection Paper Dukomentaryo Ni Kara David
2 pages
Unit 3
No ratings yet
Unit 3
14 pages
Unit I. 1 2 Problem and Problem Solving
No ratings yet
Unit I. 1 2 Problem and Problem Solving
16 pages
Lecturer Handbook 2019-20
No ratings yet
Lecturer Handbook 2019-20
24 pages
Psychology OF LANGUAGE
No ratings yet
Psychology OF LANGUAGE
6 pages
Mraz Et Al 2013
No ratings yet
Mraz Et Al 2013
20 pages
Inference
No ratings yet
Inference
11 pages
Mock Examination 1: English Business
50% (2)
Mock Examination 1: English Business
8 pages
FFE004 Assessment Brief - Assessment Bank
No ratings yet
FFE004 Assessment Brief - Assessment Bank
5 pages
Bi 8
No ratings yet
Bi 8
3 pages
Educ 5810 Unit 6 Written Assignment PDF
No ratings yet
Educ 5810 Unit 6 Written Assignment PDF
5 pages
Pofessional Goals ps1 - Deshann Valenitne
No ratings yet
Pofessional Goals ps1 - Deshann Valenitne
3 pages
Sample New Position Performance Evaluation v81
No ratings yet
Sample New Position Performance Evaluation v81
3 pages
Hrd/2/January 2, 2006/approved by Ceo
No ratings yet
Hrd/2/January 2, 2006/approved by Ceo
4 pages
(MCT) Lesson Plan & Reflection 1
No ratings yet
(MCT) Lesson Plan & Reflection 1
4 pages
DLL AOM (Week 0 June 20-21)
No ratings yet
DLL AOM (Week 0 June 20-21)
3 pages
Year 2 SBA 2023 DPTE TIMETABLE
No ratings yet
Year 2 SBA 2023 DPTE TIMETABLE
2 pages
Application Form Guidance Notes: PHD in Linguistics by Thesis and Coursework
No ratings yet
Application Form Guidance Notes: PHD in Linguistics by Thesis and Coursework
3 pages
Curriculum Map Health&Pe6quarter2
No ratings yet
Curriculum Map Health&Pe6quarter2
2 pages
Russell Arkin Letter of Reccomendation
No ratings yet
Russell Arkin Letter of Reccomendation
2 pages
Taylor Salamone Resume 10.15.2022
No ratings yet
Taylor Salamone Resume 10.15.2022
1 page
Lesson Plan On Herons Formula
100% (5)
Lesson Plan On Herons Formula
2 pages