Unit 1b

Uploaded by

Samriddhi Gupta

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

37 views24 pages

Unit 1b

Uploaded by

Samriddhi Gupta

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 24

NLP

Unit-1

Pre-processing
Basic NLP Pipeline NLP uses Language Processing Pipelines to read,
decipher and understand human languages.
Extended NLP pipeline
Spacy Data Processing Pipeline
Tokenization
• Tokenization is breaking the raw text into small chunks. Tokenization breaks
the raw text into words, sentences called tokens. These tokens help in
understanding the context or developing the model for the NLP. The
tokenization helps in interpreting the meaning of the text by analyzing the
sequence of the words.
• For example, the text “It is raining” can be tokenized into ‘It’, ‘is’, ‘raining’
• There are different methods and libraries available to perform tokenization.
NLTK, Gensim, Keras are some of the libraries that can be used to
accomplish the task.
• Stop words are those words in the text which does not add any meaning to
the sentence and their removal will not affect the processing of text for the
defined purpose. They are removed from the vocabulary to reduce noise
and to reduce the dimension of the feature set.
Various Tokenization Techniques
Stop words removal

• The words which are generally filtered out before processing a natural
language are called stop words. These are actually the most common
words in any language (like articles, prepositions, pronouns,
conjunctions, etc) and does not add much information to the text.
Examples of a few stop words in English are “the”, “a”, “an”, “so”,
“what”.
• Many Libraries are available to carry out this.
We can remove stop words while performing
the following tasks:
• Text Classification
• Spam Filtering
• Language Classification
• Genre Classification
• Caption Generation
• Auto-Tag Generation
Remove Stop words using Spacy
Stopword Removal using NLTK
Avoid Stop word Removal
• Machine Translation
• Language Modeling
• Text Summarization
• Question-Answering problems
Text Normalization
• When we normalize text, we attempt to reduce its randomness,
bringing it closer to a predefined “standard”. This helps us to reduce
the amount of different information that the computer has to deal
with, and therefore improves efficiency. The goal of normalization
techniques like stemming and lemmatization is to reduce inflectional
forms and sometimes derivationally related forms of a word to a
common base form.
Stemming
• We use Stemming to remove suffixes from words and end up with a so-called word stem. The
words “likes”, “likely” and “liked”, for example, all result in their common word stem “like” which
can be used as a synonym for all three words. That way, an NLP model can learn that all three
words are somehow similar and are used in a similar context.
• Stemming lets us standardize words to their base stem irrespective of their inflections, which
helps many applications like clustering or classifying text. Search engines use these techniques
extensively to give better results irrespective of the word form. Before the implementation of
word stems to Google in 2003, a search for “fish” did not include websites on fishes or fishing.
• Over-stemming: where a much larger part of a word is chopped off than what is required, which
in turn leads to words being reduced to the same root word or stem incorrectly when they should
have been reduced to more stem words. For example, the words “university” and “universe” that
get reduced to “univers”.
• Under-stemming: occurs when two or more words could be wrongly reduced to more than one
root word when they actually should be reduced to the same root word. For example, the words
“data” and “datum” that get reduced to “dat” and “datu” respectively (instead of the same stem
“dat”).
Stemming is an elementary rule-based process for removing
inflationary forms from a given token. The output of the error is the
stem of a word. for example laughing, laughed, laughs, laugh all will
become laugh after the stemming process.

Stemming is not a good process for normalization. since sometimes it can produce
non-meaningful words which are not present in the dictionary. Consider the
sentence ” His teams are not winning”. After stemming we get “Hi team are not
winn ” . Notice that the keyword winn is not a regular word. Also, “hi” has changed
the context of the entire sentence.
Lemmatization
• Unlike stemming, lemmatization reduces words to their base word, reducing the
inflected words properly and ensuring that the root word belongs to the
language. It’s usually more sophisticated than stemming, since stemmers works
on an individual word without knowledge of the context. In lemmatization, a root
word is called lemma. A lemma is the canonical form, dictionary form, or citation
form of a set of words.

Lemmatization is a systematic process of removing the

inflectional form of a token and transform it into a lemma. It
makes use of word structure, vocabulary, part of speech tags,
and grammar relations.
Stemming vs Lemmatization
spaCy doesn’t have a stemming library as they
prefer lemmatization over stemmer while NLTK
has both stemmer and lemmatizer
A simple data pre-processing pipeline in nltk
A simple data pre-processing pipeline in spacy
Demo Programs
Part-of-speech (POS) tagging
• Part-of-speech (POS) tagging is a popular Natural Language
Processing process which refers to categorizing words in a text
(corpus) in correspondence with a particular part of speech,
depending on the definition of the word and its context.

Some words can function in more than one

way when used in different circumstances.
The POS Tagging here plays a crucial role to
understand in what context the word is
used in the sentence. POS Tagging is useful
in sentence parsing, information retrieval,
sentiment analysis, etc.
Tags in NLTK IBM ICE (Innovation Centre for Education)

:
Tags in Spacy

The Giver Full Text
No ratings yet
The Giver Full Text
93 pages
Model Double Bar Graph Activity For OL
No ratings yet
Model Double Bar Graph Activity For OL
2 pages
2000 Product Bundle Xp5efv
50% (2)
2000 Product Bundle Xp5efv
6 pages
NLP Exp 3
No ratings yet
NLP Exp 3
24 pages
AP For NLP-Word 2 Vec
No ratings yet
AP For NLP-Word 2 Vec
33 pages
01 Introduction Embedded System Design
No ratings yet
01 Introduction Embedded System Design
32 pages
AP For NLP-LO1
No ratings yet
AP For NLP-LO1
61 pages
Ai TXT Unit2
No ratings yet
Ai TXT Unit2
14 pages
NLP-Lab Manual - Ashwini - Kachare
No ratings yet
NLP-Lab Manual - Ashwini - Kachare
41 pages
Text Preprocessing: Information Retrieval
100% (2)
Text Preprocessing: Information Retrieval
16 pages
NLP
No ratings yet
NLP
17 pages
Ebook Erp
No ratings yet
Ebook Erp
223 pages
NLP Experiment 3
No ratings yet
NLP Experiment 3
5 pages
Log
No ratings yet
Log
952 pages
UNIT 1 - Part1
No ratings yet
UNIT 1 - Part1
121 pages
Terraform 20221025
No ratings yet
Terraform 20221025
211 pages
NLP Lab Manual
No ratings yet
NLP Lab Manual
16 pages
IOM House Style Manual - 2020
No ratings yet
IOM House Style Manual - 2020
66 pages
Part B Notes
No ratings yet
Part B Notes
62 pages
Lab 2
No ratings yet
Lab 2
49 pages
ML Ch-6 Text Mining and Time Series
No ratings yet
ML Ch-6 Text Mining and Time Series
11 pages
TL-WN822N User Guide
No ratings yet
TL-WN822N User Guide
54 pages
Evaluation of Rajput States
No ratings yet
Evaluation of Rajput States
7 pages
Super Teacher Worksheets Body Parts
No ratings yet
Super Teacher Worksheets Body Parts
3 pages
LBDL
No ratings yet
LBDL
142 pages
2-Text Operations - New
No ratings yet
2-Text Operations - New
39 pages
Ir Manual
No ratings yet
Ir Manual
53 pages
Course File - IPCW-Aug-Dec 2023 - All Batches
No ratings yet
Course File - IPCW-Aug-Dec 2023 - All Batches
11 pages
CL - Lec 6
No ratings yet
CL - Lec 6
28 pages
NLP Lecture2 Text Pre Processing
No ratings yet
NLP Lecture2 Text Pre Processing
54 pages
NLP 3-6
No ratings yet
NLP 3-6
20 pages
Unit 1a
No ratings yet
Unit 1a
53 pages
Natural Language Computing
No ratings yet
Natural Language Computing
20 pages
Unit 5
No ratings yet
Unit 5
42 pages
NLB Final Lab Manual
No ratings yet
NLB Final Lab Manual
23 pages
Membership Service Provider (MSP)
No ratings yet
Membership Service Provider (MSP)
12 pages
VO - MCA - SEM 4 - Text Mining - U2
No ratings yet
VO - MCA - SEM 4 - Text Mining - U2
15 pages
2 - Text Operation - 1
No ratings yet
2 - Text Operation - 1
28 pages
Text Mining
No ratings yet
Text Mining
62 pages
Transaction Flow
No ratings yet
Transaction Flow
11 pages
USB Programmer Documentation
No ratings yet
USB Programmer Documentation
21 pages
CO - Q1 Oral Comm in Context SHS Module-7-FINAL
50% (6)
CO - Q1 Oral Comm in Context SHS Module-7-FINAL
19 pages
Week 3
No ratings yet
Week 3
15 pages
Unit 2a
No ratings yet
Unit 2a
51 pages
NLP Unit-2
No ratings yet
NLP Unit-2
12 pages
UPES Internship - HLD
No ratings yet
UPES Internship - HLD
7 pages
Sahil Kundal: UI/UX Designer and Software Engineer
No ratings yet
Sahil Kundal: UI/UX Designer and Software Engineer
3 pages
Word Level Analysis (NLP)
No ratings yet
Word Level Analysis (NLP)
28 pages
NLP Manual (1-12)
No ratings yet
NLP Manual (1-12)
54 pages
Text Preprocessing For NLP
No ratings yet
Text Preprocessing For NLP
15 pages
Text Analytics Basics
No ratings yet
Text Analytics Basics
28 pages
Natural Language Processing
No ratings yet
Natural Language Processing
25 pages
NLP Lect-5 02.02.21
No ratings yet
NLP Lect-5 02.02.21
18 pages
NLP Lect-6 03.02.21
No ratings yet
NLP Lect-6 03.02.21
17 pages
NLP Notes
No ratings yet
NLP Notes
12 pages
Statistical NLP
No ratings yet
Statistical NLP
45 pages
Final LP-VI NLP Manual 2023-24
No ratings yet
Final LP-VI NLP Manual 2023-24
29 pages
NLP Intro
No ratings yet
NLP Intro
15 pages
Chapter 2 Part 1 & 2
No ratings yet
Chapter 2 Part 1 & 2
58 pages
NLP UNIT 2 Part 2
No ratings yet
NLP UNIT 2 Part 2
6 pages
3 A Morphology
No ratings yet
3 A Morphology
4 pages
NLTK
No ratings yet
NLTK
4 pages
JD - Global Security
No ratings yet
JD - Global Security
2 pages
Assignment 3
No ratings yet
Assignment 3
1 page
Q - ClassX - AI - NATURAL LANGUAGE PROCESSING
No ratings yet
Q - ClassX - AI - NATURAL LANGUAGE PROCESSING
10 pages
Introduction To Natural Language Processing (Nlp) : Ths. Đặng Nhân Cách Email: [email protected]
No ratings yet
Introduction To Natural Language Processing (Nlp) : Ths. Đặng Nhân Cách Email: [email protected]
25 pages
Coroutines Flow - 1
No ratings yet
Coroutines Flow - 1
19 pages
Basic Grammar a-WPS Office
No ratings yet
Basic Grammar a-WPS Office
3 pages
Systematic Literature Review of Stemming and Lemmatization Performance For Sentence Similarity
No ratings yet
Systematic Literature Review of Stemming and Lemmatization Performance For Sentence Similarity
6 pages
Viva Questions
No ratings yet
Viva Questions
6 pages
Seminar Report On 8051 Microcontroller
67% (3)
Seminar Report On 8051 Microcontroller
21 pages
Unit 2b
No ratings yet
Unit 2b
22 pages
Automatic Training of Lemmatization Rules That Handle Morphological Changes in Pre-, in - and Suffixes Alike
No ratings yet
Automatic Training of Lemmatization Rules That Handle Morphological Changes in Pre-, in - and Suffixes Alike
9 pages
2 Text Operations
No ratings yet
2 Text Operations
32 pages
Software Quality Assurance From Theory To Implementation-68-71
No ratings yet
Software Quality Assurance From Theory To Implementation-68-71
4 pages
Ai Notes
No ratings yet
Ai Notes
11 pages
ECE 598 PV Course Notes2
No ratings yet
ECE 598 PV Course Notes2
6 pages
NLP Lab Manual
No ratings yet
NLP Lab Manual
29 pages
NLP Qa
No ratings yet
NLP Qa
10 pages
Extracting, Cleaning and Pre-Processing Text
No ratings yet
Extracting, Cleaning and Pre-Processing Text
12 pages
NLP 03
No ratings yet
NLP 03
3 pages
NLP Key Points
No ratings yet
NLP Key Points
3 pages
SET-01 - SOCS - ESE-MAY23 - B.Tech (CSE) +AIML - VIII - CSBA4014 - Data Analysis and Modelling Technique
No ratings yet
SET-01 - SOCS - ESE-MAY23 - B.Tech (CSE) +AIML - VIII - CSBA4014 - Data Analysis and Modelling Technique
2 pages
Soal Kelas 11 Latihan Sas
No ratings yet
Soal Kelas 11 Latihan Sas
16 pages
NLP Manual
No ratings yet
NLP Manual
15 pages
NLP TT-1 Question Bank
No ratings yet
NLP TT-1 Question Bank
21 pages
JD - Managed Services
No ratings yet
JD - Managed Services
2 pages
English For Academic Study Speaking
No ratings yet
English For Academic Study Speaking
3 pages
Jdk8u Main 191022 0810 4450 PDF
No ratings yet
Jdk8u Main 191022 0810 4450 PDF
4 pages
Experiment 3 Manual
No ratings yet
Experiment 3 Manual
7 pages
Micro Controller Based School Timer
No ratings yet
Micro Controller Based School Timer
5 pages
Deshpande-Will The Real Winner Please Stand Up Print)
No ratings yet
Deshpande-Will The Real Winner Please Stand Up Print)
15 pages
NLP Pre-Processing
No ratings yet
NLP Pre-Processing
6 pages
G10 Timetable (May-June 2025) Examination
No ratings yet
G10 Timetable (May-June 2025) Examination
1 page
Set-1 Ese Dec22 B Tech (Cse+Bao) B Tech (Cse+Bao) (Hons.) V Eceg3052 Micro Processor & Embedded Systems
No ratings yet
Set-1 Ese Dec22 B Tech (Cse+Bao) B Tech (Cse+Bao) (Hons.) V Eceg3052 Micro Processor & Embedded Systems
2 pages
Gr11-Gr12 Trigs Study Sheet
No ratings yet
Gr11-Gr12 Trigs Study Sheet
6 pages
NLP CT1
No ratings yet
NLP CT1
6 pages
Assignment 2
No ratings yet
Assignment 2
10 pages
NLTK
No ratings yet
NLTK
3 pages
Synonym or Similar Word Detection in Assignment Papers: Gayatri Behera
No ratings yet
Synonym or Similar Word Detection in Assignment Papers: Gayatri Behera
2 pages
SET-02 - SOCS - ESE-DEC23 - B.Tech (CSE-H+NH) - All Spec. - 5 - ECEG3052 - Micro Processor & Embedded Systems
No ratings yet
SET-02 - SOCS - ESE-DEC23 - B.Tech (CSE-H+NH) - All Spec. - 5 - ECEG3052 - Micro Processor & Embedded Systems
2 pages
Assembly Language II
No ratings yet
Assembly Language II
27 pages
Unit 14 Exercise 2
No ratings yet
Unit 14 Exercise 2
2 pages
LeaP Math G7 Week 6 Q3
No ratings yet
LeaP Math G7 Week 6 Q3
4 pages
My Fair Lady
No ratings yet
My Fair Lady
4 pages
Natural Language Processing
From Everand
Natural Language Processing
Ajit Singh
No ratings yet

Unit 1b

Uploaded by

Unit 1b

Uploaded by

NLP

Lemmatization is a systematic process of removing the

Some words can function in more than one

You might also like