0% found this document useful (0 votes)

34 views17 pages

Natural Language Processing (NLP)

Uploaded by

yogini.prabhu

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

34 views17 pages

Natural Language Processing (NLP)

Uploaded by

yogini.prabhu

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 17

Natural Language

Processing (NLP)
Introduction
• Natural Language Processing (NLP) is a subfield of machine learning
which leverages analysis, generation, and understanding of human
languages to derive meaningful insights.
• NLP is becoming popular as Large Language Models (LLMs) are
growing and used widely in the market. Having foundational
knowledge of NLP concepts and techniques can help you become an
NLP data scientist, NLP engineer, or distinguished ML engineer to
stand out in the job market.
NLP topic to understand
❑Text pre-processing 8 marks (3+2+2) ❑Text feature
extraction
❑Text sort
❑Named entity recognition
❑Parts-of-speech tagging
❑Text generation
❑Text-to-speech and speech-to-text techniques
Text pre-processing

• Pre-processing techniques such as tokenization, stemming, and

lemmatization, help convert raw text into a format that can be easily
analyzed.
Tokenization
• The fundamental concept in NLP is tokenization.
• It is the process of breaking down a complex piece of text into smaller units called
tokens.
Stemming
• Stemming is the process of reducing words to their base or root form.
This can be useful in classification or information retrieval tasks.

Lemmatization
• Lemmatization is the process of reducing a word to its base or root form, which is
known as the lemma.
• It is a more sophisticated version of stemming, as it takes into account the
context and the part of speech of the word.
• The lemmatize() from WordNetLemmatizer, of nltk.stem lemmatizes only nouns.
Text feature extraction

• Techniques such as Vocabulary/bag-of-words, n-grams, count

vectorization, and word embeddings are used to represent text as
numerical features in machine learning models.
• Vocabulary or Bag-of-Words
• Vocabulary in NLP refers to the set of unique words or tokens in a
given text or corpus.
N-grams

• An n-gram is a contiguous sequence of n items from a given sample of

text or speech, where n can be any positive integer. In NLP, n-grams
are often used to capture the context of words in a text.
Count vectorization
• Text vectorization is the process of converting text data into numerical
vectors, which can be used as input for machine learning models. One
of the most common techniques for text vectorization is bag-of
words, which represents text as a bag (or multiset) of its words,
disregarding grammar and word order but keeping track of the
number of occurrences of each word.
Text sort

• Techniques for classifying text into predefined categories, such as

sentiment analysis and spam detection.
• Text sorting is one of the most important NLP tasks that involves
assigning predefined categories or labels to a given piece of
text.
Named entity recognition
• Named entity recognition are techniques for identifying and
extracting named entities from text, such as people, organizations,
and locations.
• An entity can be thought of as a category type present in a given text.
For example, the name of a certain personality, the name of an
organization, location, etc.

• https://spacy.io/usage
Part-of-speech tagging
• Part-of-speech tagging are approaches for identifying the parts of
speech of words in a sentence, such as nouns, verbs, and
adjectives.

• NLTK library of python has a method called ‘pos_tag’ which allows

tagging parts of speech with just one line of code.
The Different Parts of Speech and Their Tags

• There are nine main parts of speech: noun, pronoun, verb, adjective,
adverb, conjunction, preposition, interjection, and article.

• Part-of-speech (POS) tags are labels that are assigned to words in a

text, indicating their grammatical role in a sentence. The most
common types of POS tags include:
The Different Parts of Speech and Their Tags

• Noun (NN): A person, place, thing, or idea

• Verb (VB): An action or occurrence
• Adjective (JJ): A word that describes a noun or pronoun • Adverb
(RB): A word that describes a verb, adjective, or other adverb •
Pronoun (PRP): A word that takes the place of a noun • Conjunction
(CC): A word that connects words, phrases, or clauses • Preposition
(IN): A word that shows a relationship between a noun or pronoun
and other elements in a sentence
• Interjection (UH): A word or phrase used to express strong emotion
POS and categorise.
• This is just a sample of the most common POS tags, different libraries and
models may have different sets of tags, but the purpose remains the same
— to categorise words based on their grammatical function.
• Parts of speech can also be categorised by their grammatical function in a
sentence. There are three primary categories: subjects (which perform the
action), objects (which receive the action), and modifiers (which describe
or modify the subject or object). Each primary category can be further
divided into subcategories. For example, subjects can be further classified
as simple (one word), compound (two or more words), or complex
(sentences containing subordinate clauses).
Text generation

• Techniques for generating new text based on a given input, such as

machine translation and text summarization.
• Text generation is the task of automatically producing new text based
on a given input or model. It is a popular area of research in Natural
Language Processing (NLP) and has numerous applications such as
chatbots, content creation, and language translation.
Techniques for text generation.
• Markov Chain: A statistical model that predicts the next word based on the probability
distribution of the previous words in the text. It generates new text by starting with an initial
state, and then repeatedly sampling the next word based on the probability distribution learned
from the input text.
• Sequence-to-Sequence (Seq2Seq) Model: A deep learning model that consists of two recurrent
neural networks (RNNs), an encoder, and a decoder. The encoder takes the input text and
produces a fixed-length vector representation, while the decoder generates the output text based
on the vector representation.
• Generative Adversarial Network (GAN): A deep learning model that consists of two neural
networks, a generator, and a discriminator. The generator produces new text samples, while the
discriminator tries to distinguish between the generated text and the real text. The two networks
are trained in an adversarial manner, where the generator tries to produce more realistic text,
while the discriminator tries to become better at recognizing fake text.
• Transformer-based Models: Transformer models are a type of neural network architecture
designed for NLP tasks, such as text classification and machine translation. They have been shown
to perform well on text-generation tasks as well.

Text-to-Speech and Speech-to-Text

• Techniques for converting speech to text and text to speech.

• Text-to-Speech (TTS) and Speech-to-Text (STT) are two important
applications of Natural Language Processing (NLP).
• Text-to-Speech (TTS): TTS is a technology that allows computers to
generate human-like speech from written text. The goal of TTS is to
produce speech that is natural, expressive, and matches the
intonation, rhythm, and prosody of human speech as closely as
possible. TTS systems typically consist of two components: a text
analysis module that analyzes the input text, and a speech synthesis
module that converts the analyzed text into speech.
• Speech-to-Text (STT): STT is a technology that allows computers to
transcribe spoken words into written text. STT systems are used in a
wide range of applications, including voice-controlled virtual
assistants, dictation software, and automatic speech recognition
(ASR) systems. STT systems typically use acoustic models and
language models to transcribe speech into text.
Application based questions
No. 1) code based :
How to generate following output from the given text "The quick brown fox jumps
over the lazy dog." Explain output as well.

Code Answer :
nltk.download('punkt')
text = "The quick brown fox jumps over the lazy dog."
# Tokenize the text into words
tokens = nltk.word_tokenize(text)
# Perform part-of-speech tagging on the tokenized words
tagged_words = nltk.pos_tag(tokens)
print(tagged_words)
OUTPUT
[('The', 'DT'), ('quick', 'JJ'), ('brown', 'NN'), ('fox', 'NN'), ('jumps', 'VBZ'), ('over', 'IN'), ('the', 'DT'), ('lazy', 'JJ'),
('dog', 'NN'), ('.', '.')]
Parts of Speech and Their Tags
There are nine main parts of speech: noun, pronoun, verb, adjective, adverb, conjunction, preposition,
interjection, and article.

Application based questions

No. 2) research based :
What are scikit-learn based alternatives
for all the NLP concepts which are applied for ‘20 newsgroup data’
project?

Lecture 13
No ratings yet
Lecture 13
44 pages
AI Unit 3
No ratings yet
AI Unit 3
12 pages
NLP Pipeline
No ratings yet
NLP Pipeline
58 pages
Core Components of Natural Language Processing
No ratings yet
Core Components of Natural Language Processing
43 pages
AIYA Session 3 Presentation
No ratings yet
AIYA Session 3 Presentation
40 pages
NLP Sem Imp
No ratings yet
NLP Sem Imp
46 pages
Adnan Amin
No ratings yet
Adnan Amin
19 pages
Chapter - 6 Communicating, Perceiving, and Acting
No ratings yet
Chapter - 6 Communicating, Perceiving, and Acting
30 pages
CH-2 Natural Language Processing Models and Algorithm
No ratings yet
CH-2 Natural Language Processing Models and Algorithm
119 pages
NLP Module 1
No ratings yet
NLP Module 1
124 pages
AP For NLP-Word 2 Vec
No ratings yet
AP For NLP-Word 2 Vec
33 pages
01 NLP Unit 4 Part 1
No ratings yet
01 NLP Unit 4 Part 1
25 pages
NLP Questions
No ratings yet
NLP Questions
26 pages
تعلم ML4
No ratings yet
تعلم ML4
42 pages
Natural Language Processing (NLP) : April 2024
No ratings yet
Natural Language Processing (NLP) : April 2024
88 pages
NLP Pyq Solutions
No ratings yet
NLP Pyq Solutions
59 pages
Natural Language Processing (NLP)
No ratings yet
Natural Language Processing (NLP)
9 pages
Chapter 6-NLPs
No ratings yet
Chapter 6-NLPs
31 pages
Text Processing For NLP Text Processing
No ratings yet
Text Processing For NLP Text Processing
15 pages
Module-I NLP
No ratings yet
Module-I NLP
35 pages
Natural Language Processin1
No ratings yet
Natural Language Processin1
86 pages
Brocode OP
No ratings yet
Brocode OP
133 pages
NLP Unit 1 Part1
No ratings yet
NLP Unit 1 Part1
61 pages
NLP Intro
No ratings yet
NLP Intro
74 pages
NLP Handwritten Notes
No ratings yet
NLP Handwritten Notes
26 pages
Reading Monster TG LV1
No ratings yet
Reading Monster TG LV1
109 pages
Week 8-Module 7 NLP
No ratings yet
Week 8-Module 7 NLP
52 pages
Unit-4 NLP
No ratings yet
Unit-4 NLP
54 pages
Hocken Maier 25
No ratings yet
Hocken Maier 25
46 pages
7-Text Classification-13-11-2024
No ratings yet
7-Text Classification-13-11-2024
53 pages
NLP Unit 1
No ratings yet
NLP Unit 1
44 pages
DLT Unit-5
No ratings yet
DLT Unit-5
48 pages
Chapter 6.
No ratings yet
Chapter 6.
31 pages
ACT Faulty Agreement
No ratings yet
ACT Faulty Agreement
10 pages
CAT King Study Material 5
No ratings yet
CAT King Study Material 5
21 pages
Unit Iii
No ratings yet
Unit Iii
6 pages
NLP 1
No ratings yet
NLP 1
29 pages
Natural Language Processing
No ratings yet
Natural Language Processing
6 pages
Text Analytics and Natural Language Processing - KAI073
No ratings yet
Text Analytics and Natural Language Processing - KAI073
24 pages
Narration (Direct - Indirect) - 33847518 - 2024 - 05 - 07 - 18 - 23
No ratings yet
Narration (Direct - Indirect) - 33847518 - 2024 - 05 - 07 - 18 - 23
15 pages
NLP Final
No ratings yet
NLP Final
33 pages
NLP Unit1
No ratings yet
NLP Unit1
24 pages
Natural Language Processing
No ratings yet
Natural Language Processing
14 pages
NLP Part1
No ratings yet
NLP Part1
67 pages
Natural Language Processing Lec 1
No ratings yet
Natural Language Processing Lec 1
23 pages
Module 1.1
No ratings yet
Module 1.1
9 pages
Natural Language Processing
No ratings yet
Natural Language Processing
21 pages
Unit 1 and 2
No ratings yet
Unit 1 and 2
5 pages
Chapter 4
No ratings yet
Chapter 4
17 pages
Semantics and Pragmatics
No ratings yet
Semantics and Pragmatics
27 pages
Ai 2
No ratings yet
Ai 2
7 pages
NLP Insem Notes
No ratings yet
NLP Insem Notes
13 pages
Lecture 2 Theory of Meaning by Ogden and Richards
No ratings yet
Lecture 2 Theory of Meaning by Ogden and Richards
25 pages
Minorproject Ishant
No ratings yet
Minorproject Ishant
18 pages
NLP QB
100% (2)
NLP QB
14 pages
So Shall You Reap Donna Leon PDF Download
No ratings yet
So Shall You Reap Donna Leon PDF Download
39 pages
Group-2 PDF Research Files
No ratings yet
Group-2 PDF Research Files
59 pages
The Book of Complete English Grammar (Tata Bahasa Inggris Lengkap) - p0005
No ratings yet
The Book of Complete English Grammar (Tata Bahasa Inggris Lengkap) - p0005
1 page
Natural Language Processing
No ratings yet
Natural Language Processing
27 pages
Natural Language Processing 101
No ratings yet
Natural Language Processing 101
26 pages
Unit-6 Natural Language Processing
No ratings yet
Unit-6 Natural Language Processing
7 pages
Jan 2nd - Mar 28th - Term 2 - Scheme of Work - 2nd Year (Autorecovered) (Autorecovered)
No ratings yet
Jan 2nd - Mar 28th - Term 2 - Scheme of Work - 2nd Year (Autorecovered) (Autorecovered)
41 pages
Introducing Natural Language Processing
No ratings yet
Introducing Natural Language Processing
13 pages
Micro-Teaching Reflection
No ratings yet
Micro-Teaching Reflection
3 pages
Lect 3 - Discourse & Pragmatics 1
No ratings yet
Lect 3 - Discourse & Pragmatics 1
58 pages
Grammar Course
No ratings yet
Grammar Course
67 pages
Introduction To NLP - First - Week - Lecture - 1st
No ratings yet
Introduction To NLP - First - Week - Lecture - 1st
6 pages
Chapter-1 Introduction To NLP
No ratings yet
Chapter-1 Introduction To NLP
12 pages
NLP Lab Manual-1
No ratings yet
NLP Lab Manual-1
18 pages
Wida Focus On Differentiation Part 2
No ratings yet
Wida Focus On Differentiation Part 2
7 pages
Ciso PHCP
No ratings yet
Ciso PHCP
45 pages
12 Tenses of The Verb
No ratings yet
12 Tenses of The Verb
4 pages
Unit 3
No ratings yet
Unit 3
14 pages
Handy Guide To Difficult and Irregular V
100% (1)
Handy Guide To Difficult and Irregular V
19 pages
PT - English 3 - Q2
No ratings yet
PT - English 3 - Q2
9 pages
WH Question: Aurelio Arago Memorial National High School
100% (1)
WH Question: Aurelio Arago Memorial National High School
6 pages
Early Reading Development: Wyse Usha
No ratings yet
Early Reading Development: Wyse Usha
16 pages
Comparative
No ratings yet
Comparative
1 page
Present Perfect Tense
No ratings yet
Present Perfect Tense
18 pages
Five Types of T in American English
100% (1)
Five Types of T in American English
3 pages
Robin Lakoff S Politeness Principles
No ratings yet
Robin Lakoff S Politeness Principles
3 pages
HUMSS - CW/MP11/12-Iab-4: ND RD
No ratings yet
HUMSS - CW/MP11/12-Iab-4: ND RD
4 pages
Solidarity Means To Take Sides With Group of People Who Are Oppressed and Exploited by Power
No ratings yet
Solidarity Means To Take Sides With Group of People Who Are Oppressed and Exploited by Power
5 pages
Quick Test Module 3 Access 2a
No ratings yet
Quick Test Module 3 Access 2a
4 pages
Encyclopedia of Autism Spectrum Disorders
100% (7)
Encyclopedia of Autism Spectrum Disorders
337 pages
Acrostic Poem For Tina Hameed
No ratings yet
Acrostic Poem For Tina Hameed
5 pages
LearnEnglish Com Grammar First Conditional Basic A2
No ratings yet
LearnEnglish Com Grammar First Conditional Basic A2
1 page
Russian Beginners - 2019
100% (10)
Russian Beginners - 2019
123 pages
Python Text Processing with NLTK 2.0 Cookbook: LITE
From Everand
Python Text Processing with NLTK 2.0 Cookbook: LITE
Jacob Perkins
4/5 (1)
Natural Language Processing
From Everand
Natural Language Processing
Ajit Singh
No ratings yet

Natural Language Processing (NLP)

Uploaded by

Natural Language Processing (NLP)

Uploaded by

Natural Language

• Pre-processing techniques such as tokenization, stemming, and

• Techniques such as Vocabulary/bag-of-words, n-grams, count

• An n-gram is a contiguous sequence of n items from a given sample of

• Techniques for classifying text into predefined categories, such as

• NLTK library of python has a method called ‘pos_tag’ which allows

• Part-of-speech (POS) tags are labels that are assigned to words in a

• Noun (NN): A person, place, thing, or idea

• Techniques for generating new text based on a given input, such as

Text-to-Speech and Speech-to-Text

• Techniques for converting speech to text and text to speech.

Application based questions

You might also like