0% found this document useful (0 votes)
34 views17 pages

Natural Language Processing (NLP)

Uploaded by

yogini.prabhu
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
34 views17 pages

Natural Language Processing (NLP)

Uploaded by

yogini.prabhu
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 17

Natural Language

Processing (NLP)
Introduction
• Natural Language Processing (NLP) is a subfield of machine learning
which leverages analysis, generation, and understanding of human
languages to derive meaningful insights.
• NLP is becoming popular as Large Language Models (LLMs) are
growing and used widely in the market. Having foundational
knowledge of NLP concepts and techniques can help you become an
NLP data scientist, NLP engineer, or distinguished ML engineer to
stand out in the job market.
NLP topic to understand
❑Text pre-processing 8 marks (3+2+2) ❑Text feature
extraction
❑Text sort
❑Named entity recognition
❑Parts-of-speech tagging
❑Text generation
❑Text-to-speech and speech-to-text techniques
Text pre-processing

• Pre-processing techniques such as tokenization, stemming, and


lemmatization, help convert raw text into a format that can be easily
analyzed.
Tokenization
• The fundamental concept in NLP is tokenization.
• It is the process of breaking down a complex piece of text into smaller units called
tokens.
Stemming
• Stemming is the process of reducing words to their base or root form.
This can be useful in classification or information retrieval tasks.

Lemmatization
• Lemmatization is the process of reducing a word to its base or root form, which is
known as the lemma.
• It is a more sophisticated version of stemming, as it takes into account the
context and the part of speech of the word.
• The lemmatize() from WordNetLemmatizer, of nltk.stem lemmatizes only nouns.
Text feature extraction

• Techniques such as Vocabulary/bag-of-words, n-grams, count


vectorization, and word embeddings are used to represent text as
numerical features in machine learning models.
• Vocabulary or Bag-of-Words
• Vocabulary in NLP refers to the set of unique words or tokens in a
given text or corpus.
N-grams

• An n-gram is a contiguous sequence of n items from a given sample of


text or speech, where n can be any positive integer. In NLP, n-grams
are often used to capture the context of words in a text.
Count vectorization
• Text vectorization is the process of converting text data into numerical
vectors, which can be used as input for machine learning models. One
of the most common techniques for text vectorization is bag-of
words, which represents text as a bag (or multiset) of its words,
disregarding grammar and word order but keeping track of the
number of occurrences of each word.
Text sort

• Techniques for classifying text into predefined categories, such as


sentiment analysis and spam detection.
• Text sorting is one of the most important NLP tasks that involves
assigning predefined categories or labels to a given piece of
text.
Named entity recognition
• Named entity recognition are techniques for identifying and
extracting named entities from text, such as people, organizations,
and locations.
• An entity can be thought of as a category type present in a given text.
For example, the name of a certain personality, the name of an
organization, location, etc.

• https://spacy.io/usage
Part-of-speech tagging
• Part-of-speech tagging are approaches for identifying the parts of
speech of words in a sentence, such as nouns, verbs, and
adjectives.

• NLTK library of python has a method called ‘pos_tag’ which allows


tagging parts of speech with just one line of code.
The Different Parts of Speech and Their Tags

• There are nine main parts of speech: noun, pronoun, verb, adjective,
adverb, conjunction, preposition, interjection, and article.

• Part-of-speech (POS) tags are labels that are assigned to words in a


text, indicating their grammatical role in a sentence. The most
common types of POS tags include:
The Different Parts of Speech and Their Tags

• Noun (NN): A person, place, thing, or idea


• Verb (VB): An action or occurrence
• Adjective (JJ): A word that describes a noun or pronoun • Adverb
(RB): A word that describes a verb, adjective, or other adverb •
Pronoun (PRP): A word that takes the place of a noun • Conjunction
(CC): A word that connects words, phrases, or clauses • Preposition
(IN): A word that shows a relationship between a noun or pronoun
and other elements in a sentence
• Interjection (UH): A word or phrase used to express strong emotion
POS and categorise.
• This is just a sample of the most common POS tags, different libraries and
models may have different sets of tags, but the purpose remains the same
— to categorise words based on their grammatical function.
• Parts of speech can also be categorised by their grammatical function in a
sentence. There are three primary categories: subjects (which perform the
action), objects (which receive the action), and modifiers (which describe
or modify the subject or object). Each primary category can be further
divided into subcategories. For example, subjects can be further classified
as simple (one word), compound (two or more words), or complex
(sentences containing subordinate clauses).
Text generation

• Techniques for generating new text based on a given input, such as


machine translation and text summarization.
• Text generation is the task of automatically producing new text based
on a given input or model. It is a popular area of research in Natural
Language Processing (NLP) and has numerous applications such as
chatbots, content creation, and language translation.
Techniques for text generation.
• Markov Chain: A statistical model that predicts the next word based on the probability
distribution of the previous words in the text. It generates new text by starting with an initial
state, and then repeatedly sampling the next word based on the probability distribution learned
from the input text.
• Sequence-to-Sequence (Seq2Seq) Model: A deep learning model that consists of two recurrent
neural networks (RNNs), an encoder, and a decoder. The encoder takes the input text and
produces a fixed-length vector representation, while the decoder generates the output text based
on the vector representation.
• Generative Adversarial Network (GAN): A deep learning model that consists of two neural
networks, a generator, and a discriminator. The generator produces new text samples, while the
discriminator tries to distinguish between the generated text and the real text. The two networks
are trained in an adversarial manner, where the generator tries to produce more realistic text,
while the discriminator tries to become better at recognizing fake text.
• Transformer-based Models: Transformer models are a type of neural network architecture
designed for NLP tasks, such as text classification and machine translation. They have been shown
to perform well on text-generation tasks as well.

Text-to-Speech and Speech-to-Text

• Techniques for converting speech to text and text to speech.


• Text-to-Speech (TTS) and Speech-to-Text (STT) are two important
applications of Natural Language Processing (NLP).
• Text-to-Speech (TTS): TTS is a technology that allows computers to
generate human-like speech from written text. The goal of TTS is to
produce speech that is natural, expressive, and matches the
intonation, rhythm, and prosody of human speech as closely as
possible. TTS systems typically consist of two components: a text
analysis module that analyzes the input text, and a speech synthesis
module that converts the analyzed text into speech.
• Speech-to-Text (STT): STT is a technology that allows computers to
transcribe spoken words into written text. STT systems are used in a
wide range of applications, including voice-controlled virtual
assistants, dictation software, and automatic speech recognition
(ASR) systems. STT systems typically use acoustic models and
language models to transcribe speech into text.
Application based questions
No. 1) code based :
How to generate following output from the given text "The quick brown fox jumps
over the lazy dog." Explain output as well.

Code Answer :
nltk.download('punkt')
text = "The quick brown fox jumps over the lazy dog."
# Tokenize the text into words
tokens = nltk.word_tokenize(text)
# Perform part-of-speech tagging on the tokenized words
tagged_words = nltk.pos_tag(tokens)
print(tagged_words)
OUTPUT
[('The', 'DT'), ('quick', 'JJ'), ('brown', 'NN'), ('fox', 'NN'), ('jumps', 'VBZ'), ('over', 'IN'), ('the', 'DT'), ('lazy', 'JJ'),
('dog', 'NN'), ('.', '.')]
Parts of Speech and Their Tags
There are nine main parts of speech: noun, pronoun, verb, adjective, adverb, conjunction, preposition,
interjection, and article.

Application based questions


No. 2) research based :
What are scikit-learn based alternatives
for all the NLP concepts which are applied for ‘20 newsgroup data’
project?

You might also like