0% found this document useful (0 votes)

18 views24 pages

NLP Unit1

Natural Language Processing (NLP) is a multidisciplinary field that focuses on enabling computers to understand and respond to human language. Key concepts include tokenization, part-of-speech tagging, named entity recognition, and various techniques for text analysis such as stemming, lemmatization, and word embeddings. The document also discusses challenges like ambiguity in language and the importance of dependency parsing in understanding grammatical structures.

Uploaded by

vamshi panaka

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

18 views24 pages

NLP Unit1

Uploaded by

vamshi panaka

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 24

(21CS121) NATURAL

LANGUAGE PROCESSING
NLP CONCEPT
 Natural Language Processing (NLP) is a field at the intersection of
computer science, artificial intelligence, and linguistics, focusing on the
interaction between computers and human (natural) languages.

 The goal of NLP is to enable computers to understand, interpret, and

respond to human language in a way that is both meaningful and useful.
Some basic concepts in NLP are shown bellow:

 Tokenization: Tokenization is the process of splitting text into individual

words, phrases, symbols, or other meaningful elements called tokens. For
example, the sentence "Hello world!" might be tokenized into ["Hello",
"world", "!"].
CONT...
 Part-of-Speech (POS) Tagging: POS tagging involves labeling each word in
a sentence with its corresponding part of speech, such as noun, verb, adjective,
etc. For example, in the sentence "The cat sits on the mat," the tags might be:
i. The/DT (determiner)
ii. cat/NN (noun)
iii. sits/VB (verb)
iv. on/IN (preposition)
v. the/DT (determiner)
vi. mat/NN (noun)

 Named Entity Recognition (NER): NER identifies and classifies named

entities in text into predefined categories such as person names, organizations,
locations, dates, and more. For example, in the sentence "Barack Obama was
born in Hawaii," the entities are:
i. Barack Obama (Person)
ii. Hawaii (Location)
CONT...
 Lemmatization and Stemming:- are techniques used to reduce words to
their base or root form.
i. Stemming: Reduces words to their root form by removing suffixes (e.g.,
"running" becomes "run").
ii. Lemmatization: Reduces words to their base form considering the context
(e.g., "running" becomes "run", "better" becomes "good").
 Stop Words: are common words that are often filtered out in NLP tasks
because they carry less meaning (e.g., "the", "is", "in"). Removing stop words
can improve the efficiency of text processing.

 Bag of Words (BoW): is a representation of text that describes the

occurrence of words within a document.
 It involves creating a vocabulary of all words in the document and then
representing each document by a vector of word counts.
 This model disregards grammar and word order but keeps multiplicity.
CONT...
 TF-IDF (Term Frequency-Inverse Document Frequency):- It is a statistical
measure used to evaluate the importance of a word in a document relative to a
collection of documents (corpus). It combines:
i. Term Frequency (TF): How often a word appears in a document.
ii. Inverse Document Frequency (IDF): How common or rare a word is across
all documents in the corpus.

 Word Embeddings:-are dense vector representations of words that capture

semantic meaning. Examples include Word2Vec, GloVe, and FastText. These
embeddings map words to vectors in a high-dimensional space where
semantically similar words are closer together.
 N-grams are contiguous sequences of n items (words,
characters, etc.) from a given text. Common examples are:
i. Unigrams (1-gram): ["The", "cat", "sits"]
ii. Bigrams (2-gram): ["The cat", "cat sits"]
iii. Trigrams (3-gram): ["The cat sits"]
CONT...
 Sentiment Analysis: is the process of determining the
emotional tone behind words, sentences, or texts. It classifies
the text into positive, negative, or neutral sentiments.
 Syntax and Parsing: Parsing involves analyzing the

grammatical structure of a sentence to identify relationships

between words. Syntax parsing can be:
i. Dependency Parsing: Identifies dependencies between words
(e.g., subject-verb relationships).
ii. Constituency Parsing: Breaks sentences into sub-phrases or
constituents (e.g., noun phrases, verb phrases).
 Machine Translation: is the automatic translation of text from

one language to another. Techniques range from rule-based

approaches to statistical and neural machine translation models.
CONT...
 Language Models: predict the probability of a sequence of words. They are
fundamental for tasks like text generation and are often built using neural
networks. Examples include LSTM-based models.
 Text Classification: is the process of assigning predefined categories to text.
Examples include spam detection, topic classification, and sentiment
analysis.
 Speech Recognition: involves converting spoken language into text. It
combines NLP with signal processing and often uses models like Hidden
Markov Models (HMM) and deep learning techniques.
AMBIGUITY IN LANGUAGE
 It refers to the phenomenon where a word, phrase, or sentence has
multiple interpretations.
 Ambiguity can occur at various levels of language processing, such as
lexical (word-level), syntactic (sentence structure), and semantic
(meaning) levels. Understanding and resolving ambiguity is a significant
challenge in natural language processing (NLP).
 1. Lexical Ambiguity
 Lexical ambiguity arises when a word has multiple meanings.
 Example:
 "I went to the bank."
 Explanation:
 The word "bank" can refer to a financial institution or the side of a river.
 Resolution:
 Context is used to determine the correct meaning. For example, additional
context like "to deposit money" clarifies that "bank" refers to a financial
institution.
CONT...
 2. Syntactic Ambiguity
 Syntactic ambiguity occurs when a sentence can be parsed in multiple ways
due to its structure.
 Example:
 "I saw the man with the telescope."
 Explanation:
 This sentence can be interpreted as either:
 "I used the telescope to see the man."
 "I saw a man who had a telescope."
 Resolution:
 Parsing algorithms and contextual understanding are used to determine the
most likely structure.
CONT...
 3. Semantic Ambiguity
 Semantic ambiguity happens when a sentence can have multiple meanings,
even if its syntactic structure is clear.
 Example:
 "He gave her cat food."
 Explanation:
 This sentence can mean:
 "He gave food to her cat."
 "He gave her some cat food."
 Resolution:
 Semantic analysis and context are used to infer the intended meaning.
CONT...
 4. Pragmatic Ambiguity
 Pragmatic ambiguity involves the interpretation of language in context,
considering the speaker's intentions and the situational context.
 Example:
 "Can you pass the salt?"
 Explanation:
 This sentence can be interpreted as:
 A question about the listener's ability to pass the salt.
 A polite request for the listener to pass the salt.
 Resolution:
 Understanding the social and conversational context helps resolve
pragmatic ambiguity.
QUESTIONS
 More than 100 students attended the seminar. 50
of them were from our college.
 "The project will be completed in 10 days”.

 "The temperature will rise by 5 to 10 degrees”.

SEGMENTATION
 Segmentation in Natural Language Processing (NLP) refers to the process
of dividing text into smaller meaningful units. These units can be sentences,
words, phrases, or other subunits.
 Effective segmentation is crucial for many downstream NLP tasks such as
tokenization, part-of-speech tagging, named entity recognition, and parsing.
 1. Sentence Segmentation
 Sentence segmentation, also known as sentence boundary detection, involves
splitting a text into individual sentences.
 2. Word Segmentation
 Word segmentation, also known as tokenization, involves splitting a sentence
into individual words or tokens.
 3. Subword Segmentation
 Subword segmentation involves splitting words into smaller units, such as
morphemes or subwords, which can be useful for handling out-of-vocabulary
words in machine translation or language modeling.
CONT...
 4. Paragraph Segmentation
 Paragraph segmentation involves splitting a text into paragraphs. This is less
common in typical NLP tasks but can be important for document-level
analysis.
 5. Chunking (Shallow Parsing)
 Chunking involves segmenting and labeling multi-token sequences, such as
noun phrases (NP), verb phrases (VP), etc.
STEMMING
 Stemming is a text normalization technique in Natural Language Processing
(NLP) that reduces words to their base or root form.
 The root form is usually not a valid word by itself but is a common
representation of words that allows for the conflation of different inflected
forms of a word.
 Stemming helps in reducing the dimensionality of text data and is particularly
useful in search engines, text mining, and information retrieval systems.
 Common Stemming Algorithms
i. Porter Stemmer: One of the most widely used stemming algorithms, known
for its simplicity and efficiency.
ii. Lancaster Stemmer: A more aggressive stemming algorithm compared to the
Porter Stemmer.
iii. Snowball Stemmer: Also known as the Porter2 stemmer, it is an
improvement over the original Porter stemmer and is available for multiple
languages.
TOKENIZATION
 Tokenization is a fundamental step in natural language processing (NLP) that
involves splitting text into individual units called tokens. These tokens can be
words, phrases, or other meaningful elements. Tokenization facilitates further
processing and analysis of text data by breaking it down into manageable
pieces.
 Types of Tokenization
 Word Tokenization: Splitting text into individual words.
 Sentence Tokenization: Splitting text into individual sentences.
 Subword Tokenization: Splitting words into smaller units, such as morphemes
or subwords, useful in dealing with unknown words or for languages with rich
morphology.
 Libraries for Tokenization
 Several NLP libraries provide robust tokenization tools, including:
 NLTK (Natural Language Toolkit)
 spaCy
 Transformers by Hugging Face
 Gensim
WORD EMBEDDING
 Word Embedding refers to a technique for representing words as dense
vectors of real numbers in a continuous vector space.
 Unlike traditional methods such as one-hot encoding, which represent words

as sparse, high-dimensional vectors, word embeddings capture semantic

relationships between words in a more compact and meaningful way.
 Key points are as following:-

i. Dimensionality Reduction: Word embeddings reduce the dimensionality of

word representations compared to one-hot encoding, which typically results
in a sparse vector of the size of the vocabulary. Embeddings represent each
word as a dense vector of fixed size, often in the range of 50 to 300
dimensions.
ii. Semantic Meaning: Word embeddings capture semantic meaning and
relationships between words. Words with similar meanings or contexts are
represented by similar vectors. For example, "king" and "queen" may have
vectors that are closer to each other than "king" and "car."
CONT...
 Contextual Information: Word embeddings are learned from large corpora
of text and can reflect syntactic and semantic patterns. Popular embeddings
like Word2Vec, GloVe, and FastText are trained using various methods to
capture these patterns.

 Pre-trained Embeddings: Pre-trained word embeddings can be used to

initialize models, allowing them to leverage learned semantic relationships
from large datasets without having to train embeddings from scratch.

 Applications: Word embeddings are used in various NLP tasks such as text
classification, sentiment analysis, machine translation, and information
retrieval. They are foundational for many modern NLP techniques and
models.
WORD SENSES
 Refer to the different meanings or interpretations that a word can have
depending on its context. A single word can have multiple senses, each with
its own specific meaning.
 Key Points About Word Senses:
 Polysemy: This is the phenomenon where a single word has multiple related
meanings. For example, the word "bank" can refer to a financial institution or
the side of a river. The different meanings are considered different senses of
the word.
 Homonymy: This is when a word has multiple meanings that are unrelated or
only loosely related. For instance, "bat" can refer to a flying mammal or a
piece of sports equipment. These are considered different senses of the word
and are usually distinguished by context.
 Contextual Disambiguation: To understand the intended sense of a word in
a given context, disambiguation techniques are used. This process is crucial
for tasks such as machine translation, information retrieval, and text
understanding.
CONT...
 Word Sense Disambiguation (WSD): This is a subtask of NLP
focused on determining which sense of a word is used in a particular
context. WSD can be approached using various methods, including:
 Dictionary-based methods: Leveraging predefined lexical resources
like WordNet, which provide detailed sense definitions and relations.
 Supervised learning: Training models on labeled datasets where the
senses of words are annotated.
 Unsupervised and semi-supervised learning: Using clustering or
co-occurrence patterns to infer word senses without extensive labeled
data.
 Lexical Resources: Resources such as WordNet provide structured
information about word senses and their relationships, including
synonyms, antonyms, hypernyms, and hyponyms. These resources
are valuable for sense disambiguation and other NLP tasks.
CONT...
 Applications: Understanding word senses is critical for many NLP
applications, including:
 Machine Translation: Ensuring the correct translation of words based on
their intended meanings.
 Information Retrieval: Improving search results by understanding the
context of search queries.
 Text Summarization: Generating accurate summaries that reflect the
correct meanings of words.
DEPENDENCY PARSING
 It is a key aspect of syntactic analysis in natural language processing
(NLP) and computational linguistics.
 It focuses on analyzing the grammatical structure of a sentence by
identifying the relationships between words, particularly how each word
depends on others.
 Key Concepts in Dependency Parsing:
 Dependency Relations: In dependency parsing, the grammatical structure
of a sentence is represented by a set of dependency relations. Each relation
consists of a head and a dependent. The head is a word that governs or
influences another word (the dependent), establishing a syntactic
connection between them.
CONT...
 Dependency Tree: The result of dependency parsing is often visualized as a
dependency tree or dependency graph. In this tree, each node represents a
word, and directed edges represent dependency relations. The root of the tree
is typically the main verb or another central element of the sentence.
 Head and Dependent:
 Head: The governing word in a dependency relation.
 Dependent: The word that is governed by the head. For example, in the
phrase "The cat sleeps," "sleeps" is the head of "cat," which is the dependent.
 Types of Dependencies: Common dependency relations include:
 Subject: The noun or noun phrase that performs the action (e.g., "cat" in "The
cat sleeps").
 Object: The noun or noun phrase that receives the action (e.g., "ball" in "She
throws the ball").
 Modifier: Words that provide additional information about another word
(e.g., adjectives describing nouns).
CONT...
 Dependency Parsing Models: Several algorithms and models are used for dependency
parsing, including:
 Transition-based parsing: Constructs the dependency tree by making a sequence of parsing
decisions based on transitions between different states.
 Graph-based parsing: Constructs the entire dependency graph and selects the best tree by
optimizing a scoring function.
 Neural network-based models: Leverage deep learning techniques to learn complex patterns
in dependency structures, improving accuracy and flexibility.
 Applications: Dependency parsing is crucial for various NLP tasks, including:
 Semantic Role Labeling: Understanding the roles played by different words in a sentence.
 Machine Translation: Improving the accuracy of translations by capturing grammatical
relationships.
 Information Extraction: Identifying and extracting specific information based on
grammatical structure.
 Text Summarization: Generating coherent summaries by understanding sentence structure.
 Tools and Resources: Popular tools for dependency parsing include:
 SpaCy: An NLP library with built-in support for dependency parsing.
 Stanford Parser: A widely used tool from the Stanford NLP group that provides dependency
parsing capabilities.
 NLTK: The Natural Language Toolkit, which includes functions for dependency parsing.

Unit 1
No ratings yet
Unit 1
99 pages
NLP Question and Answers Final
No ratings yet
NLP Question and Answers Final
129 pages
NLP Pipeline
No ratings yet
NLP Pipeline
58 pages
NLP Important Question and Answers Module Wise
No ratings yet
NLP Important Question and Answers Module Wise
101 pages
NLP Lab1
No ratings yet
NLP Lab1
33 pages
NLP Presentation1
No ratings yet
NLP Presentation1
25 pages
Ai CH 4
No ratings yet
Ai CH 4
53 pages
TOPIC 4 Natural Language Processing
No ratings yet
TOPIC 4 Natural Language Processing
26 pages
AP For NLP-Word 2 Vec
No ratings yet
AP For NLP-Word 2 Vec
33 pages
CAT King Study Material 2
No ratings yet
CAT King Study Material 2
20 pages
Notes MSC NLP
No ratings yet
Notes MSC NLP
36 pages
Chapter 7.1 - Introducing Natural Language Processing
No ratings yet
Chapter 7.1 - Introducing Natural Language Processing
39 pages
Natural Language Processing Tools and Approaches
No ratings yet
Natural Language Processing Tools and Approaches
106 pages
Natural Language Processing (NLP) : April 2024
No ratings yet
Natural Language Processing (NLP) : April 2024
88 pages
NLP Pyq Solutions
No ratings yet
NLP Pyq Solutions
59 pages
Chapter 6-NLPs
No ratings yet
Chapter 6-NLPs
31 pages
Ai Note Unit 1-5 Panimalar
100% (1)
Ai Note Unit 1-5 Panimalar
380 pages
Introduction To NLPAbebe Zerihun
No ratings yet
Introduction To NLPAbebe Zerihun
45 pages
NLP PPT
No ratings yet
NLP PPT
41 pages
Module I NLP
No ratings yet
Module I NLP
65 pages
Brocode OP
No ratings yet
Brocode OP
133 pages
Module-I NLP
No ratings yet
Module-I NLP
35 pages
Natural Language Processin1
No ratings yet
Natural Language Processin1
86 pages
NLP Unit 1 Part1
No ratings yet
NLP Unit 1 Part1
61 pages
AIML-HC Mod 04
No ratings yet
AIML-HC Mod 04
71 pages
Introduction
No ratings yet
Introduction
23 pages
Unit 4
No ratings yet
Unit 4
39 pages
Unit 1a
No ratings yet
Unit 1a
53 pages
NLP 2
No ratings yet
NLP 2
45 pages
Unit I
No ratings yet
Unit I
12 pages
Unit-4 NLP
No ratings yet
Unit-4 NLP
54 pages
Handling Corpus Raw Text
No ratings yet
Handling Corpus Raw Text
15 pages
Natural Language Processing 5
No ratings yet
Natural Language Processing 5
24 pages
Introduction To Natural Language Processing
No ratings yet
Introduction To Natural Language Processing
32 pages
Chapter - 1
No ratings yet
Chapter - 1
25 pages
Unit 1 NLP and TA
No ratings yet
Unit 1 NLP and TA
9 pages
Ai TXT Unit1
No ratings yet
Ai TXT Unit1
13 pages
CAT King Study Material 5
No ratings yet
CAT King Study Material 5
21 pages
Natural Language Processing
No ratings yet
Natural Language Processing
6 pages
Fundamental of AI (BE02000041)
No ratings yet
Fundamental of AI (BE02000041)
55 pages
Natural Language Processing
No ratings yet
Natural Language Processing
7 pages
Text Analytics and Natural Language Processing - KAI073
No ratings yet
Text Analytics and Natural Language Processing - KAI073
24 pages
NLP Unit1
No ratings yet
NLP Unit1
51 pages
1.AG60202 - AI Applications in Agriculture - Rajendra Machavaram - Spring 2022-23
No ratings yet
1.AG60202 - AI Applications in Agriculture - Rajendra Machavaram - Spring 2022-23
26 pages
NLP Part1
No ratings yet
NLP Part1
67 pages
Natural Language Processing Lec 1
No ratings yet
Natural Language Processing Lec 1
23 pages
NLP Ia1
No ratings yet
NLP Ia1
7 pages
NLP Self
No ratings yet
NLP Self
22 pages
Disruptive Technologies AI Lecture 3
No ratings yet
Disruptive Technologies AI Lecture 3
19 pages
Informit T2024041000014691603614192
No ratings yet
Informit T2024041000014691603614192
21 pages
NLP Mod 1 SEE
No ratings yet
NLP Mod 1 SEE
7 pages
Ai Applications Unit-1
No ratings yet
Ai Applications Unit-1
11 pages
Natural Language Processing
No ratings yet
Natural Language Processing
21 pages
NLP Final
No ratings yet
NLP Final
33 pages
7 2
No ratings yet
7 2
59 pages
Ai 2
No ratings yet
Ai 2
7 pages
Introduction To Natural Language Processing and NLTK
No ratings yet
Introduction To Natural Language Processing and NLTK
23 pages
NLP - Srilakshmi H - PPT Assignment
No ratings yet
NLP - Srilakshmi H - PPT Assignment
29 pages
Answer KEY: Exercise
No ratings yet
Answer KEY: Exercise
8 pages
Ai and Architecture
No ratings yet
Ai and Architecture
27 pages
NLP QB
100% (2)
NLP QB
14 pages
Professional Elective Vii
No ratings yet
Professional Elective Vii
19 pages
Natural Language Processing
No ratings yet
Natural Language Processing
27 pages
Sam Altman Know What He
No ratings yet
Sam Altman Know What He
18 pages
Chapter 1
No ratings yet
Chapter 1
28 pages
Class 1. Introduction To Data Science
No ratings yet
Class 1. Introduction To Data Science
10 pages
Introduction To NLP - First - Week - Lecture - 1st
No ratings yet
Introduction To NLP - First - Week - Lecture - 1st
6 pages
Natural Language Processing - NOTES
No ratings yet
Natural Language Processing - NOTES
4 pages
Sentiment Analysis of Student Feedback Using Attention-Based RNN and Transformer Embedding
No ratings yet
Sentiment Analysis of Student Feedback Using Attention-Based RNN and Transformer Embedding
12 pages
AI Question Paper Complete
No ratings yet
AI Question Paper Complete
2 pages
Chapter-1 Introduction To NLP
No ratings yet
Chapter-1 Introduction To NLP
12 pages
Amit Kumar Das
No ratings yet
Amit Kumar Das
18 pages
Starcraft Presentation PDF
No ratings yet
Starcraft Presentation PDF
33 pages
UNIT-1 Introduction To AI 1.1 Excite: Answer
No ratings yet
UNIT-1 Introduction To AI 1.1 Excite: Answer
10 pages
AI MCQ Answers
No ratings yet
AI MCQ Answers
5 pages
Introduction To Pattern Recognition
No ratings yet
Introduction To Pattern Recognition
6 pages
AI Assisted Mushroom Identification
No ratings yet
AI Assisted Mushroom Identification
5 pages
Arinze 2024
No ratings yet
Arinze 2024
12 pages
Phi-3 Technical Report: A Highly Capable Language Model Locally On Your Phone
No ratings yet
Phi-3 Technical Report: A Highly Capable Language Model Locally On Your Phone
12 pages
Chinese AI Startup DeepSeek Shocks US Tech Industry
No ratings yet
Chinese AI Startup DeepSeek Shocks US Tech Industry
5 pages
Model Notes Grade 6 - 04.07.24 - 1
No ratings yet
Model Notes Grade 6 - 04.07.24 - 1
4 pages
Article 3 - 5 Pros and Cons of AI in The Education Sector
No ratings yet
Article 3 - 5 Pros and Cons of AI in The Education Sector
4 pages
AIA 6600 - Module 1
No ratings yet
AIA 6600 - Module 1
5 pages
19eid331 - Artificial Neural Networks
No ratings yet
19eid331 - Artificial Neural Networks
3 pages
Week 1 - Introduction and Basic Concepts of AI
No ratings yet
Week 1 - Introduction and Basic Concepts of AI
2 pages
Speech. Can We Rely On Robots
No ratings yet
Speech. Can We Rely On Robots
4 pages
Easy Explanation of Data Modelling in Python
No ratings yet
Easy Explanation of Data Modelling in Python
2 pages
Csa4020 Deep-Learning LP 1.0 22 Csa4020 Deep-Learning LP 1.0 1 Deep Learning
No ratings yet
Csa4020 Deep-Learning LP 1.0 22 Csa4020 Deep-Learning LP 1.0 1 Deep Learning
2 pages
Syntax and Sentence Structure in Linguistics
From Everand
Syntax and Sentence Structure in Linguistics
Aadinath Guha
No ratings yet
Natural Language Processing
From Everand
Natural Language Processing
Ajit Singh
No ratings yet

NLP Unit1

Uploaded by

NLP Unit1

Uploaded by

(21CS121) NATURAL

 The goal of NLP is to enable computers to understand, interpret, and

 Tokenization: Tokenization is the process of splitting text into individual

 Named Entity Recognition (NER): NER identifies and classifies named

 Bag of Words (BoW): is a representation of text that describes the

 Word Embeddings:-are dense vector representations of words that capture

grammatical structure of a sentence to identify relationships

one language to another. Techniques range from rule-based

 "The temperature will rise by 5 to 10 degrees”.

as sparse, high-dimensional vectors, word embeddings capture semantic

i. Dimensionality Reduction: Word embeddings reduce the dimensionality of

 Pre-trained Embeddings: Pre-trained word embeddings can be used to

You might also like