0% found this document useful (0 votes)
18 views24 pages

NLP Unit1

Natural Language Processing (NLP) is a multidisciplinary field that focuses on enabling computers to understand and respond to human language. Key concepts include tokenization, part-of-speech tagging, named entity recognition, and various techniques for text analysis such as stemming, lemmatization, and word embeddings. The document also discusses challenges like ambiguity in language and the importance of dependency parsing in understanding grammatical structures.

Uploaded by

vamshi panaka
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
18 views24 pages

NLP Unit1

Natural Language Processing (NLP) is a multidisciplinary field that focuses on enabling computers to understand and respond to human language. Key concepts include tokenization, part-of-speech tagging, named entity recognition, and various techniques for text analysis such as stemming, lemmatization, and word embeddings. The document also discusses challenges like ambiguity in language and the importance of dependency parsing in understanding grammatical structures.

Uploaded by

vamshi panaka
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 24

(21CS121) NATURAL

LANGUAGE PROCESSING
NLP CONCEPT
 Natural Language Processing (NLP) is a field at the intersection of
computer science, artificial intelligence, and linguistics, focusing on the
interaction between computers and human (natural) languages.

 The goal of NLP is to enable computers to understand, interpret, and


respond to human language in a way that is both meaningful and useful.
Some basic concepts in NLP are shown bellow:

 Tokenization: Tokenization is the process of splitting text into individual


words, phrases, symbols, or other meaningful elements called tokens. For
example, the sentence "Hello world!" might be tokenized into ["Hello",
"world", "!"].
CONT...
 Part-of-Speech (POS) Tagging: POS tagging involves labeling each word in
a sentence with its corresponding part of speech, such as noun, verb, adjective,
etc. For example, in the sentence "The cat sits on the mat," the tags might be:
i. The/DT (determiner)
ii. cat/NN (noun)
iii. sits/VB (verb)
iv. on/IN (preposition)
v. the/DT (determiner)
vi. mat/NN (noun)

 Named Entity Recognition (NER): NER identifies and classifies named


entities in text into predefined categories such as person names, organizations,
locations, dates, and more. For example, in the sentence "Barack Obama was
born in Hawaii," the entities are:
i. Barack Obama (Person)
ii. Hawaii (Location)
CONT...
 Lemmatization and Stemming:- are techniques used to reduce words to
their base or root form.
i. Stemming: Reduces words to their root form by removing suffixes (e.g.,
"running" becomes "run").
ii. Lemmatization: Reduces words to their base form considering the context
(e.g., "running" becomes "run", "better" becomes "good").
 Stop Words: are common words that are often filtered out in NLP tasks
because they carry less meaning (e.g., "the", "is", "in"). Removing stop words
can improve the efficiency of text processing.

 Bag of Words (BoW): is a representation of text that describes the


occurrence of words within a document.
 It involves creating a vocabulary of all words in the document and then
representing each document by a vector of word counts.
 This model disregards grammar and word order but keeps multiplicity.
CONT...
 TF-IDF (Term Frequency-Inverse Document Frequency):- It is a statistical
measure used to evaluate the importance of a word in a document relative to a
collection of documents (corpus). It combines:
i. Term Frequency (TF): How often a word appears in a document.
ii. Inverse Document Frequency (IDF): How common or rare a word is across
all documents in the corpus.

 Word Embeddings:-are dense vector representations of words that capture


semantic meaning. Examples include Word2Vec, GloVe, and FastText. These
embeddings map words to vectors in a high-dimensional space where
semantically similar words are closer together.
 N-grams are contiguous sequences of n items (words,
characters, etc.) from a given text. Common examples are:
i. Unigrams (1-gram): ["The", "cat", "sits"]
ii. Bigrams (2-gram): ["The cat", "cat sits"]
iii. Trigrams (3-gram): ["The cat sits"]
CONT...
 Sentiment Analysis: is the process of determining the
emotional tone behind words, sentences, or texts. It classifies
the text into positive, negative, or neutral sentiments.
 Syntax and Parsing: Parsing involves analyzing the

grammatical structure of a sentence to identify relationships


between words. Syntax parsing can be:
i. Dependency Parsing: Identifies dependencies between words
(e.g., subject-verb relationships).
ii. Constituency Parsing: Breaks sentences into sub-phrases or
constituents (e.g., noun phrases, verb phrases).
 Machine Translation: is the automatic translation of text from

one language to another. Techniques range from rule-based


approaches to statistical and neural machine translation models.
CONT...
 Language Models: predict the probability of a sequence of words. They are
fundamental for tasks like text generation and are often built using neural
networks. Examples include LSTM-based models.
 Text Classification: is the process of assigning predefined categories to text.
Examples include spam detection, topic classification, and sentiment
analysis.
 Speech Recognition: involves converting spoken language into text. It
combines NLP with signal processing and often uses models like Hidden
Markov Models (HMM) and deep learning techniques.
AMBIGUITY IN LANGUAGE
 It refers to the phenomenon where a word, phrase, or sentence has
multiple interpretations.
 Ambiguity can occur at various levels of language processing, such as
lexical (word-level), syntactic (sentence structure), and semantic
(meaning) levels. Understanding and resolving ambiguity is a significant
challenge in natural language processing (NLP).
 1. Lexical Ambiguity
 Lexical ambiguity arises when a word has multiple meanings.
 Example:
 "I went to the bank."
 Explanation:
 The word "bank" can refer to a financial institution or the side of a river.
 Resolution:
 Context is used to determine the correct meaning. For example, additional
context like "to deposit money" clarifies that "bank" refers to a financial
institution.
CONT...
 2. Syntactic Ambiguity
 Syntactic ambiguity occurs when a sentence can be parsed in multiple ways
due to its structure.
 Example:
 "I saw the man with the telescope."
 Explanation:
 This sentence can be interpreted as either:
 "I used the telescope to see the man."
 "I saw a man who had a telescope."
 Resolution:
 Parsing algorithms and contextual understanding are used to determine the
most likely structure.
CONT...
 3. Semantic Ambiguity
 Semantic ambiguity happens when a sentence can have multiple meanings,
even if its syntactic structure is clear.
 Example:
 "He gave her cat food."
 Explanation:
 This sentence can mean:
 "He gave food to her cat."
 "He gave her some cat food."
 Resolution:
 Semantic analysis and context are used to infer the intended meaning.
CONT...
 4. Pragmatic Ambiguity
 Pragmatic ambiguity involves the interpretation of language in context,
considering the speaker's intentions and the situational context.
 Example:
 "Can you pass the salt?"
 Explanation:
 This sentence can be interpreted as:
 A question about the listener's ability to pass the salt.
 A polite request for the listener to pass the salt.
 Resolution:
 Understanding the social and conversational context helps resolve
pragmatic ambiguity.
QUESTIONS
 More than 100 students attended the seminar. 50
of them were from our college.
 "The project will be completed in 10 days”.

 "The temperature will rise by 5 to 10 degrees”.


SEGMENTATION
 Segmentation in Natural Language Processing (NLP) refers to the process
of dividing text into smaller meaningful units. These units can be sentences,
words, phrases, or other subunits.
 Effective segmentation is crucial for many downstream NLP tasks such as
tokenization, part-of-speech tagging, named entity recognition, and parsing.
 1. Sentence Segmentation
 Sentence segmentation, also known as sentence boundary detection, involves
splitting a text into individual sentences.
 2. Word Segmentation
 Word segmentation, also known as tokenization, involves splitting a sentence
into individual words or tokens.
 3. Subword Segmentation
 Subword segmentation involves splitting words into smaller units, such as
morphemes or subwords, which can be useful for handling out-of-vocabulary
words in machine translation or language modeling.
CONT...
 4. Paragraph Segmentation
 Paragraph segmentation involves splitting a text into paragraphs. This is less
common in typical NLP tasks but can be important for document-level
analysis.
 5. Chunking (Shallow Parsing)
 Chunking involves segmenting and labeling multi-token sequences, such as
noun phrases (NP), verb phrases (VP), etc.
STEMMING
 Stemming is a text normalization technique in Natural Language Processing
(NLP) that reduces words to their base or root form.
 The root form is usually not a valid word by itself but is a common
representation of words that allows for the conflation of different inflected
forms of a word.
 Stemming helps in reducing the dimensionality of text data and is particularly
useful in search engines, text mining, and information retrieval systems.
 Common Stemming Algorithms
i. Porter Stemmer: One of the most widely used stemming algorithms, known
for its simplicity and efficiency.
ii. Lancaster Stemmer: A more aggressive stemming algorithm compared to the
Porter Stemmer.
iii. Snowball Stemmer: Also known as the Porter2 stemmer, it is an
improvement over the original Porter stemmer and is available for multiple
languages.
TOKENIZATION
 Tokenization is a fundamental step in natural language processing (NLP) that
involves splitting text into individual units called tokens. These tokens can be
words, phrases, or other meaningful elements. Tokenization facilitates further
processing and analysis of text data by breaking it down into manageable
pieces.
 Types of Tokenization
 Word Tokenization: Splitting text into individual words.
 Sentence Tokenization: Splitting text into individual sentences.
 Subword Tokenization: Splitting words into smaller units, such as morphemes
or subwords, useful in dealing with unknown words or for languages with rich
morphology.
 Libraries for Tokenization
 Several NLP libraries provide robust tokenization tools, including:
 NLTK (Natural Language Toolkit)
 spaCy
 Transformers by Hugging Face
 Gensim
WORD EMBEDDING
 Word Embedding refers to a technique for representing words as dense
vectors of real numbers in a continuous vector space.
 Unlike traditional methods such as one-hot encoding, which represent words

as sparse, high-dimensional vectors, word embeddings capture semantic


relationships between words in a more compact and meaningful way.
 Key points are as following:-

i. Dimensionality Reduction: Word embeddings reduce the dimensionality of


word representations compared to one-hot encoding, which typically results
in a sparse vector of the size of the vocabulary. Embeddings represent each
word as a dense vector of fixed size, often in the range of 50 to 300
dimensions.
ii. Semantic Meaning: Word embeddings capture semantic meaning and
relationships between words. Words with similar meanings or contexts are
represented by similar vectors. For example, "king" and "queen" may have
vectors that are closer to each other than "king" and "car."
CONT...
 Contextual Information: Word embeddings are learned from large corpora
of text and can reflect syntactic and semantic patterns. Popular embeddings
like Word2Vec, GloVe, and FastText are trained using various methods to
capture these patterns.

 Pre-trained Embeddings: Pre-trained word embeddings can be used to


initialize models, allowing them to leverage learned semantic relationships
from large datasets without having to train embeddings from scratch.

 Applications: Word embeddings are used in various NLP tasks such as text
classification, sentiment analysis, machine translation, and information
retrieval. They are foundational for many modern NLP techniques and
models.
WORD SENSES
 Refer to the different meanings or interpretations that a word can have
depending on its context. A single word can have multiple senses, each with
its own specific meaning.
 Key Points About Word Senses:
 Polysemy: This is the phenomenon where a single word has multiple related
meanings. For example, the word "bank" can refer to a financial institution or
the side of a river. The different meanings are considered different senses of
the word.
 Homonymy: This is when a word has multiple meanings that are unrelated or
only loosely related. For instance, "bat" can refer to a flying mammal or a
piece of sports equipment. These are considered different senses of the word
and are usually distinguished by context.
 Contextual Disambiguation: To understand the intended sense of a word in
a given context, disambiguation techniques are used. This process is crucial
for tasks such as machine translation, information retrieval, and text
understanding.
CONT...
 Word Sense Disambiguation (WSD): This is a subtask of NLP
focused on determining which sense of a word is used in a particular
context. WSD can be approached using various methods, including:
 Dictionary-based methods: Leveraging predefined lexical resources
like WordNet, which provide detailed sense definitions and relations.
 Supervised learning: Training models on labeled datasets where the
senses of words are annotated.
 Unsupervised and semi-supervised learning: Using clustering or
co-occurrence patterns to infer word senses without extensive labeled
data.
 Lexical Resources: Resources such as WordNet provide structured
information about word senses and their relationships, including
synonyms, antonyms, hypernyms, and hyponyms. These resources
are valuable for sense disambiguation and other NLP tasks.
CONT...
 Applications: Understanding word senses is critical for many NLP
applications, including:
 Machine Translation: Ensuring the correct translation of words based on
their intended meanings.
 Information Retrieval: Improving search results by understanding the
context of search queries.
 Text Summarization: Generating accurate summaries that reflect the
correct meanings of words.
DEPENDENCY PARSING
 It is a key aspect of syntactic analysis in natural language processing
(NLP) and computational linguistics.
 It focuses on analyzing the grammatical structure of a sentence by
identifying the relationships between words, particularly how each word
depends on others.
 Key Concepts in Dependency Parsing:
 Dependency Relations: In dependency parsing, the grammatical structure
of a sentence is represented by a set of dependency relations. Each relation
consists of a head and a dependent. The head is a word that governs or
influences another word (the dependent), establishing a syntactic
connection between them.
CONT...
 Dependency Tree: The result of dependency parsing is often visualized as a
dependency tree or dependency graph. In this tree, each node represents a
word, and directed edges represent dependency relations. The root of the tree
is typically the main verb or another central element of the sentence.
 Head and Dependent:
 Head: The governing word in a dependency relation.
 Dependent: The word that is governed by the head. For example, in the
phrase "The cat sleeps," "sleeps" is the head of "cat," which is the dependent.
 Types of Dependencies: Common dependency relations include:
 Subject: The noun or noun phrase that performs the action (e.g., "cat" in "The
cat sleeps").
 Object: The noun or noun phrase that receives the action (e.g., "ball" in "She
throws the ball").
 Modifier: Words that provide additional information about another word
(e.g., adjectives describing nouns).
CONT...
 Dependency Parsing Models: Several algorithms and models are used for dependency
parsing, including:
 Transition-based parsing: Constructs the dependency tree by making a sequence of parsing
decisions based on transitions between different states.
 Graph-based parsing: Constructs the entire dependency graph and selects the best tree by
optimizing a scoring function.
 Neural network-based models: Leverage deep learning techniques to learn complex patterns
in dependency structures, improving accuracy and flexibility.
 Applications: Dependency parsing is crucial for various NLP tasks, including:
 Semantic Role Labeling: Understanding the roles played by different words in a sentence.
 Machine Translation: Improving the accuracy of translations by capturing grammatical
relationships.
 Information Extraction: Identifying and extracting specific information based on
grammatical structure.
 Text Summarization: Generating coherent summaries by understanding sentence structure.
 Tools and Resources: Popular tools for dependency parsing include:
 SpaCy: An NLP library with built-in support for dependency parsing.
 Stanford Parser: A widely used tool from the Stanford NLP group that provides dependency
parsing capabilities.
 NLTK: The Natural Language Toolkit, which includes functions for dependency parsing.

You might also like