0% found this document useful (0 votes)
4 views12 pages

Unit I

This document provides an overview of Natural Language Processing (NLP), detailing its definition, applications, and the challenges it faces due to the complexity of human language. It outlines the five stages of NLP: Lexical Analysis, Syntactic Analysis, Semantic Analysis, Discourse Integration, and Pragmatic Analysis, along with key techniques such as tokenization, stemming, and lemmatization. Additionally, it emphasizes the importance of Part-of-Speech tagging in understanding sentence structure and improving various NLP applications.

Uploaded by

Manasi Pawar
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
4 views12 pages

Unit I

This document provides an overview of Natural Language Processing (NLP), detailing its definition, applications, and the challenges it faces due to the complexity of human language. It outlines the five stages of NLP: Lexical Analysis, Syntactic Analysis, Semantic Analysis, Discourse Integration, and Pragmatic Analysis, along with key techniques such as tokenization, stemming, and lemmatization. Additionally, it emphasizes the importance of Part-of-Speech tagging in understanding sentence structure and improving various NLP applications.

Uploaded by

Manasi Pawar
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 12

UNIT I: Introduction to Natural Language Processing

---Natural Language Processing:


Natural Language Processing (NLP) is a subfield of artificial intelligence (AI) that focuses on the
interaction between computers and human languages.
It enables machines to understand, interpret, generate, and manipulate natural language, whether in
written or spoken form.
NLP combines computational linguistics, machine learning, and deep learning techniques to bridge
the gap between human communication and machine comprehension.
The goal of NLP is to allow computers to process and analyze large amounts of natural language data
in a way that is meaningful and useful.
Applications of NLP include chatbots, virtual assistants (such as Siri and Alexa), machine translation
(like Google Translate), speech recognition, sentiment analysis, and many more.

---Challenges in NLP:
Despite its advancements, NLP faces several challenges due to the complexity and variability of
human language:
1. Ambiguity
o Words and sentences can have multiple meanings.
o Example: "I saw the man with a telescope." (Who has the telescope?)
2. Context Understanding
o Some sentences require prior knowledge to be correctly interpreted.
o Example: "She put the book on the table and sat on it." (Did she sit on the table or the
book?)
3. Idioms and Sarcasm
o Figurative language and sarcasm can be difficult for machines to recognize.
o Example: "Oh, great! Another traffic jam." (The tone is negative, even though the
words suggest something positive.)
4. Multiple Languages and Dialects
o NLP models need to be trained on different languages, dialects, and writing styles.
5. Slang and Informal Language
o Social media posts and informal conversations often include slang, abbreviations, and
emojis.

o Example: "LOL, that’s lit! " (Understanding this requires cultural and contextual
knowledge)
---Programming languages Vs Natural Languages [Important]
---Stages of NLP: [Important]
Natural Language Processing (NLP) is a field within artificial intelligence that allows computers to
comprehend, analyze, and interact with human language effectively.
The process of NLP can be divided into five distinct phases: Lexical Analysis, Syntactic Analysis,
Semantic Analysis, Discourse Integration, and Pragmatic Analysis.
Each phase plays a crucial role in the overall understanding and processing of natural language.

----First Phase of NLP: Lexical and Morphological Analysis


1. The lexical phase in Natural Language Processing (NLP) involves scanning text and breaking it
down into smaller units such as paragraphs, sentences, and words.
2. This process, known as tokenization, converts raw text into manageable units called tokens or
lexemes. Tokenization is essential for understanding and processing text at the word level.
3. In addition to tokenization, various data cleaning and feature extraction techniques are applied,
including:
• Lemmatization: Reducing words to their base or root form.
• Stopwords Removal: Eliminating common words that do not carry significant meaning, such
as "and," "the," and "is."
• Correcting Misspelled Words: Ensuring the text is free of spelling errors to maintain
accuracy.

----Second Phase of NLP: Syntactic Analysis (Parsing)


1. Syntactic analysis, also known as parsing, is the second phase of Natural Language Processing
(NLP).
2. This phase is essential for understanding the structure of a sentence and assessing its grammatical
correctness.
3. It involves analyzing the relationships between words and ensuring their logical consistency by
comparing their arrangement against standard grammatical rules.
4. Parsing examines the grammatical structure and relationships within a given text. It assigns Parts-
Of-Speech (POS) tags to each word, categorizing them as nouns, verbs, adverbs, etc.

----Third Phase of NLP: Semantic Analysis


1. Semantic Analysis is the third phase of Natural Language Processing (NLP), focusing on
extracting the meaning from text.
2. Unlike syntactic analysis, which deals with grammatical structure, semantic analysis is concerned
with the literal and contextual meaning of words, phrases, and sentences.
3. Semantic analysis aims to understand the dictionary definitions of words and their usage in
context.
4. It determines whether the arrangement of words in a sentence makes logical sense.
5. This phase helps in finding context and logic by ensuring the semantic coherence of sentences.
----Fourth Phase of NLP: Discourse Integration
1. Discourse Integration is the fourth phase of Natural Language Processing (NLP).
2. This phase deals with comprehending the relationship between the current sentence and earlier
sentences or the larger context.
3. Discourse integration is crucial for contextualizing text and understanding the overall message
conveyed.
4. Discourse integration examines how words, phrases, and sentences relate to each other within a
larger context.
5. It assesses the impact a word or sentence has on the structure of a text and how the combination
of sentences affects the overall meaning.
----Fifth Phase of NLP: Pragmatic Analysis
1. Pragmatic Analysis is the fifth and final phase of Natural Language Processing (NLP), focusing
on interpreting the inferred meaning of a text beyond its literal content.
2. Human language is often complex and layered with underlying assumptions, implications, and
intentions that go beyond straightforward interpretation.
3. This phase aims to grasp these deeper meanings in communication.
4. Pragmatic analysis goes beyond the literal meanings examined in semantic analysis, aiming to
understand what the writer or speaker truly intends to convey.
5. In natural language, words and phrases can carry different meanings depending on context, tone,
and the situation in which they are used.
Basics of text processing:
--Tokenization:- [Important]
• Tokenization is the process of dividing a text into smaller units known as tokens.
• Tokens are typically words or sub-words in the context of natural language processing.
• Tokenization is a critical step in many NLP tasks, including text processing, language modelling,
and machine translation.
• The process involves splitting a string, or text into a list of tokens. One can think of tokens as
parts like a word is a token in a sentence, and a sentence is a token in a paragraph.
• Tokenization involves using a tokenizer to segment unstructured data and natural language text
into distinct chunks of information, treating them as different elements.
• The tokens within a document can be used as vector, transforming an unstructured text document
into a numerical data structure suitable for machine learning.

-Types of Tokenization
Tokenization can be classified into several types based on how the text is segmented. Here are some
types of tokenization:
1.Word Tokenization:
Word tokenization divides the text into individual words. Many NLP tasks use this approach, in which
words are treated as the basic units of meaning.
Example:
Input: "Tokenization is an important NLP task."
Output: ["Tokenization", "is", "an", "important", "NLP", "task", "."]
2.Sentence Tokenization:
The text is segmented into sentences during sentence tokenization. This is useful for tasks requiring
individual sentence analysis or processing.
Example:
Input: "Tokenization is an important NLP task. It helps break down text into smaller units."
Output: ["Tokenization is an important NLP task.", "It helps break down text into smaller units."]
3.Subword Tokenization:
Subword tokenization entails breaking down words into smaller units, which can be especially useful
when dealing with morphologically rich languages or rare words.
Example: Input: "tokenization"
Output: ["token", "ization"]
4.Character Tokenization:
This process divides the text into individual characters. This can be useful for modelling character-
level language.
Example: Input: "Tokenization"
Output: ["T", "o", "k", "e", "n", "i", "z", "a", "t", "i", "o", "n"]
--Stemming:-
• Stemming is a method in text processing that eliminates prefixes and suffixes from words,
transforming them into their fundamental or root form.
• The main objective of stemming is to streamline and standardize words, enhancing the
effectiveness of the natural language processing tasks.
• Simplifying words to their most basic form is called stemming, and it is made easier by stemmers
or stemming algorithms. For example, “chocolates” becomes “chocolate” and “retrieval” becomes
“retrieve.”
• Stemming in natural language processing reduces words to their base or root form, aiding in text
normalization for easier processing.
• This technique is crucial in tasks like text classification, information retrieval, and text
summarization.
• It is important to note that stemming is different from Lemmatization. Lemmatization is the
process of reducing a word to its base form, but unlike stemming, it takes into account the context
of the word, and it produces a valid word, unlike stemming which may produce a non-word as the
root form.
Some more example of stemming for root word "like" include:
->"likes"
->"liked"
->"likely"
->"liking"

--Lemmatization:-
• Lemmatization is a fundamental text pre-processing technique widely applied in natural
language processing (NLP) and machine learning.
• Lemmatization is the process of grouping together the different inflected forms of a word so they
can be analyzed as a single item.
• Lemmatization is similar to stemming but it brings context to the words. So, it links words with
similar meanings to one word.
• Text preprocessing includes both Stemming as well as lemmatization. Many times, people find
these two terms confusing. Some treat these two as the same. Lemmatization is preferred over
Stemming because lemmatization does morphological analysis of the words.
Examples of lemmatization:
-> rocks : rock
-> corpora : corpus
-> better : good
Lemmatization Techniques
1. Rule Based Lemmatization
Rule-based lemmatization involves the application of predefined rules to derive the base or root form
of a word. Unlike machine learning-based approaches, which learn from data, rule-based
lemmatization relies on linguistic rules and patterns.
Example:
• Word: “walked”
• Rule Application: Remove “-ed”
• Result: “walk
2. Dictionary-Based Lemmatization
Dictionary-based lemmatization relies on predefined dictionaries or lookup tables to map words to
their corresponding base forms or lemmas. Each word is matched against the dictionary entries to find
its lemma. This method is effective for languages with well-defined rules.
Suppose we have a dictionary with lemmatized forms for some words:
• ‘running’ -> ‘run’
• ‘better’ -> ‘good’
• ‘went’ -> ‘go’

---Part of Speech Tagging:-[Important]


• One of the core tasks in Natural Language Processing (NLP) is Parts of Speech (PoS) tagging,
which is giving each word in a text a grammatical category, such as nouns, verbs, adjectives,
and adverbs.
• Through improved comprehension of phrase structure and semantics, this technique makes it
possible for machines to study and comprehend human language more accurately.
• Parts of Speech tagging is a linguistic activity in Natural Language Processing (NLP) wherein
each word in a document is given a particular part of speech (adverb, adjective, verb, etc.) or
grammatical category.
• Through the addition of a layer of syntactic and semantic information to the words, this
procedure makes it easier to comprehend the sentence’s structure and meaning.
• In NLP applications, POS tagging is useful for machine translation, named entity recognition,
and information extraction, among other things.
• It also works well for clearing out ambiguity in terms with numerous meanings and revealing
a sentence’s grammatical structure.

Default tagging is a basic step for the part-of-speech tagging. It is performed using the DefaultTagger
class.
The DefaultTagger class takes ‘tag’ as a single argument. NN is the tag for a singular noun.
DefaultTagger is most useful when it gets to work with most common part-of-speech tag. that’s why
noun tag is recommended.

Example of POS Tagging


Consider the sentence: “The quick brown fox jumps over the lazy dog.”
After performing POS Tagging:
• “The” is tagged as determiner (DT)
• “quick” is tagged as adjective (JJ)
• “brown” is tagged as adjective (JJ)
• “fox” is tagged as noun (NN)
• “jumps” is tagged as verb (VBZ)
• “over” is tagged as preposition (IN)
• “the” is tagged as determiner (DT)
• “lazy” is tagged as adjective (JJ)
• “dog” is tagged as noun (NN)

Need for Part-of-Speech (POS) Tagging in NLP:


1. Understanding Sentence Structure
POS tagging helps machines understand the grammatical structure of sentences, making it easier to
process human language.
Example:

• "She can fish."


o "Can" (Verb: ability) OR "Can" (Noun: container)?
o "Fish" (Verb: action) OR "Fish" (Noun: animal)?
o POS tagging helps distinguish the meaning based on sentence structure.
2. Improving Machine Translation
In languages like English and Hindi, word order differs. POS tagging helps align words correctly
when translating.
Example:

• English: "He is eating an apple."

• Hindi: "वह एक सेब खा रहा है ।"

o POS tagging ensures the correct word order in translation.


3. Sentiment Analysis
POS tagging helps identify adjectives, adverbs, and verbs, which play a key role in sentiment analysis.
Example:

• "The movie was absolutely amazing!"


o "amazing" (Adjective) → Strong positive sentiment.
o Without POS tagging, sentiment detection might fail.
4. Question Answering Systems & Chatbots
To generate relevant responses, AI chatbots need to understand what type of word each term is.
Example:

• "Who is the president of the USA?"


o "Who" (Interrogative Pronoun) signals a person is expected as an answer.
o POS tagging helps the system determine the right answer format.
5. Speech Recognition & Text-to-Speech (TTS)
POS tagging helps disambiguate words in spoken language where multiple pronunciations exist.
Example:

• "Lead" (Noun: metal) vs. "Lead" (Verb: to guide)


o POS tagging helps text-to-speech systems pronounce words correctly.

---------------------------------------------------Least Important---------------------------------------------------

----Are natural languages regular?


Answer: No, natural languages are not regular. This means that they cannot be fully described using
regular expressions or finite automata (FA). Instead, natural languages require more powerful models,
like Context-Free Grammars (CFGs) and Turing Machines, to handle their complexity.
Why Are Natural Languages Not Regular?
1. Long-Distance Dependencies
o Natural languages contain dependencies between words that are far apart in a
sentence.
o Example (Subject-Verb Agreement):
▪ "The dog that chased the cats that ate the fish is happy."
▪ The verb "is" must agree with "dog", not "cats", which is beyond the power
of finite automata.
2. Nested Structures (Recursion)
o Many natural languages allow recursive sentence structures (sentences inside
sentences).
o Example (Embedded Clauses):
▪ "The book that the student who won the award wrote is interesting."
▪ Regular languages cannot handle nested or recursive structures.
3. Cross-Serial Dependencies (Non-Regular Patterns)
o Some languages, like Swiss German, have cross-serial dependencies, which regular
languages cannot handle.
o Example:
▪ "Jan glaubt dass Maria den Hans das Buch geben will."
▪ In this sentence, the verbs match with specific nouns non-linearly, requiring a
more powerful grammar.
4. Center-Embedding (Cannot Be Processed by FA)
o Example:
▪ "The rat the cat the dog chased bit ran."
▪ This type of sentence structure cannot be parsed using a finite automaton.

Finite Automata for NLP


Although finite automata (FA) are not sufficient for modeling entire natural languages, they are still
useful for specific NLP tasks, such as:
1. Lexical Analysis (Tokenization)
o Breaking text into words, numbers, or symbols.
o Example: Converting "Hello, world!" → ["Hello", ",", "world", "!"].
o Regular expressions (which can be implemented using finite automata) work well for
this.
2. Part-of-Speech Tagging (Simple Cases)
o Example: Identifying words as noun, verb, adjective.
o Finite State Transducers (FSTs) can be used for simple tagging tasks.
3. Spell Checking
o Recognizing incorrect words and suggesting corrections.
o Finite automata help by storing a dictionary of correct words.
4. Keyword Search and Pattern Matching
o Used in search engines and text editors (e.g., grep in Unix).
o Regular expressions, implemented with finite-state automata, help in fast searching.

Que.

Answer:

You might also like