0% found this document useful (0 votes)

12 views71 pages

NLP m2

The document outlines the syllabus for a module on Lexical Analysis in a Computer Engineering program, focusing on Natural Language Processing (NLP). It covers key topics such as tokenization, morphological analysis, language modeling, and techniques like stemming and lemmatization. The importance of morphological analysis in enhancing text comprehension and language model performance is emphasized throughout the document.

Uploaded by

bhavinjain408

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

12 views71 pages

NLP m2

Uploaded by

bhavinjain408

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 71

UG Program in Computer Engineering

(Accredited by NBA for 3 years from AY 2022-23)

Natural Language Processing

MS. VANDANA SONI

SYLLABUS OF MODULE 2
Module 2: Lexical Analysis 12 Hours
2.1 Text Processing: Tokenization, Sentence Segmentation, Handling hyphenation, Stemming,
Lemmatization;
Computational Morphology: Morphemes, Types of morphemes, Inﬂectional Morphology,
Derivational Morphology, Word formation, Processing morphology;
2.2 Morphological Models: Dictionary lookup, Finite State Morphology; Morphological parsing
with FST (Finite
State Transducer), Orthographic rules and FST;Lexicon free FST Porter Stemmer algorithm;
2.3 Language modeling: Grams and its variation: Bigram, Trigram; Simple (Unsmoothed)
N-grams; N-gram
language models, Probabilistic Language model, Markov Assumption, Maximum Likelihood,
Sentence
probability, Spelling correction: N-gram, edit distance; Evaluation of Language
Models:Perplexity; Smoothing:
Laplace Smoothing, Add-k smoothing
Introduction to Morphological Analysis

Morphology is the branch of linguistics concerned with the structure and

form of words in a language. Morphological analysis, in the context of NLP,
refers to the computational processing of word structures. It aims to break
down words into their constituent parts, such as roots, prefixes, and
suffixes, and understand their roles and meanings. This process is essential
for various NLP tasks, including language modeling, text analysis, and
machine translation.
Importance of Morphological Analysis
Morphological analysis is a critical step in NLP for several reasons:
1. Understanding Word Formation: It helps in identifying the basic building blocks of words,
which is crucial for language comprehension.
2. Improving Text Analysis: By breaking down words into their roots and affixes, it enhances
the accuracy of text analysis tasks like sentiment analysis and topic modeling.
3. Enhancing Language Models: Morphological analysis provides detailed insights into word
formation, improving the performance of language models used in tasks like speech
recognition and text generation.
4. Facilitating Multilingual Processing: It aids in handling the morphological diversity of
different languages, making NLP systems more robust and versatile.
Natural Language Processing
Natural language processing
(NLP) is a field of computer
science and a subfield of artificial
intelligence that aims to make
computers understand human
language. NLP uses
computational linguistics, which is
the study of how language works,
and various models based on
statistics, machine learning, and
deep learning. These technologies
allow computers to analyze and
process text or voice data, and to
grasp their full meaning, including
the speaker’s or writer’s intentions
and emotions.

5
Tokenization is the process of converting text or
data into smaller units, called tokens. These
What is tokens can be words, subwords, characters, or
Tokenization even sentences, depending on the level of
granularity chosen for tokenization. The idea is to
break down complex data (such as natural
language text) into manageable, discrete pieces
that can be more easily processed, analyzed, or
understood by computers, especially in fields like
natural language processing (NLP) and machine
learning.
Tokenization in nlp

Hassan Khoury © 7
Ms. Vandana Soni 8
In the context of Natural Language ● Word Tokenization: The process of splitting a sentence into
Processing (NLP):
words. For example, the sentence "I love pizza" would be
tokenized into the words ["I", "love", "pizza"].
● Subword Tokenization: Sometimes words are further
broken down into smaller units, such as prefixes, suffixes, or
even parts of words (e.g., breaking "unhappiness" into ["un",
"happi", "ness"]). This is common in models like BERT or
GPT.
● Character Tokenization: In some cases, individual
characters are treated as tokens. For instance, "hello" could
be tokenized as ["h", "e", "l", "l", "o"].
● Sentence Tokenization: Breaking a document or passage of
text into sentences. For example, "Hello! How are you?"
would be tokenized into the sentences ["Hello!", "How are
you?"].

9
Sentence segmentation is the process of breaking a sentence down into individual words. Here are some examples
of sentence segmentation:
● Counting words: Students can count the words in a sentence and use their fingers to map the words.
● Using counters: Students can use counters to represent each word in a sentence.
● Using clothespins: Students can count the words in a sentence and clip a clothespin on the correct number.
● Building sentences: Students can use word cards to build sentences.
● Sentence segmentation games: Students can use a game board and marker to count the words in a
sentence and move their marker that many spaces.

Here are some sample sentences for sentence segmentation:

● "I ate an apple"
● "We went to the store"
● "My dog likes bones"
● "I like cheese pizza"
● "We went to the party"
● "My cat likes to climb"
● "I have a brother"
● "I drank chocolate milk"
#import spacy library
import spacy
Command to install this library:

pip install spacy #load core english library

python -m spacy nlp = spacy.load("en_core_web_sm")
download
en_core_web_sm #take unicode string
#here u stands for unicode
Here
doc = nlp(u"I Love Coding. Sakec College
en_core_web_sm help to us .i am so happy")
means core English #to print sentences
Language available for sent in doc.sents:
print(sent)
online of small
size.
Handling hyphenation
1. Word Splitting at Line Breaks:
○ When a word is split across two lines due to space constraints, it’s important to merge the parts back together.

Example:
arduino
Copy code
"hyphenation" (without the line break)

○
2. Hyphenated Compound Words:
○ Compound words that use hyphens to combine two or more words into one, such as "high-quality" or "well-known".
○ Example: "He is a well-known scientist."
Should be recognized as "well-known" rather than two separate tokens ("well" and "known").
3. Hyphenation in Foreign Words:
○ Some foreign words or names may use hyphens as part of their structure, and these should not be split or misunderstood by
NLP models.
○ Example: "Franco-British" or "co-op".
4. Hyphens as Separators in Lists or Numbers:
○ Hyphens might appear in numeric expressions, phone numbers, or lists.
○ Example: "A 10-15% increase in profits."
Stemming is a text normalization technique used in Natural
Language Processing (NLP) that reduces words to their root or
base form. The purpose of stemming is to treat different forms
of a word (e.g., "running", "runner", "ran") as a single entity,
making it easier to process and analyze the underlying
concepts.
For example:
● "running" → "run"
● "better" → "good"
● "happiness" → "happy"
Lemmatization
Lemmatization is a text
pre-processing technique that
breaks down words into their
root form, or lemma, to make
them easier to analyze:
Lemmatization Techniques

1. Rule Based Lemmatization

2. Dictionary-Based
Lemmatization

3. Machine Learning-Based
Lemmatization
1. Rule Based Lemmatization

Rule-based lemmatization involves the application of predeﬁned rules to

derive the base or root form of a word. Unlike machine learning-based
approaches, which learn from data, rule-based lemmatization relies on
linguistic rules and patterns.
Here’s a simplified example of rule-based lemmatization for English verbs:
Rule: For regular verbs ending in “-ed,” remove the “-ed” suffix.
Example:
● Word: “walked”
● Rule Application: Remove “-ed”
● Result: “walk
2. Dictionary-Based
Lemmatization
Dictionary-based lemmatization relies on predefined dictionaries or lookup
tables to map words to their corresponding base forms or lemmas. Each
word is matched against the dictionary entries to find its lemma. This
method is effective for languages with well-defined rules.
Suppose we have a dictionary with lemmatized forms for some words:
● ‘running’ -> ‘run’
● ‘better’ -> ‘good’
● ‘went’ -> ‘go’

When we apply dictionary-based lemmatization to a text like “I was running

to become a better athlete, and then I went home,” the resulting lemmatized
form would be: “I was run to become a good athlete, and then I go home.”
3. Machine Learning-Based
Lemmatization

Machine learning-based lemmatization leverages computational models

to automatically learn the relationships between words and their base
forms. Unlike rule-based or dictionary-based approaches, machine
learning models, such as neural networks or statistical models, are
trained on large text datasets to generalize patterns in language.
Example:
Consider a machine learning-based lemmatizer trained on diverse texts.
When encountering the word ‘went,’ the model, having learned patterns,
predicts the base form as ‘go.’ Similarly, for ‘happier,’ the model deduces
‘happy’ as the lemma. The advantage lies in the model’s ability to adapt
to varied linguistic nuances and handle irregularities, making it robust for
lemmatizing diverse vocabularies.
Computational Morphology: Morphemes,
Types of morphemes,
Inﬂectional Morphology,
Derivational
Morphology, Word formation, Processing
morphology;
What are morphemes?
● Morphemes are the smallest units of language that have
meaning.
● They are the building blocks of words.
● For example, the word "dogs" has two morphemes: "dog"
and "s".
● "S" is a plural marker of the noun.
Inflectional morphology
● Adds suffixes to nouns, verbs, adjectives, or adverbs to indicate grammatical
properties like tense, number, possession, or comparison

● Does not change the part of speech or meaning of the word

● Ensures that a word is in the correct form to make a sentence grammatically

correct
Derivational morphology
● Adds prefixes or suffixes to create new words or change the
meaning or grammatical class of existing words

● Often changes the part of speech of a word

Examples
● Derivational morphemes
○ -ness: Creates a noun, as in "kindness" from "kind"
○ -ment: Creates a noun, as in "development" from "develop"
○ -ize: Creates a verb, as in "modernize" from "modern"
○ -less: Creates an adjective, as in "hopeless" from "hope"
○ -ful: Creates an adjective, as in "bountiful" from "bounty"
Inflectional morphemes
● -s: Plural, as in "cats"
● -ed: Past tense, as in "raced"
● -ing: Present participle, as in "racing"
● -'s: Possessive, as in "Alex's"
● -er: Comparative, as in "faster"
● -est: Superlative, as in "fastest"
2.3 Language modeling: Grams and its variation: Bigram, Trigram; Simple (Unsmoothed)
N-grams; N-gram
language models, Probabilistic Language model, Markov Assumption, Maximum Likelihood,
Sentence
probability, Spelling correction: N-gram, edit distance; Evaluation of Language
Models:Perplexity; Smoothing:
Laplace Smoothing, Add-k smoothing
Played With
1\3=0.33

NLP Sem Answers (All)
No ratings yet
NLP Sem Answers (All)
124 pages
Introduction To Natural Language Processing
No ratings yet
Introduction To Natural Language Processing
45 pages
NLP Unit-1 Notes
No ratings yet
NLP Unit-1 Notes
162 pages
NLP Exp 3
No ratings yet
NLP Exp 3
24 pages
NLP Sem Imp
No ratings yet
NLP Sem Imp
46 pages
Natural Language Processing Tools and Approaches
No ratings yet
Natural Language Processing Tools and Approaches
106 pages
Class 10 Ai Sample Paper - 1
No ratings yet
Class 10 Ai Sample Paper - 1
5 pages
UNIT 1 - Part1
No ratings yet
UNIT 1 - Part1
121 pages
Natural Language Processing Revision Notes
No ratings yet
Natural Language Processing Revision Notes
4 pages
Natural Language Processing Notes Class 10 AI
No ratings yet
Natural Language Processing Notes Class 10 AI
25 pages
Unit 5 Notes
100% (1)
Unit 5 Notes
33 pages
Unit-4 NLP
No ratings yet
Unit-4 NLP
54 pages
NLP - 1 - 250119 - 222702
No ratings yet
NLP - 1 - 250119 - 222702
71 pages
NLP Part1
No ratings yet
NLP Part1
67 pages
NLP Pyq Solutions
No ratings yet
NLP Pyq Solutions
59 pages
1 Introduction
No ratings yet
1 Introduction
99 pages
NLP Notes
No ratings yet
NLP Notes
43 pages
00 ME781 Merged Till SVM
No ratings yet
00 ME781 Merged Till SVM
604 pages
Chapter 7.1 - Introducing Natural Language Processing
No ratings yet
Chapter 7.1 - Introducing Natural Language Processing
39 pages
NLP Lecture2 Text Pre Processing
No ratings yet
NLP Lecture2 Text Pre Processing
54 pages
Natural Language Processing
No ratings yet
Natural Language Processing
47 pages
AIML-HC Mod 04
No ratings yet
AIML-HC Mod 04
71 pages
NLP Unit 1 Part1
No ratings yet
NLP Unit 1 Part1
61 pages
AI Internship
No ratings yet
AI Internship
5 pages
Natural Language Processing (NLP) & Computational Linguistics
No ratings yet
Natural Language Processing (NLP) & Computational Linguistics
60 pages
PDF NLP
No ratings yet
PDF NLP
7 pages
Ai Protection Framework
No ratings yet
Ai Protection Framework
60 pages
C10 - Ai - Unit 3 - NLP - Half Yearly
No ratings yet
C10 - Ai - Unit 3 - NLP - Half Yearly
37 pages
Week 8-Module 7 NLP
No ratings yet
Week 8-Module 7 NLP
52 pages
Unit 6 - AI (NLP)
No ratings yet
Unit 6 - AI (NLP)
37 pages
NLP Lab Manual
No ratings yet
NLP Lab Manual
17 pages
Ai TXT Unit1
No ratings yet
Ai TXT Unit1
13 pages
NLP Basics
No ratings yet
NLP Basics
12 pages
Lecture 2 NLP
No ratings yet
Lecture 2 NLP
27 pages
NLP - Srilakshmi H - PPT Assignment
No ratings yet
NLP - Srilakshmi H - PPT Assignment
29 pages
CAT King Study Material 5
No ratings yet
CAT King Study Material 5
21 pages
Text Analytics and Natural Language Processing - KAI073
No ratings yet
Text Analytics and Natural Language Processing - KAI073
24 pages
Smart College Enquiry Chatbot Using Deep Learning Algorithm
No ratings yet
Smart College Enquiry Chatbot Using Deep Learning Algorithm
88 pages
ChatGPT For Business Professionals
No ratings yet
ChatGPT For Business Professionals
7 pages
NLP - Shortnotes Unit 1 & 2
No ratings yet
NLP - Shortnotes Unit 1 & 2
16 pages
NLP
No ratings yet
NLP
17 pages
Unit2 A
No ratings yet
Unit2 A
22 pages
NLP Unit 1
No ratings yet
NLP Unit 1
15 pages
What Is NLP?: Components of An FSA
No ratings yet
What Is NLP?: Components of An FSA
16 pages
Natural Language Processing
No ratings yet
Natural Language Processing
25 pages
Natural Language Processing Unit 1
No ratings yet
Natural Language Processing Unit 1
16 pages
NLP Insem Notes
No ratings yet
NLP Insem Notes
13 pages
Introduction
No ratings yet
Introduction
23 pages
Unit 1 NLP KCS072
No ratings yet
Unit 1 NLP KCS072
12 pages
Grapheme:: Morpheme
No ratings yet
Grapheme:: Morpheme
20 pages
Part B Unit 1 Introduction To Artificial Intelligence
No ratings yet
Part B Unit 1 Introduction To Artificial Intelligence
15 pages
AI6122 Topic 1.2 - WordLevel
No ratings yet
AI6122 Topic 1.2 - WordLevel
63 pages
Chapter 4
No ratings yet
Chapter 4
17 pages
Unit Ii NLP Notes Final
No ratings yet
Unit Ii NLP Notes Final
6 pages
Introduction To Natural Language Processing and NLTK
No ratings yet
Introduction To Natural Language Processing and NLTK
23 pages
NLPNotes
No ratings yet
NLPNotes
12 pages
Natural Language Processing
No ratings yet
Natural Language Processing
10 pages
NLP Ai X
No ratings yet
NLP Ai X
6 pages
Ai NLP
No ratings yet
Ai NLP
9 pages
Part-of-Speech (POS) Tagging
No ratings yet
Part-of-Speech (POS) Tagging
4 pages
Module 1.1
No ratings yet
Module 1.1
9 pages
Natural Language Processing Lec 1
No ratings yet
Natural Language Processing Lec 1
23 pages
WEF AI in Action Beyond Experimentation To Transform Industry 2025
No ratings yet
WEF AI in Action Beyond Experimentation To Transform Industry 2025
30 pages
Class X Ch.-1 Introduction To Artificial Intelligence
No ratings yet
Class X Ch.-1 Introduction To Artificial Intelligence
76 pages
Chapter-1 Introduction To NLP
No ratings yet
Chapter-1 Introduction To NLP
12 pages
POA - Tracker
No ratings yet
POA - Tracker
60 pages
Intro To NLP: Natural Language Toolkit
No ratings yet
Intro To NLP: Natural Language Toolkit
11 pages
Week 6: Introduction To Natural Language Processing
No ratings yet
Week 6: Introduction To Natural Language Processing
18 pages
Leveraging
No ratings yet
Leveraging
5 pages
AI BCAI 551 Lab Manual
No ratings yet
AI BCAI 551 Lab Manual
54 pages
A Beginner's Guide To Natural Language Processing - IBM Developer
No ratings yet
A Beginner's Guide To Natural Language Processing - IBM Developer
9 pages
Text Paraphrasing With Large Language Models-3
No ratings yet
Text Paraphrasing With Large Language Models-3
6 pages
Introduction To Text Mining
No ratings yet
Introduction To Text Mining
82 pages
Christopher Manning Lecture 5: Language Models and Recurrent Neural Networks (Oh, and Finish Neural Dependency Parsing J)
No ratings yet
Christopher Manning Lecture 5: Language Models and Recurrent Neural Networks (Oh, and Finish Neural Dependency Parsing J)
66 pages
Final QB With DT
No ratings yet
Final QB With DT
5 pages
Wei, J., Et Al. (2022) - Emergent Abilities of Large Language Models. Arxiv
No ratings yet
Wei, J., Et Al. (2022) - Emergent Abilities of Large Language Models. Arxiv
16 pages
AI in Anesthesiology
No ratings yet
AI in Anesthesiology
16 pages
Artificial Intelligence: B.E. (Computer Technology) Semester Seventh (C.B.S.)
No ratings yet
Artificial Intelligence: B.E. (Computer Technology) Semester Seventh (C.B.S.)
2 pages
Near Human-Level Performance in Grammatical Error Correction With Hybrid Machine Translation
No ratings yet
Near Human-Level Performance in Grammatical Error Correction With Hybrid Machine Translation
7 pages
Text
No ratings yet
Text
10 pages
Stimmer Et Al 2025 Natural Language Processing in Veterinary Pathology A Commentary On Oppo
No ratings yet
Stimmer Et Al 2025 Natural Language Processing in Veterinary Pathology A Commentary On Oppo
4 pages
Impact of AI-focussed Technologies On Social and Technical Competencies For HR Managers - A Systematic
No ratings yet
Impact of AI-focussed Technologies On Social and Technical Competencies For HR Managers - A Systematic
18 pages
Satyajith - Research - Paper - Text Generation Using Markov Model LSTM Networks To Generate Realistic Text
No ratings yet
Satyajith - Research - Paper - Text Generation Using Markov Model LSTM Networks To Generate Realistic Text
8 pages
Department of Computer Science and Engineering: Detection of Child Predators and Cyber Harassers On Social Media
No ratings yet
Department of Computer Science and Engineering: Detection of Child Predators and Cyber Harassers On Social Media
9 pages
Artificial Intelligence in Project Management: A Study of The Role of Ai-Powered Chatbots in Project Stakeholder Engagement
No ratings yet
Artificial Intelligence in Project Management: A Study of The Role of Ai-Powered Chatbots in Project Stakeholder Engagement
7 pages
Sagnik Majumder: Education
No ratings yet
Sagnik Majumder: Education
3 pages
Is AI The Answer For Better Government Services
No ratings yet
Is AI The Answer For Better Government Services
2 pages
Shubham Resume
No ratings yet
Shubham Resume
1 page
Natural Language Processing
From Everand
Natural Language Processing
Ajit Singh
No ratings yet
Language Identification: Fundamentals and Applications
From Everand
Language Identification: Fundamentals and Applications
Fouad Sabry
No ratings yet

NLP m2

Uploaded by

NLP m2

Uploaded by

UG Program in Computer Engineering

(Accredited by NBA for 3 years from AY 2022-23)

Natural Language Processing

MS. VANDANA SONI

Morphology is the branch of linguistics concerned with the structure and

Here are some sample sentences for sentence segmentation:

pip install spacy #load core english library

1. Rule Based Lemmatization

Rule-based lemmatization involves the application of predeﬁned rules to

When we apply dictionary-based lemmatization to a text like “I was running

Machine learning-based lemmatization leverages computational models

● Does not change the part of speech or meaning of the word

● Ensures that a word is in the correct form to make a sentence grammatically

● Often changes the part of speech of a word

You might also like