0% found this document useful (0 votes)
12 views71 pages

NLP m2

The document outlines the syllabus for a module on Lexical Analysis in a Computer Engineering program, focusing on Natural Language Processing (NLP). It covers key topics such as tokenization, morphological analysis, language modeling, and techniques like stemming and lemmatization. The importance of morphological analysis in enhancing text comprehension and language model performance is emphasized throughout the document.

Uploaded by

bhavinjain408
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
12 views71 pages

NLP m2

The document outlines the syllabus for a module on Lexical Analysis in a Computer Engineering program, focusing on Natural Language Processing (NLP). It covers key topics such as tokenization, morphological analysis, language modeling, and techniques like stemming and lemmatization. The importance of morphological analysis in enhancing text comprehension and language model performance is emphasized throughout the document.

Uploaded by

bhavinjain408
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 71

UG Program in Computer Engineering

(Accredited by NBA for 3 years from AY 2022-23)

Natural Language Processing

MS. VANDANA SONI


SYLLABUS OF MODULE 2
Module 2: Lexical Analysis 12 Hours
2.1 Text Processing: Tokenization, Sentence Segmentation, Handling hyphenation, Stemming,
Lemmatization;
Computational Morphology: Morphemes, Types of morphemes, Inflectional Morphology,
Derivational Morphology, Word formation, Processing morphology;
2.2 Morphological Models: Dictionary lookup, Finite State Morphology; Morphological parsing
with FST (Finite
State Transducer), Orthographic rules and FST;Lexicon free FST Porter Stemmer algorithm;
2.3 Language modeling: Grams and its variation: Bigram, Trigram; Simple (Unsmoothed)
N-grams; N-gram
language models, Probabilistic Language model, Markov Assumption, Maximum Likelihood,
Sentence
probability, Spelling correction: N-gram, edit distance; Evaluation of Language
Models:Perplexity; Smoothing:
Laplace Smoothing, Add-k smoothing
Introduction to Morphological Analysis

Morphology is the branch of linguistics concerned with the structure and


form of words in a language. Morphological analysis, in the context of NLP,
refers to the computational processing of word structures. It aims to break
down words into their constituent parts, such as roots, prefixes, and
suffixes, and understand their roles and meanings. This process is essential
for various NLP tasks, including language modeling, text analysis, and
machine translation.
Importance of Morphological Analysis
Morphological analysis is a critical step in NLP for several reasons:
1. Understanding Word Formation: It helps in identifying the basic building blocks of words,
which is crucial for language comprehension.
2. Improving Text Analysis: By breaking down words into their roots and affixes, it enhances
the accuracy of text analysis tasks like sentiment analysis and topic modeling.
3. Enhancing Language Models: Morphological analysis provides detailed insights into word
formation, improving the performance of language models used in tasks like speech
recognition and text generation.
4. Facilitating Multilingual Processing: It aids in handling the morphological diversity of
different languages, making NLP systems more robust and versatile.
Natural Language Processing
Natural language processing
(NLP) is a field of computer
science and a subfield of artificial
intelligence that aims to make
computers understand human
language. NLP uses
computational linguistics, which is
the study of how language works,
and various models based on
statistics, machine learning, and
deep learning. These technologies
allow computers to analyze and
process text or voice data, and to
grasp their full meaning, including
the speaker’s or writer’s intentions
and emotions.

5
Tokenization is the process of converting text or
data into smaller units, called tokens. These
What is tokens can be words, subwords, characters, or
Tokenization even sentences, depending on the level of
granularity chosen for tokenization. The idea is to
break down complex data (such as natural
language text) into manageable, discrete pieces
that can be more easily processed, analyzed, or
understood by computers, especially in fields like
natural language processing (NLP) and machine
learning.
Tokenization in nlp

Hassan Khoury © 7
Ms. Vandana Soni 8
In the context of Natural Language ● Word Tokenization: The process of splitting a sentence into
Processing (NLP):
words. For example, the sentence "I love pizza" would be
tokenized into the words ["I", "love", "pizza"].
● Subword Tokenization: Sometimes words are further
broken down into smaller units, such as prefixes, suffixes, or
even parts of words (e.g., breaking "unhappiness" into ["un",
"happi", "ness"]). This is common in models like BERT or
GPT.
● Character Tokenization: In some cases, individual
characters are treated as tokens. For instance, "hello" could
be tokenized as ["h", "e", "l", "l", "o"].
● Sentence Tokenization: Breaking a document or passage of
text into sentences. For example, "Hello! How are you?"
would be tokenized into the sentences ["Hello!", "How are
you?"].

9
Sentence segmentation is the process of breaking a sentence down into individual words. Here are some examples
of sentence segmentation:
● Counting words: Students can count the words in a sentence and use their fingers to map the words.
● Using counters: Students can use counters to represent each word in a sentence.
● Using clothespins: Students can count the words in a sentence and clip a clothespin on the correct number.
● Building sentences: Students can use word cards to build sentences.
● Sentence segmentation games: Students can use a game board and marker to count the words in a
sentence and move their marker that many spaces.

Here are some sample sentences for sentence segmentation:


● "I ate an apple"
● "We went to the store"
● "My dog likes bones"
● "I like cheese pizza"
● "We went to the party"
● "My cat likes to climb"
● "I have a brother"
● "I drank chocolate milk"
#import spacy library
import spacy
Command to install this library:

pip install spacy #load core english library


python -m spacy nlp = spacy.load("en_core_web_sm")
download
en_core_web_sm #take unicode string
#here u stands for unicode
Here
doc = nlp(u"I Love Coding. Sakec College
en_core_web_sm help to us .i am so happy")
means core English #to print sentences
Language available for sent in doc.sents:
print(sent)
online of small
size.
Handling hyphenation
1. Word Splitting at Line Breaks:
○ When a word is split across two lines due to space constraints, it’s important to merge the parts back together.

Example:
arduino
Copy code
"hyphenation" (without the line break)


2. Hyphenated Compound Words:
○ Compound words that use hyphens to combine two or more words into one, such as "high-quality" or "well-known".
○ Example: "He is a well-known scientist."
Should be recognized as "well-known" rather than two separate tokens ("well" and "known").
3. Hyphenation in Foreign Words:
○ Some foreign words or names may use hyphens as part of their structure, and these should not be split or misunderstood by
NLP models.
○ Example: "Franco-British" or "co-op".
4. Hyphens as Separators in Lists or Numbers:
○ Hyphens might appear in numeric expressions, phone numbers, or lists.
○ Example: "A 10-15% increase in profits."
Stemming is a text normalization technique used in Natural
Language Processing (NLP) that reduces words to their root or
base form. The purpose of stemming is to treat different forms
of a word (e.g., "running", "runner", "ran") as a single entity,
making it easier to process and analyze the underlying
concepts.
For example:
● "running" → "run"
● "better" → "good"
● "happiness" → "happy"
Lemmatization
Lemmatization is a text
pre-processing technique that
breaks down words into their
root form, or lemma, to make
them easier to analyze:
Lemmatization Techniques

1. Rule Based Lemmatization

2. Dictionary-Based
Lemmatization

3. Machine Learning-Based
Lemmatization
1. Rule Based Lemmatization

Rule-based lemmatization involves the application of predefined rules to


derive the base or root form of a word. Unlike machine learning-based
approaches, which learn from data, rule-based lemmatization relies on
linguistic rules and patterns.
Here’s a simplified example of rule-based lemmatization for English verbs:
Rule: For regular verbs ending in “-ed,” remove the “-ed” suffix.
Example:
● Word: “walked”
● Rule Application: Remove “-ed”
● Result: “walk
2. Dictionary-Based
Lemmatization
Dictionary-based lemmatization relies on predefined dictionaries or lookup
tables to map words to their corresponding base forms or lemmas. Each
word is matched against the dictionary entries to find its lemma. This
method is effective for languages with well-defined rules.
Suppose we have a dictionary with lemmatized forms for some words:
● ‘running’ -> ‘run’
● ‘better’ -> ‘good’
● ‘went’ -> ‘go’

When we apply dictionary-based lemmatization to a text like “I was running


to become a better athlete, and then I went home,” the resulting lemmatized
form would be: “I was run to become a good athlete, and then I go home.”
3. Machine Learning-Based
Lemmatization

Machine learning-based lemmatization leverages computational models


to automatically learn the relationships between words and their base
forms. Unlike rule-based or dictionary-based approaches, machine
learning models, such as neural networks or statistical models, are
trained on large text datasets to generalize patterns in language.
Example:
Consider a machine learning-based lemmatizer trained on diverse texts.
When encountering the word ‘went,’ the model, having learned patterns,
predicts the base form as ‘go.’ Similarly, for ‘happier,’ the model deduces
‘happy’ as the lemma. The advantage lies in the model’s ability to adapt
to varied linguistic nuances and handle irregularities, making it robust for
lemmatizing diverse vocabularies.
Computational Morphology: Morphemes,
Types of morphemes,
Inflectional Morphology,
Derivational
Morphology, Word formation, Processing
morphology;
What are morphemes?
● Morphemes are the smallest units of language that have
meaning.
● They are the building blocks of words.
● For example, the word "dogs" has two morphemes: "dog"
and "s".
● "S" is a plural marker of the noun.
Inflectional morphology
● Adds suffixes to nouns, verbs, adjectives, or adverbs to indicate grammatical
properties like tense, number, possession, or comparison

● Does not change the part of speech or meaning of the word

● Ensures that a word is in the correct form to make a sentence grammatically


correct
Derivational morphology
● Adds prefixes or suffixes to create new words or change the
meaning or grammatical class of existing words

● Often changes the part of speech of a word


Examples
● Derivational morphemes
○ -ness: Creates a noun, as in "kindness" from "kind"
○ -ment: Creates a noun, as in "development" from "develop"
○ -ize: Creates a verb, as in "modernize" from "modern"
○ -less: Creates an adjective, as in "hopeless" from "hope"
○ -ful: Creates an adjective, as in "bountiful" from "bounty"
Inflectional morphemes
● -s: Plural, as in "cats"
● -ed: Past tense, as in "raced"
● -ing: Present participle, as in "racing"
● -'s: Possessive, as in "Alex's"
● -er: Comparative, as in "faster"
● -est: Superlative, as in "fastest"
2.3 Language modeling: Grams and its variation: Bigram, Trigram; Simple (Unsmoothed)
N-grams; N-gram
language models, Probabilistic Language model, Markov Assumption, Maximum Likelihood,
Sentence
probability, Spelling correction: N-gram, edit distance; Evaluation of Language
Models:Perplexity; Smoothing:
Laplace Smoothing, Add-k smoothing
Played With
1\3=0.33

You might also like