Unit I
Unit I
---Challenges in NLP:
Despite its advancements, NLP faces several challenges due to the complexity and variability of
human language:
1. Ambiguity
o Words and sentences can have multiple meanings.
o Example: "I saw the man with a telescope." (Who has the telescope?)
2. Context Understanding
o Some sentences require prior knowledge to be correctly interpreted.
o Example: "She put the book on the table and sat on it." (Did she sit on the table or the
book?)
3. Idioms and Sarcasm
o Figurative language and sarcasm can be difficult for machines to recognize.
o Example: "Oh, great! Another traffic jam." (The tone is negative, even though the
words suggest something positive.)
4. Multiple Languages and Dialects
o NLP models need to be trained on different languages, dialects, and writing styles.
5. Slang and Informal Language
o Social media posts and informal conversations often include slang, abbreviations, and
emojis.
o Example: "LOL, that’s lit! " (Understanding this requires cultural and contextual
knowledge)
---Programming languages Vs Natural Languages [Important]
---Stages of NLP: [Important]
Natural Language Processing (NLP) is a field within artificial intelligence that allows computers to
comprehend, analyze, and interact with human language effectively.
The process of NLP can be divided into five distinct phases: Lexical Analysis, Syntactic Analysis,
Semantic Analysis, Discourse Integration, and Pragmatic Analysis.
Each phase plays a crucial role in the overall understanding and processing of natural language.
-Types of Tokenization
Tokenization can be classified into several types based on how the text is segmented. Here are some
types of tokenization:
1.Word Tokenization:
Word tokenization divides the text into individual words. Many NLP tasks use this approach, in which
words are treated as the basic units of meaning.
Example:
Input: "Tokenization is an important NLP task."
Output: ["Tokenization", "is", "an", "important", "NLP", "task", "."]
2.Sentence Tokenization:
The text is segmented into sentences during sentence tokenization. This is useful for tasks requiring
individual sentence analysis or processing.
Example:
Input: "Tokenization is an important NLP task. It helps break down text into smaller units."
Output: ["Tokenization is an important NLP task.", "It helps break down text into smaller units."]
3.Subword Tokenization:
Subword tokenization entails breaking down words into smaller units, which can be especially useful
when dealing with morphologically rich languages or rare words.
Example: Input: "tokenization"
Output: ["token", "ization"]
4.Character Tokenization:
This process divides the text into individual characters. This can be useful for modelling character-
level language.
Example: Input: "Tokenization"
Output: ["T", "o", "k", "e", "n", "i", "z", "a", "t", "i", "o", "n"]
--Stemming:-
• Stemming is a method in text processing that eliminates prefixes and suffixes from words,
transforming them into their fundamental or root form.
• The main objective of stemming is to streamline and standardize words, enhancing the
effectiveness of the natural language processing tasks.
• Simplifying words to their most basic form is called stemming, and it is made easier by stemmers
or stemming algorithms. For example, “chocolates” becomes “chocolate” and “retrieval” becomes
“retrieve.”
• Stemming in natural language processing reduces words to their base or root form, aiding in text
normalization for easier processing.
• This technique is crucial in tasks like text classification, information retrieval, and text
summarization.
• It is important to note that stemming is different from Lemmatization. Lemmatization is the
process of reducing a word to its base form, but unlike stemming, it takes into account the context
of the word, and it produces a valid word, unlike stemming which may produce a non-word as the
root form.
Some more example of stemming for root word "like" include:
->"likes"
->"liked"
->"likely"
->"liking"
--Lemmatization:-
• Lemmatization is a fundamental text pre-processing technique widely applied in natural
language processing (NLP) and machine learning.
• Lemmatization is the process of grouping together the different inflected forms of a word so they
can be analyzed as a single item.
• Lemmatization is similar to stemming but it brings context to the words. So, it links words with
similar meanings to one word.
• Text preprocessing includes both Stemming as well as lemmatization. Many times, people find
these two terms confusing. Some treat these two as the same. Lemmatization is preferred over
Stemming because lemmatization does morphological analysis of the words.
Examples of lemmatization:
-> rocks : rock
-> corpora : corpus
-> better : good
Lemmatization Techniques
1. Rule Based Lemmatization
Rule-based lemmatization involves the application of predefined rules to derive the base or root form
of a word. Unlike machine learning-based approaches, which learn from data, rule-based
lemmatization relies on linguistic rules and patterns.
Example:
• Word: “walked”
• Rule Application: Remove “-ed”
• Result: “walk
2. Dictionary-Based Lemmatization
Dictionary-based lemmatization relies on predefined dictionaries or lookup tables to map words to
their corresponding base forms or lemmas. Each word is matched against the dictionary entries to find
its lemma. This method is effective for languages with well-defined rules.
Suppose we have a dictionary with lemmatized forms for some words:
• ‘running’ -> ‘run’
• ‘better’ -> ‘good’
• ‘went’ -> ‘go’
Default tagging is a basic step for the part-of-speech tagging. It is performed using the DefaultTagger
class.
The DefaultTagger class takes ‘tag’ as a single argument. NN is the tag for a singular noun.
DefaultTagger is most useful when it gets to work with most common part-of-speech tag. that’s why
noun tag is recommended.
---------------------------------------------------Least Important---------------------------------------------------
Que.
Answer: