0% found this document useful (0 votes)
19 views49 pages

Module 1

The document outlines a course on Introduction to Natural Language Processing (NLP), detailing its objectives, modules, and applications. It covers essential topics such as text preprocessing, tokenization, and the importance of NLP in AI, highlighting various applications like chatbots, sentiment analysis, and machine translation. Additionally, it emphasizes the significance of text filtration and script validation in enhancing the quality of NLP tasks.

Uploaded by

shanmukh899
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
19 views49 pages

Module 1

The document outlines a course on Introduction to Natural Language Processing (NLP), detailing its objectives, modules, and applications. It covers essential topics such as text preprocessing, tokenization, and the importance of NLP in AI, highlighting various applications like chatbots, sentiment analysis, and machine translation. Additionally, it emphasizes the significance of text filtration and script validation in enhancing the quality of NLP tasks.

Uploaded by

shanmukh899
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 49

Course Name:

Introduction to Natural Language


Processing

20-03-2025 Introduction to Natural Language Processing 1


Course Objectives:
The objective of this course is to:

❑ To gain the basic knowledge on Natural Language Processing.

❑ Apply text cleaning, Morphological and Lexical analysis on text data.

❑ To enrich the algorithmic knowledge on the application of various syntactic and semantic parsing in NLP

process.

❑ To grab the strong knowledge on the natural language generation in NLP process

20-03-2025 Introduction to Natural Language Processing 2


Module -1 Introduction to Natural Language Processing

❑ Introduction to NLP and Text Processing

❑ Tokenization: Word and Sentence Tokenization

❑ Text Filtration and Script Validation

❑ Stop Word Removal Techniques

❑ Stemming and Lemmatization

❑ Text preprocessing pipeline (Tokenization, Filtration, Script Validation, Stop Word Removal,

Stemming)

20-03-2025 Introduction to Natural Language Processing 3


20-03-2025 Introduction to Natural Language Processing 4
Natural Language Processing (NLP)

20-03-2025 Introduction to Natural Language Processing 5


20-03-2025 Introduction to Natural Language Processing
6
Natural Language Processing (NLP)
❑ Natural Language Processing (NLP) is a branch of Artificial Intelligence (AI) that enables computers to understand,

interpret, and generate human language.

❑ Natural Language Processing (NLP) is a field of study that focuses on processing information contained in natural

language text. It enables computers to understand, interpret, and generate human language.

❑ NLP is also known as:

✓ Computational Linguistics (CL)

✓ Human Language Technology (HLT)

✓ Natural Language Engineering (NLE)

20-03-2025 Introduction to Natural Language Processing 7


❑Natural Language Processing (NLP) enables machines to:

➢ Analyze, understand, and generate human languages similarly to how humans do.

➢ Apply computational techniques to the field of language.

➢ Explain linguistic theories and use them to develop systems that benefit society.

➢ Evolve from its origins as a branch of Artificial Intelligence.

➢ Incorporate knowledge from Linguistics, Psycholinguistics, Cognitive Science, and Statistics.

➢ Make computers learn human language rather than requiring humans to learn machine language.

20-03-2025 Introduction to Natural Language Processing 8


Importance of NLP in AI and Machine Learning
❖ Bridges the gap between human communication and machine understanding.

❖ Enables automation of text and speech-related tasks.

❖ Powers AI-driven applications like chatbots, search engines, and sentiment analysis.

20-03-2025 Introduction to Natural Language Processing 9


20-03-2025 Introduction to Natural Language Processing 10
❑Examples of NLP Applications

✓ Chatbots & Virtual Assistants: Siri, Alexa, Google Assistant.

✓ Machine Translation: Google Translate, DeepL.

✓ Speech Recognition: Voice commands in smart devices.

✓ Text Analysis: Sentiment analysis for customer reviews.

✓ Summarization: AI-generated article and report summaries.

20-03-2025 Introduction to Natural Language Processing 11


Applications of NLP

1. Chatbots & Virtual Assistants

•AI-powered chatbots like Siri, Alexa, Google Assistant help users with queries, reminders, and automation.

•Customer service chatbots handle inquiries efficiently (e.g., banks, e-commerce).

2. Sentiment Analysis

•NLP is used to analyze emotions in text (positive, negative, neutral).

•Examples: Analyzing social media posts, product reviews, and customer feedback.

3. Machine Translation

•Automatically translates text between languages (e.g., Google Translate, DeepL).

•Helps in breaking language barriers for global communication.

20-03-2025 Introduction to Natural Language Processing 12


4. Speech Recognition

•Converts spoken language into text.

•Used in voice assistants, transcription software, and voice commands in smart devices.

5. Information Retrieval (Search Engines)

•NLP helps search engines like Google understand and rank relevant content.

•Enables semantic search (understanding user intent beyond keywords).

6. Text Summarization

•AI-generated summaries of long documents, news articles, and reports.

•Example: Automatic news summaries (Google News, SummarizeBot).

20-03-2025 Introduction to Natural Language Processing 13


Text Preprocessing in NLP?

❖ Text preprocessing is the backbone of any successful Generative or Natural Language Processing (NLP) project.

❖ It’s the phase where raw text data undergoes various transformations to make it suitable for analysis and modeling.

❖ Text preprocessing in Natural Language Processing (NLP) is the process of cleaning and transforming raw text into a

structured format that is easier for machines to analyze. Since raw text often contains noise, inconsistencies, and

unnecessary elements, preprocessing helps improve the performance of NLP models.

❖ It is a crucial step in Natural Language Processing (NLP) to improve the performance of machine learning models.

20-03-2025 Introduction to Natural Language Processing 14


Why text pre-processing is essential:
1. Cleaning and Normalization
➢ Removes unwanted characters, punctuation, and special symbols.
➢ Converts text to lowercase for uniformity.
➢ Handles spelling corrections and stopword removal to reduce noise.
2. Tokenization
➢ Splits text into words or sentences, enabling efficient processing.
➢ Helps in identifying meaningful units for further analysis.
3. Lemmatization and Stemming
➢ Reduces words to their root forms (e.g., "running" → "run").
➢ Improves consistency in text data by reducing word variations.
4. Stopword Removal
➢ Eliminates frequently occurring words (e.g., "the", "is") that do not add meaningful information.
➢ Reduces the dimensionality of data, improving processing speed.

20-03-2025 Introduction to Natural Language Processing 15


5. Feature Extraction

➢ Converts text into numerical representations like TF-IDF, Word Embeddings, or Bag of Words.

➢ Helps machine learning models understand text contextually.


6. Handling Ambiguity and Noise

➢ Detects and corrects spelling errors or typos.

➢ Resolves issues related to synonyms, polysemy (multiple meanings of a word), and homonyms.
7. Improves Model Accuracy

➢ Preprocessing enhances text quality, leading to better accuracy in NLP tasks such as sentiment analysis, machine

translation, and text summarization.

➢ Without proper text processing, NLP models would struggle with inconsistencies, noise, and irrelevant information,

reducing their effectiveness.

20-03-2025 Introduction to Natural Language Processing 16


20-03-2025 Introduction to Natural Language Processing 17
TOKENIZATION
•Unstructured text data, such as articles, social media posts, or emails, lacks a predefined structure that machines can readily
interpret.
•Tokenization bridges this gap by breaking down the text into smaller units called tokens.
•These tokens can be words, characters, or even subwords, depending on the chosen tokenization strategy. By transforming
unstructured text into a structured format, tokenization lays the foundation for further analysis and processing.

What is Tokenization?

•Tokenization is the process of splitting text into smaller units called tokens (words, phrases, or sentences).

•It is the first step in many NLP applications, helping computers understand text structure.

20-03-2025 Introduction to Natural Language Processing 18


WHY WE NEED TOKENIZATION???

•One of the primary reasons for tokenization is to convert textual data into a numerical representation that can be processed by

machine learning algorithms. With this numeric representation we can train the model to perform various tasks, such as

classification, sentiment analysis, or language generation.

•Tokens not only serve as numeric representations of text but can also be used as features in machine learning pipelines. These

features capture important linguistic information and can trigger more complex decisions or behaviors.

•For example, in text classification, the presence or absence of specific tokens can influence the prediction of a particular class.

Tokenization, therefore, plays a pivotal role in extracting meaningful features and enabling effective machine learning models.

20-03-2025 Introduction to Natural Language Processing 19


Types of Tokenization
1. Word Tokenization

2. Sentence Tokenization

20-03-2025 Introduction to Natural Language Processing 20


❑ Word tokenization splits a sentence or paragraph into individual words (tokens).
Examples:

• Input:
"Natural Language Processing is amazing!"

• Output:
["Natural", "Language", "Processing", "is", "amazing", "!"]

• Input:
"I love Natural Language Processing!"

• Output:
["I", "love", "Natural", "Language", "Processing", "!"]

20-03-2025 Introduction to Natural Language Processing 21


❑ Sentence tokenization (also called sentence segmentation) breaks a paragraph or long text into
individual sentences.
Example:
• Input:
"Hello! How are you? I'm fine."
• Output:
["Hello!", "How are you?", "I'm fine."]

• Input:
"Natural Language Processing is fascinating. It helps computers understand text."

• Output:
["Natural Language Processing is fascinating.", "It helps computers understand text."]

20-03-2025 Introduction to Natural Language Processing 22


• Input:
"NLP is amazing... but complex. (It's evolving fast!)"

• Output:
["NLP is amazing... but complex.", "(It's evolving fast!)"]

Note: The parentheses and ellipses are handled correctly.

20-03-2025 Introduction to Natural Language Processing 23


20-03-2025 Introduction to Natural Language Processing 24
20-03-2025 Introduction to Natural Language Processing 25
Python Code for Tokenization

20-03-2025 Introduction to Natural Language Processing 26


Challenges in Tokenization

❖ Handling Punctuation: "U.S.A. is a country." should be one entity, not "U", "S", "A" separately.

❖ Dealing with Contractions: "I'm" → "I", "am" (correct) vs. "I'm" (incorrect).

❖ Multilingual Texts: Some languages (e.g., Chinese, Japanese) don’t have spaces between words.

20-03-2025 Introduction to Natural Language Processing 27


Text Filtration
Text filtration is the process of removing unnecessary, irrelevant, or harmful content from text data to improve the

quality and relevance for NLP applications.

Why is Text Filtration Important?

➢ Removes stopwords, special characters, or profanity to clean data.

➢ Helps in spam filtering, content moderation, and privacy protection.

➢ Enhances the efficiency of NLP models by reducing noise.

20-03-2025 Introduction to Natural Language Processing 28


Methods of Text Filtration

❑Stopword Removal – Removing common words like the, is, in, and that do not contribute much meaning.

❑Profanity & Offensive Content Filtering – Detecting and replacing inappropriate words.

❑Duplicate & Redundant Data Removal – Avoiding repeated words or sentences.

❑Special Character & HTML Tag Removal – Cleaning unnecessary symbols like <html>, @, #, etc.

Example of Text Filtration

Input:
"I am very very happy!!! This is the best day <3 <html>."

After Filtration:
"I am happy. This is the best day."

20-03-2025 Introduction to Natural Language Processing 29


❑Stopwords
Stopwords are common words like the, is, and, of that don’t add much meaning.
•Input:
"The quick brown fox jumps over the lazy dog."
•After Filtration:
"quick brown fox jumps over lazy dog."

❑Filters out unnecessary symbols and emojis


Input:
"Hello!!! How are you??? #Excited"
After Filtration:
"Hello How are you Excited"

20-03-2025 Introduction to Natural Language Processing 30


Replaces offensive words with asterisks or removes them.

Input:

"This is a **** stupid idea!"

After Filtration:

"This is a **** idea!"

20-03-2025 Introduction to Natural Language Processing 31


Simple Python Demo for Text Filtration

Expected Output:

20-03-2025 Introduction to Natural Language Processing 32


Script Validation
Script validation ensures that the text follows predefined rules and is structured correctly before further NLP processing.

Why is Script Validation Important?

➢ Ensures text is well-formed and error-free.

➢ Helps prevent injection attacks, syntax errors, and invalid input in applications like chatbots, search engines, and form

validation.

➢ Validates language-specific rules (e.g., Hindi vs. English scripts).

❖ Script validation ensures that text input follows predefined rules and does not contain invalid, malicious, or

unintended content.

20-03-2025 Introduction to Natural Language Processing 33


Key Aspects of Script Validation
1. Character Encoding Validation

Ensures correct character encoding (e.g., UTF-8, ASCII) to prevent errors in text processing.

Example:

Valid Input: "Hello" (UTF-8 encoded)

Invalid Input: "Héllo" (if ASCII encoding is enforced)

2. Language & Script Detection

Identifies the correct language script to prevent mixing of different languages in restricted applications.

Example:

Input: "Bonjour! Comment ça va?" (French text in an English-only system)

After Validation: "Error: Non-English text detected!"


20-03-2025 Introduction to Natural Language Processing 34
3. Input Sanitization

Prevents malicious code injection (e.g., SQL injection, XSS attacks) by filtering harmful inputs.

Example:

Input: "SELECT * FROM users WHERE username='admin' --"

After Validation: "Invalid input detected!"

4. Grammar & Syntax Validation

Ensures that the text follows proper grammar and sentence structure.

Example:

Input: "He going to school"

Corrected Output: "He is going to school"

20-03-2025 Introduction to Natural Language Processing 35


Python Demo for Script Validation

20-03-2025 Introduction to Natural Language Processing 36


Stop Word Removal Techniques in NLP
❖Stop words are common words (e.g., the, is, in, and, of, a, to) that do not contribute significant meaning in Natural

Language Processing (NLP) tasks. Removing them helps in text processing by reducing noise and improving model

efficiency.

Why Remove Stop Words?

Reduces text size and computation time.

Eliminates words that don’t add much value to text analysis.

Improves accuracy in NLP tasks like sentiment analysis, text classification, and search engines.

20-03-2025 Introduction to Natural Language Processing 37


1. Stop Word Removal in a Sentence
Removes unnecessary words to retain only meaningful content.

Input Sentence:
"The quick brown fox jumps over the lazy dog."

After Stop Word Removal:


"quick brown fox jumps lazy dog."

2. Stop Word Removal in a Paragraph


Cleans a longer text by filtering out stop words while keeping essential information.

Input Paragraph:

"Natural language processing is a field of artificial intelligence that helps computers understand human language. It involves techniques such as
tokenization, stop word removal, stemming, and lemmatization."

After Stop Word Removal:

"Natural language processing field artificial intelligence helps computers understand human language. Involves techniques tokenization, stop
word removal, stemming, lemmatization."

20-03-2025 Introduction to Natural Language Processing 38


❑ Challenges & Considerations

Context Matters – Removing stop words may change the meaning (e.g., "To be or not to be" → "be not be").

Custom Stop Words Needed – Industry-specific texts (medical, legal, finance) require tailored stop word

lists.

Multilingual Processing – Different languages need their own stop word lists.

20-03-2025 Introduction to Natural Language Processing 39


Stemming in NLP
Stemming is a rule-based process that removes suffixes to obtain the root form of a word.
It does not consider meaning or context, which may result in incorrect words.
Example of Stemming

Word After Stemming


Running → Run
Studies → Studi
Happily → Happili
Better → Better (Incorrect, should be "Good")

20-03-2025 Introduction to Natural Language Processing 40


Popular Stemmer Algorithms:
Porter Stemmer
The Porter Stemmer is a simple algorithm that reduces words to their root form (stem) by removing common suffixes like "-ing",
"-ed", and "-s". It's like a word-simplifier, helpful for text analysis and information retrieval.

Snowball Stemmer
The Snowball Stemmer, also known as the Porter2 Stemmer, is an effective stemming algorithm designed to process and reduce
words to their stems.

Lancaster Stemmer
The Lancaster Stemmer, also known as the Paice Stemmer, is a very aggressive algorithm used in natural language processing
(NLP) to reduce words to their base or root form by removing suffixes, often resulting in shorter, sometimes non-existent words.

20-03-2025 Introduction to Natural Language Processing 41


Lemmatization in NLP
Lemmatization reduces a word to its dictionary form (lemma) by considering grammar and meaning.
It requires a linguistic database (like WordNet) to find the correct base form.
Example of Lemmatization

Word After Lemmatization


Running → Run
Studies → Study
Happily → Happy
Better → Good
Popular Lemmatization Tools:
WordNet Lemmatizer (NLTK)
spaCy Lemmatizer
20-03-2025 Introduction to Natural Language Processing 42
20-03-2025 Introduction to Natural Language Processing 43
Here’s a simple Python code snippet demonstrating stemming and lemmatization using the NLTK library:

import nltk
from nltk.stem import PorterStemmer, # Apply stemming
WordNetLemmatizer stemmed_words = [stemmer.stem(word) for word in
from nltk.tokenize import word_tokenize words]

# Download necessary resources # Apply lemmatization


nltk.download('punkt') lemmatized_words = [lemmatizer.lemmatize(word) for
nltk.download('wordnet') word in words]

# Initialize stemmer and lemmatizer # Print results


stemmer = PorterStemmer() print("Original Words:", words)
lemmatizer = WordNetLemmatizer() print("Stemmed Words:", stemmed_words)
print("Lemmatized Words:", lemmatized_words)
# Example text
text = "The cats are running faster than the dogs."

# Tokenize words
words = word_tokenize(text)

20-03-2025 Introduction to Natural Language Processing 44


❑ What are Stemming and Lemmatization?

Both stemming and lemmatization are text normalization techniques in Natural Language Processing (NLP) used to reduce

words to their root forms. However, they work differently:

Feature Stemming Lemmatization


Definition Reduces a word to its root form Converts a word to its base
by removing suffixes. form (lemma) using a
dictionary.
Accuracy Less accurate, may produce non- More accurate, produces valid
existent words. words.
Speed Faster as it follows rule-based Slower since it checks word
truncation. meaning and grammar.
Use Cases When speed is more important When meaning and
than accuracy. grammatical correctness are
required.

20-03-2025 Introduction to Natural Language Processing 45


Text Preprocessing Pipeline
A Text Preprocessing Pipeline is a sequence of steps used to clean and prepare raw text data for Natural Language

Processing (NLP) tasks. This pipeline ensures that the text is in a structured and standardized format, making it suitable for

analysis by machine learning models or linguistic algorithms.

Text Preprocessing Pipeline with the following key steps:

1.Tokenization – Splitting text into individual words or subwords.

2.Filtration – Removing unnecessary characters or symbols.

3.Script Validation – Ensuring valid characters and language script.

4.Stop Word Removal – Eliminating commonly used words that do not contribute to meaning (e.g., "is," "the," "and").

5.Stemming – Reducing words to their root form (e.g., "running" → "run").

This pipeline is commonly used in Natural Language Processing (NLP) for text analysis and machine learning applications.

20-03-2025 Introduction to Natural Language Processing 46


REFERENCES
Text Books:
1. Foundations & Text Preprocessing
"Speech and Language Processing" – Daniel Jurafsky & James H. Martin
• Covers fundamental NLP concepts, text processing, POS tagging, parsing, and machine learning models.
• Best for understanding both theoretical and practical aspects of NLP.
"Natural Language Processing with Python" (NLTK Book) – Steven Bird, Ewan Klein, & Edward Loper
• Great for hands-on coding, especially for text preprocessing, POS tagging, chunking, and named entity recognition.
• Uses Python with NLTK, making it ideal for implementing your programs.
2. Morphological Analysis & Language Models
"Introduction to Natural Language Processing" – Jacob Eisenstein
• Covers morphology, syntax, and probabilistic language models (N-grams, HMMs).
• A good blend of theoretical concepts and programming examples.

20-03-2025 Introduction to Natural Language Processing 47


3. Syntactic & Semantic Processing

"Handbook of Natural Language Processing" – Nitin Indurkhya & Fred J. Damerau

• Covers advanced syntactic analysis, POS tagging, chunking, and information extraction techniques.

"Statistical Natural Language Processing" – Christopher Manning & Hinrich Schütze

• Focuses on statistical approaches to NLP, including POS tagging, chunking, and language modeling.

4. Deep Learning & NLP Applications

"Deep Learning for Natural Language Processing" – Palash Goyal, Sumit Pandey, & Karan Jain

• Best for modern NLP applications like Named Entity Recognition, Transformers, and chatbot development.

"Natural Language Processing with Transformers" – Lewis Tunstall, Leandro von Werra, & Thomas Wolf

• Focuses on deep learning and Transformer-based models (BERT, GPT, etc.).

20-03-2025 Introduction to Natural Language Processing 48


THANK YOU

20-03-2025 Introduction to Natural Language Processing 49

You might also like