Module 1
Module 1
❑ To enrich the algorithmic knowledge on the application of various syntactic and semantic parsing in NLP
process.
❑ To grab the strong knowledge on the natural language generation in NLP process
❑ Text preprocessing pipeline (Tokenization, Filtration, Script Validation, Stop Word Removal,
Stemming)
❑ Natural Language Processing (NLP) is a field of study that focuses on processing information contained in natural
language text. It enables computers to understand, interpret, and generate human language.
➢ Analyze, understand, and generate human languages similarly to how humans do.
➢ Explain linguistic theories and use them to develop systems that benefit society.
➢ Make computers learn human language rather than requiring humans to learn machine language.
❖ Powers AI-driven applications like chatbots, search engines, and sentiment analysis.
•AI-powered chatbots like Siri, Alexa, Google Assistant help users with queries, reminders, and automation.
2. Sentiment Analysis
•Examples: Analyzing social media posts, product reviews, and customer feedback.
3. Machine Translation
•Used in voice assistants, transcription software, and voice commands in smart devices.
•NLP helps search engines like Google understand and rank relevant content.
6. Text Summarization
❖ Text preprocessing is the backbone of any successful Generative or Natural Language Processing (NLP) project.
❖ It’s the phase where raw text data undergoes various transformations to make it suitable for analysis and modeling.
❖ Text preprocessing in Natural Language Processing (NLP) is the process of cleaning and transforming raw text into a
structured format that is easier for machines to analyze. Since raw text often contains noise, inconsistencies, and
❖ It is a crucial step in Natural Language Processing (NLP) to improve the performance of machine learning models.
➢ Converts text into numerical representations like TF-IDF, Word Embeddings, or Bag of Words.
➢ Resolves issues related to synonyms, polysemy (multiple meanings of a word), and homonyms.
7. Improves Model Accuracy
➢ Preprocessing enhances text quality, leading to better accuracy in NLP tasks such as sentiment analysis, machine
➢ Without proper text processing, NLP models would struggle with inconsistencies, noise, and irrelevant information,
What is Tokenization?
•Tokenization is the process of splitting text into smaller units called tokens (words, phrases, or sentences).
•It is the first step in many NLP applications, helping computers understand text structure.
•One of the primary reasons for tokenization is to convert textual data into a numerical representation that can be processed by
machine learning algorithms. With this numeric representation we can train the model to perform various tasks, such as
•Tokens not only serve as numeric representations of text but can also be used as features in machine learning pipelines. These
features capture important linguistic information and can trigger more complex decisions or behaviors.
•For example, in text classification, the presence or absence of specific tokens can influence the prediction of a particular class.
Tokenization, therefore, plays a pivotal role in extracting meaningful features and enabling effective machine learning models.
2. Sentence Tokenization
• Input:
"Natural Language Processing is amazing!"
• Output:
["Natural", "Language", "Processing", "is", "amazing", "!"]
• Input:
"I love Natural Language Processing!"
• Output:
["I", "love", "Natural", "Language", "Processing", "!"]
• Input:
"Natural Language Processing is fascinating. It helps computers understand text."
• Output:
["Natural Language Processing is fascinating.", "It helps computers understand text."]
• Output:
["NLP is amazing... but complex.", "(It's evolving fast!)"]
❖ Handling Punctuation: "U.S.A. is a country." should be one entity, not "U", "S", "A" separately.
❖ Dealing with Contractions: "I'm" → "I", "am" (correct) vs. "I'm" (incorrect).
❖ Multilingual Texts: Some languages (e.g., Chinese, Japanese) don’t have spaces between words.
❑Stopword Removal – Removing common words like the, is, in, and that do not contribute much meaning.
❑Profanity & Offensive Content Filtering – Detecting and replacing inappropriate words.
❑Special Character & HTML Tag Removal – Cleaning unnecessary symbols like <html>, @, #, etc.
Input:
"I am very very happy!!! This is the best day <3 <html>."
After Filtration:
"I am happy. This is the best day."
Input:
After Filtration:
Expected Output:
➢ Helps prevent injection attacks, syntax errors, and invalid input in applications like chatbots, search engines, and form
validation.
❖ Script validation ensures that text input follows predefined rules and does not contain invalid, malicious, or
unintended content.
Ensures correct character encoding (e.g., UTF-8, ASCII) to prevent errors in text processing.
Example:
Identifies the correct language script to prevent mixing of different languages in restricted applications.
Example:
Prevents malicious code injection (e.g., SQL injection, XSS attacks) by filtering harmful inputs.
Example:
Ensures that the text follows proper grammar and sentence structure.
Example:
Language Processing (NLP) tasks. Removing them helps in text processing by reducing noise and improving model
efficiency.
Improves accuracy in NLP tasks like sentiment analysis, text classification, and search engines.
Input Sentence:
"The quick brown fox jumps over the lazy dog."
Input Paragraph:
"Natural language processing is a field of artificial intelligence that helps computers understand human language. It involves techniques such as
tokenization, stop word removal, stemming, and lemmatization."
"Natural language processing field artificial intelligence helps computers understand human language. Involves techniques tokenization, stop
word removal, stemming, lemmatization."
Context Matters – Removing stop words may change the meaning (e.g., "To be or not to be" → "be not be").
Custom Stop Words Needed – Industry-specific texts (medical, legal, finance) require tailored stop word
lists.
Multilingual Processing – Different languages need their own stop word lists.
Snowball Stemmer
The Snowball Stemmer, also known as the Porter2 Stemmer, is an effective stemming algorithm designed to process and reduce
words to their stems.
Lancaster Stemmer
The Lancaster Stemmer, also known as the Paice Stemmer, is a very aggressive algorithm used in natural language processing
(NLP) to reduce words to their base or root form by removing suffixes, often resulting in shorter, sometimes non-existent words.
import nltk
from nltk.stem import PorterStemmer, # Apply stemming
WordNetLemmatizer stemmed_words = [stemmer.stem(word) for word in
from nltk.tokenize import word_tokenize words]
# Tokenize words
words = word_tokenize(text)
Both stemming and lemmatization are text normalization techniques in Natural Language Processing (NLP) used to reduce
Processing (NLP) tasks. This pipeline ensures that the text is in a structured and standardized format, making it suitable for
4.Stop Word Removal – Eliminating commonly used words that do not contribute to meaning (e.g., "is," "the," "and").
This pipeline is commonly used in Natural Language Processing (NLP) for text analysis and machine learning applications.
• Covers advanced syntactic analysis, POS tagging, chunking, and information extraction techniques.
• Focuses on statistical approaches to NLP, including POS tagging, chunking, and language modeling.
"Deep Learning for Natural Language Processing" – Palash Goyal, Sumit Pandey, & Karan Jain
• Best for modern NLP applications like Named Entity Recognition, Transformers, and chatbot development.
"Natural Language Processing with Transformers" – Lewis Tunstall, Leandro von Werra, & Thomas Wolf