Chapter 1
Chapter 1
Introduction
to
calculus
• Functions and Graphs • Vectors and Vector Spaces • Description of data • Introduction and ER Mod
• Limits and derivatives • Introduction to Matrices, matrix • Probability • Relational Data Model
• Applications of transformations, determinants • Random variables • Relational Algebra
differentiation • Vector space • Probability distribution • SQL, Database design
• Integrals • Eigen value and vector, SVD • Statistical inference • transaction processing
• Applications of integrals and PCA • Concurrency system
First Year
Second Year
Unit - 1
1 Introduction to NLP
Introduction to Natural Language Processing, Applications of Natural Language 04
Processing, Word embeddings. Parsing techniques - Dependency Grammar, Neural hrs
dependency parsing.
2 Machine Translation, Auto encoders and decoders 06
Machine Translation, Seq2Seq and Attention, Autoencoder and decoders. hrs
3 Generative Adversarial Networks
05
Generative vs. Discriminative models, Generative Adversarial Networks and Language
hrs
Models, types of GANs.
Unit - 2
4 Transformer Networks & Diffusion models
Transformer Networks, transformers for text generation, Diffusion models – continuous 07
vs discrete, deterministic vs stochastic models. hrs
1950: Alan Turing publishes "Computing Machinery and Intelligence", introducing the idea of machines
understanding and generating human language.
Heuristics-Based NLP (Early Years):
•Rule-based methods using predefined, hand-crafted rules from domain experts (e.g., regular
expressions).
•Limitations: Limited scalability for complex language processing.
Statistical NLP (1990s):
•Shift to machine learning algorithms using statistical patterns learned from data.
•Examples: Naive Bayes, Support Vector Machines (SVM), Hidden Markov Models (HMMs).
Neural Network-Based NLP (Present):
•Deep learning models provide better accuracy but require large datasets and high computational power.
•Examples: Recurrent Neural Networks (RNNs), Long Short-Term Memory (LSTM), Transformers (e.g.,
BERT, GPT).
Components of NLP
• Lexical Analysis: Breaks down the text into words (tokens) and identifies their part of speech (nouns, verbs, etc.).
• Syntactic Analysis (Parsing): Examines the grammatical structure of a sentence, checking if the arrangement of
words follows the rules of a language.
• Semantic Analysis: Focuses on understanding the meaning of individual words and how they combine in a sentence.
• Pragmatic Analysis: Interprets the meaning of the sentence in context, considering background knowledge or the
speaker's intent.
• Discourse Integration: Ensures that individual sentences are connected to form a coherent text or conversation.
Regular Expressions
A regular expression (sometimes called a rational expression) is a sequence of characters that define a search
pattern, mainly for use in pattern matching with strings, or string matching, i.e. “find and replace” like
operations. Regular expressions are a generalized way to match patterns with sequences of characters.
Rules for Regular Expressions
•Every letter of ∑ can be made into a regular expression, null string, ∈ itself is a regular expression.
•If r1 and r2 are regular expressions, then (r1), r1.r2, r1+r2, r1*, r1 + are also regular expressions.
Terminology
Corpus
• Definition: A corpus is a large collection of text used for training and evaluating NLP
models.
• Example: Wikipedia articles, news datasets, customer reviews.
• Types of Corpus:
• Monolingual Corpus: Text in one language (e.g., English News Corpus).
• Parallel Corpus: Text in multiple languages for translation tasks (e.g., Europarl dataset).
• Domain-Specific Corpus: Text from specialized fields like medical or legal texts.
Example:
If we collect 10,000 articles from news websites, we call it a news corpus.
Documents
• Definition: A document is a single unit of text in a corpus.
• Example: An email, a tweet, a book chapter, a product review.
• Relationship: A corpus is made up of multiple documents.
Example:
• A news corpus contains thousands of news articles (each article is a document).
• A book corpus contains individual book chapters as documents.
Terminology
Vocabulary
• Definition: A vocabulary is the set of unique words found in a corpus.
• Example: [“AI”, “Machine”, “Learning”, “Python”, “Deep”, “Neural”]
• Size: Vocabulary size depends on the dataset; large corpora have extensive vocabularies.
• Preprocessing Impact:
• Removing stopwords (common words like “is”, “the”) reduces vocabulary size.
• Applying stemming/lemmatization normalizes words and minimizes redundant terms.
Example:
If we process a news corpus with 1 million words, and find 50,000 unique words, then the
vocabulary size is 50,000.
Words (Tokens)
• Definition: Words (or tokens) are the smallest units in a text after tokenization.
• Example: [“I”, “love”, “NLP”, “!”]
• Types of Tokens:
• Unigram Tokens: Single words (“deep”, “learning”).
• Bigram Tokens: Pairs of words (“deep learning”, “neural networks”).
• Subword Tokens: Smaller parts of words used in NLP models (e.g., “learn” and “##ing” in BERT).
Example:
Text: “Natural Language Processing is fun!”
• Tokens: [“Natural”, “Language”, “Processing”, “is”, “fun”, “!”]
Text Processing
Tokenization
Definition:
Tokenization is the process of splitting text into smaller units called tokens (words,
phrases, or sentences). It helps in preprocessing text for NLP tasks.
Types of Tokenization:
• Word Tokenization: Splitting text into words.
• Sentence Tokenization: Splitting text into sentences.
• Subword Tokenization: Used in deep learning models like BERT (e.g., WordPiece,
Byte Pair Encoding).
Stemming
Definition:
Stemming reduces a word to its root form by chopping off suffixes, but it does not always
produce meaningful words.
Lemmatization
Definition:
Lemmatization reduces words to their dictionary base form (lemma), considering the
word’s meaning and grammar.
Solution:
Vocabulary {unique words present in the dataset} = 7
“The Food Is Good Bad Pizza Amazing”
Based on these words One-Hot Encoding is done.
Word Embeddings - One Hot Encoding
Word Embeddings - One Hot Encoding
Advantage:
1. Easy to implement and understand.
Disadvantage:
1. The resulting matrix is sparse (mostly zeros), which can be inefficient for
storage and computation, and also leads to overfitting.
2. As seen in the above example, for d3, the matrix size is 3*7. This infers that
“fixed text size” is not possible in all the cases. But, for training an ML
algorithm it requires fixed size matrix.
3. No semantic meaning is captured between the words.
4. One-hot encoding does ignore out-of-vocabulary (oov) categories in test
data.
Word Embeddings - Bag of Words
Bag of Words
•The Bag of Words model represents text data by counting the occurrences of each
word in a document by ignoring grammar and word order, but keeping track of
frequency.
Word Embeddings - Bag of Words
Example on Bag of words:
Given dataset, solve using Bag of words
Solution :
First step – lower all the words.
Second step – Exclude the stop words (words not involved in sentiment
analysis).
Hence the reframed Text is as follows
S1 - good boy
S2 – good girl
S3 – boy girl good
Word Embeddings - Bag of Words
Vocabulary frequency
good 3
boy 2
girl 2
Document Feature
good boy girl
S1 [1 1 0] These vectors are used to train the
S2 [1 0 1] machine learning model
S3 [1 1 1]
Advantages
—Simple to implement and understand.
—Results in fixed size input, which benefits ML algorithms.
Disadvantages
—The resulting matrix is sparse which can lead to overfitting.
—Ordering of the word is changed.
—Ignores out-of-vocabulary (oov) categories in test data.
—Sometimes, it fails in capturing the semantic meaning between the
sentences.
Word Embeddings: TF-IDF
Word Embeddings : TF-IDF
Example on TF-IDF
Given dataset, solve using TF-IDF
Document Text Output
D1 He is a good boy 1
D2 She is a good girl 1
D3 Boy and girl are 1
good
Solution:
First step – lower all the words.
Second step – Exclude the stop words (words not involved in sentiment
analysis).
Hence the reframed Text is as follows
S1 - good boy
S2 – good girl
S3 – boy girl good
Word Embeddings : TF-IDF
Term Frequency
Words in S1 S2 S3
Vocabulary
good 1/2 1/2 1/3
boy 1/2 0 1/3
girl 0 1/2 1/3
IDF
WORDS IDF
good Log(3/3)=
0
boy Log(3/2)
girl Log(3/2)
Final TF-IDF
Sentence good boy girl
Sentence 1 0 ½log(3/2) 0
Sentence 2 0 0 1/2log(3/2)
Sentence 3 0 1/3log(3/2) 1/3log(3/2)
Word Embeddings – TF-IDF
Advantages
—Simple to implement and understand.
—Results in fixed size input, which benefits ML algorithms.
—Word importance is getting captured
Disadvantages
—The resulting matrix is sparse which can lead to overfitting.
—Ignores out-of-vocabulary (oov) categories in test data.
—TF-IDF treats words as independent entities and doesn’t consider semantic
relationships between them.
Word Embeddings – Word2Vec
• Example -
• Each unique word is represented in the form of vector considering the particular
features.
• The numerical value is assigned based on the relationship between the vocabulary
and feature representation.
Word Embeddings – Word2Vec
Word Embeddings – Word2Vec: CBOW and Skip Gram
• Dependency grammar is a
fundamental concept in natural
language processing (NLP) that allows
us to understand how words connect
within sentences.
• It provides a syntactic framework for
representing sentence structure based
on word-to-word relationships are
connected by directed links
(dependencies).
• Focuses on head-dependent relations
rather than phrase structure.
Grammar in the sentence
Dependency Parsing
• Example: 'The cat chased the mouse' • 'She enjoys playing the piano'
• chased (head) → cat (subject) • enjoys (head) → She (subject)
• chased (head) → mouse (object) • enjoys (head) → playing (object)
• cat (head) → the (determiner) • playing (head) → piano (object)
• mouse (head) → the (determiner)
• piano (head) → the (determiner)
Examples and Visualization
• Using spaCy
• spaCy is an open-source Python library for
Natural Language Processing.
• To get started, first install spaCy and load
the required language model.
In NLP applications, POS tagging is useful for machine translation, named entity
recognition, and information extraction, among other things.
It also works well for clearing out ambiguity in terms with numerous meanings and
revealing a sentence’s grammatical structure.
Part-of-Speech (POS) Tagging
Consider the sentence: “The quick brown fox jumps over the lazy dog.”