7-Text Classification-13-11-2024
7-Text Classification-13-11-2024
Dr K Ganesan
Professor – HAG
SCORE, VIT
[email protected]
Phone : 6382203768
• In NLP, info is often expressed using unstructured text.
• The following steps are often involved in extracting
structured information from unstructured texts:
• 1. Initial processing. 2. Proper names identification.
• 3. Parsing. 4. Extraction of events and relations.
• 5. Anaphora resolution. 6. Output result generation.
• 1. Initial processing
• The first step is to break down a text into
fragments such as zones, phrases, segments, and
tokens.
• This function can be performed by tokenizers.
• Tokenizers in NLP break down text into smaller units
(words, subwords, or characters )called tokens.
• Tokenization is a crucial in text classification,
sentiment analysis, machine translation.
• sentence = "Tokenization is an important step in
NLP tasks.“
• ['Tokenization', 'is', 'an', 'important', 'step', 'in',
'NLP', 'tasks', '.']
• Parts of speech tagging means words in a text are categorized
into their parts of speech - nouns, verbs, adjectives.
• It can find grammatical structure of sentences, and used in
tasks like sentiment analysis, named entity recognition & ML
• Input Text: "The quick brown fox jumps over the lazy dog."
• POS Tagged Output: "The" - Determiner
• "quick" - Adjective
• "brown" - Adjective
• "fox" - Noun
• "jumps" - Verb
• "over" - Preposition
• "the" - Determiner
• "lazy" - Adjective
• "dog" - Noun
• Here each word is tagged with its corresponding part of speech.
• In natural language processing, phrasal unit identification means
identifying phrases or units of words that have a specific syntactic
or semantic meaning.
• Simple noun phrase: "the red car"
• Complex verb phrase : "will be going to the store."
• Example: "The quick brown fox jumps over the lazy dog."
• Noun Phrase (NP): "The quick brown fox" - This is a noun phrase
consisting of determiner "the," adjectives "quick" and "brown," and
the noun "fox."
• Verb Phrase (VP): "Jumps over the lazy dog" - This is a verb phrase
consisting of the verb "jumps," preposition "over," and the noun
phrase "the lazy dog."
• Prepositional Phrase (PP): "Over the lazy dog" - This is a
prepositional phrase consisting of the preposition "over" and the
noun phrase "the lazy dog."
• Identifying these phrasal units helps to understand the structure
and meaning of sentences.
2. Proper names identification
• Also known as named entity recognition (NER) – identifies and
classifies named entities in the text into predefined categories
such as person names, organization names, locations, dates,
etc.
• "John Smith works at Google headquarters in NewYork City"
• We can identify the following named entities:
• Person Name: "John Smith"
• Organization Name: "Google"
• Location: "New York City"
• NER uses MLmodels, such as conditional random fields (CRFs)
or deep learning models like recurrent neural networks
(RNNs) and transformers, to recognize and classify named
entities in text.
• These models are trained on annotated datasets that provide
• 3. Parsing
• Parsing refers to the process of analyzing the grammatical
structure of sentences to determine their syntactic structure
• Here we break sentences into smaller components such as
phrases and identifying relationships between them,
represented in the form of a parse tree.
• Here's an example to illustrate parsing:
• Example: "The quick brown fox jumps over the lazy dog."
• Tokenization: The sentence is tokenized into individual words:
"The", "quick", "brown", "fox", "jumps", "over", "the", "lazy", "dog".
• Part-of-Speech (POS) Tagging: Each word is tagged with its
grammatical category (e.g., noun, verb, adjective):
• "The_DET", "quick_ADJ", "brown_ADJ", "fox_NOUN",
"jumps_VERB", "over_ADP", "the_DET", "lazy_ADJ",
"dog_NOUN".
• Parsing: The parsed tree structure represents syntactic
relationships between the words.
doc[0] 0 0 0 1 1 0 0 1 0 1 0 1
doc[1] 0 0 1 0 2 0 1 0 0 0 0 1
doc[2] 1 1 0 1 1 1 0 1 1 0 1 0
• Here we call fit_transform() method on our training data and
transform() method on our test data.
• Common way to scale the features is using scikit-learn’s
StandardScaler class.
• Data standardization is the process of rescaling the attributes so
that they have mean as 0 and variance as 1.
• Goal of standardization is to bring down all features to a
common scale without distorting differences in the range of
values.
• In sklearn.preprocessing.StandardScaler(), centering and scaling
happens independently on each feature.
• Here, the model built by us will learn the mean and variance of
the features of the training set.
• The parameters learned by our model using the training data will
• The fit method is calculating the mean and variance of each of
the features present in our data.
• The transform method is transforming all the features using
the respective mean and variance.
• We then train a Multinomial Naïve Bayes model on the
training set and evaluate its accuracy on the testing set.
• Multimodal Naive Bayes is an extension of the Naive Bayes
algorithm that can handle multiple types of features (or
modalities) in the input data.
• Each modality refers to a different type of attribute or
feature, say, text, numerical values, categorical variables, …
• It assumes that features are conditionally independent given
the class label, like in Naive Bayes algorithm.
• Consider a classification problem where we want to predict
whether a customer will buy a product based on two types of
features: age (numerical) and profession (categorical).
• Step 1: Data Preparation - We have following training data:
Age Profession Bought
30 Engineer Yes
25 Doctor Yes
40 Teacher No
35 Engineer No
28 Doctor Yes
One Hot Encoding converts categorical variables into a binary format. It creates new binary
columns (0s and 1s) for each variable. Each category in the original column is represented as a
separate column, where a value of 1 indicates presence of that category, 0 indicates its absence.
• Step 3: Calculate Class Probabilities
• Calculate prior probabilities P(Bought=Yes) and P(Bought=No):
• P(Bought=Yes) = (Number of 'Yes' instances) / (Total number of
instances) = 3/5
• P(Bought=No) = (Number of 'No' instances) / (Total number of
instances) = 2/5
• Step 4: Calculate Feature Probabilities
• For numerical features (like Age), use probability density functions
(PDFs) such as Gaussian distribution for simplicity.
• Let's assume Gaussian distributions for each class:
• For Bought=Yes: (Mean = (30+25+28)/3 = 27.7
Age: Mean=27.7, Variance=15.67 (calculated from 'Yes' instances)
• For Bought=No: (Mean = (40 + 35)/2 = 37.5
Age: Mean=37.5, Variance=12.5 (calculated from 'No' instances)
• Step 5: Predictions
• Given a new instance with Age=32 and Profession=Engineer, we
calculate the conditional probabilities for each class:
• For Bought=Yes:
P(Age=32 | Bought=Yes) = Gaussian(32, 27.7, 15.67) = 0.071 (aprox)
P(Engineer | Bought=Yes) = 2/3 (since 2 out of 3 'Yes' instances are
Engineers)
P(Bought=Yes) = 3/5
Posterior Probability = P(Age=32 | Bought=Yes) * P(Engineer |
Bought=Yes) * P(Bought=Yes) = 0.071 * (2/3) * (3/5) = 0.0473 (aprox)
• For Bought=No:
P(Age=32 | Bought=No) = Gaussian(32, 37.5, 12.5) = 0.055 (aptox)
P(Engineer | Bought=No) = 1/2 (as 1 out of 2 'No' instances is an
Engineer)
P(Bought=No) = 2/5
Posterior Probability = P(Age=32 | Bought=No) * P(Engineer |
Bought=No) * P(Bought=No) = 0.055 * (1/2) * (2/5) =0.011 (aprox)
• As posterior probability for Bought=Yes is higher, we predict that
the customer will buy the product.
• To calculate Gaussian function values, we need to know the formula
for the Gaussian distribution, which is also known as the normal
distribution.
• The Gaussian distribution is given by the formula:
x is the input value (e.g., the age in our example), 𝜇 is the mean of
the distribution, 𝜎 is the standard deviation of the distribution.
• Suppose we have a Gaussian distribution with mean 𝜇=37.5 and
standard deviation 𝜎=12.5. Calculate 𝑓(32), which is Gaussian
function value for age =32 in this distribution. We get 0.055
• Sklearn's model.score(X_test,y_test) calculation is based on co-
efficient of determination i.e R^2 (it values is between 0 and 1)
• u = ((y_test - y_predicted) ** 2).sum()
• v = ((y_test - y_test.mean()) ** 2).sum()
• R2 = 1 - (u/v); [If R2 = 0, dependent variable cannot be predicted
from independent variable. If R2 = 1, then dependent variable can
be predicted from the independent variable without any error.]
News Classification- Another example of text classification is news
classification, where we can train a model to predict the category of
a news article based on its content.
• Load dataset of news articles into a Pandas DataFrame.
• Then split the data into training and testing sets using the
train_test_split function from scikit-learn.
• TfidfVectorizer is a feature extraction technique used in
NLP and text mining.
• It stands for Term Frequency-Inverse Document Frequency
Vectorizer.
• This technique is used to convert a collection of raw
documents into a matrix of TF-IDF features.
• TF-IDF evaluates importance of a word in a document
relative to a collection of documents.
• Suppose we have three documents:
• "The quick brown fox jumps over the lazy dog."
• "The dog chased the cat."
• Step 1: Tokenization
• Text is tokenized into individual words. Tokenization results are:
• Document 1: ['the', 'quick', 'brown', 'fox', 'jumps', 'over', 'the',
'lazy', 'dog']
• Document 2: ['the', 'dog', 'chased', 'the', 'cat']
• Document 3: ['the', 'cat', 'climbed', 'the', 'tree']
• Step 2: TF (Term Frequency) Calculation
• TfidfVectorizer calculates TF for each word in each document.
• TF is count of how many times a word appears in a document,
normalized by the total number of words in the document.
• For example, let's calculate TF for the word 'the' in Document 1:
• Total words in Document 1 = 9
• Count of 'the' in Document 1 = 2
• TF('the', Document 1) = 2/9 ≈ 0.2222
• Similarly, TF values are calculated for all words in all
• Step 3: IDF (Inverse Document Frequency) Calculation
• Now, TfidfVectorizer computes the inverse document
frequency (IDF) for each word.
• IDF measures how important a word is in all documents.
• Words that appear frequently in many documents are
penalized, while rare words are given higher weight.
• 𝑇𝐹−𝐼𝐷𝐹(𝑤,𝑑)=𝑇𝐹(𝑤,𝑑)×𝐼𝐷𝐹(𝑤)
in each document using the formula:
• With probability 1−𝛽, the surfer follows a link on the current page.
• When a surfer reaches a page, there are two possibilities:
Iteration 0:
For iteration 0 assume that each page is having page rank =
1 / Total number of nodes
PR(A) = PR(B) = PR(C) = PR(D) = PR(E) = PR(F) = 1/6 = 0.16
Iteration 1:
By using the above-mentioned formula
PR(A) = (1-0.8) + 0.8 * PR(B)/4 + PR(C)/2 = (1-0.8) + 0.8 *
0.16/4 + 0.16/2 = 0.3
• So, what have we done is for node A we will see how many
incoming signals are there.
• Here we have PR(B) and PR(C).
• For each of the incoming signals, see the outgoing signals
from that particular incoming signal i.e. for PR(B) we have 4
outgoing signals and for PR(C) we have 2 outgoing signals.
• Same procedure is applicable for remaining nodes, iterations
• USE UPDATED PAGE RANK FOR FURTHER CALCULATIONS.
• PR(B) = (1-0.8) + 0.8 * PR(A)/2 = (1-0.8) + 0.8 * 0.3/2 = 0.32
• PR(C) = (1-0.8) + 0.8 * PR(A)/2 = (1-0.8) + 0.8 * 0.3/2 = 0.32
• PR(D) = (1-0.8) + 0.8 * PR(B)/4 = (1-0.8) + 0.8 * 0.32/4 = 0.264
• PR(E) = (1-0.8) + 0.8 * PR(B)/4 = (1-0.8) + 0.8 * 0.32/4 = 0.264
• PR(F) = (1-0.8) + 0.8 * PR(B)/4 + PR(C)/2 = (1-0.8) + 0.8 *
(0.32/4) + (0.32/2) = 0.392
• This is for iteration 1, now let us calculate for iteration 2.
• Iteration 2:
• By using the above-mentioned formula
• PR(A) = (1-0.8) + 0.8 * PR(B)/4 + PR(C)/2 = (1-0.8) + 0.8 *
(0.32/4) + (0.32/2) = 0.392
USE THE UPDATED PAGE RANK FOR FURTHER CALCULATIONS.
• PR(B) = (1-0.8) + 0.8 * PR(A)/2 = (1-0.8) + 0.8 * 0.392/2 =
0.3568
• PR(C) = (1-0.8) + 0.8 * PR(A)/2 = (1-0.8) + 0.8 * 0.392/2 =
0.3568
• PR(D) = (1-0.8) + 0.8 * PR(B)/4 = (1-0.8) + 0.8 * 0.3568/4 =
0.2714
• PR(E) = (1-0.8) + 0.8 * PR(B)/4 = (1-0.8) + 0.8 * 0.3568/4 =
0.2714
• PR(F) = (1-0.8) + 0.8 * PR(B)/4 + PR(C)/2 = (1-0.8) + 0.8 *
(0.3568/4) + (0.3568/2) = 0.4141
NODES ITERATION 0 ITERATION 1 ITERATION 2
A 1/6 = 0.16 0.3 0.392
B 1/6 = 0.16 0.32 0.3568
C 1/6 = 0.16 0.32 0.3568
D 1/6 = 0.16 0.264 0.2714
E 1/6 = 0.16 0.264 0.2714
F 1/6 = 0.16 0.392 0.4141