Applying Multinomial Naive Bayes to NLP Problems

Last Updated : 11 Jul, 2025

Multinomial Naive Bayes (MNB) is a popular machine learning algorithm for text classification problems in Natural Language Processing (NLP). It is particularly useful for problems that involve text data with discrete features such as word frequency counts. MNB works on the principle of Bayes theorem and assumes that the features are conditionally independent given the class variable.

Here are the steps for applying Multinomial Naive Bayes to NLP problems:

Preprocessing the text data: The text data needs to be preprocessed before applying the algorithm. This involves steps such as tokenization, stop-word removal, stemming, and lemmatization.

Feature extraction: The text data needs to be converted into a feature vector format that can be used as input to the MNB algorithm. The most common method of feature extraction is to use a bag-of-words model, where each document is represented by a vector of word frequency counts.

Splitting the data: The data needs to be split into training and testing sets. The training set is used to train the MNB model, while the testing set is used to evaluate its performance.

Training the MNB model: The MNB model is trained on the training set by estimating the probabilities of each feature given each class. This involves calculating the prior probabilities of each class and the likelihood of each feature given each class.

Evaluating the performance of the model: The performance of the model is evaluated using metrics such as accuracy, precision, recall, and F1-score on the testing set.

Using the model to make predictions: Once the model is trained, it can be used to make predictions on new text data. The text data is preprocessed and transformed into the feature vector format, which is then input to the trained model to obtain the predicted class label.

MNB is a simple and efficient algorithm that works well for many NLP problems such as sentiment analysis, spam detection, and topic classification. However, it has some limitations, such as the assumption of independence between features, which may not hold true in some cases. Therefore, it is important to carefully evaluate the performance of the model before using it in a real-world application.

Naive Bayes Classifier Algorithm is a family of probabilistic algorithms based on applying Bayes’ theorem with the “naive” assumption of conditional independence between every pair of a feature.
Bayes theorem calculates probability P(c|x) where c is the class of the possible outcomes and x is the given instance which has to be classified, representing some certain features.
P(c|x) = P(x|c) * P(c) / P(x)
Naive Bayes are mostly used in natural language processing (NLP) problems. Naive Bayes predict the tag of a text. They calculate the probability of each tag for a given text and then output the tag with the highest one.
How Naive Bayes Algorithm Works ?
Let's consider an example, classify the review whether it is positive or negative.
Training Dataset:

Text	Reviews
“I liked the movie”	positive
“It’s a good movie. Nice story”	positive
“Nice songs. But sadly boring ending. ”	negative
“Hero’s acting is bad but heroine looks good. Overall nice movie”	positive
“Sad, boring movie"	negative

We classify whether the text "overall liked the movie" has a positive review or a negative review. We have to calculate,
P(positive | overall liked the movie) — the probability that the tag of a sentence is positive given that the sentence is “overall liked the movie”.
P(negative | overall liked the movie) — the probability that the tag of a sentence is negative given that the sentence is “overall liked the movie”.
Before that, first, we apply Removing Stopwords and Stemming in the text.
Removing Stopwords: These are common words that don’t really add anything to the classification, such as an able, either, else, ever and so on.
Stemming: Stemming to take out the root of the word.
Now After applying these two techniques, our text becomes

Text	Reviews
“ilikedthemovi”	positive
“itsagoodmovienicestori”	positive
“nicesongsbutsadlyboringend”	negative
“herosactingisbadbutheroinelooksgoodoverallnicemovi”	positive
“sadboringmovi"	negative

Feature Engineering:
The important part is to find the features from the data to make machine learning algorithms works. In this case, we have text. We need to convert this text into numbers that we can do calculations on. We use word frequencies. That is treating every document as a set of the words it contains. Our features will be the counts of each of these words.
In our case, we have P(positive | overall liked the movie), by using this theorem:

P(positive | overall liked the movie) = P(overall liked the movie | positive) * P(positive) / P(overall liked the movie)

Since for our classifier we have to find out which tag has a bigger probability, we can discard the divisor which is the same for both tags,
P(overall liked the movie | positive)* P(positive) with P(overall liked the movie | negative) * P(negative)
There’s a problem though: “overall liked the movie” doesn’t appear in our training dataset, so the probability is zero. Here, we assume the 'naive' condition that every word in a sentence is independent of the other ones. This means that now we look at individual words.
We can write this as:

P(overall liked the movie) = P(overall) * P(liked) * P(the) * P(movie)

The next step is just applying the Bayes theorem:-

P(overall liked the movie| positive) = P(overall | positive) * P(liked | positive) * P(the | positive) * P(movie | positive)

And now, these individual words actually show up several times in our training data, and we can calculate them!
Calculating probabilities:
First, we calculate the a priori probability of each tag: for a given sentence in our training data, the probability that it is positive P(positive) is 3/5. Then, P(negative) is 2/5.
Then, calculating P(overall | positive) means counting how many times the word “overall” appears in positive texts (1) divided by the total number of words in positive (17). Therefore, P(overall | positive) = 1/17, P(liked/positive) = 1/17, P(the/positive) = 2/17, P(movie/positive) = 3/17.
If probability comes out to be zero then By using Laplace smoothing: we add 1 to every count so it’s never zero. To balance this, we add the number of possible words to the divisor, so the division will never be greater than 1. In our case, the total possible words count are 21.
Applying smoothing, The results are:

Word	P(word \| positive)	P(word \| negative)
overall	1 + 1/17 + 21	0 + 1/7 + 21
liked	1 + 1/17 + 21	0 + 1/7 + 21
the	2 + 1/17 + 21	0 + 1/7 + 21
movie	3 + 1/17 + 21	1 + 1/7 + 21

Now we just multiply all the probabilities, and see who is bigger:

P(overall | positive) * P(liked | positive) * P(the | positive) * P(movie | positive) * P(positive ) = 1.38 * 10^{-5} = 0.0000138
P(overall | negative) * P(liked | negative) * P(the | negative) * P(movie | negative) * P(negative) = 0.13 * 10^{-5} = 0.0000013

Our classifier gives “overall liked the movie” the positive tag.
Below is the implementation :

Python

# cleaning texts
import pandas as pd
import re
import nltk
from nltk.corpus import stopwords
from nltk.stem.porter import PorterStemmer
from sklearn.feature_extraction.text import CountVectorizer

dataset = [["I liked the movie", "positive"],
           ["It’s a good movie. Nice story", "positive"],
           ["Hero’s acting is bad but heroine looks good.\
            Overall nice movie", "positive"],
            ["Nice songs. But sadly boring ending.", "negative"],
            ["sad movie, boring movie", "negative"]]
            
dataset = pd.DataFrame(dataset)
dataset.columns = ["Text", "Reviews"]

nltk.download('stopwords')

corpus = []

for i in range(0, 5):
    text = re.sub('[^a-zA-Z]', '', dataset['Text'][i])
    text = text.lower()
    text = text.split()
    ps = PorterStemmer()
    text = ''.join(text)
    corpus.append(text)

# creating bag of words model
cv = CountVectorizer(max_features = 1500)

X = cv.fit_transform(corpus).toarray()
y = dataset.iloc[:, 1].values

Python

# splitting the data set into training set and test set
from sklearn.model_selection import train_test_split


X_train, X_test, y_train, y_test = train_test_split(
           X, y, test_size = 0.25, random_state = 0)

Python

# fitting naive bayes to the training set
from sklearn.naive_bayes import GaussianNB
from sklearn.metrics import confusion_matrix

classifier = GaussianNB();
classifier.fit(X_train, y_train)

# predicting test set results
y_pred = classifier.predict(X_test)

# making the confusion matrix
cm = confusion_matrix(y_test, y_pred)
cm

Natural Language Processing (NLP) - Overview

Darshika Sanghi

Improve

Article Tags :

Applying Multinomial Naive Bayes to NLP Problems

Similar Reads

Introduction to NLP

Libraries for NLP

Text Normalization in NLP

Text Representation and Embedding Techniques

NLP Deep Learning Techniques

NLP Projects and Practice

Thank You!

What kind of Experience do you want to share?