Complement Naive Bayes (CNB) Algorithm

Last Updated : 02 Aug, 2025

Classification is a task where we assign labels to data based on input features. Among the various approaches available, Naive Bayes classifiers are popular for their simplicity and speed. Standard versions like Gaussian or Multinomial Naive Bayes can struggle with unbalanced datasets(where one class significantly outnumbers others). This bias toward majority classes can lead to poor performance on minority classes, which are often the most important to identify.

The Complement Naive Bayes (CNB) algorithm was developed as an extension of Multinomial Naive Bayes to address this challenge. CNB is very effective for unbalanced datasets, especially in text classification tasks.

The Challenge of Unbalanced Datasets

Unbalanced datasets occur in applications such as fraud detection, spam filtering and medical diagnosis, where the majority class dominates the data. A classifier might achieve high accuracy by predicting only the majority class, yet it can completely fail on the minority class.

Example:

In a dataset with 95% "not fraud" and 5% "fraud" cases, a model that predicts "not fraud" for all samples achieves 95% accuracy but misses all fraudulent cases. This highlights the need for approaches designed to fairly handle uneven class distributions.

Complement Naive Bayes: Key Idea

CNB effectively addresses the issue of class imbalance by estimating feature probabilities using data from the complement of each class rather than relying on the data within the class itself. This approach ensures that parameter estimates are not biased toward majority classes. It provides a more balanced estimation that improves performance on imbalanced datasets:

\theta_{i,y} = \frac{ \displaystyle\sum_{d \notin y} f_{i,d} + \alpha}{ \displaystyle\sum_{j} \sum_{d \notin y} f_{j,d} + \alpha V}

Where:

f_{i,d}: Frequency of feature i in document d
\alpha: Smoothing parameter
V: Number of features

During prediction, CNB computes a score for each class:

\text{score}(y) = \log P(y) - \sum_i f_i \log \theta_{i,y}

The class with the highest score is selected.

Example

Suppose classifying sentences as Apples or Bananas using word frequencies:

To classify a new sentence (Round=1, Red=1, Soft=1):

MNB would estimate probabilities for Apples using only Apples data
CNB estimates probabilities for Apples using Bananas' data (complement) and vice versa

This adjustment reduces the bias toward the majority class, especially in unbalanced scenarios.

Implementing CNB

We can implement CNB using scikit-learn on the wine dataset (for demonstration purposes).

1. Import libraries and load data

We will import and load the required libraries

Import load_wine for dataset loading.
Use train_test_split to divide data into training and test sets.
Import ComplementNB as the classifier.
Import evaluation metrics: classification_report and accuracy_score.

Python

from sklearn.datasets import load_wine
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import ComplementNB
from sklearn.metrics import classification_report, accuracy_score

# Load the wine dataset
data = load_wine()
X, y = data.data, data.target

2. Split into training and test sets

We will split the dataset into training and test sets:

Split the dataset into 70% training and 30% testing data.
Set random_state=42 for reproducibility.

Python

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

3. Train the CNB classifier

We will train the Complement Naive Bayes classifier

Create a ComplementNB instance.
Fit the classifier on the training data.

Python

cnb = ComplementNB()
cnb.fit(X_train, y_train)

4. Evaluate the model

We will now evaluate the trained model:

Predict class labels for the test set using predict().
Print the accuracy score and the classification report for detailed metrics.

Python

y_pred = cnb.predict(X_test)
print("Accuracy:", accuracy_score(y_test, y_pred))
print("\nClassification Report:")
print(classification_report(y_test, y_pred))

Note: CNB is better suited for discrete data like text. For continuous features (as in this dataset), Gaussian Naive Bayes might perform better.

When to Use CNB

Scenario	Why CNB is Suitable
Imbalanced class distributions	The complement approach ensures minority classes receive fairer parameter estimates.
Text classification	CNB handles discrete feature counts (e.g., word frequencies) very effectively.
Large feature spaces	CNB is computationally efficient and easy to interpret, even with many features.

Limitations of CNB

Feature independence assumption: Like all Naive Bayes variants, CNB assumes that features are conditionally independent given the class. This assumption is rarely true in real-world datasets and can reduce accuracy when violated.
Best suited for discrete features: CNB is primarily designed for tasks with discrete data, such as word counts in text classification. Continuous data typically requires preprocessing for optimal results.
Bias in balanced datasets: The complement-based parameter estimation can introduce unnecessary bias when classes are already balanced. This may reduce its advantage compared to standard Naive Bayes models.

Naive Bayes Classifiers
Gaussian Naive Bayes

alokesh985

Improve

Article Tags :

Practice Tags :

Complement Naive Bayes (CNB) Algorithm

The Challenge of Unbalanced Datasets

Complement Naive Bayes: Key Idea

Example

Implementing CNB

1. Import libraries and load data

2. Split into training and test sets

3. Train the CNB classifier

4. Evaluate the model

When to Use CNB

Limitations of CNB

Related articles

Similar Reads

Introduction to Machine Learning

Python for Machine Learning

Feature Engineering

Supervised Learning

Unsupervised Learning

Model Evaluation and Tuning

Advance Machine Learning Technique

Machine Learning Practice

Thank You!

What kind of Experience do you want to share?