Open In App

Complement Naive Bayes (CNB) Algorithm

Last Updated : 02 Aug, 2025
Comments
Improve
Suggest changes
Like Article
Like
Report

Classification is a task where we assign labels to data based on input features. Among the various approaches available, Naive Bayes classifiers are popular for their simplicity and speed. Standard versions like Gaussian or Multinomial Naive Bayes can struggle with unbalanced datasets(where one class significantly outnumbers others). This bias toward majority classes can lead to poor performance on minority classes, which are often the most important to identify.

The Complement Naive Bayes (CNB) algorithm was developed as an extension of Multinomial Naive Bayes to address this challenge. CNB is very effective for unbalanced datasets, especially in text classification tasks.

The Challenge of Unbalanced Datasets

Unbalanced datasets occur in applications such as fraud detection, spam filtering and medical diagnosis, where the majority class dominates the data. A classifier might achieve high accuracy by predicting only the majority class, yet it can completely fail on the minority class.

Example:

In a dataset with 95% "not fraud" and 5% "fraud" cases, a model that predicts "not fraud" for all samples achieves 95% accuracy but misses all fraudulent cases. This highlights the need for approaches designed to fairly handle uneven class distributions.

Complement Naive Bayes: Key Idea

CNB effectively addresses the issue of class imbalance by estimating feature probabilities using data from the complement of each class rather than relying on the data within the class itself. This approach ensures that parameter estimates are not biased toward majority classes. It provides a more balanced estimation that improves performance on imbalanced datasets:

\theta_{i,y} = \frac{ \displaystyle\sum_{d \notin y} f_{i,d} + \alpha}{ \displaystyle\sum_{j} \sum_{d \notin y} f_{j,d} + \alpha V}

Where:

  • f_{i,d}: Frequency of feature i in document d
  • \alpha: Smoothing parameter
  • V: Number of features

During prediction, CNB computes a score for each class:

\text{score}(y) = \log P(y) - \sum_i f_i \log \theta_{i,y}

The class with the highest score is selected.

Example

Suppose classifying sentences as Apples or Bananas using word frequencies:

To classify a new sentence (Round=1, Red=1, Soft=1):

  • MNB would estimate probabilities for Apples using only Apples data
  • CNB estimates probabilities for Apples using Bananas' data (complement) and vice versa

This adjustment reduces the bias toward the majority class, especially in unbalanced scenarios.

Implementing CNB

We can implement CNB using scikit-learn on the wine dataset (for demonstration purposes).

1. Import libraries and load data

We will import and load the required libraries

  • Import load_wine for dataset loading.
  • Use train_test_split to divide data into training and test sets.
  • Import ComplementNB as the classifier.
  • Import evaluation metrics: classification_report and accuracy_score.
Python
from sklearn.datasets import load_wine
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import ComplementNB
from sklearn.metrics import classification_report, accuracy_score

# Load the wine dataset
data = load_wine()
X, y = data.data, data.target

2. Split into training and test sets

We will split the dataset into training and test sets:

  • Split the dataset into 70% training and 30% testing data.
  • Set random_state=42 for reproducibility.
Python
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

3. Train the CNB classifier

We will train the Complement Naive Bayes classifier

  • Create a ComplementNB instance.
  • Fit the classifier on the training data.
Python
cnb = ComplementNB()
cnb.fit(X_train, y_train)

4. Evaluate the model

We will now evaluate the trained model:

  • Predict class labels for the test set using predict().
  • Print the accuracy score and the classification report for detailed metrics.
Python
y_pred = cnb.predict(X_test)
print("Accuracy:", accuracy_score(y_test, y_pred))
print("\nClassification Report:")
print(classification_report(y_test, y_pred))
CNB-output
CNB output

Note: CNB is better suited for discrete data like text. For continuous features (as in this dataset), Gaussian Naive Bayes might perform better.

When to Use CNB

ScenarioWhy CNB is Suitable
Imbalanced class distributionsThe complement approach ensures minority classes receive fairer parameter estimates.
Text classificationCNB handles discrete feature counts (e.g., word frequencies) very effectively.
Large feature spacesCNB is computationally efficient and easy to interpret, even with many features.

Limitations of CNB

  • Feature independence assumption: Like all Naive Bayes variants, CNB assumes that features are conditionally independent given the class. This assumption is rarely true in real-world datasets and can reduce accuracy when violated.
  • Best suited for discrete features: CNB is primarily designed for tasks with discrete data, such as word counts in text classification. Continuous data typically requires preprocessing for optimal results.
  • Bias in balanced datasets: The complement-based parameter estimation can introduce unnecessary bias when classes are already balanced. This may reduce its advantage compared to standard Naive Bayes models.

Practice Tags :

Similar Reads