Handling Categorical Data in Python
Last Updated :
02 Aug, 2025
Categorical data refers to features that contain a fixed set of possible values or categories that data points can belong to. Handling categorical data correctly is important because improper handling can lead to inaccurate analysis and poor model performance. In this article, we will see how to handle categorical data and its related concepts.
Why Do We Need to Handle Categorical Data?
Handling categorical data is important because:
- Algorithms Require Numerical Inputs: Most machine learning algorithms cannot directly process categorical data and need it to be converted into numerical formats.
- Inconsistent Categories: Categorical data contains inconsistencies like typos, case sensitivity or alternate spellings. We must standardize these to avoid treating them as separate categories.
- Remapping Categories: Some categories might need to be grouped for simplicity and relevance. For example, remapping rare categories into an "Other" group.
- Improves Model Performance: Proper encoding techniques like one-hot encoding or label encoding help models to understand the relationships of categories leading to better predictions.
- Handles Real-World Complexity: It is used in many domains such as E-commerce, Finance, Healthcare, etc making it robust to handle important features.
Implementation for Handling Categorical Data
Here we will be using a Demographics dataset which has some incorrect, invalid or meaningless data (bogus values) due to human error while filling survey form or any other reason. You can download dataset from here.
Step 1: Importing necessary Libraries
We will be using Numpy, Pandas, Matplotlib, Seaborn and Sckit-learn libraries for its implementation.
Python
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sb
from sklearn.preprocessing import LabelEncoder
Step 2: Loading the Dataset
We load the dataset into a Pandas DataFrame for manipulation.
Python
file_path = '/content/demographics.csv'
main_data = pd.read_csv(file_path)
print(main_data.head())
Output:
First five rows of the datasetStep 3: Identifying and Removing Bogus Blood Types
First we create a DataFrame containing all valid blood types to check for bogus values in the dataset:
Python
valid_blood_type_list = ['A+', 'A-', 'B+', 'B-', 'AB+', 'AB-', 'O+', 'O-']
blood_type_categories = pd.DataFrame({'blood_type': valid_blood_type_list})
print(blood_type_categories)
Output:

Lets find bogus blood types by comparing the dataset values to this valid list:
Python
unique_blood_types_main = set(main_data['blood_type'])
valid_blood_types_set = set(blood_type_categories['blood_type'])
bogus_blood_types = unique_blood_types_main.difference(valid_blood_types_set)
bogus_blood_types
Output:
{'C+', 'D-'}
Once the bogus values are found the corresponding rows can be dropped from the dataset.
Python
bogus_records_index = main_data['blood_type'].isin(bogus_blood_types)
without_bogus_records = main_data[~bogus_records_index].copy()
without_bogus_records['blood_type'].unique()
Output:
array(['A+', 'B+', 'A-', 'AB-', 'AB+', 'B-', 'O-', 'O+'], dtype=object)
Step 4: Handling Inconsistent Marriage Status Categories
Checking the unique values in the marriage_status column:
Python
main_data['marriage_status'].unique()
Output:
array(['married', 'MARRIED', ' married', 'unmarried ', 'divorced', 'unmarried', 'UNMARRIED', 'separated'], dtype=object)
Standardizing the categories by converting all text to lowercase.
Python
inconsistent_data = main_data.copy()
inconsistent_data['marriage_status'] = inconsistent_data['marriage_status'].str.lower()
inconsistent_data['marriage_status'].unique()
Output:
array(['married', ' married', 'unmarried ', 'divorced', 'unmarried', 'separated'], dtype=object)
Now we will standardize the categories by stripping extra spaces:
Python
inconsistent_data['marriage_status'] = inconsistent_data['marriage_status'].str.strip()
inconsistent_data['marriage_status'].unique()
Output:
array(['married', 'unmarried', 'divorced', 'separated'], dtype=object)
Step 5: Grouping Income into Meaningful Bins
Numerical data like age or income can be mapped to different groups. Let us check income range to define bin intervals:
Python
print(f"Max income - {main_data['income'].max()}, Min income - {main_data['income'].min()}")
Output:
Max income - 190000, Min income - 40000
Now, let us create the range and labels for the income feature. Pandas cut method is used here.
Python
income_bins = [40000, 75000, 100000, 125000, 150000, np.inf]
income_labels = ['40k-75k', '75k-100k', '100k-125k', '125k-150k', '150k+']
remapping_data = main_data.copy()
remapping_data['income_groups'] = pd.cut(
remapping_data['income'],
bins=income_bins,
labels=income_labels
)
remapping_data.head()
Output:
First five rows of the dataset.Step 6: Visualizing Income Group Distribution
Now lets visualize the distribution of income groups:
Python
remapping_data['income_groups'].value_counts().sort_index().plot.bar()
plt.title('Income Group Distribution')
plt.xlabel('Income Groups')
plt.ylabel('Count')
plt.tight_layout()
plt.show()
Output:
Visualize the distributionStep 7: Cleaning Phone Number Data
Simulating phone numbers with inconsistent formats and cleaning them:
Python
import random
phone_numbers = []
for i in range(100):
number = random.randint(100000000, 9999999999) # length can be 9 or 10 digits
if i % 2 == 0:
phone_numbers.append('+91 ' + str(number)) # add +91 prefix for some
else:
phone_numbers.append(str(number))
phone_numbers_data = pd.DataFrame({
'phone_numbers': phone_numbers
})
phone_numbers_data.head()
Output:
Phone numbers CreatedBased on the use case the country code before numbers could be dropped or added for missing ones. Similarly phone numbers with less than 10 numbers should be discarded.
Python
phone_numbers_data['phone_numbers'] = phone_numbers_data['phone_numbers'].str.replace(r'\+91 ', '', regex=True)
num_digits = phone_numbers_data['phone_numbers'].str.len()
invalid_numbers_index = phone_numbers_data[num_digits < 10].index
phone_numbers_data.drop(invalid_numbers_index, inplace=True)
phone_numbers_data.dropna(inplace=True)
phone_numbers_data.reset_index(drop=True, inplace=True)
phone_numbers_data.head()
Output:
After Phone numbers discardedFinally we can verify whether the data is clean or not.
Python
assert not phone_numbers_data['phone_numbers'].str.contains(r'\+91 ').any(), "Found phone numbers with '+91 ' prefix"
assert (phone_numbers_data['phone_numbers'].str.len() == 10).all(), "Some phone numbers do not have 10 digits"
Step 8: Visualizing Categorical Data
Various plots could be used to visualize categorical data to get more insights about the data. So let us visualize the number of people belonging to each blood type.
Python
import seaborn as sns
sns.countplot(x='blood_type', data=without_bogus_records)
plt.title('Count of Blood Types')
plt.xlabel('Blood Type')
plt.ylabel('Count')
plt.tight_layout()
plt.show()
Output:
Visualizing Categorical DataNow we can see the relationship between income and the marital status of a person using a boxplot.
Python
sns.boxplot(x='marriage_status', y='income', data=inconsistent_data)
plt.title('Income Distribution by Marriage Status')
plt.xlabel('Marriage Status')
plt.ylabel('Income')
plt.tight_layout()
plt.show()
Output:
Step 9: Encoding Categorical Data
Certain learning algorithms like regression and neural networks require their input to be numbers. Hence categorical data must be converted to numbers to use these algorithms. Let us see some encoding methods.
1. Label Encoding
With label encoding we can number the categories from 0 to num_categories - 1. Let us apply label encoding on the blood type feature.
Python
le = LabelEncoder()
without_bogus_records['blood_type_encoded'] = le.fit_transform(without_bogus_records['blood_type'])
without_bogus_records[['blood_type', 'blood_type_encoded']].drop_duplicates()
Output:
Label Encoding2. One-hot Encoding in Python
There are certain limitations of label encoding that are taken care of by one-hot encoding. Some of them are:
- Creates a false order: It gives numbers like 0, 1, 2 to categories which may make models think one category is bigger or better than the other.
- Misleads models: Algorithms like linear regression or decision trees might assume there's a ranking which can reduce accuracy.
- Problem with distance-based models: In models like KNN or K-Means, the numeric labels can wrongly influence distance calculations.
- Bias in training: Some models may give more importance to higher label values, even if all categories are equal.
- Not suitable for nominal data: Label encoding is not a good choice when categories have no natural order, like colors or city names.
Python
inconsistent_data = pd.get_dummies(inconsistent_data, columns=['marriage_status'])
inconsistent_data.head()
Output:
One-hot Encoding3. Ordinal Encoding in Python
Categorical data can be ordinal where the order is of importance. For such features, we want to preserve the order after encoding as well. We will perform ordinal encoding on income groups. We want to preserve the order as 40K-75K < 75K-100K < 100K-125K < 125K-150K < 150K+
Python
custom_map = {
'40k-75k': 1,
'75k-100k': 2,
'100k-125k': 3,
'125k-150k': 4,
'150k+': 5
}
remapping_data['income_groups_encoded'] = remapping_data['income_groups'].map(custom_map)
remapping_data[['income', 'income_groups', 'income_groups_encoded']].head()
Output:
Ordinal EncodingWith these techniques we can prepare categorical data for meaningful analysis and effective machine learning models.
Similar Reads
Machine Learning Tutorial Machine learning is a branch of Artificial Intelligence that focuses on developing models and algorithms that let computers learn from data without being explicitly programmed for every task. In simple words, ML teaches the systems to think and understand like humans by learning from the data.Do you
5 min read
Introduction to Machine Learning
Python for Machine Learning
Machine Learning with Python TutorialPython language is widely used in Machine Learning because it provides libraries like NumPy, Pandas, Scikit-learn, TensorFlow, and Keras. These libraries offer tools and functions essential for data manipulation, analysis, and building machine learning models. It is well-known for its readability an
5 min read
Pandas TutorialPandas is an open-source software library designed for data manipulation and analysis. It provides data structures like series and DataFrames to easily clean, transform and analyze large datasets and integrates with other Python libraries, such as NumPy and Matplotlib. It offers functions for data t
6 min read
NumPy Tutorial - Python LibraryNumPy (short for Numerical Python ) is one of the most fundamental libraries in Python for scientific computing. It provides support for large, multi-dimensional arrays and matrices along with a collection of mathematical functions to operate on arrays.At its core it introduces the ndarray (n-dimens
3 min read
Scikit Learn TutorialScikit-learn (also known as sklearn) is a widely-used open-source Python library for machine learning. It builds on other scientific libraries like NumPy, SciPy and Matplotlib to provide efficient tools for predictive data analysis and data mining.It offers a consistent and simple interface for a ra
3 min read
ML | Data Preprocessing in PythonData preprocessing is a important step in the data science transforming raw data into a clean structured format for analysis. It involves tasks like handling missing values, normalizing data and encoding variables. Mastering preprocessing in Python ensures reliable insights for accurate predictions
6 min read
EDA - Exploratory Data Analysis in PythonExploratory Data Analysis (EDA) is a important step in data analysis which focuses on understanding patterns, trends and relationships through statistical tools and visualizations. Python offers various libraries like pandas, numPy, matplotlib, seaborn and plotly which enables effective exploration
6 min read
Feature Engineering
Supervised Learning
Unsupervised Learning
Model Evaluation and Tuning
Advance Machine Learning Technique
Machine Learning Practice