Categorical Data Encoding Techniques in Machine Learning

Last Updated : 30 Jul, 2025

Categorical data represents variables that fall into distinct categories such as labels, names or types. Machine learning algorithms typically require numerical input making categorical data encoding a crucial preprocessing step. Proper encoding techniques help models interpret categorical variables effectively, improve predictive accuracy and minimize bias.

Types of Categorical Data

1. Nominal Data

Nominal data consists of categories without any inherent order or ranking. These are simple labels used to classify data.

Example: 'Red', 'Blue', 'Green' (Car Color).
Encoding Options: One-Hot Encoding or Label Encoding, depending on the model's needs.

2. Ordinal Data

Ordinal data includes categories with a defined order or ranking, where the relationship between values is important.

Example: 'Low', 'Medium', 'High' (Car Engine Power).
Encoding Options: Ordinal Encoding.

Using the right encoding techniques, we can effectively transform categorical data for machine learning models which improves their performance and predictive capabilities.

Why Encode Categorical Data?

Before diving into the encoding techniques, it's important to understand why encoding is necessary:

Machine Learning Algorithms: Most machine learning algorithms such as linear regression, support vector machines and neural networks require numerical input. Categorical data needs to be converted into a numerical format to be used effectively.
Model Performance: Proper encoding can significantly impact the performance of a machine learning model. Incorrect or suboptimal encoding can lead to poor model performance and inaccurate predictions.
Data Preprocessing: Encoding is a crucial step in the data preprocessing pipeline, ensuring that the data is in a suitable format for training and evaluation.

categorical_data_encoding_techniques — Techniques

Techniques to perform Categorical Data Encoding

1. Label Encoding

Label Encoding assigns a unique integer to each category. However, it does not respect the order of the categories, making it more suitable for nominal data where the order doesn’t matter.

Pros: Simple and memory-efficient.
Cons: Introduces implicit order which may be misinterpreted by non-tree models when used with nominal data.

categorical_data_encoding_2 — Label Encoding

Let's look at the following example:

Python

from sklearn.preprocessing import LabelEncoder

data = ['Red', 'Green', 'Blue', 'Red']

le = LabelEncoder()
encoded_data = le.fit_transform(data)
print(f"Encoded Data: {encoded_data}")

Output:

Encoded Data: [0 1 2 0]

Here, 'Red' becomes 0, 'Green' becomes 1 and 'Blue' becomes 2, based on the lexicographical order.

2. One-Hot Encoding

categorical_data_encoding — One-Hot Encoding

One-Hot Encoding converts categorical data into a binary format, where each category is represented by a new column. A value of 1 indicates the presence of that category, while 0 indicates its absence. This technique is ideal for nominal data (categories without an order), preventing the model from assuming any relationship between the categories.

Pros: Does not assume order; widely supported.
Cons: Can cause high dimensionality and sparse data when feature has many categories.

Let's look at the following example:

Python

import pandas as pd

data = ['Red', 'Blue', 'Green', 'Red']

df = pd.DataFrame(data, columns=['Color'])
one_hot_encoded = pd.get_dummies(df['Color'])

print(one_hot_encoded)

Output:

Each unique category ('Red', 'Blue', 'Green') is transformed into a separate binary column, with 1 representing the presence of the category and 0 its absence.

3. Ordinal Encoding

Ordinal Encoding is used for ordinal data, where categories have a natural order. It converts categorical values into numeric values, preserving the inherent order.

Pros: Maintains order; reduces dimensionality.
Cons: Not suitable for nominal categories.

Let's consider the following example:

Python

from sklearn.preprocessing import OrdinalEncoder
data = [['Low'], ['Medium'], ['High'], ['Medium'], ['Low']]

encoder = OrdinalEncoder(categories=[['Low', 'Medium', 'High']])
encoded_data = encoder.fit_transform(data)

print(f"Encoded Ordinal Data: {encoded_data}")

Output:

In this case, 'Low' is encoded as 0, 'Medium' as 1 and 'High' as 2, preserving the natural order of the categories.

4. Target Encoding

Target Encoding (also known as Mean Encoding) is a technique where each category in a feature is replaced by the mean of the target variable for that category. This technique is especially useful when there is a relationship between the categorical feature and the target variable.

Pros: Captures relationship to target variable.
Cons: Risk of overfitting; must apply smoothing/statistical techniques and ensure leakage prevention (e.g., CV or holdout techniques).

Python

import pandas as pd
import category_encoders as ce

df = pd.DataFrame(
    {'City': ['London', 'Paris', 'London', 'Berlin'], 'Target': [1, 0, 1, 0]}
)

encoder = ce.TargetEncoder(cols=['City'])
df_tgt = encoder.fit_transform(df['City'], df['Target'])

print(f"Encoded Target Data:\n{df_tgt}")

Output:

In this case, each color is encoded based on the mean of the target variable. For instance, 'Red' has a mean target value of approximately 0.485, which reflects the target values for the rows where 'Red' appears.

5. Binary Encoding

Binary encoding is a more compact version of one-hot encoding. Each category is assigned a unique binary code. The binary code is then split into multiple columns. This method is suitable for datasets with high cardinality (many unique categories), as it results in fewer columns compared to one-hot encoding.

Pros: Reduces dimensionality; more memory-efficient than one-hot encoding.
Cons: Slightly more complex; requires careful handling of missing values.

Python

import category_encoders as ce
data = ['Red', 'Green', 'Blue', 'Red']
encoder = ce.BinaryEncoder(cols=['Color'])
encoded_data = encoder.fit_transform(pd.DataFrame(data, columns=['Color']))
print(encoded_data)

Output:

Here, each category (like 'Red', 'Blue', 'Green') is converted into binary digits. 'Red' gets the binary code '10', 'Blue' becomes '01' and 'Green' becomes '11'. Each binary digit is placed in a separate column (e.g., Color_0 and Color_1).

6. Frequency Encoding

Frequency Encoding assigns each category a value based on its frequency in the dataset. This technique can be useful for handling high-cardinality categorical features (features with many unique categories).

Pros: Low computational and storage requirements.
Cons: Can introduce data leakage if not handled properly.

Python

import pandas as pd
data = ['Red', 'Green', 'Blue', 'Red', 'Red']
series_data = pd.Series(data)
frequency_encoding = series_data.value_counts()

encoded_data = [frequency_encoding[x] for x in data]
print("Encoded Data:", encoded_data)

Output:

Encoded Data: [np.int64(3), np.int64(1), np.int64(1), np.int64(3), np.int64(3)]

Here, 'Red' appears 3 times, so it is encoded as 3, while 'Green' and 'Blue' appear once, so they are encoded as 1.

Differences between Various Techniques

Technique	Suitable For	Dimensionality	Overfitting Risk	Interpretability
One-Hot Encoding	Nominal	High	Low	High
Label Encoding	Ordinal (sometimes Nominal)	Low	Medium	Medium
Ordinal Encoding	Ordinal	Low	Medium	High
Binary Encoding	High-cardinality features	Medium	Medium	Medium
Frequency Encoding	High-cardinality	Low	High	Medium
Target Encoding	High-cardinality	Low	High	Low-Medium

Categorical data encoding is essential for translating real-world information into a format machine learning algorithms can understand. By selecting appropriate encoding techniques we can ensure our models interpret categories correctly, enabling better predictive accuracy and robust performance.

Introduction to Machine Learning

abhirajksingh

Improve

Article Tags :

Practice Tags :

Machine Learning

Categorical Data Encoding Techniques in Machine Learning

Types of Categorical Data

1. Nominal Data

2. Ordinal Data

Why Encode Categorical Data?

Techniques to perform Categorical Data Encoding

1. Label Encoding

2. One-Hot Encoding

3. Ordinal Encoding

4. Target Encoding

5. Binary Encoding

6. Frequency Encoding

Differences between Various Techniques

Similar Reads

Introduction to Machine Learning

Python for Machine Learning

Feature Engineering

Supervised Learning

Unsupervised Learning

Model Evaluation and Tuning

Advance Machine Learning Technique

Machine Learning Practice

Thank You!

What kind of Experience do you want to share?