Open In App

Categorical Data Encoding Techniques in Machine Learning

Last Updated : 30 Jul, 2025
Comments
Improve
Suggest changes
Like Article
Like
Report

Categorical data represents variables that fall into distinct categories such as labels, names or types. Machine learning algorithms typically require numerical input making categorical data encoding a crucial preprocessing step. Proper encoding techniques help models interpret categorical variables effectively, improve predictive accuracy and minimize bias.

Types of Categorical Data

1. Nominal Data

Nominal data consists of categories without any inherent order or ranking. These are simple labels used to classify data.

  • Example: 'Red', 'Blue', 'Green' (Car Color).
  • Encoding Options: One-Hot Encoding or Label Encoding, depending on the model's needs.

2. Ordinal Data

Ordinal data includes categories with a defined order or ranking, where the relationship between values is important.

  • Example: 'Low', 'Medium', 'High' (Car Engine Power).
  • Encoding Options: Ordinal Encoding.

Using the right encoding techniques, we can effectively transform categorical data for machine learning models which improves their performance and predictive capabilities.

Why Encode Categorical Data?

Before diving into the encoding techniques, it's important to understand why encoding is necessary:

  1. Machine Learning Algorithms: Most machine learning algorithms such as linear regression, support vector machines and neural networks require numerical input. Categorical data needs to be converted into a numerical format to be used effectively.
  2. Model Performance: Proper encoding can significantly impact the performance of a machine learning model. Incorrect or suboptimal encoding can lead to poor model performance and inaccurate predictions.
  3. Data Preprocessing: Encoding is a crucial step in the data preprocessing pipeline, ensuring that the data is in a suitable format for training and evaluation.
categorical_data_encoding_techniques
Techniques

Techniques to perform Categorical Data Encoding

1. Label Encoding

Label Encoding assigns a unique integer to each category. However, it does not respect the order of the categories, making it more suitable for nominal data where the order doesn’t matter.

  • Pros: Simple and memory-efficient.
  • Cons: Introduces implicit order which may be misinterpreted by non-tree models when used with nominal data.
categorical_data_encoding_2
Label Encoding

Let's look at the following example:

Python
from sklearn.preprocessing import LabelEncoder

data = ['Red', 'Green', 'Blue', 'Red']

le = LabelEncoder()
encoded_data = le.fit_transform(data)
print(f"Encoded Data: {encoded_data}")

Output:

Encoded Data: [0 1 2 0]

Here, 'Red' becomes 0, 'Green' becomes 1 and 'Blue' becomes 2, based on the lexicographical order.

2. One-Hot Encoding

categorical_data_encoding
One-Hot Encoding

One-Hot Encoding converts categorical data into a binary format, where each category is represented by a new column. A value of 1 indicates the presence of that category, while 0 indicates its absence. This technique is ideal for nominal data (categories without an order), preventing the model from assuming any relationship between the categories.

  • Pros: Does not assume order; widely supported.
  • Cons: Can cause high dimensionality and sparse data when feature has many categories.

Let's look at the following example:

Python
import pandas as pd

data = ['Red', 'Blue', 'Green', 'Red']

df = pd.DataFrame(data, columns=['Color'])
one_hot_encoded = pd.get_dummies(df['Color'])

print(one_hot_encoded)

Output:

Screenshot-2025-07-30-133640
output

Each unique category ('Red', 'Blue', 'Green') is transformed into a separate binary column, with 1 representing the presence of the category and 0 its absence.

3. Ordinal Encoding

Ordinal Encoding is used for ordinal data, where categories have a natural order. It converts categorical values into numeric values, preserving the inherent order.

  • Pros: Maintains order; reduces dimensionality.
  • Cons: Not suitable for nominal categories.

Let's consider the following example:

Python
from sklearn.preprocessing import OrdinalEncoder
data = [['Low'], ['Medium'], ['High'], ['Medium'], ['Low']]

encoder = OrdinalEncoder(categories=[['Low', 'Medium', 'High']])
encoded_data = encoder.fit_transform(data)

print(f"Encoded Ordinal Data: {encoded_data}")

Output:

Screenshot-2025-07-30-133800
output

In this case, 'Low' is encoded as 0, 'Medium' as 1 and 'High' as 2, preserving the natural order of the categories.

4. Target Encoding

Target Encoding (also known as Mean Encoding) is a technique where each category in a feature is replaced by the mean of the target variable for that category. This technique is especially useful when there is a relationship between the categorical feature and the target variable.

  • Pros: Captures relationship to target variable.
  • Cons: Risk of overfitting; must apply smoothing/statistical techniques and ensure leakage prevention (e.g., CV or holdout techniques).
Python
import pandas as pd
import category_encoders as ce

df = pd.DataFrame(
    {'City': ['London', 'Paris', 'London', 'Berlin'], 'Target': [1, 0, 1, 0]}
)

encoder = ce.TargetEncoder(cols=['City'])
df_tgt = encoder.fit_transform(df['City'], df['Target'])

print(f"Encoded Target Data:\n{df_tgt}")

Output:

Screenshot-2025-07-30-134115
output

In this case, each color is encoded based on the mean of the target variable. For instance, 'Red' has a mean target value of approximately 0.485, which reflects the target values for the rows where 'Red' appears.

5. Binary Encoding

candidate_solution
Binary Encoding

Binary encoding is a more compact version of one-hot encoding. Each category is assigned a unique binary code. The binary code is then split into multiple columns. This method is suitable for datasets with high cardinality (many unique categories), as it results in fewer columns compared to one-hot encoding.

  • Pros: Reduces dimensionality; more memory-efficient than one-hot encoding.
  • Cons: Slightly more complex; requires careful handling of missing values.
Python
import category_encoders as ce
data = ['Red', 'Green', 'Blue', 'Red']
encoder = ce.BinaryEncoder(cols=['Color'])
encoded_data = encoder.fit_transform(pd.DataFrame(data, columns=['Color']))
print(encoded_data)

Output:

Screenshot-2025-07-30-134206
output

Here, each category (like 'Red', 'Blue', 'Green') is converted into binary digits. 'Red' gets the binary code '10', 'Blue' becomes '01' and 'Green' becomes '11'. Each binary digit is placed in a separate column (e.g., Color_0 and Color_1).

6. Frequency Encoding

Frequency Encoding assigns each category a value based on its frequency in the dataset. This technique can be useful for handling high-cardinality categorical features (features with many unique categories).

  • Pros: Low computational and storage requirements.
  • Cons: Can introduce data leakage if not handled properly.
Python
import pandas as pd
data = ['Red', 'Green', 'Blue', 'Red', 'Red']
series_data = pd.Series(data)
frequency_encoding = series_data.value_counts()

encoded_data = [frequency_encoding[x] for x in data]
print("Encoded Data:", encoded_data)

Output:

Encoded Data: [np.int64(3), np.int64(1), np.int64(1), np.int64(3), np.int64(3)]

Here, 'Red' appears 3 times, so it is encoded as 3, while 'Green' and 'Blue' appear once, so they are encoded as 1.

Differences between Various Techniques

TechniqueSuitable ForDimensionalityOverfitting RiskInterpretability
One-Hot EncodingNominalHighLowHigh
Label EncodingOrdinal (sometimes Nominal)LowMediumMedium
Ordinal EncodingOrdinalLowMediumHigh
Binary EncodingHigh-cardinality featuresMediumMediumMedium
Frequency EncodingHigh-cardinalityLowHighMedium
Target EncodingHigh-cardinalityLowHighLow-Medium

Categorical data encoding is essential for translating real-world information into a format machine learning algorithms can understand. By selecting appropriate encoding techniques we can ensure our models interpret categories correctly, enabling better predictive accuracy and robust performance.


Similar Reads