Categorical Data Encoding Techniques in Machine Learning
Last Updated :
30 Jul, 2025
Categorical data represents variables that fall into distinct categories such as labels, names or types. Machine learning algorithms typically require numerical input making categorical data encoding a crucial preprocessing step. Proper encoding techniques help models interpret categorical variables effectively, improve predictive accuracy and minimize bias.
Types of Categorical Data
1. Nominal Data
Nominal data consists of categories without any inherent order or ranking. These are simple labels used to classify data.
- Example: 'Red', 'Blue', 'Green' (Car Color).
- Encoding Options: One-Hot Encoding or Label Encoding, depending on the model's needs.
2. Ordinal Data
Ordinal data includes categories with a defined order or ranking, where the relationship between values is important.
- Example: 'Low', 'Medium', 'High' (Car Engine Power).
- Encoding Options: Ordinal Encoding.
Using the right encoding techniques, we can effectively transform categorical data for machine learning models which improves their performance and predictive capabilities.
Why Encode Categorical Data?
Before diving into the encoding techniques, it's important to understand why encoding is necessary:
- Machine Learning Algorithms: Most machine learning algorithms such as linear regression, support vector machines and neural networks require numerical input. Categorical data needs to be converted into a numerical format to be used effectively.
- Model Performance: Proper encoding can significantly impact the performance of a machine learning model. Incorrect or suboptimal encoding can lead to poor model performance and inaccurate predictions.
- Data Preprocessing: Encoding is a crucial step in the data preprocessing pipeline, ensuring that the data is in a suitable format for training and evaluation.
Techniques 1. Label Encoding
Label Encoding assigns a unique integer to each category. However, it does not respect the order of the categories, making it more suitable for nominal data where the order doesn’t matter.
- Pros: Simple and memory-efficient.
- Cons: Introduces implicit order which may be misinterpreted by non-tree models when used with nominal data.
Label EncodingLet's look at the following example:
Python
from sklearn.preprocessing import LabelEncoder
data = ['Red', 'Green', 'Blue', 'Red']
le = LabelEncoder()
encoded_data = le.fit_transform(data)
print(f"Encoded Data: {encoded_data}")
Output:
Encoded Data: [0 1 2 0]
Here, 'Red' becomes 0, 'Green' becomes 1 and 'Blue' becomes 2, based on the lexicographical order.
2. One-Hot Encoding
One-Hot EncodingOne-Hot Encoding converts categorical data into a binary format, where each category is represented by a new column. A value of 1 indicates the presence of that category, while 0 indicates its absence. This technique is ideal for nominal data (categories without an order), preventing the model from assuming any relationship between the categories.
- Pros: Does not assume order; widely supported.
- Cons: Can cause high dimensionality and sparse data when feature has many categories.
Let's look at the following example:
Python
import pandas as pd
data = ['Red', 'Blue', 'Green', 'Red']
df = pd.DataFrame(data, columns=['Color'])
one_hot_encoded = pd.get_dummies(df['Color'])
print(one_hot_encoded)
Output:
outputEach unique category ('Red', 'Blue', 'Green') is transformed into a separate binary column, with 1 representing the presence of the category and 0 its absence.
3. Ordinal Encoding
Ordinal Encoding is used for ordinal data, where categories have a natural order. It converts categorical values into numeric values, preserving the inherent order.
- Pros: Maintains order; reduces dimensionality.
- Cons: Not suitable for nominal categories.
Let's consider the following example:
Python
from sklearn.preprocessing import OrdinalEncoder
data = [['Low'], ['Medium'], ['High'], ['Medium'], ['Low']]
encoder = OrdinalEncoder(categories=[['Low', 'Medium', 'High']])
encoded_data = encoder.fit_transform(data)
print(f"Encoded Ordinal Data: {encoded_data}")
Output:
outputIn this case, 'Low' is encoded as 0, 'Medium' as 1 and 'High' as 2, preserving the natural order of the categories.
4. Target Encoding
Target Encoding (also known as Mean Encoding) is a technique where each category in a feature is replaced by the mean of the target variable for that category. This technique is especially useful when there is a relationship between the categorical feature and the target variable.
- Pros: Captures relationship to target variable.
- Cons: Risk of overfitting; must apply smoothing/statistical techniques and ensure leakage prevention (e.g., CV or holdout techniques).
Python
import pandas as pd
import category_encoders as ce
df = pd.DataFrame(
{'City': ['London', 'Paris', 'London', 'Berlin'], 'Target': [1, 0, 1, 0]}
)
encoder = ce.TargetEncoder(cols=['City'])
df_tgt = encoder.fit_transform(df['City'], df['Target'])
print(f"Encoded Target Data:\n{df_tgt}")
Output:
outputIn this case, each color is encoded based on the mean of the target variable. For instance, 'Red' has a mean target value of approximately 0.485, which reflects the target values for the rows where 'Red' appears.
5. Binary Encoding
Binary EncodingBinary encoding is a more compact version of one-hot encoding. Each category is assigned a unique binary code. The binary code is then split into multiple columns. This method is suitable for datasets with high cardinality (many unique categories), as it results in fewer columns compared to one-hot encoding.
- Pros: Reduces dimensionality; more memory-efficient than one-hot encoding.
- Cons: Slightly more complex; requires careful handling of missing values.
Python
import category_encoders as ce
data = ['Red', 'Green', 'Blue', 'Red']
encoder = ce.BinaryEncoder(cols=['Color'])
encoded_data = encoder.fit_transform(pd.DataFrame(data, columns=['Color']))
print(encoded_data)
Output:
outputHere, each category (like 'Red', 'Blue', 'Green') is converted into binary digits. 'Red' gets the binary code '10', 'Blue' becomes '01' and 'Green' becomes '11'. Each binary digit is placed in a separate column (e.g., Color_0 and Color_1).
6. Frequency Encoding
Frequency Encoding assigns each category a value based on its frequency in the dataset. This technique can be useful for handling high-cardinality categorical features (features with many unique categories).
- Pros: Low computational and storage requirements.
- Cons: Can introduce data leakage if not handled properly.
Python
import pandas as pd
data = ['Red', 'Green', 'Blue', 'Red', 'Red']
series_data = pd.Series(data)
frequency_encoding = series_data.value_counts()
encoded_data = [frequency_encoding[x] for x in data]
print("Encoded Data:", encoded_data)
Output:
Encoded Data: [np.int64(3), np.int64(1), np.int64(1), np.int64(3), np.int64(3)]
Here, 'Red' appears 3 times, so it is encoded as 3, while 'Green' and 'Blue' appear once, so they are encoded as 1.
Differences between Various Techniques
Technique | Suitable For | Dimensionality | Overfitting Risk | Interpretability |
---|
One-Hot Encoding | Nominal | High | Low | High |
---|
Label Encoding | Ordinal (sometimes Nominal) | Low | Medium | Medium |
---|
Ordinal Encoding | Ordinal | Low | Medium | High |
---|
Binary Encoding | High-cardinality features | Medium | Medium | Medium |
---|
Frequency Encoding | High-cardinality | Low | High | Medium |
---|
Target Encoding | High-cardinality | Low | High | Low-Medium |
---|
Categorical data encoding is essential for translating real-world information into a format machine learning algorithms can understand. By selecting appropriate encoding techniques we can ensure our models interpret categories correctly, enabling better predictive accuracy and robust performance.
Similar Reads
Encoding Categorical Data in Sklearn Categorical data is a common occurrence in many datasets especially in fields like marketing, finance and social sciences. Unlike numerical data, categorical data represents discrete values or categories such as gender, country or product type. Machine learning algorithms require numerical input, ma
3 min read
Feature Encoding Techniques - Machine Learning As we all know that better encoding leads to a better model and most algorithms cannot handle the categorical variables unless they are converted into a numerical value. Categorical features are generally divided into 3 types: A. Binary: Either/or Examples: Yes, NoTrue, False B. Ordinal: Specific
5 min read
One Hot Encoding in Machine Learning One Hot Encoding is a method for converting categorical variables into a binary format. It creates new columns for each category where 1 means the category is present and 0 means it is not. The primary purpose of One Hot Encoding is to ensure that categorical data can be effectively used in machine
9 min read
Passing categorical data to Sklearn Decision Tree Theoretically, decision trees are capable of handling numerical as well as categorical data, but, while implementing, we need to prepare the data for classification. There are two methods to handle the categorical data before training: one-hot encoding and label encoding. In this article, we underst
5 min read
What are Embedding in Machine Learning? Embeddings in machine learning are mathematical representations of complex data like words, images, audio as dense vectors in a continuous, lower-dimensional space. They capture relationships and contextual meaning making it easier for algorithms to process and analyze data efficiently. They turn th
7 min read
What are embeddings in machine learning? In machine learning, the term "embeddings" refers to a method of transforming high-dimensional data into a lower-dimensional space while preserving essential relationships and properties. Embeddings play a crucial role in various machine learning tasks, particularly in natural language processing (N
7 min read