Handling of Categorical Data
Handling of Categorical Data
Data
• The performance of a machine learning model not only depends on
the model and the hyperparameters but also on how we process and
feed different types of variables to the model.
• Since most machine learning models only accept numerical variables,
preprocessing the categorical variables becomes a necessary step.
• We need to convert these categorical variables to numbers such that
the model is able to understand and extract valuable information.
• A typical data scientist spends 70 – 80% of his time cleaning and
preparing the data.
• converting categorical data is an unavoidable activity. It not only
elevates the model quality but also helps in better feature
engineering.
• Which categorical data encoding method should we use?
What is Categorical Data?
• Categorical variables are usually represented as ‘strings’ or
‘categories’ and are finite in number. Here are a few examples:
• The city where a person lives: Delhi, Mumbai, Ahmedabad, Bangalore,
etc.
• The department a person works in: Finance, Human resources, IT,
Production.
• The highest degree a person has: High school, Diploma, Bachelors,
Masters, PhD.
• The grades of a student: A+, A, B+, B, B- etc.
Two categories
• Ordinal Data: The categories have an inherent order
• Nominal Data: The categories do not have an inherent order
• In Ordinal data, while encoding, one should retain the information regarding
the order in which the category is provided. Like in the above example the
highest degree a person possesses, gives vital information about his
qualification. The degree is an important feature to decide whether a person is
suitable for a post or not.
• While encoding Nominal data, we have to consider the presence or absence of
a feature. In such a case, no notion of order is present. For example, the city a
person lives in. For the data, it is important to retain where a person lives.
Here, We do not have any order or sequence. It is equal if a person lives in
Delhi or Bangalore.
Types of encoding techniques
• Label Encoding or Ordinal Encoding
• One Hot Encoding
• Dummy Encoding
• Effect Encoding
• Hash Encoder
• Binary Encoding
• Base N Encoding
• Target Encoding
Label Encoding
• We use this categorical data encoding technique when the categorical
feature is ordinal. In this case, retaining the order is important. Hence
encoding should reflect the sequence.
• In Label encoding, each label is converted into an integer value.
• We will create a variable that contains the categories representing the
education qualification of a person.
One Hot Encoding
• We use this categorical data encoding technique when the features
are nominal(do not have any order).
• In one hot encoding, for each level of a categorical feature, we create
a new variable.
• Each category is mapped with a binary variable containing either 0 or
1. Here, 0 represents the absence, and 1 represents the presence of
that category.
• These newly created binary features are known as Dummy
variables. The number of dummy variables depends on the levels
present in the categorical variable.
Example
Dummy Encoding
• Dummy coding scheme is similar to one-hot encoding. This categorical data
encoding method transforms the categorical variable into a set of binary variables
(also known as dummy variables). In the case of one-hot encoding, for N
categories in a variable, it uses N binary variables. The dummy encoding is a small
improvement over one-hot-encoding. Dummy encoding uses N-1 features to
represent N labels/categories.
Effect Encoding
• In the numeral system, the Base or the radix is the number of digits or a
combination of digits and letters used to represent the numbers. The
most common base we use in our life is 10 or decimal system as here we
use 10 unique digits i.e 0 to 9 to represent all the numbers. Another
widely used system is binary i.e. the base is 2. It uses 0 and 1 i.e 2 digits
to express all the numbers.
• For Binary encoding, the Base is 2 which means it converts the numerical
values of a category into its respective Binary form. If you want to change
the Base of encoding scheme you may use Base N encoder. In the case
when categories are more and binary encoding is not able to handle the
dimensionality then we can use a larger base such as 4 or 8.
Target Encoding