Categorical Variable

There are three main approaches to prepare categorical data for modeling: 1. Drop categorical variables, which simply removes them. This is easiest but works only if the variables contain no useful information. 2. Label encoding, which assigns integer values to each category. It works best for ordinal variables. 3. One-hot encoding, which creates new columns to represent each category. It does not assume ordering of categories like label encoding. It is generally not used for variables with more than 15 values.

Uploaded by

chicky

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as TXT, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

39 views2 pages

Categorical Variable

Uploaded by

chicky

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as TXT, PDF, TXT or read online on Scribd

You are on page 1/ 2

--> categorical variable takes only limited number of values

Three Approaches to prepare categorical data

1. Drop categorical variables

easiest way is to drop them from the dataset , only work if column contains
no useful information

2. Label Encoding

assigns unique value to a different integer

for tree based model , label encoding works best with ordinal variables
If the variable has a clear ordering, then that variable would be an ordinal
variable, ---- economic status with three categories (low, medium and high)

3. One-Hot Encoding

One-hot encoding creates new columns indicating the presence (or absence) of
each possible value in the original data.
In contrast to label encoding, one-hot encoding does not assume an ordering
of the categories.
We refer to categorical variables without an intrinsic ranking as nominal
variables.
One-hot encoding generally does not perform well if the categorical variable
takes on a large number of values
(i.e., you generally won't use it for variables taking more than 15 different
values).

The object dtype indicates the column has text

#Get list of categorical variables

s=(X_train.dtypes == 'object')
object_cols = list(s[s].index)
print("Categorical variables")
print(object_cols)

#Score from Approach 1 (Drop Categorical Variables)

--we drop the object columns with select_dtypes() method

drop_X_train = X_train.select_dtypes(exclude=['object'])
drop_X_valid = X_valid.select_dtypes(exclude=['object'])
print("MAE from approach 1(drop categorical variables)")
print(score_dataset(drop_X_train,drop_X_valid,y_train,y_valid))

#Score from Approach 2 (Label Encoding)

--Scikit-learn has a LabelEncoder class that can be used to get label encodings. We
loop over the categorical variables and apply the label encoder separately to each
column.

from sklear.preprocessing import LabelEncoder

#Make copy to avoid changing original data

label_X_train = X_train.copy()
label_X_valid = X_valid.copy()

#apply labelencoder to each column with categorical data

label_encoder = LabelEncoder()
for col in object_cols:
label_X_train[col] = label_encoder.fit_transform(X_train[col])
label_X_valid[col] = label_encoder.fit_transform(X_valid[col])

print("MAE from Approach 2 (Label encoding:)")

print(score_dataset(label_X_train,label_X_valid,y_train,y_valid))

#Score from Approach 3 (One-Hot Encoding)

We use the OneHotEncoder class from scikit-learn to get one-hot encodings. There
are a number of parameters that can be used to customize its behavior.

We set handle_unknown='ignore' to avoid errors when the validation data contains

classes that aren't represented in the training data, and
setting sparse=False ensures that the encoded columns are returned as a numpy array
(instead of a sparse matrix).

from sklear.preprocessing import OneHotEncoder

#Apply One_hot_encoder to each column with categorical data

OH_encoder = OneHotEncoder(handle_unknown='ignore' , sparse=False)
OH_cols_train = pd.DataFrame(OH_encoder.fit_transform(X_train[object_cols]))
OH_cols_valid= pd.DataFrame(OH_encoder.transform(X_valid[object_cols]))

#One hot encoding removed index , put it back

OH_cols_train.index = X_train.index
OH_cols_valid.index = X_valid.index

#Remove categorical columns (will replace with one_hot_encoding)

num_X_train = X_train.drop(object_cols , axis=1)
num_X_valid = X_valid.drop(object_cols , axis =1)

#Add one_hot encoded columns to numerical features

OH_X_train = pd.concat([num_X_train , OH_cols_train] , axis =1)
OH_X_valid = pd.concat([num_X_valid , OH_cols_valid] , axis =1)

print("MAE from Approach 2 (one hot encoding:)")

print(score_dataset(OH_X_train,OH_X_valid,y_train,y_valid))

100 Days of Machine Learning
No ratings yet
100 Days of Machine Learning
14 pages
Categorical Encoding Using Label-Encoding and One-Hot-Encoder
No ratings yet
Categorical Encoding Using Label-Encoding and One-Hot-Encoder
9 pages
All About Encoding - by Baijayanta Roy - Towards Data Science
No ratings yet
All About Encoding - by Baijayanta Roy - Towards Data Science
25 pages
003-FIN7790 (Part2)
No ratings yet
003-FIN7790 (Part2)
162 pages
Machine Learning Lab Assignment 1
No ratings yet
Machine Learning Lab Assignment 1
23 pages
Dealing With Categorical
No ratings yet
Dealing With Categorical
25 pages
Student Abandonment Classification in Brazil
No ratings yet
Student Abandonment Classification in Brazil
59 pages
Machine Learning
No ratings yet
Machine Learning
34 pages
Lecture 5 Encoding
No ratings yet
Lecture 5 Encoding
35 pages
Abhiml ML File
No ratings yet
Abhiml ML File
74 pages
L1 - Data Pre-Processing & Steps of Building A Model
No ratings yet
L1 - Data Pre-Processing & Steps of Building A Model
30 pages
Exp 6
No ratings yet
Exp 6
9 pages
One Hot Encodding
No ratings yet
One Hot Encodding
7 pages
C2 W4 Lab 02 Tree Ensemble
No ratings yet
C2 W4 Lab 02 Tree Ensemble
16 pages
DS 1
No ratings yet
DS 1
20 pages
Multi Classification - Py (For 1 Class TP, TN, FP, FN)
No ratings yet
Multi Classification - Py (For 1 Class TP, TN, FP, FN)
25 pages
23BCE7199 ML Lab Assignment
No ratings yet
23BCE7199 ML Lab Assignment
15 pages
Spyder Version Errors and Warnings
No ratings yet
Spyder Version Errors and Warnings
2 pages
Deep Learning Perceptron
No ratings yet
Deep Learning Perceptron
10 pages
C2 W4 Lab 02 Tree Ensemble
No ratings yet
C2 W4 Lab 02 Tree Ensemble
10 pages
Code Day 9 ML (Ordinal) - Jupyter Notebook
No ratings yet
Code Day 9 ML (Ordinal) - Jupyter Notebook
4 pages
MACHINE LEARNING Manual
No ratings yet
MACHINE LEARNING Manual
36 pages
Label Encoding Presentation
No ratings yet
Label Encoding Presentation
11 pages
Categorical Variable Explanation
No ratings yet
Categorical Variable Explanation
5 pages
P3) Code Neural Networks
No ratings yet
P3) Code Neural Networks
3 pages
Lab 6
No ratings yet
Lab 6
6 pages
ML 7
No ratings yet
ML 7
6 pages
7 - InnovatiCS - Categorical Data & Data Transformation
No ratings yet
7 - InnovatiCS - Categorical Data & Data Transformation
20 pages
DMML Lab Report 04
No ratings yet
DMML Lab Report 04
6 pages
ML Functions
No ratings yet
ML Functions
12 pages
TP4-ML-features Encoding
No ratings yet
TP4-ML-features Encoding
4 pages
Dia PGM
No ratings yet
Dia PGM
2 pages
Dealing With Categorical Data
No ratings yet
Dealing With Categorical Data
14 pages
Encoding Categorical Data: Is There Yet Anything Hotter' Than One-Hot Encoding?
No ratings yet
Encoding Categorical Data: Is There Yet Anything Hotter' Than One-Hot Encoding?
11 pages
ML-Lab05-Data Preprocessing Techniques in Python
No ratings yet
ML-Lab05-Data Preprocessing Techniques in Python
7 pages
Encoding Categorical Data
No ratings yet
Encoding Categorical Data
4 pages
Feature Encoding
No ratings yet
Feature Encoding
5 pages
4 Data Preprocessing
No ratings yet
4 Data Preprocessing
27 pages
Data Preprocessing Example Programs1
No ratings yet
Data Preprocessing Example Programs1
9 pages
Working With Pre (Rocessing Data Files
No ratings yet
Working With Pre (Rocessing Data Files
4 pages
Mastering Categorical Encoding
No ratings yet
Mastering Categorical Encoding
8 pages
OneHot Encoding
No ratings yet
OneHot Encoding
5 pages
Encoding Notes
No ratings yet
Encoding Notes
4 pages
ML Exp-6
No ratings yet
ML Exp-6
2 pages
Practical 3 - Categorical Feature Engineering
No ratings yet
Practical 3 - Categorical Feature Engineering
6 pages
Fall Semester 2020-21 AI With Python ECE-4031
No ratings yet
Fall Semester 2020-21 AI With Python ECE-4031
5 pages
Dwdm-Lab Manual
No ratings yet
Dwdm-Lab Manual
39 pages
Untitled Document
No ratings yet
Untitled Document
2 pages
StarterNotebook - Jupyter Notebook
No ratings yet
StarterNotebook - Jupyter Notebook
12 pages
Mercedes-Benz Greener Manufacturing Ai
0% (1)
Mercedes-Benz Greener Manufacturing Ai
16 pages
Exp 2 Data Preprocessing - Cleaning The Dataset Obtained From The UCI ML Repository
No ratings yet
Exp 2 Data Preprocessing - Cleaning The Dataset Obtained From The UCI ML Repository
9 pages
ML Concepts Papers
No ratings yet
ML Concepts Papers
3 pages
Lab Manual 5 Solved 40
No ratings yet
Lab Manual 5 Solved 40
13 pages
Using Categorical Data With One Hot Encoding - Kaggle PDF
No ratings yet
Using Categorical Data With One Hot Encoding - Kaggle PDF
4 pages
Data Wrangling Python.
No ratings yet
Data Wrangling Python.
8 pages
All About Categorical Variable Encoding
No ratings yet
All About Categorical Variable Encoding
21 pages
6 One Hot Encoding
No ratings yet
6 One Hot Encoding
3 pages
Lab 08 - Data Preprocessing
No ratings yet
Lab 08 - Data Preprocessing
9 pages
Advanced C Concepts and Programming: First Edition
From Everand
Advanced C Concepts and Programming: First Edition
Gayatri
3/5 (1)
The Essential R Reference
From Everand
The Essential R Reference
Mark Gardener
No ratings yet