0% found this document useful (0 votes)
39 views2 pages

Categorical Variable

There are three main approaches to prepare categorical data for modeling: 1. Drop categorical variables, which simply removes them. This is easiest but works only if the variables contain no useful information. 2. Label encoding, which assigns integer values to each category. It works best for ordinal variables. 3. One-hot encoding, which creates new columns to represent each category. It does not assume ordering of categories like label encoding. It is generally not used for variables with more than 15 values.

Uploaded by

chicky
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as TXT, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
39 views2 pages

Categorical Variable

There are three main approaches to prepare categorical data for modeling: 1. Drop categorical variables, which simply removes them. This is easiest but works only if the variables contain no useful information. 2. Label encoding, which assigns integer values to each category. It works best for ordinal variables. 3. One-hot encoding, which creates new columns to represent each category. It does not assume ordering of categories like label encoding. It is generally not used for variables with more than 15 values.

Uploaded by

chicky
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as TXT, PDF, TXT or read online on Scribd
You are on page 1/ 2

--> categorical variable takes only limited number of values

Three Approaches to prepare categorical data

1. Drop categorical variables

easiest way is to drop them from the dataset , only work if column contains
no useful information

2. Label Encoding

assigns unique value to a different integer


for tree based model , label encoding works best with ordinal variables
If the variable has a clear ordering, then that variable would be an ordinal
variable, ---- economic status with three categories (low, medium and high)

3. One-Hot Encoding

One-hot encoding creates new columns indicating the presence (or absence) of
each possible value in the original data.
In contrast to label encoding, one-hot encoding does not assume an ordering
of the categories.
We refer to categorical variables without an intrinsic ranking as nominal
variables.
One-hot encoding generally does not perform well if the categorical variable
takes on a large number of values
(i.e., you generally won't use it for variables taking more than 15 different
values).

The object dtype indicates the column has text

#Get list of categorical variables

s=(X_train.dtypes == 'object')
object_cols = list(s[s].index)
print("Categorical variables")
print(object_cols)

#Score from Approach 1 (Drop Categorical Variables)

--we drop the object columns with select_dtypes() method


drop_X_train = X_train.select_dtypes(exclude=['object'])
drop_X_valid = X_valid.select_dtypes(exclude=['object'])
print("MAE from approach 1(drop categorical variables)")
print(score_dataset(drop_X_train,drop_X_valid,y_train,y_valid))

#Score from Approach 2 (Label Encoding)


--Scikit-learn has a LabelEncoder class that can be used to get label encodings. We
loop over the categorical variables and apply the label encoder separately to each
column.

from sklear.preprocessing import LabelEncoder

#Make copy to avoid changing original data


label_X_train = X_train.copy()
label_X_valid = X_valid.copy()

#apply labelencoder to each column with categorical data


label_encoder = LabelEncoder()
for col in object_cols:
label_X_train[col] = label_encoder.fit_transform(X_train[col])
label_X_valid[col] = label_encoder.fit_transform(X_valid[col])

print("MAE from Approach 2 (Label encoding:)")


print(score_dataset(label_X_train,label_X_valid,y_train,y_valid))

#Score from Approach 3 (One-Hot Encoding)

We use the OneHotEncoder class from scikit-learn to get one-hot encodings. There
are a number of parameters that can be used to customize its behavior.

We set handle_unknown='ignore' to avoid errors when the validation data contains


classes that aren't represented in the training data, and
setting sparse=False ensures that the encoded columns are returned as a numpy array
(instead of a sparse matrix).

from sklear.preprocessing import OneHotEncoder

#Apply One_hot_encoder to each column with categorical data


OH_encoder = OneHotEncoder(handle_unknown='ignore' , sparse=False)
OH_cols_train = pd.DataFrame(OH_encoder.fit_transform(X_train[object_cols]))
OH_cols_valid= pd.DataFrame(OH_encoder.transform(X_valid[object_cols]))

#One hot encoding removed index , put it back


OH_cols_train.index = X_train.index
OH_cols_valid.index = X_valid.index

#Remove categorical columns (will replace with one_hot_encoding)


num_X_train = X_train.drop(object_cols , axis=1)
num_X_valid = X_valid.drop(object_cols , axis =1)

#Add one_hot encoded columns to numerical features


OH_X_train = pd.concat([num_X_train , OH_cols_train] , axis =1)
OH_X_valid = pd.concat([num_X_valid , OH_cols_valid] , axis =1)

print("MAE from Approach 2 (one hot encoding:)")


print(score_dataset(OH_X_train,OH_X_valid,y_train,y_valid))

You might also like