0% found this document useful (0 votes)
204 views3 pages

6 One Hot Encoding

This document discusses one hot encoding for categorical variables. It explains converting categorical variables into numeric values using one hot encoding. It provides examples of implementing one hot encoding using pandas get_dummies and sklearn OneHotEncoder. It also discusses avoiding the dummy variable trap by dropping one dummy variable. Finally, it provides an example predicting house prices after applying one hot encoding to categorical variables like town.

Uploaded by

Sudheer Redus
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
204 views3 pages

6 One Hot Encoding

This document discusses one hot encoding for categorical variables. It explains converting categorical variables into numeric values using one hot encoding. It provides examples of implementing one hot encoding using pandas get_dummies and sklearn OneHotEncoder. It also discusses avoiding the dummy variable trap by dropping one dummy variable. Finally, it provides an example predicting house prices after applying one hot encoding to categorical variables like town.

Uploaded by

Sudheer Redus
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 3

ONE HOT ENCODING

When categorical variables are found in any column in the dataset, we have to convert it into numerical
values. This is done using One hot encoding technique.
Example:
See the following data: homeprices.csv

Above, we have ‘town’ as categorical variable since it has 3 towns names repeated: Monroe township,
west Windsor and Robbinsville. These towns’ names can be converted into 0, 1, 2.

Other examples:
1 Male, Female
2 Green, Red, Blue

Categorical variable values can be represented with limited values like 0,1,2 etc. A dummy variable is a
variable that represents a categorical variable value using 2 or more values.

The Dummy Variable trap is a scenario in which the independent variables are multicollinear — a
scenario in which two or more variables are highly correlated; in simple terms one variable can be
predicted from the others.

In linear regression, the independent variables should not depend on each other. But dummy variables
are dependent. To eliminate this problem, we have to delete one dummy variable. Thus we are
eliminating multicollinear problem.

Program – one hot encoding using dummy variables

# one hot encoding - using pandas get_dummies()


import pandas as pd
df = pd.read_csv("F://datascience-notes/ml/2-onehot-encoding/homeprices.csv")
df

# create dummy variables


dummies = pd.get_dummies(df.town)
dummies

# add these dummies to original df. add columns of both


merged = pd.concat([df, dummies], axis='columns')
merged

# we do not require 'town' variable as it is replaced by dummy vars. Hence drop town.
# we should delete one dummy variable out of them as they lead to multicollinear problem.
# when 5 dummy vars are there, we should take any 4 only. t\
# hence we drop town and the last dummy variable.
final = merged.drop(['town', 'west windsor'], axis='columns')
final

# we have to deleted price column as it is the target column to be predicted


x = final.drop(['price'], axis= 'columns')
x

y = final['price']
y

# even though we do not drop the dummy variable, linear regression model will work
# correctly. The reason is it will internally drop a column.

# let us create linear regression model


from sklearn.linear_model import LinearRegression
model = LinearRegression()
model.fit(x,y) # train the model

# predict the price of house with 2800 sft area located at robinsville
# parameters: 1st: area, 2nd: monroe township, 3rd: robinsvielle
model.predict([[2800, 0, 1]]) # array([590775.63964739])

# predict the price of house with 3400 sft at west windsor


model.predict([[3400, 0, 0]]) # array([681241.66845839])

# find the accuracy of our model


model.score(x,y) # 0.9573929037221873

Program – one hot encoding using one hot encoder

# one hot encoding - using sklearn OneHotEncoder


import pandas as pd
df = pd.read_csv("F://datascience-notes/ml/2-onehot-encoding/homeprices.csv")
df

# to use one hot encoding, first we should use Label encoding


from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()

# fit and transform the data frame using le on town column


df.town = le.fit_transform(df.town)
df.town

# see the new data frame where town will be 0,2, or 1


df
# output: array([0, 0, 0, 0, 0, 2, 2, 2, 2, 1, 1, 1, 1])
# here, 0= monroe township, 2=west windsor, 1=robinsville

# retreive training data


x = df[['town', 'area']].values # when values used, we get 2D
x
# retrieve target data
y = df.price
y

from sklearn.preprocessing import OneHotEncoder


ohe = OneHotEncoder(categorical_features=[0]) # take 0th column in x as categorical variable
x = ohe.fit_transform(x).toarray()
x

# to avoid dummy variable trap, drop 0th column


x = x[:, 1:] # take all rows. take from 1st col onwards.i.e.avoid 0th col
x

# let us create linear regression model


from sklearn.linear_model import LinearRegression
model = LinearRegression()
model.fit(x,y) # train the model

# predict the price of house with 2800 sft area located at robinsville
# parameters: 1st: robinsville, 2nd: west windsor, 3rd: area
model.predict([[1, 0, 2800]]) # array([590775.63964739])

# predict the price of house at west windsor.


model.predict([[0,1,3400]]) # array([681241.6684584])

Task on One hot encoding

carprices.csv contains car sell prices for 3 different models. First plot data points on a scatter plot chart
to see if linear regression model can be applied. Then build a model that can answer the following
questions:
a. Predict price of a mercedez benz that is 4 yrs old with mileage 45000
b. Predict price of a BMW X5 that is 7 yrs old with mileage 86000
c. Tell me the score (accuracy) of your model.

You might also like