6 One Hot Encoding
6 One Hot Encoding
When categorical variables are found in any column in the dataset, we have to convert it into numerical
values. This is done using One hot encoding technique.
Example:
See the following data: homeprices.csv
Above, we have ‘town’ as categorical variable since it has 3 towns names repeated: Monroe township,
west Windsor and Robbinsville. These towns’ names can be converted into 0, 1, 2.
Other examples:
1 Male, Female
2 Green, Red, Blue
Categorical variable values can be represented with limited values like 0,1,2 etc. A dummy variable is a
variable that represents a categorical variable value using 2 or more values.
The Dummy Variable trap is a scenario in which the independent variables are multicollinear — a
scenario in which two or more variables are highly correlated; in simple terms one variable can be
predicted from the others.
In linear regression, the independent variables should not depend on each other. But dummy variables
are dependent. To eliminate this problem, we have to delete one dummy variable. Thus we are
eliminating multicollinear problem.
# we do not require 'town' variable as it is replaced by dummy vars. Hence drop town.
# we should delete one dummy variable out of them as they lead to multicollinear problem.
# when 5 dummy vars are there, we should take any 4 only. t\
# hence we drop town and the last dummy variable.
final = merged.drop(['town', 'west windsor'], axis='columns')
final
y = final['price']
y
# even though we do not drop the dummy variable, linear regression model will work
# correctly. The reason is it will internally drop a column.
# predict the price of house with 2800 sft area located at robinsville
# parameters: 1st: area, 2nd: monroe township, 3rd: robinsvielle
model.predict([[2800, 0, 1]]) # array([590775.63964739])
# predict the price of house with 2800 sft area located at robinsville
# parameters: 1st: robinsville, 2nd: west windsor, 3rd: area
model.predict([[1, 0, 2800]]) # array([590775.63964739])
carprices.csv contains car sell prices for 3 different models. First plot data points on a scatter plot chart
to see if linear regression model can be applied. Then build a model that can answer the following
questions:
a. Predict price of a mercedez benz that is 4 yrs old with mileage 45000
b. Predict price of a BMW X5 that is 7 yrs old with mileage 86000
c. Tell me the score (accuracy) of your model.