0% found this document useful (0 votes)

69 views8 pages

5 - One - Hot - Encoding - Ipynb - Colaboratory

The document discusses one hot encoding of categorical variables in pandas and scikit-learn. It shows how to encode a 'town' column into dummy variables, then use those dummies as features in a linear regression model to predict 'price' based on 'town' and 'area'. It compares using pandas get_dummies versus scikit-learn's OneHotEncoder to create the dummy variables. The model achieves good accuracy, correctly predicting prices for different towns and areas.

Uploaded by

duryodhan sahoo

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

69 views8 pages

5 - One - Hot - Encoding - Ipynb - Colaboratory

Uploaded by

duryodhan sahoo

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 8

Categorical Variables and One Hot Encoding

import pandas as pd

df = pd.read_csv("homeprices.csv")
df

town area price

0 monroe township 2600 550000

1 monroe township 3000 565000

2 monroe township 3200 610000

3 monroe township 3600 680000

4 monroe township 4000 725000

5 west windsor 2600 585000

6 west windsor 2800 615000

7 west windsor 3300 650000

8 west windsor 3600 710000

9 robinsville 2600 575000

10 robinsville 2900 600000

11 robinsville 3100 620000

12 robinsville 3600 695000

Using pandas to create dummy variables

dummies = pd.get_dummies(df.town)
dummies

monroe township robinsville west windsor

0 1 0 0

1 1 0 0

2 1 0 0

3 1 0 0

4 1 0 0

5 0 0 1

6 0 0 1

7 0 0 1

merged = pd.concat([df,dummies],axis='columns')

8 0 0 1
merged

9 0 1 0

10 town 0 area price1 monroe township

0 robinsville west windsor

0
11 monroe township 0 2600 5500001 0 1 0 0

1
12 monroe township 0 3000 5650001 0 1 0 0

2 monroe township 3200 610000 1 0 0

3 monroe township 3600 680000 1 0 0

4 monroe township 4000 725000 1 0 0

5 west windsor 2600 585000 0 0 1

6 west windsor 2800 615000 0 0 1

7 west windsor 3300 650000 0 0 1

8 west windsor 3600 710000 0 0 1

9 robinsville 2600 575000 0 1 0

10 robinsville 2900 600000 0 1 0

11 robinsville 3100 620000 0 1 0

12 robinsville 3600 695000 0 1 0

final = merged.drop(['town'], axis='columns')

final

area price monroe township robinsville west windsor

0 2600 550000 1 0 0

1 3000 565000 1 0 0

2 3200 610000 1 0 0

3 3600 680000 1 0 0

4 4000 725000 1 0 0

5 2600 585000 0 0 1

6 2800 615000 0 0 1

7 3300 650000 0 0 1

8 3600 710000 0 0 1

9 2600 575000 0 1 0

Dummy
10 Variable Trap
2900 600000 0 1 0

11 3100 620000 0 1 0

When 12 3600derive
you can 695000 0
one variable from other variables, 1they are known0to be multi-colinear.
Here
if you know values of california and georgia then you can easily infer value of new
jersey state, i.e. california=0 and georgia=0. There for these state variables are called to be
multi-colinear. In this
situation linear regression won't work as expected. Hence you need to
drop one column.

NOTE: sklearn library takes care of dummy variable trap hence even if you don't drop one
of the state columns it is going to work, however we should make a habit of taking care of
dummy variable
trap ourselves just in case library that you are using is not handling this
for you

final = final.drop(['west windsor'], axis='columns')

final

area price monroe township robinsville

0 2600 550000 1 0

1 3000 565000 1 0

2 3200 610000 1 0

3 3600 680000 1 0

4 4000 725000 1 0

5 2600 585000 0 0

6 2800 615000 0 0
X = final.drop('price', axis='columns')

X
7 3300 650000 0 0

8 3600 710000 0 0
area monroe township robinsville
9 2600 575000 0 1
0 2600 1 0
10 2900 600000 0 1
1 3000 1 0
11 3100 620000 0 1
2 3200 1 0
12 3600 695000 0 1
3 3600 1 0

4 4000 1 0

5 2600 0 0

6 2800 0 0

7 3300 0 0

8 3600 0 0

9 2600 0 1

10 2900 0 1

11 3100 0 1

12 3600 0 1

y = final.price

from sklearn.linear_model import LinearRegression

model = LinearRegression()

model.fit(X,y)

LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None, normalize=False)

model.predict(X) # 2600 sqr ft home in new jersey

array([539709.7398409 , 590468.71640508, 615848.20468716, 666607.18125134,

717366.15781551, 579723.71533005, 605103.20361213, 668551.92431735,

706621.15674048, 565396.15136531, 603465.38378844, 628844.87207052,

692293.59277574])

model.score(X,y)

0.9573929037221873

model.predict([[3400,0,0]]) # 3400 sqr ft home in west windsor

array([681241.66845839])

model.predict([[2800,0,1]]) # 2800 sqr ft home in robbinsville

array([590775.63964739])

Using sklearn OneHotEncoder

First step is to use label encoder to convert town names into numbers

from sklearn.preprocessing import LabelEncoder

le = LabelEncoder()

dfle = df

dfle.town = le.fit_transform(dfle.town)

dfle

town area price

0 0 2600 550000

1 0 3000 565000

2 0 3200 610000
X = dfle[['town','area']].values

3 0 3600 680000

4 0 4000 725000
X

5 2 2600 585000
array([[ 0, 2600],

6 [2 2800 615000
0, 3000],

[ 0, 3200],

7 [2 3300 650000
0, 3600],

[ 0, 4000],

8 2 3600 710000
[ 2, 2600],

9 [1 2, 2800],

2600 575000
[ 2, 3300],

10 [1 2900 600000
2, 3600],

[ 1, 2600],

11 [1 3100 620000
1, 2900],

[ 1, 3100],

12 1 3600 695000
[ 1, 3600]])

y = dfle.price.values

array([550000, 565000, 610000, 680000, 725000, 585000, 615000, 650000,

710000, 575000, 600000, 620000, 695000])

Now use one hot encoder to create dummy variables for each of the town

from sklearn.preprocessing import OneHotEncoder

from sklearn.compose import ColumnTransformer

ct = ColumnTransformer([('town', OneHotEncoder(), [0])], remainder = 'passthrough')

X = ct.fit_transform(X)

array([[1.0e+00, 0.0e+00, 0.0e+00, 2.6e+03],

[1.0e+00, 0.0e+00, 0.0e+00, 3.0e+03],

[1.0e+00, 0.0e+00, 0.0e+00, 3.2e+03],

[1.0e+00, 0.0e+00, 0.0e+00, 3.6e+03],

[1.0e+00, 0.0e+00, 0.0e+00, 4.0e+03],

[0.0e+00, 0.0e+00, 1.0e+00, 2.6e+03],

[0.0e+00, 0.0e+00, 1.0e+00, 2.8e+03],

[0.0e+00, 0.0e+00, 1.0e+00, 3.3e+03],

[0.0e+00, 0.0e+00, 1.0e+00, 3.6e+03],

[0.0e+00, 1.0e+00, 0.0e+00, 2.6e+03],

[0.0e+00, 1.0e+00, 0.0e+00, 2.9e+03],

[0.0e+00, 1.0e+00, 0.0e+00, 3.1e+03],

[0.0e+00, 1.0e+00, 0.0e+00, 3.6e+03]])

X = X[:,1:]

array([[0.0e+00, 0.0e+00, 2.6e+03],

[0.0e+00, 0.0e+00, 3.0e+03],

[0.0e+00, 0.0e+00, 3.2e+03],

[0.0e+00, 0.0e+00, 3.6e+03],

[0.0e+00, 0.0e+00, 4.0e+03],

[0.0e+00, 1.0e+00, 2.6e+03],

[0.0e+00, 1.0e+00, 2.8e+03],

[0.0e+00, 1.0e+00, 3.3e+03],

[0.0e+00, 1.0e+00, 3.6e+03],

[1.0e+00, 0.0e+00, 2.6e+03],

[1.0e+00, 0.0e+00, 2.9e+03],

[1.0e+00, 0.0e+00, 3.1e+03],

[1.0e+00, 0.0e+00, 3.6e+03]])

model.fit(X,y)

LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None, normalize=False)

model.predict([[0,1,3400]]) # 3400 sqr ft home in west windsor

array([681241.6684584])

model.predict([[1,0,2800]]) # 2800 sqr ft home in robbinsville

array([590775.63964739])

Exercise

At the same level as this notebook on github, there is an Exercise folder that contains
carprices.csv.
This file has car sell prices for 3 different models. First plot data points on a
scatter plot chart
to see if linear regression model can be applied. If yes, then build a model
that can answer
following questions,

1) Predict price of a mercedez benz that is 4 yr old with mileage 45000

2) Predict price of a BMW X5 that is 7 yr old with mileage 86000

3) Tell me the score (accuracy) of your model. (Hint: use LinearRegression().score())

Colab paid products
-
Cancel contracts here

Multiple Linear Regression
No ratings yet
Multiple Linear Regression
10 pages
One Hot Encoding
No ratings yet
One Hot Encoding
12 pages
Saira
100% (1)
Saira
6 pages
SVM (Support Vector Machine) For Classification - by Aditya Kumar - Towards Data Science
100% (1)
SVM (Support Vector Machine) For Classification - by Aditya Kumar - Towards Data Science
28 pages
1996-2001 Royal Star Tour Classic Service Manual
No ratings yet
1996-2001 Royal Star Tour Classic Service Manual
539 pages
Aluminium Foil
0% (1)
Aluminium Foil
45 pages
0.1 Guilherme Marthe - Boston House Pricing Challenge
100% (1)
0.1 Guilherme Marthe - Boston House Pricing Challenge
15 pages
Central Civil Services (Conduct) Rules MCQ
No ratings yet
Central Civil Services (Conduct) Rules MCQ
11 pages
Anila 8611
No ratings yet
Anila 8611
18 pages
ML Manual
No ratings yet
ML Manual
30 pages
6 One Hot Encoding
No ratings yet
6 One Hot Encoding
3 pages
4 - Học Máy Cơ Bản - Hồi Quy Tuyến Tính
No ratings yet
4 - Học Máy Cơ Bản - Hồi Quy Tuyến Tính
113 pages
Circlet For Edinburgh
100% (1)
Circlet For Edinburgh
2 pages
ML Shristi File
No ratings yet
ML Shristi File
49 pages
Machine Learning Record VR19
No ratings yet
Machine Learning Record VR19
46 pages
MLLab Manual
No ratings yet
MLLab Manual
24 pages
Machine Learning Lab
No ratings yet
Machine Learning Lab
20 pages
ML Full For Print New 1
No ratings yet
ML Full For Print New 1
38 pages
Dealing With Categorical
No ratings yet
Dealing With Categorical
25 pages
Final ML File
No ratings yet
Final ML File
34 pages
Machine Learning Project: TITLE: Predicting The Sale Price of A House Using Linear Regression
No ratings yet
Machine Learning Project: TITLE: Predicting The Sale Price of A House Using Linear Regression
20 pages
Document From Jahnavi
No ratings yet
Document From Jahnavi
20 pages
Train
No ratings yet
Train
17 pages
Exercise4 Solution
No ratings yet
Exercise4 Solution
20 pages
Gaurav - Data Mining Lab Assignment
No ratings yet
Gaurav - Data Mining Lab Assignment
36 pages
Data Science Record - 05
No ratings yet
Data Science Record - 05
20 pages
Linear Reg
No ratings yet
Linear Reg
25 pages
Machine Learning
No ratings yet
Machine Learning
22 pages
T2 Summary VHA
No ratings yet
T2 Summary VHA
14 pages
Week 6 LAB
No ratings yet
Week 6 LAB
13 pages
ML Record
No ratings yet
ML Record
19 pages
Xgboost
No ratings yet
Xgboost
12 pages
ML Labs
No ratings yet
ML Labs
14 pages
ML Lab Record
No ratings yet
ML Lab Record
17 pages
Exp - 6-Model Development - SDK - Ok
No ratings yet
Exp - 6-Model Development - SDK - Ok
11 pages
ML Manual
No ratings yet
ML Manual
9 pages
Exp 6
No ratings yet
Exp 6
9 pages
House Price Prediction Using Machine Learning in Python
No ratings yet
House Price Prediction Using Machine Learning in Python
13 pages
Regression Algorithm
No ratings yet
Regression Algorithm
9 pages
ML Regression
No ratings yet
ML Regression
9 pages
1 - Linear - Regression - Ipynb - Colaboratory
No ratings yet
1 - Linear - Regression - Ipynb - Colaboratory
7 pages
Assignment 4
No ratings yet
Assignment 4
7 pages
Project 4 - House Price Prediction - Ipynb - Colab
No ratings yet
Project 4 - House Price Prediction - Ipynb - Colab
5 pages
Python File
No ratings yet
Python File
5 pages
Week 5
No ratings yet
Week 5
7 pages
DL 1
No ratings yet
DL 1
4 pages
Introduction To Machine Learning (ML) With Sklearn
No ratings yet
Introduction To Machine Learning (ML) With Sklearn
10 pages
C1 W1 Lab03 Model Representation Soln-Copy1
No ratings yet
C1 W1 Lab03 Model Representation Soln-Copy1
7 pages
DA Lab2
No ratings yet
DA Lab2
5 pages
Expt 7
No ratings yet
Expt 7
3 pages
Docu 4
No ratings yet
Docu 4
3 pages
Data Wrangling Python.
No ratings yet
Data Wrangling Python.
8 pages
2 - Linear - Regression - Multivariate - Ipynb - Colaboratory
No ratings yet
2 - Linear - Regression - Multivariate - Ipynb - Colaboratory
4 pages
MHT-Brochure New
No ratings yet
MHT-Brochure New
2 pages
AIML
No ratings yet
AIML
5 pages
2 Linear Regression Multivariate
No ratings yet
2 Linear Regression Multivariate
2 pages
Boston Housing Kaggle Challenge With Linear Regression
No ratings yet
Boston Housing Kaggle Challenge With Linear Regression
3 pages
Pattern - Recognition - 3 - Code With Output
No ratings yet
Pattern - Recognition - 3 - Code With Output
7 pages
A
No ratings yet
A
2 pages
Housing Prices Linear Regression
No ratings yet
Housing Prices Linear Regression
3 pages
Mlext
No ratings yet
Mlext
1 page
SOP On Zero FIR
No ratings yet
SOP On Zero FIR
12 pages
Import As Import As From Import: "Mean Squared Errors: "
No ratings yet
Import As Import As From Import: "Mean Squared Errors: "
1 page
Data Modeling Featurization Visualization Example
No ratings yet
Data Modeling Featurization Visualization Example
2 pages
Linear Regression - Jupyter Notebook
No ratings yet
Linear Regression - Jupyter Notebook
2 pages
DSBDA Prac4 2
No ratings yet
DSBDA Prac4 2
1 page
Release Notes Tallyprime 6
No ratings yet
Release Notes Tallyprime 6
6 pages
Chapter 11 Network Models: Quantitative Analysis For Management, 11e (Render)
No ratings yet
Chapter 11 Network Models: Quantitative Analysis For Management, 11e (Render)
32 pages
Acara Cartoon Yang Pernah Tayang Dan Masih Tayang Di TV PDF
No ratings yet
Acara Cartoon Yang Pernah Tayang Dan Masih Tayang Di TV PDF
8 pages
Marketing Principles
No ratings yet
Marketing Principles
54 pages
Tata Motors
No ratings yet
Tata Motors
38 pages
Compare Colgate and Sensodyne Based On Positioning
No ratings yet
Compare Colgate and Sensodyne Based On Positioning
5 pages
Grounded Theory Thesis Structure
100% (3)
Grounded Theory Thesis Structure
5 pages
AMV in Pharma
No ratings yet
AMV in Pharma
13 pages
E103-W02 UserManual EN V3.0
No ratings yet
E103-W02 UserManual EN V3.0
54 pages
Memorandum: Rivergate Place, Murrarie, QLD Hope Harbour Marina, QLD +1300 052 081
No ratings yet
Memorandum: Rivergate Place, Murrarie, QLD Hope Harbour Marina, QLD +1300 052 081
3 pages
Project Africa Now
No ratings yet
Project Africa Now
6 pages
Be 20230428
No ratings yet
Be 20230428
8 pages
A Comparative Analysis of The
No ratings yet
A Comparative Analysis of The
15 pages
SAP Material Training
No ratings yet
SAP Material Training
37 pages
Chapter 2 Different Types of Fixtures
No ratings yet
Chapter 2 Different Types of Fixtures
20 pages
PHD - Aerodynamics of Flexible Membranes
No ratings yet
PHD - Aerodynamics of Flexible Membranes
165 pages
Pointers To Review On Mathematics
No ratings yet
Pointers To Review On Mathematics
3 pages
Cat Global Catalog Loctite
100% (1)
Cat Global Catalog Loctite
47 pages
Chapter 1 Business
No ratings yet
Chapter 1 Business
52 pages
Smart Assistive Multi Final
No ratings yet
Smart Assistive Multi Final
11 pages
Title Page Thesis SHSHSH
No ratings yet
Title Page Thesis SHSHSH
6 pages
How Much Power
No ratings yet
How Much Power
5 pages
TEDxYouth Programme
No ratings yet
TEDxYouth Programme
2 pages
Worth Every Gasp A Lone Woman’S Journey In The Himalayas
From Everand
Worth Every Gasp A Lone Woman’S Journey In The Himalayas
Anamika Mukherjee
No ratings yet

5 - One - Hot - Encoding - Ipynb - Colaboratory

Uploaded by

5 - One - Hot - Encoding - Ipynb - Colaboratory

Uploaded by

Categorical Variables and One Hot Encoding

town area price

0 monroe township 2600 550000

1 monroe township 3000 565000

2 monroe township 3200 610000

3 monroe township 3600 680000

4 monroe township 4000 725000

5 west windsor 2600 585000

6 west windsor 2800 615000

7 west windsor 3300 650000

8 west windsor 3600 710000

9 robinsville 2600 575000

10 robinsville 2900 600000

11 robinsville 3100 620000

12 robinsville 3600 695000

Using pandas to create dummy variables

monroe township robinsville west windsor

10 town 0 area price1 monroe township

2 monroe township 3200 610000 1 0 0

3 monroe township 3600 680000 1 0 0

4 monroe township 4000 725000 1 0 0

5 west windsor 2600 585000 0 0 1

6 west windsor 2800 615000 0 0 1

7 west windsor 3300 650000 0 0 1

8 west windsor 3600 710000 0 0 1

9 robinsville 2600 575000 0 1 0

10 robinsville 2900 600000 0 1 0

11 robinsville 3100 620000 0 1 0

12 robinsville 3600 695000 0 1 0

area price monroe township robinsville west windsor

area price monroe township robinsville

LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None, normalize=False)

array([539709.7398409 , 590468.71640508, 615848.20468716, 666607.18125134,

717366.15781551, 579723.71533005, 605103.20361213, 668551.92431735,

706621.15674048, 565396.15136531, 603465.38378844, 628844.87207052,

Using sklearn OneHotEncoder

town area price

array([550000, 565000, 610000, 680000, 725000, 585000, 615000, 650000,

710000, 575000, 600000, 620000, 695000])

array([[1.0e+00, 0.0e+00, 0.0e+00, 2.6e+03],

[1.0e+00, 0.0e+00, 0.0e+00, 3.0e+03],

[1.0e+00, 0.0e+00, 0.0e+00, 3.2e+03],

[1.0e+00, 0.0e+00, 0.0e+00, 3.6e+03],

[1.0e+00, 0.0e+00, 0.0e+00, 4.0e+03],

[0.0e+00, 0.0e+00, 1.0e+00, 2.6e+03],

[0.0e+00, 0.0e+00, 1.0e+00, 2.8e+03],

[0.0e+00, 0.0e+00, 1.0e+00, 3.3e+03],

[0.0e+00, 0.0e+00, 1.0e+00, 3.6e+03],

[0.0e+00, 1.0e+00, 0.0e+00, 2.6e+03],

[0.0e+00, 1.0e+00, 0.0e+00, 2.9e+03],

[0.0e+00, 1.0e+00, 0.0e+00, 3.1e+03],

[0.0e+00, 1.0e+00, 0.0e+00, 3.6e+03]])

array([[0.0e+00, 0.0e+00, 2.6e+03],

[0.0e+00, 0.0e+00, 3.0e+03],

[0.0e+00, 0.0e+00, 3.2e+03],

[0.0e+00, 0.0e+00, 3.6e+03],

[0.0e+00, 0.0e+00, 4.0e+03],

[0.0e+00, 1.0e+00, 2.6e+03],

[0.0e+00, 1.0e+00, 2.8e+03],

[0.0e+00, 1.0e+00, 3.3e+03],

[0.0e+00, 1.0e+00, 3.6e+03],

[1.0e+00, 0.0e+00, 2.6e+03],

[1.0e+00, 0.0e+00, 2.9e+03],

[1.0e+00, 0.0e+00, 3.1e+03],

[1.0e+00, 0.0e+00, 3.6e+03]])

LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None, normalize=False)

1) Predict price of a mercedez benz that is 4 yr old with mileage 45000

2) Predict price of a BMW X5 that is 7 yr old with mileage 86000

3) Tell me the score (accuracy) of your model. (Hint: use LinearRegression().score())

You might also like