0% found this document useful (0 votes)
4 views4 pages

Exp. 1

The document details the process of data preprocessing using Python, including importing libraries, handling missing data, encoding categorical variables, and splitting the dataset into training and test sets. It utilizes libraries such as pandas, sklearn, and numpy to perform operations like imputation, one-hot encoding, and feature scaling. The final output includes transformed training and test datasets ready for further analysis.

Uploaded by

G Ravi Kumar
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
4 views4 pages

Exp. 1

The document details the process of data preprocessing using Python, including importing libraries, handling missing data, encoding categorical variables, and splitting the dataset into training and test sets. It utilizes libraries such as pandas, sklearn, and numpy to perform operations like imputation, one-hot encoding, and feature scaling. The final output includes transformed training and test datasets ready for further analysis.

Uploaded by

G Ravi Kumar
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 4

Experiment 1:

Data Preprocessing with Python

Importing the libraries

import numpy as np
import matplotlib.pyplot as plt
import pandas as pd

Importing the dataset (Data.csv)

Country,Age,Salary,Purchased
France,44,72000,No
Spain,27,48000,Yes
Germany,30,54000,No
Spain,38,61000,No
Germany,40,,Yes
France,35,58000,Yes
Spain,,52000,No
France,48,79000,Yes
Germany,50,83000,No
France,37,67000,Yes

dataset = pd.read_csv('Data.csv')
X = dataset.iloc[:, :-1].values
y = dataset.iloc[:, -1].values

print(X)

[['France' 44.0 72000.0]


['Spain' 27.0 48000.0]
['Germany' 30.0 54000.0]
['Spain' 38.0 61000.0]
['Germany' 40.0 nan]
['France' 35.0 58000.0]
['Spain' nan 52000.0]
['France' 48.0 79000.0]
['Germany' 50.0 83000.0]
['France' 37.0 67000.0]]
print(y)

['No' 'Yes' 'No' 'No' 'Yes' 'Yes' 'No' 'Yes' 'No' 'Yes']

Taking care of missing data

from sklearn.impute import SimpleImputer


imputer = SimpleImputer(missing_values=np.nan, strategy='mean')
imputer.fit(X[:, 1:3])
X[:, 1:3] = imputer.transform(X[:, 1:3])

print(X)

[['France' 44.0 72000.0]


['Spain' 27.0 48000.0]
['Germany' 30.0 54000.0]
['Spain' 38.0 61000.0]
['Germany' 40.0 63777.77777777778]
['France' 35.0 58000.0]
['Spain' 38.77777777777778 52000.0]
['France' 48.0 79000.0]
['Germany' 50.0 83000.0]
['France' 37.0 67000.0]]

Encoding categorical data

Encoding the Independent Variable

from sklearn.compose import ColumnTransformer


from sklearn.preprocessing import OneHotEncoder
ct = ColumnTransformer(transformers=[('encoder', OneHotEncoder(), [0])], remainder='passthr
ough')
X = np.array(ct.fit_transform(X))

print(X)

[[1.0 0.0 0.0 44.0 72000.0]


[0.0 0.0 1.0 27.0 48000.0]
[0.0 1.0 0.0 30.0 54000.0]
[0.0 0.0 1.0 38.0 61000.0]
[0.0 1.0 0.0 40.0 63777.77777777778]
[1.0 0.0 0.0 35.0 58000.0]
[0.0 0.0 1.0 38.77777777777778 52000.0]
[1.0 0.0 0.0 48.0 79000.0]
[0.0 1.0 0.0 50.0 83000.0]
[1.0 0.0 0.0 37.0 67000.0]]

Encoding the Dependent Variable

from sklearn.preprocessing import LabelEncoder


le = LabelEncoder()
y = le.fit_transform(y)

print(y)

[0 1 0 0 1 1 0 1 0 1]

Splitting the dataset into the Training set and Test set

from sklearn.model_selection import train_test_split


X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 1)

print(X_train)

[[0.0 0.0 1.0 38.77777777777778 52000.0]


[0.0 1.0 0.0 40.0 63777.77777777778]
[1.0 0.0 0.0 44.0 72000.0]
[0.0 0.0 1.0 38.0 61000.0]
[0.0 0.0 1.0 27.0 48000.0]
[1.0 0.0 0.0 48.0 79000.0]
[0.0 1.0 0.0 50.0 83000.0]
[1.0 0.0 0.0 35.0 58000.0]]

print(X_test)

[[0.0 1.0 0.0 30.0 54000.0]


[1.0 0.0 0.0 37.0 67000.0]]

print(y_train)

[0 1 0 0 1 1 0 1]
print(y_test)
[0 1]

Feature Scaling

from sklearn.preprocessing import StandardScaler


sc = StandardScaler()
X_train = sc.fit_transform(X_train)
X_test = sc.transform(X_test)

print(X_train)

[[-0.77459667 -0.57735027 1.29099445 -0.19159184 -1.07812594]


[-0.77459667 1.73205081 -0.77459667 -0.01411729 -0.07013168]
[ 1.29099445 -0.57735027 -0.77459667 0.56670851 0.63356243]
[-0.77459667 -0.57735027 1.29099445 -0.30453019 -0.30786617]
[-0.77459667 -0.57735027 1.29099445 -1.90180114 -1.42046362]
[ 1.29099445 -0.57735027 -0.77459667 1.14753431 1.23265336]
[-0.77459667 1.73205081 -0.77459667 1.43794721 1.57499104]
[ 1.29099445 -0.57735027 -0.77459667 -0.74014954 -0.56461943]]

print(X_test)

[[-0.77459667 1.73205081 -0.77459667 -1.46618179 -0.9069571 ]


[ 1.29099445 -0.57735027 -0.77459667 -0.44973664 0.20564034]]

You might also like