0% found this document useful (0 votes)
15 views9 pages

Exp2-Dm - KS

The document outlines data preprocessing tasks in Python, focusing on handling categorical data, scaling features, and splitting datasets into training and testing sets. It provides code examples for encoding categorical variables, using the OneHotEncoder and LabelEncoder, and demonstrates how to split the dataset using train_test_split. Additionally, it emphasizes the importance of scaling features for accurate machine learning predictions and includes visualizations before and after scaling.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
15 views9 pages

Exp2-Dm - KS

The document outlines data preprocessing tasks in Python, focusing on handling categorical data, scaling features, and splitting datasets into training and testing sets. It provides code examples for encoding categorical variables, using the OneHotEncoder and LabelEncoder, and demonstrates how to split the dataset using train_test_split. Additionally, it emphasizes the importance of scaling features for accurate machine learning predictions and includes visualizations before and after scaling.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 9

2/4 B.

Tech AI&DS Data Mining Using Python


EXP2. Lab
2. Demonstrate the following data preprocessing tasks using python libraries.
a) Dealing with categorical data
b) Scaling the features
c) Splitting dataset into Training and Testing Sets

A) Dealing with Categorical Data:

When dealing with large and real-world datasets, categorical data is almost inevitable.Categorical variables

represent types of data which may be divided into groups. Examples of categorical variables are race, sex, age

group, educational level etc. These variables often has letters or words as its values. Since machine learning

models are all about numbers and calculations , these categorical variables need to be coded in to numbers.

Having coded the categorical variable into numbers may just not be enough.

DATASETEXP1:
KS 2/4 B. Tech AI&DS Data Mining Using Python Lab

import numpy as np
import matplotlib.pyplot as pltimport
pandas as pd

import dtale

from sklearn.impute import SimpleImputer


from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import LabelEncoder, OneHotEncoderfrom
sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

datas = pd.read_csv('C:\\Users\\AIDS\\Desktop\\EXP1DATASET.csv')

datas.head(20)
KS 2/4 B. Tech AI&DS Data Mining Using Python Lab

encoder = OneHotEncoder(handle_unknown='ignore')
encoder_datas = pd.DataFrame(encoder.fit_transform(datas[['SALARY']]).toarray())
df2 =datas.join(encoder_datas)
df2.drop('SALARY', axis=1, inplace=True)
df2.head()

INDEX COUNTRY FACULTYID 0 1 2 3 4 5 6 7 8 9 10 11 12

0 0 INDIA 270.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0

1 1 FRANCE 301.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0

2 2 JAPAN 50.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0

3 3 SPAIN 35.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0

4 4 CHINA 390.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0

from sklearn.preprocessing import LabelEncoder


le = LabelEncoder()
datas.iloc[:,-1] = le.fit_transform(datas.iloc[:,-1])
datas

[35]:

INDEX COUNTRY FACULTYID SALARY

0 0 INDIA 270.0 3.0

1 1 FRANCE 301.0 4.0

2 2 JAPAN 50.0 6.0

3 3 SPAIN 35.0 12.0

4 4 CHINA 390.0 8.0

5 5 GERMANY 35.0 7.0

6 6 INDIA NaN 5.0

7 7 AUSTRALIA 422.0 10.0

8 8 RUSSIA 50.0 11.0

9 9 USA 534.0 12.0


KS 2/4 B. Tech AI&DS Data Mining Using Python Lab
INDEX COUNTRY FACULTYID SALARY

10 10 KORIA 21.0 0.0

11 11 USA NaN 8.0

12 12 INDIA 63.0 9.0

13 13 LONDON 200.0 2.0

14 14 INDIA 105.0 1.0

15 15 JAPAN 321.0 9.0

As you can see the purchased column has been successfully transformed.

Now we have completed the encoding of all the categorical data in our dataset and can move
to the next step.

c) Splitting the Dataset into Training and Testing sets:

All machine learning models require us to provide a training set for the machine so that the model can train

from that data to understand the relations between features and can predict for new observations. When we

are provided a single huge dataset with too much of observations ,it is a good idea to split the dataset into to

two, a training_set and a test_set, so that we can test our model after its been trained with the training_set.

X = datas.iloc[:, :-1].values
y = datas.iloc[:, -1].values
#values function coverts the data into arrays
print("Independent Variable\n")
print(X)
print("\nDependent Variable\n")
print(y)

Independent Variable
KS 2/4 B. Tech AI&DS Data Mining Using Python Lab
[[0 'INDIA' 270.0]
[1 'FRANCE' 301.0]
[2 'JAPAN' 50.0]
[3 'SPAIN' 35.0]
[4 'CHINA' 390.0]
[5 'GERMANY' 35.0]
[6 'INDIA' nan]
[7 'AUSTRALIA' 422.0]
[8 'RUSSIA' 50.0]
[9 'USA' 534.0]
[10 'KORIA' 21.0]
[11 'USA' nan]
[12 'INDIA' 63.0]
[13 'LONDON' 200.0]
[14 'INDIA' 105.0]
[15 'JAPAN' 321.0]]

Dependent Variable
[ 3. 4. 6. 12. 8. 7. 5. 10. 11. 12. 0. 8. 9. 2. 1. 9.]

from sklearn.model_selection import train_test_split

x_train, x_test, y_train, y_test = train_test_split(X,y, test_size = 0.2)

print(x_train)

[[0 'INDIA' 270.0]


[3 'SPAIN' 35.0]
[7 'AUSTRALIA' 422.0]
[13 'LONDON' 200.0]
[5 'GERMANY' 35.0]
[12 'INDIA' 63.0]
[14 'INDIA' 105.0]
[9 'USA' 534.0]
[2 'JAPAN' 50.0]
[15 'JAPAN' 321.0]
[4 'CHINA' 390.0]
[1 'FRANCE' 301.0]]

print(x_test)

[[10 'KORIA' 21.0]


[8 'RUSSIA' 50.0]
[11 'USA' nan]
[6 'INDIA' nan]]

from sklearn.model_selection import train_test_split


KS 2/4 B. Tech AI&DS Data Mining Using Python Lab
x_train, x_test, y_train, y_test= train_test_split(X,y, test_size= 0.2,random_state=0)

x_train

Out[41]:
array([[13, 'LONDON', 200.0],
[4, 'CHINA', 390.0],
[2, 'JAPAN', 50.0],
[14, 'INDIA', 105.0],
[10, 'KORIA', 21.0],
[7, 'AUSTRALIA', 422.0],
[15, 'JAPAN', 321.0],
[11, 'USA', nan],
[3, 'SPAIN', 35.0],
[0, 'INDIA', 270.0],
[5, 'GERMANY', 35.0],
[12, 'INDIA', 63.0]], dtype=object)

x_test

Out[42]:
array([[1, 'FRANCE', 301.0],
[6, 'INDIA', nan],
[8, 'RUSSIA', 50.0],
[9, 'USA', 534.0]], dtype=object)

y_train

array([ 2., 8., 6., 1., 0., 10., 9., 8., 12., 3., 7., 9.])

y_test
Out[45]:
array([ 4., 5., 11., 12.])

.
KS 2/4 B. Tech AI&DS Data Mining Using Python Lab
B) Scaling the features:

Since machine learning models rely on numbers to solve relations it is important to have similarly scaled data in

a dataset. Scaling ensures that all data in a dataset falls in the same range. Unscaled data can cause

inaccurate or false predictions. Some machine learning algorithms can handle feature scaling on its own and

doesn’t require it explicitly.

from sklearn.preprocessing import StandardScaler

import pandas as pd

import numpy as np

import matplotlib

import matplotlib.pyplot as plt

import seaborn as sns

%matplotlib inline

matplotlib.style.use('ggplot')

np.random.seed(1)

df = pd.DataFrame({

'Col_1': np.random.normal(0, 2, 30000),

'Col_2': np.random.normal(5, 3, 30000), 'Col_3': np.random.normal(-5, 5, 30000)

})
KS 2/4 B. Tech AI&DS Data Mining Using Python Lab

df.head()

Col_1 Col_2 Col_3

0 3.248691 6.158165 3.702313

1 -1.223513 3.502543 0.381228

2 -1.056344 8.859463 -0.195733

3 -2.145937 3.943360 -8.099449

4 1.730815 11.728704 -10.434062

col_names = df.columns

features = df[col_names]

scaler = StandardScaler().fit(features.values)

features = scaler.transform(features.values)

scaled_features = pd.DataFrame(features, columns = col_names)

scaled_features.head()

Col_1 Col_2 Col_3

0 1.624908 0.376945 1.751563

1 -0.614084 -0.504704 1.083641

2 -0.530391 1.273758 0.967605

3 -1.075893 -0.358355 -0.621956

4 0.864990 2.226327 -1.091483


KS 2/4 B. Tech AI&DS Data Mining Using Python Lab

scaled_features.describe()

Col_1 Col_2 Col_3

count 3.000000e+04 3.000000e+04 3.000000e+04

mean 1.231607e-17 -5.933032e-17 6.300146e-17

std 1.000017e+00 1.000017e+00 1.000017e+00

min -4.240174e+00 -4.079611e+00 -4.349633e+00

25% -6.757204e-01 -6.750500e-01 -6.728936e-01

50% 1.140547e-03 4.150084e-03 1.111792e-03

75% 6.657449e-01 6.767212e-01 6.671877e-01

max 4.171969e+00 3.600454e+00 4.021416e+00

fig,(ax1,ax2) = plt.subplots(ncols=2, figsize=(6,5))


ax1.set_title('Before Scaling')
sns.kdeplot(df['Col_1'], ax=ax1)
sns.kdeplot(df['Col_2'], ax=ax1)
sns.kdeplot(df['Col_3'], ax=ax1)
ax2.set_title('After Standard Scaler')
sns.kdeplot(scaled_features['Col_1'], ax=ax2)
sns.kdeplot(scaled_features['Col_2'], ax=ax2)
sns.kdeplot(scaled_features['Col_3'], ax=ax2)
plt.show()

You might also like