0% found this document useful (0 votes)
32 views

LAB EXERCISE 2 - Data Preprocessing

Uploaded by

shreya halaswamy
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
32 views

LAB EXERCISE 2 - Data Preprocessing

Uploaded by

shreya halaswamy
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 10

LAB EXERCISE – 2

Data Preprocessing

Aim of the Experiment.


The main aim of this experiment is to preprocess the given dataset. The database is created
and is available in the file sample.csv.
Sample Dataset

id first last gender Marks selected


1 Leone Debrick Female 50 TRUE
2 Romola Phinness Female 60 FALSE
y
3 Geri Prium Male 65 FALSE
4 Sandy Doveston Female 95 FALSE
5 Jacenta Jansik Female 31 TRUE
6 Diane- Medhurst Female 45 TRUE
marie
7 Austen Pool Male 45 TRUE
8 Vanya Teffrey Male 70 FALSE
9 Giordano Elloy Male 36 FALSE
10 Rozele Fawcett Female 50 FALSE

The objectives of this experiment are

1. Explore Label Encoder


2. Explore Scikit Preprocessing routines like Scaling
3. Explore Scikit Preprocessing routines like Binarizer

Reference to the Textbook and Explanation

All the fundamentals are given in Chapter 2 and Appendix 2.

The variable in the dataset Female and Male can be changed to 0 or 1 using Label Encoder. It is done as
given below:

df_gender_encode=LabelEncoder()

df.gender=df_gender_encode.fit_transform(df.gender)

Scaling can be done as follows:

df.Marks = preprocessing.scale(df.Marks)

scaled_df= preprocessing.scale(df.Marks)

Scaling removes the mean

Copyright @ Oxford University Press, India 2021


Binarization uses threshold and converts values to binary as shown below:

scaled_df_bin = preprocessing.Binarizer(threshold=0.5).transform(newarr)

Duplicates can be removed as follows:

df_duplicates_removed = pd.DataFrame.drop_duplicates(df_duplicated)

The NaN of a column can be removed as shown below:

df['m5']=df['m5'].fillna(0)

This removes all the NaN to zero.

The command,

df=df.dropna(axis=1)

removes all the columns that has NaN.

Listing 1

import pandas as pd

col_list=["id","first","last","gender","Marks","selected"]

df = pd.read_csv("sample.csv",usecols=col_list)

print(df)

print("End of Listing\n\n\n")

# Let us convert the in Gender column, make Female as 0 and

# male as 1 using LabelEncoder in scikitlearn method

from sklearn.preprocessing import LabelEncoder

df_gender_encode=LabelEncoder()

df.gender=df_gender_encode.fit_transform(df.gender)

# One can observe that female is coded as 0 and Male as 1

print(df)

print("End of Listing\n\n\n")

# Now one can scale the marks to remove mean

Copyright @ Oxford University Press, India 2021


from sklearn import preprocessing

df.Marks = preprocessing.scale(df.Marks)

scaled_df= preprocessing.scale(df.Marks)

print(df)

print("Scaling of marks is completed\n\n\n\n")

newarr = scaled_df.reshape(-1,1)

scaled_df_bin = preprocessing.Binarizer(threshold=0.5).transform(newarr)

df['Marks']=scaled_df_bin

print(df)

print("Binarizarion of marks is completed\n\n\n\n")

Output

Copyright @ Oxford University Press, India 2021


Copyright @ Oxford University Press, India 2021
Listing 2

import pandas as pd

col_list=["id","first","last","gender","Marks","selected"]

df = pd.read_csv("sample.csv",usecols=col_list)

print(df)

print("End of Listing\n\n\n")

# Let us create duplicate elements in the given dataset

# This is done using the command concate 2 times as given below

df_duplicated = pd.concat([df]*2, ignore_index=True)

print(df_duplicated)

print("Display before duplication\n\n\n\n")

df_duplicates_removed = pd.DataFrame.drop_duplicates(df_duplicated)

print(df_duplicates_removed)

print("Display after duplication\n\n\n\n")

Output

Copyright @ Oxford University Press, India 2021


Copyright @ Oxford University Press, India 2021
Listing 3

import pandas as pd

df = pd.DataFrame({

'm1':[50,'A',60,'A',80],

'm2':[60,'A','60','A',80],

'm3':[50,70,'A','A',60],

'm4':[60,'A','A','A',60],

'm5':['A','A','A',10,20]

})

df = df.apply(pd.to_numeric,errors='coerce')

print(df)

print('Dataframe with NaN\n\n\n')

# Make all the NaN in Mark5 as zero

df['m5']=df['m5'].fillna(0)

print(df)

print('Making m5 NaN as 0 using fillna() function\n\n\n\n')

df1 = df.copy()

df1['m2'].fillna(df1['m2'].mean(),inplace=True)

print(df1)

print('Making m5 NaN as mean using fillna() function\n\n\n\n')

df2 = df.copy()

df1['m3'].fillna(df1['m2'].median(),inplace=True)

print(df2)

print('Making m5 NaN as median using fillna() function\n\n\n\n')

Copyright @ Oxford University Press, India 2021


# Dropping all columns having NaN

df=df.dropna(axis=1)

print(df)

print('Dropping all columns having NaN\n\n\n\n')

Output

Copyright @ Oxford University Press, India 2021


Listing 4

This listing illustrates the use of MinMax scaling and Standard scaling for finding Z-scores.

from numpy import asarray

from sklearn.preprocessing import MinMaxScaler

from sklearn.preprocessing import StandardScaler

data = asarray([[1,3],[8,5],[6,7],[8,9]])

print("\n Original Data")

print(data)

Copyright @ Oxford University Press, India 2021


scaler1 = MinMaxScaler()

scaler2 = StandardScaler()

scaled1 = scaler1.fit_transform(data)

scaled2 = scaler2.fit_transform(data)

print("\n\nThe output of MinMax Scaling")

print(scaled1)

print("\n\nThe output of Standard scaling as z-score")

print(scaled2)

Output

Copyright @ Oxford University Press, India 2021

You might also like