0% found this document useful (0 votes)

39 views

LAB EXERCISE 2 - Data Preprocessing

Uploaded by

shreya halaswamy

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

39 views

LAB EXERCISE 2 - Data Preprocessing

Uploaded by

shreya halaswamy

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

You are on page 1/ 10

LAB EXERCISE – 2

Data Preprocessing

Aim of the Experiment.

The main aim of this experiment is to preprocess the given dataset. The database is created
and is available in the file sample.csv.
Sample Dataset

id first last gender Marks selected

1 Leone Debrick Female 50 TRUE
2 Romola Phinness Female 60 FALSE
y
3 Geri Prium Male 65 FALSE
4 Sandy Doveston Female 95 FALSE
5 Jacenta Jansik Female 31 TRUE
6 Diane- Medhurst Female 45 TRUE
marie
7 Austen Pool Male 45 TRUE
8 Vanya Teffrey Male 70 FALSE
9 Giordano Elloy Male 36 FALSE
10 Rozele Fawcett Female 50 FALSE

The objectives of this experiment are

1. Explore Label Encoder

2. Explore Scikit Preprocessing routines like Scaling
3. Explore Scikit Preprocessing routines like Binarizer

Reference to the Textbook and Explanation

All the fundamentals are given in Chapter 2 and Appendix 2.

The variable in the dataset Female and Male can be changed to 0 or 1 using Label Encoder. It is done as
given below:

df_gender_encode=LabelEncoder()

df.gender=df_gender_encode.fit_transform(df.gender)

Scaling can be done as follows:

df.Marks = preprocessing.scale(df.Marks)

scaled_df= preprocessing.scale(df.Marks)

Scaling removes the mean

Copyright @ Oxford University Press, India 2021

Binarization uses threshold and converts values to binary as shown below:

scaled_df_bin = preprocessing.Binarizer(threshold=0.5).transform(newarr)

Duplicates can be removed as follows:

df_duplicates_removed = pd.DataFrame.drop_duplicates(df_duplicated)

The NaN of a column can be removed as shown below:

df['m5']=df['m5'].fillna(0)

This removes all the NaN to zero.

The command,

df=df.dropna(axis=1)

removes all the columns that has NaN.

Listing 1

import pandas as pd

col_list=["id","first","last","gender","Marks","selected"]

df = pd.read_csv("sample.csv",usecols=col_list)

print(df)

print("End of Listing\n\n\n")

# Let us convert the in Gender column, make Female as 0 and

# male as 1 using LabelEncoder in scikitlearn method

from sklearn.preprocessing import LabelEncoder

df_gender_encode=LabelEncoder()

df.gender=df_gender_encode.fit_transform(df.gender)

# One can observe that female is coded as 0 and Male as 1

print(df)

print("End of Listing\n\n\n")

# Now one can scale the marks to remove mean

Copyright @ Oxford University Press, India 2021

from sklearn import preprocessing

df.Marks = preprocessing.scale(df.Marks)

scaled_df= preprocessing.scale(df.Marks)

print(df)

print("Scaling of marks is completed\n\n\n\n")

newarr = scaled_df.reshape(-1,1)

scaled_df_bin = preprocessing.Binarizer(threshold=0.5).transform(newarr)

df['Marks']=scaled_df_bin

print(df)

print("Binarizarion of marks is completed\n\n\n\n")

Output

Copyright @ Oxford University Press, India 2021

import pandas as pd

col_list=["id","first","last","gender","Marks","selected"]

df = pd.read_csv("sample.csv",usecols=col_list)

print(df)

print("End of Listing\n\n\n")

# Let us create duplicate elements in the given dataset

# This is done using the command concate 2 times as given below

df_duplicated = pd.concat([df]*2, ignore_index=True)

print(df_duplicated)

print("Display before duplication\n\n\n\n")

df_duplicates_removed = pd.DataFrame.drop_duplicates(df_duplicated)

print(df_duplicates_removed)

print("Display after duplication\n\n\n\n")

Output

Copyright @ Oxford University Press, India 2021

import pandas as pd

df = pd.DataFrame({

'm1':[50,'A',60,'A',80],

'm2':[60,'A','60','A',80],

'm3':[50,70,'A','A',60],

'm4':[60,'A','A','A',60],

'm5':['A','A','A',10,20]

})

df = df.apply(pd.to_numeric,errors='coerce')

print(df)

print('Dataframe with NaN\n\n\n')

# Make all the NaN in Mark5 as zero

df['m5']=df['m5'].fillna(0)

print(df)

print('Making m5 NaN as 0 using fillna() function\n\n\n\n')

df1 = df.copy()

df1['m2'].fillna(df1['m2'].mean(),inplace=True)

print(df1)

print('Making m5 NaN as mean using fillna() function\n\n\n\n')

df2 = df.copy()

df1['m3'].fillna(df1['m2'].median(),inplace=True)

print(df2)

print('Making m5 NaN as median using fillna() function\n\n\n\n')

# Dropping all columns having NaN

df=df.dropna(axis=1)

print(df)

print('Dropping all columns having NaN\n\n\n\n')

Output

Listing 4

This listing illustrates the use of MinMax scaling and Standard scaling for finding Z-scores.

from numpy import asarray

from sklearn.preprocessing import MinMaxScaler

from sklearn.preprocessing import StandardScaler

data = asarray([[1,3],[8,5],[6,7],[8,9]])

print("\n Original Data")

print(data)

scaler1 = MinMaxScaler()

scaler2 = StandardScaler()

scaled1 = scaler1.fit_transform(data)

scaled2 = scaler2.fit_transform(data)

print("\n\nThe output of MinMax Scaling")

print(scaled1)

print("\n\nThe output of Standard scaling as z-score")

print(scaled2)

Output

Ultrasound of The Eye and Orbit-Frazier
No ratings yet
Ultrasound of The Eye and Orbit-Frazier
517 pages
Apache Cassandra Administrator Associate - Exam Practice Tests
From Everand
Apache Cassandra Administrator Associate - Exam Practice Tests
Cristian Scutaru
No ratings yet
(Feature Engineering) (Extended-Cheatsheet)
No ratings yet
(Feature Engineering) (Extended-Cheatsheet)
9 pages
How To Update The KEMPER PROFILER Operating System Via USB Flash Drive
No ratings yet
How To Update The KEMPER PROFILER Operating System Via USB Flash Drive
1 page
LAB EXERCISE 2 - Data Preprocessing
No ratings yet
LAB EXERCISE 2 - Data Preprocessing
10 pages
DWDM Lab Manual
No ratings yet
DWDM Lab Manual
32 pages
ML Lab Records
No ratings yet
ML Lab Records
101 pages
Dsbda Ass2
No ratings yet
Dsbda Ass2
49 pages
Abhiml ML File
No ratings yet
Abhiml ML File
74 pages
Data Analysis: Data Preparation
No ratings yet
Data Analysis: Data Preparation
9 pages
Ap Python
No ratings yet
Ap Python
12 pages
EDA - Exploratory Data Analysis
No ratings yet
EDA - Exploratory Data Analysis
16 pages
Python Basics Refresher
No ratings yet
Python Basics Refresher
19 pages
Data Mining Lab 03
No ratings yet
Data Mining Lab 03
10 pages
Week 10
No ratings yet
Week 10
50 pages
Data Preprocessing Tutorial
No ratings yet
Data Preprocessing Tutorial
39 pages
DA PROGRAM UPTO 6 (1)
No ratings yet
DA PROGRAM UPTO 6 (1)
20 pages
100 Days of Machine Learning
No ratings yet
100 Days of Machine Learning
14 pages
PW2 DataCleaning
No ratings yet
PW2 DataCleaning
6 pages
WORKING WITH PRE[ROCESSING DATA FILES
No ratings yet
WORKING WITH PRE[ROCESSING DATA FILES
4 pages
L-2 (Data Frame Part 1).Ipynb - Colab
No ratings yet
L-2 (Data Frame Part 1).Ipynb - Colab
5 pages
LAB MANUAL 5 SOLVED 40 (1)
No ratings yet
LAB MANUAL 5 SOLVED 40 (1)
13 pages
Ss Project With Python
No ratings yet
Ss Project With Python
9 pages
data processing
No ratings yet
data processing
19 pages
1737527078055
No ratings yet
1737527078055
111 pages
pp DWDM 4 5
No ratings yet
pp DWDM 4 5
26 pages
Data Preprocessing
No ratings yet
Data Preprocessing
38 pages
Lab 3 & 4
No ratings yet
Lab 3 & 4
10 pages
Data Mining Lab Manual 2 2
No ratings yet
Data Mining Lab Manual 2 2
63 pages
FeatureEngineering (1)
No ratings yet
FeatureEngineering (1)
50 pages
Practical File 2024
No ratings yet
Practical File 2024
25 pages
Data Cleaning in Python
No ratings yet
Data Cleaning in Python
6 pages
Tutorial Data Visualization Pandas Matplotlib Seaborn
No ratings yet
Tutorial Data Visualization Pandas Matplotlib Seaborn
32 pages
Lab 2 Solved
No ratings yet
Lab 2 Solved
3 pages
Seven Lab Instruction
No ratings yet
Seven Lab Instruction
38 pages
Practical File IP
No ratings yet
Practical File IP
27 pages
Dwdm-Lab Manual
No ratings yet
Dwdm-Lab Manual
39 pages
Lecture-2-20022025-092902am
No ratings yet
Lecture-2-20022025-092902am
87 pages
003-FIN7790 (Part2)
No ratings yet
003-FIN7790 (Part2)
162 pages
Group A Assignment No2 Writeup
No ratings yet
Group A Assignment No2 Writeup
9 pages
Lecture 5 Encoding
No ratings yet
Lecture 5 Encoding
35 pages
Data Preprocessing in Machine Learning[1]
No ratings yet
Data Preprocessing in Machine Learning[1]
24 pages
Data_preprocessing_example_programs1
No ratings yet
Data_preprocessing_example_programs1
9 pages
More On Pandas
No ratings yet
More On Pandas
51 pages
DAV Guidelines
No ratings yet
DAV Guidelines
4 pages
05 Pandas (1)
No ratings yet
05 Pandas (1)
12 pages
ip study
No ratings yet
ip study
18 pages
Data Pre Processing
No ratings yet
Data Pre Processing
2 pages
ML-Lab05-Data Preprocessing Techniques in Python
No ratings yet
ML-Lab05-Data Preprocessing Techniques in Python
7 pages
DAV_practicle_File
No ratings yet
DAV_practicle_File
28 pages
Lab File
No ratings yet
Lab File
96 pages
ML_Notes
No ratings yet
ML_Notes
44 pages
Lesson 2 - Data Preprocessing
100% (1)
Lesson 2 - Data Preprocessing
72 pages
data analytics lab manual
No ratings yet
data analytics lab manual
26 pages
ModuleAr Merged
No ratings yet
ModuleAr Merged
42 pages
Wa0012.
No ratings yet
Wa0012.
30 pages
Project paarth (1) (1)
No ratings yet
Project paarth (1) (1)
21 pages
Day-4 DS Practicals
No ratings yet
Day-4 DS Practicals
5 pages
data science practicals
No ratings yet
data science practicals
47 pages
DAV Guidelines
No ratings yet
DAV Guidelines
4 pages
Python Data Science 101
100% (1)
Python Data Science 101
41 pages
Oracle Certified Professional Java Programmer OCPJP 1Z0 809
From Everand
Oracle Certified Professional Java Programmer OCPJP 1Z0 809
Manish Soni
No ratings yet
Pption 2
No ratings yet
Pption 2
17 pages
0625_s24_qp_21
No ratings yet
0625_s24_qp_21
16 pages
Stages of Compilation
No ratings yet
Stages of Compilation
2 pages
(Empirical Approaches To Language Typology 54) Silvia Luraghi., Tuomas Huuomo (Eds.) - Partitive Cases and Related Categories-De Gruyter Mouton (2014) PDF
100% (3)
(Empirical Approaches To Language Typology 54) Silvia Luraghi., Tuomas Huuomo (Eds.) - Partitive Cases and Related Categories-De Gruyter Mouton (2014) PDF
585 pages
1997 Paul Virilio Third Interval
0% (1)
1997 Paul Virilio Third Interval
8 pages
00 Man Ug Ind360 en
No ratings yet
00 Man Ug Ind360 en
194 pages
Kim 1997
No ratings yet
Kim 1997
23 pages
Theoretical Framework
No ratings yet
Theoretical Framework
5 pages
UTPT '10 Sci-Tech Quiz Finals
No ratings yet
UTPT '10 Sci-Tech Quiz Finals
112 pages
Q3 Science 5 Periodical Test Questions No Heading
No ratings yet
Q3 Science 5 Periodical Test Questions No Heading
4 pages
Cblechpu 09
No ratings yet
Cblechpu 09
7 pages
209 Agricultural Machinery - Power-Operated Corn Sheller - Methods of Test
No ratings yet
209 Agricultural Machinery - Power-Operated Corn Sheller - Methods of Test
24 pages
8088/8086 MICROPROCESSOR Programming - Integer Instructions and Computations
No ratings yet
8088/8086 MICROPROCESSOR Programming - Integer Instructions and Computations
11 pages
ED509753
No ratings yet
ED509753
24 pages
120 Diez
No ratings yet
120 Diez
6 pages
Isl Lesson Plan 3
No ratings yet
Isl Lesson Plan 3
2 pages
T-2002-03 - Tutorial 3 Overview of SDLC Models (Ans)
100% (1)
T-2002-03 - Tutorial 3 Overview of SDLC Models (Ans)
9 pages
Study Material On Passivity and Corrosion by Dr. D. M. Patel
No ratings yet
Study Material On Passivity and Corrosion by Dr. D. M. Patel
15 pages
Potentiometer and Strain Gauge PDF
No ratings yet
Potentiometer and Strain Gauge PDF
25 pages
Sony str-k760p
No ratings yet
Sony str-k760p
44 pages
SANTAFE
100% (1)
SANTAFE
270 pages
Application of General Material Balance On Gas Condensate Reservoirs GIIP Estimation
No ratings yet
Application of General Material Balance On Gas Condensate Reservoirs GIIP Estimation
10 pages
Module Requirement For Abstract Algebra
No ratings yet
Module Requirement For Abstract Algebra
6 pages
EEB443 Test2 With Solutions - 2022
No ratings yet
EEB443 Test2 With Solutions - 2022
12 pages
Aamir-PJHR
No ratings yet
Aamir-PJHR
10 pages
Student Copy CAPS-12
No ratings yet
Student Copy CAPS-12
5 pages
PDF Vertical Axis Wind Turbines DD
No ratings yet
PDF Vertical Axis Wind Turbines DD
38 pages
Operator Manual - N PDF
No ratings yet
Operator Manual - N PDF
71 pages

LAB EXERCISE 2 - Data Preprocessing

Uploaded by

LAB EXERCISE 2 - Data Preprocessing

Uploaded by

LAB EXERCISE – 2

Aim of the Experiment.

id first last gender Marks selected

The objectives of this experiment are

1. Explore Label Encoder

Reference to the Textbook and Explanation

All the fundamentals are given in Chapter 2 and Appendix 2.

Scaling can be done as follows:

Scaling removes the mean

Copyright @ Oxford University Press, India 2021

Duplicates can be removed as follows:

The NaN of a column can be removed as shown below:

This removes all the NaN to zero.

removes all the columns that has NaN.

# Let us convert the in Gender column, make Female as 0 and

# male as 1 using LabelEncoder in scikitlearn method

from sklearn.preprocessing import LabelEncoder

# One can observe that female is coded as 0 and Male as 1

# Now one can scale the marks to remove mean

Copyright @ Oxford University Press, India 2021

print("Scaling of marks is completed\n\n\n\n")

print("Binarizarion of marks is completed\n\n\n\n")

Copyright @ Oxford University Press, India 2021

# Let us create duplicate elements in the given dataset

# This is done using the command concate 2 times as given below

df_duplicated = pd.concat([df]*2, ignore_index=True)

print("Display before duplication\n\n\n\n")

print("Display after duplication\n\n\n\n")

Copyright @ Oxford University Press, India 2021

print('Dataframe with NaN\n\n\n')

# Make all the NaN in Mark5 as zero

print('Making m5 NaN as 0 using fillna() function\n\n\n\n')

print('Making m5 NaN as mean using fillna() function\n\n\n\n')

print('Making m5 NaN as median using fillna() function\n\n\n\n')

Copyright @ Oxford University Press, India 2021

print('Dropping all columns having NaN\n\n\n\n')

Copyright @ Oxford University Press, India 2021

from numpy import asarray

from sklearn.preprocessing import MinMaxScaler

from sklearn.preprocessing import StandardScaler

print("\n Original Data")

Copyright @ Oxford University Press, India 2021

print("\n\nThe output of MinMax Scaling")

print("\n\nThe output of Standard scaling as z-score")

Copyright @ Oxford University Press, India 2021

You might also like