0% found this document useful (0 votes)

127 views6 pages

Data Preprocessing ML Lab

The experiment aimed to apply various data preprocessing techniques on a dataset to prepare it for machine learning algorithms. The steps included checking for missing values and handling them by dropping rows, encoding categorical variables, splitting the data into training and test sets, and performing feature scaling. The code demonstrated reading in a dataset, checking for missing values, label encoding categorical variables, splitting the data, and scaling features.

Uploaded by

mohan kukreja

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

127 views6 pages

Data Preprocessing ML Lab

Uploaded by

mohan kukreja

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 6

Experiment No.

1
Aim : To apply various data-preprocessing techniques on a data set to prepare it for machine learning algorithms. Theory What is
Data Preprocessing ? Data preprocessing is a data mining technique that involves transforming raw data into an understandable
format. Real-world data is often incomplete, inconsistent, and/or lacking in certain behaviors or trends, and is likely to contain many
errors. Data preprocessing is a proven method of resolving such issues.

Steps Check missing values Categorical variables (Label Encoding and One-Hot Enconding) Split data into training and testing sets
Feature scaling

Code and Output

In [1]:
import numpy as np
import pandas as pd

In [2]:
data = pd.read_csv('50_Startups.csv')

In [3]:
data.head(5)

Out[3]:

R&D Spend Administration Marketing Spend State Profit

0 165349.20 136897.80 471784.10 New York 192261.83

1 162597.70 151377.59 443898.53 California 191792.06

2 153441.51 101145.55 407934.54 Florida 191050.39

3 144372.41 118671.85 383199.62 New York 182901.99

4 142107.34 91391.77 366168.42 Florida 166187.94

In [4]:

data.shape

Out[4]:
(50, 5)

In [5]:

data.columns #features

Out[5]:
Index(['R&D Spend', 'Administration', 'Marketing Spend', 'State', 'Profit'], dtype='object')

Checking missing values

In [6]:
#check for missing values
data.isnull().any()
#It is observed that every column has missing values

Out[6]:
R&D Spend False
Administration False
Marketing Spend False
State False
Profit False
dtype: bool

Handling missing values

1. Drop rows having null values

2. Fill missing values with mean/median/mode or any relevant value

In [7]:
# Dropping null rows
data.dropna(inplace=True)
data.isnull().any()
#No null values now

Out[7]:
R&D Spend False
Administration False
Marketing Spend False
State False
Profit False
dtype: bool

In [8]:
print(data.shape)

(50, 5)

Handling categorical variables

In [17]:
data2 = pd.read_csv('50_Startups.csv')
data2.head()

Out[17]:

R&D Spend Administration Marketing Spend State Profit

0 165349.20 136897.80 471784.10 New York 192261.83

1 162597.70 151377.59 443898.53 California 191792.06

2 153441.51 101145.55 407934.54 Florida 191050.39

3 144372.41 118671.85 383199.62 New York 182901.99

4 142107.34 91391.77 366168.42 Florida 166187.94

In [18]:

data2['Profit'].unique()

Out[18]:
array([192261.83, 191792.06, 191050.39, 182901.99, 166187.94, 156991.12,
156122.51, 155752.6 , 152211.77, 149759.96, 146121.95, 144259.4 ,
141585.52, 134307.35, 132602.65, 129917.04, 126992.93, 125370.37,
124266.9 , 122776.86, 118474.03, 111313.02, 110352.25, 108733.99,
108552.04, 107404.34, 105733.54, 105008.31, 103282.38, 101004.64,
99937.59, 97483.56, 97427.84, 96778.92, 96712.8 , 96479.51,
99937.59, 97483.56, 97427.84, 96778.92, 96712.8 , 96479.51,
90708.19, 89949.14, 81229.06, 81005.76, 78239.91, 77798.83,
71498.49, 69758.98, 65200.33, 64926.08, 49490.75, 42559.73,
35673.41, 14681.4 ])

In [160]:

from sklearn.preprocessing import LabelEncoder

label_encoder = LabelEncoder()

In [19]:

data_LE = data2.copy()
data_LE['State'] = label_encoder.fit_transform(data_LE['State'])

In [20]:

data_LE.head()

Out[20]:

R&D Spend Administration Marketing Spend State Profit

0 165349.20 136897.80 471784.10 2 192261.83

1 162597.70 151377.59 443898.53 0 191792.06

2 153441.51 101145.55 407934.54 1 191050.39

3 144372.41 118671.85 383199.62 2 182901.99

4 142107.34 91391.77 366168.42 1 166187.94

In [21]:
data_LE_df = pd.DataFrame(data_LE)

In [22]:
data_LE_df.dropna(inplace=True)

In [23]:
data_LE_df

Out[23]:

R&D Spend Administration Marketing Spend State Profit

0 165349.20 136897.80 471784.10 2 192261.83

1 162597.70 151377.59 443898.53 0 191792.06

2 153441.51 101145.55 407934.54 1 191050.39

3 144372.41 118671.85 383199.62 2 182901.99

4 142107.34 91391.77 366168.42 1 166187.94

5 131876.90 99814.71 362861.36 2 156991.12

6 134615.46 147198.87 127716.82 0 156122.51

7 130298.13 145530.06 323876.68 1 155752.60

8 120542.52 148718.95 311613.29 2 152211.77

9 123334.88 108679.17 304981.62 0 149759.96

10 101913.08 110594.11 229160.95 1 146121.95

11 100671.96 91790.61 249744.55 0 144259.40

12 93863.75 127320.38 249839.44 1 141585.52

13 R&D Spend Administration
91992.39 135495.07 Marketing
252664.93 Spend State
0 Profit
134307.35

14 119943.24 156547.42 256512.92 1 132602.65

15 114523.61 122616.84 261776.23 2 129917.04

16 78013.11 121597.55 264346.06 0 126992.93

17 94657.16 145077.58 282574.31 2 125370.37

18 91749.16 114175.79 294919.57 1 124266.90

19 86419.70 153514.11 0.00 2 122776.86

20 76253.86 113867.30 298664.47 0 118474.03

21 78389.47 153773.43 299737.29 2 111313.02

22 73994.56 122782.75 303319.26 1 110352.25

23 67532.53 105751.03 304768.73 1 108733.99

24 77044.01 99281.34 140574.81 2 108552.04

25 64664.71 139553.16 137962.62 0 107404.34

26 75328.87 144135.98 134050.07 1 105733.54

27 72107.60 127864.55 353183.81 2 105008.31

28 66051.52 182645.56 118148.20 1 103282.38

29 65605.48 153032.06 107138.38 2 101004.64

30 61994.48 115641.28 91131.24 1 99937.59

31 61136.38 152701.92 88218.23 2 97483.56

32 63408.86 129219.61 46085.25 0 97427.84

33 55493.95 103057.49 214634.81 1 96778.92

34 46426.07 157693.92 210797.67 0 96712.80

35 46014.02 85047.44 205517.64 2 96479.51

36 28663.76 127056.21 201126.82 1 90708.19

37 44069.95 51283.14 197029.42 0 89949.14

38 20229.59 65947.93 185265.10 2 81229.06

39 38558.51 82982.09 174999.30 0 81005.76

40 28754.33 118546.05 172795.67 0 78239.91

41 27892.92 84710.77 164470.71 1 77798.83

42 23640.93 96189.63 148001.11 0 71498.49

43 15505.73 127382.30 35534.17 2 69758.98

44 22177.74 154806.14 28334.72 0 65200.33

45 1000.23 124153.04 1903.93 2 64926.08

46 1315.46 115816.21 297114.46 1 49490.75

47 0.00 135426.92 0.00 0 42559.73

48 542.05 51743.15 0.00 2 35673.41

49 0.00 116983.80 45173.06 0 14681.40

Splitting into training and testing sets

In [26]:
from sklearn.cross_validation import train_test_split
X_train, X_test, y_train, y_test = train_test_split(data_LE_df,data_LE_df['Profit'],test_size=0.2)

/anaconda3/lib/python3.6/site-packages/sklearn/cross_validation.py:41: DeprecationWarning: This mo

dule was deprecated in version 0.18 in favor of the model_selection module into which all the refa
ctored classes and functions are moved. Also note that the interface of the new CV iterators are d
ctored classes and functions are moved. Also note that the interface of the new CV iterators are d
ifferent from that of this module. This module will be removed in 0.20.
"This module will be removed in 0.20.", DeprecationWarning)

In [27]:
X_train.head()

Out[27]:

R&D Spend Administration Marketing Spend State Profit

25 64664.71 139553.16 137962.62 0 107404.34

0 165349.20 136897.80 471784.10 2 192261.83

10 101913.08 110594.11 229160.95 1 146121.95

14 119943.24 156547.42 256512.92 1 132602.65

35 46014.02 85047.44 205517.64 2 96479.51

In [28]:
y_train.head()

Out[28]:
25 107404.34
0 192261.83
10 146121.95
14 132602.65
35 96479.51
Name: Profit, dtype: float64

Feature Scaling

In [29]:
from sklearn.preprocessing import StandardScaler
standard_X = StandardScaler()

In [30]:
X_train = standard_X.fit_transform(X_train)
X_test = standard_X.fit_transform(X_test)

In [31]:
pd.DataFrame(X_train) #SCALED

Out[31]:

0 1 2 3 4

0 -0.147778 0.768777 -0.732925 -1.248168 -0.078585

1 2.099133 0.672035 2.246595 1.187282 2.114855

2 0.683470 -0.286287 0.081064 -0.030443 0.922208

3 1.085838 1.387929 0.325194 -0.030443 0.572754

4 -0.563993 -1.217028 -0.129964 1.187282 -0.360975

5 -0.949166 0.003426 -0.422023 -1.248168 -0.832442

6 -1.590858 -0.053492 -1.561117 -1.248168 -2.475335

7 0.158509 1.286864 0.710993 1.187282 0.022449

8 -0.730373 -1.292275 -0.402355 -1.248168 -0.760949

9 0.5215450 0.9700481 0.5578052 1.1872823 0.3858104

10 1.316921 0.986533 0.926449 -0.030443 1.171146

11 -0.607378 -2.447162 -0.205725 -1.248168 -0.529776

12 -0.352435 -0.560869 -0.048589 -0.030443 -0.353236

13 0.964891 0.151737 0.372172 1.187282 0.503335

14 -1.244827 0.325357 -1.647149 1.187282 -1.051661

15 -1.578762 -2.430403 -1.964309 1.187282 -1.932723

16 -1.139408 -1.912880 -0.310728 1.187282 -0.755177

17 -1.561502 -0.096031 0.687583 -0.030443 -1.575565

18 -0.968390 -1.229294 -0.496328 -0.030443 -0.843843

19 1.631008 0.008009 1.455935 1.187282 1.872917

20 0.018321 0.342926 1.188029 1.187282 -0.140519

21 -1.095932 1.324489 -1.711408 -1.248168 -1.169496

22 -0.083778 -0.462735 0.755901 -0.030443 -0.044215

23 0.090208 0.935743 -0.767847 -0.030443 -0.121772

24 0.150110 0.114601 0.395109 -1.248168 0.427751

25 0.462077 0.620929 0.290849 -1.248168 0.616818

26 1.099212 1.102714 0.816992 1.187282 1.079621

27 1.413268 1.047333 -0.824374 -1.248168 1.180707

28 1.580460 -0.985886 1.303923 -0.030443 1.440884

29 1.352154 -0.679013 1.274406 1.187282 1.203160

30 -0.951188 0.313476 -0.169154 -0.030443 -0.510155

31 0.655773 -0.971355 0.264783 -1.248168 0.874064

32 -0.554798 1.429699 -0.082837 -1.248168 -0.354945

33 -1.063279 -0.811085 -0.643327 -1.248168 -1.006698

34 -1.568537 0.207705 -1.947316 1.187282 -1.176585

35 0.503839 0.323101 0.265630 -0.030443 0.804948

36 0.060431 0.157781 0.742964 -0.030443 -0.002386

37 0.110850 -0.167035 0.701417 -1.248168 0.207550

38 0.456649 -0.155796 0.667992 -0.030443 0.357287

39 0.337715 1.277416 -1.964309 1.187282 0.318772

Result
The data set was pre-processed by using label encoding, feature scaling (standardisation, normalization), checking for missing values
and splitting data into training and testing sets.

Conclusion
In Real world data are generally incomplete: lacking attribute values, lacking certain attributes of interest, or containing only
aggregate data. Noisy: containing errors or outliers. Inconsistent: containing discrepancies in codes or names. Hence, it is essential
to pre-process data so that algorithms can be applied without any hindrance.

Untitled
No ratings yet
Untitled
1,326 pages
Activation Functions - Ipynb - Colaboratory
No ratings yet
Activation Functions - Ipynb - Colaboratory
10 pages
Machine Learning Assignment
No ratings yet
Machine Learning Assignment
5 pages
Memory Based Reasoning - BIA
100% (1)
Memory Based Reasoning - BIA
19 pages
HIRARC - Changing A Flat Tire
No ratings yet
HIRARC - Changing A Flat Tire
8 pages
Data Preprocesing JavaPoint
No ratings yet
Data Preprocesing JavaPoint
19 pages
Chandigarh Group of Colleges College of Engineering Landran, Mohali
No ratings yet
Chandigarh Group of Colleges College of Engineering Landran, Mohali
47 pages
Concepts and Techniques: Data Mining
No ratings yet
Concepts and Techniques: Data Mining
52 pages
Previous Exam Exercises On Classification: Exercise 4 2012: Classification With 2 Features
No ratings yet
Previous Exam Exercises On Classification: Exercise 4 2012: Classification With 2 Features
9 pages
Ain Shams University Faculty of Engineering
No ratings yet
Ain Shams University Faculty of Engineering
2 pages
1.linear Regression PSP
No ratings yet
1.linear Regression PSP
92 pages
DATA MINING Chapter 1 and 2 Lect Slide
No ratings yet
DATA MINING Chapter 1 and 2 Lect Slide
47 pages
Instructions For Physics Practical Exam
No ratings yet
Instructions For Physics Practical Exam
2 pages
Dimension Reduction
No ratings yet
Dimension Reduction
15 pages
Machine Learning CA 2
No ratings yet
Machine Learning CA 2
19 pages
Cluster Analysis Chapter 8 Solution
No ratings yet
Cluster Analysis Chapter 8 Solution
8 pages
ML Lab Manual
100% (1)
ML Lab Manual
37 pages
Lecture 03 Gradient Descent
No ratings yet
Lecture 03 Gradient Descent
26 pages
Supervised and Deep Learning
No ratings yet
Supervised and Deep Learning
83 pages
ML Lab Observation
100% (1)
ML Lab Observation
44 pages
Data Mining
No ratings yet
Data Mining
15 pages
03 - K Means Clustering On Iris Datasets
No ratings yet
03 - K Means Clustering On Iris Datasets
4 pages
IS328 Final Exam
No ratings yet
IS328 Final Exam
12 pages
Image Processing and Pattern Recoginition Lab Manual
No ratings yet
Image Processing and Pattern Recoginition Lab Manual
38 pages
Fuzzy Logic and Applications PDF
No ratings yet
Fuzzy Logic and Applications PDF
13 pages
Pds Fall2003
No ratings yet
Pds Fall2003
52 pages
Pattern Recognition
No ratings yet
Pattern Recognition
3 pages
Modelling in R
No ratings yet
Modelling in R
47 pages
Aiml Lab Manual 2023
No ratings yet
Aiml Lab Manual 2023
17 pages
Parallelism of Statistics and Machine Learning & Logistic Regression Versus Random Forest
100% (1)
Parallelism of Statistics and Machine Learning & Logistic Regression Versus Random Forest
72 pages
Seminar Report Machine Learning
No ratings yet
Seminar Report Machine Learning
20 pages
COMPX310-19A Machine Learning Chapter 7: Ensembles, Random Forest
No ratings yet
COMPX310-19A Machine Learning Chapter 7: Ensembles, Random Forest
41 pages
71A Machine Learning
No ratings yet
71A Machine Learning
8 pages
Machine Learning Report
No ratings yet
Machine Learning Report
58 pages
Les 3 DWM
No ratings yet
Les 3 DWM
21 pages
Machine Learning Guide Line
No ratings yet
Machine Learning Guide Line
10 pages
ML UNIT II
No ratings yet
ML UNIT II
30 pages
Motion Detection
No ratings yet
Motion Detection
33 pages
Constraint Satisfaction Problems: AIMA: Chapter 6
No ratings yet
Constraint Satisfaction Problems: AIMA: Chapter 6
64 pages
ANN Architecture
No ratings yet
ANN Architecture
41 pages
Data Exploration and Visualization - AD3301 - Hand Written Notes - Unit 5 - Multivariate and Time Series Analysis
No ratings yet
Data Exploration and Visualization - AD3301 - Hand Written Notes - Unit 5 - Multivariate and Time Series Analysis
59 pages
Question Bank Module-1: Department of Computer Applications 18mca53 - Machine Learning
No ratings yet
Question Bank Module-1: Department of Computer Applications 18mca53 - Machine Learning
7 pages
Project2 NN Digit Classification Brief Updated PDF
No ratings yet
Project2 NN Digit Classification Brief Updated PDF
2 pages
Association Rules FP Growth
No ratings yet
Association Rules FP Growth
32 pages
Times Series Analysis Notes May 2021
No ratings yet
Times Series Analysis Notes May 2021
69 pages
Chapter - 2 Machine Learning Overview
No ratings yet
Chapter - 2 Machine Learning Overview
90 pages
Data Mining Worksheet One
No ratings yet
Data Mining Worksheet One
2 pages
Convolution Neural Networks U2
No ratings yet
Convolution Neural Networks U2
24 pages
Midterm Exam
No ratings yet
Midterm Exam
1 page
ML QB WITH ANSWER
No ratings yet
ML QB WITH ANSWER
20 pages
Machine Learning Assignment 1 Submission Date: 5/10/2020
No ratings yet
Machine Learning Assignment 1 Submission Date: 5/10/2020
1 page
Matplotlib Fundamentals
No ratings yet
Matplotlib Fundamentals
31 pages
C++ Practice Questions PDF
100% (1)
C++ Practice Questions PDF
3 pages
Unit II Visualizing Using Matplotlib
No ratings yet
Unit II Visualizing Using Matplotlib
24 pages
Must Know Questions Deep Learning
No ratings yet
Must Know Questions Deep Learning
22 pages
Nueral Network Mcqs
No ratings yet
Nueral Network Mcqs
6 pages
Attribute Oriented Induction
100% (1)
Attribute Oriented Induction
6 pages
Int. To Data Analytics and Cyber Security Syllabus
No ratings yet
Int. To Data Analytics and Cyber Security Syllabus
2 pages
FDSA UNIT 3
No ratings yet
FDSA UNIT 3
42 pages
ML Lab
No ratings yet
ML Lab
21 pages
profitanalysis
No ratings yet
profitanalysis
18 pages
Electronics Today 1979 03
100% (1)
Electronics Today 1979 03
100 pages
upute-za-rukovanje-2973412-tru-components-tc-me31-aaax2240-modul-sucelja-modbus-rtu-modbus-tcp-modbus-gateway-dio-analogna-rj-45-rs-485
No ratings yet
upute-za-rukovanje-2973412-tru-components-tc-me31-aaax2240-modul-sucelja-modbus-rtu-modbus-tcp-modbus-gateway-dio-analogna-rj-45-rs-485
36 pages
High Tunnel Hoop House Portable End Wall Construction Plans by Steve Upson
100% (1)
High Tunnel Hoop House Portable End Wall Construction Plans by Steve Upson
47 pages
MC35 Geometry Ch10 Handout-209
No ratings yet
MC35 Geometry Ch10 Handout-209
5 pages
Technical Education Essay
100% (2)
Technical Education Essay
8 pages
Cafes in Singapore
No ratings yet
Cafes in Singapore
12 pages
Objective Type Questions of Electronics and Communication Engineering PDF
38% (8)
Objective Type Questions of Electronics and Communication Engineering PDF
2 pages
Strategic Cost Management (Acct 1107)
No ratings yet
Strategic Cost Management (Acct 1107)
19 pages
Characterization of Microstructure and Fracture Behavior of Gg20 and Gg25 Cast Iron Materials Used in Valves
No ratings yet
Characterization of Microstructure and Fracture Behavior of Gg20 and Gg25 Cast Iron Materials Used in Valves
6 pages
8905-Exhibit B Technical Specifications
No ratings yet
8905-Exhibit B Technical Specifications
30 pages
6 Chase Nat'l Bank of New York V Battat
No ratings yet
6 Chase Nat'l Bank of New York V Battat
2 pages
Errlog Sai2 20220724 195636
No ratings yet
Errlog Sai2 20220724 195636
3 pages
Roc - Unit 8 Test
No ratings yet
Roc - Unit 8 Test
3 pages
Thesis Topics For Mba in Accounting
100% (3)
Thesis Topics For Mba in Accounting
5 pages
Ba Harrison J 2014
No ratings yet
Ba Harrison J 2014
48 pages
Raspberry_PLC_Family_v6_Datasheet (1)
No ratings yet
Raspberry_PLC_Family_v6_Datasheet (1)
3 pages
Machinery of Govt
No ratings yet
Machinery of Govt
63 pages
Syllabus of MBA 661 R - Global Economic Competitiveness - Spring 2019
No ratings yet
Syllabus of MBA 661 R - Global Economic Competitiveness - Spring 2019
8 pages
Rules For Classification and Construction I Ship Technology: 1 Seagoing Ships
No ratings yet
Rules For Classification and Construction I Ship Technology: 1 Seagoing Ships
16 pages
Hemorrhoidal Laser Procedure - Short - and Long-Term Results From A Prospective Study
No ratings yet
Hemorrhoidal Laser Procedure - Short - and Long-Term Results From A Prospective Study
5 pages
Educ 5324-Article Review Template
No ratings yet
Educ 5324-Article Review Template
6 pages
DLL - Tle-Ia 6 - Q2 - W2
No ratings yet
DLL - Tle-Ia 6 - Q2 - W2
6 pages
Birhanu Gebisa
No ratings yet
Birhanu Gebisa
113 pages
Wärtsilä Coalescer
No ratings yet
Wärtsilä Coalescer
2 pages
Air Asia
No ratings yet
Air Asia
25 pages
Public Engagement Strategy 2019-2021
No ratings yet
Public Engagement Strategy 2019-2021
3 pages
Guide Specifications For Plant Precast Concrete Products
No ratings yet
Guide Specifications For Plant Precast Concrete Products
18 pages
ISET Journal of Earthquake Technology, Paper No.
No ratings yet
ISET Journal of Earthquake Technology, Paper No.
26 pages
Chapter 24-Economic and Product Design Considerations
No ratings yet
Chapter 24-Economic and Product Design Considerations
39 pages