0% found this document useful (0 votes)
102 views

Data Preprocessing ML Lab

The experiment aimed to apply various data preprocessing techniques on a dataset to prepare it for machine learning algorithms. The steps included checking for missing values and handling them by dropping rows, encoding categorical variables, splitting the data into training and test sets, and performing feature scaling. The code demonstrated reading in a dataset, checking for missing values, label encoding categorical variables, splitting the data, and scaling features.

Uploaded by

mohan kukreja
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
102 views

Data Preprocessing ML Lab

The experiment aimed to apply various data preprocessing techniques on a dataset to prepare it for machine learning algorithms. The steps included checking for missing values and handling them by dropping rows, encoding categorical variables, splitting the data into training and test sets, and performing feature scaling. The code demonstrated reading in a dataset, checking for missing values, label encoding categorical variables, splitting the data, and scaling features.

Uploaded by

mohan kukreja
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 6

Experiment No.

1
Aim : To apply various data-preprocessing techniques on a data set to prepare it for machine learning algorithms. Theory What is
Data Preprocessing ? Data preprocessing is a data mining technique that involves transforming raw data into an understandable
format. Real-world data is often incomplete, inconsistent, and/or lacking in certain behaviors or trends, and is likely to contain many
errors. Data preprocessing is a proven method of resolving such issues.

Steps Check missing values Categorical variables (Label Encoding and One-Hot Enconding) Split data into training and testing sets
Feature scaling

Code and Output

In [1]:
import numpy as np
import pandas as pd

In [2]:
data = pd.read_csv('50_Startups.csv')

In [3]:
data.head(5)

Out[3]:

R&D Spend Administration Marketing Spend State Profit

0 165349.20 136897.80 471784.10 New York 192261.83

1 162597.70 151377.59 443898.53 California 191792.06

2 153441.51 101145.55 407934.54 Florida 191050.39

3 144372.41 118671.85 383199.62 New York 182901.99

4 142107.34 91391.77 366168.42 Florida 166187.94

In [4]:

data.shape

Out[4]:
(50, 5)

In [5]:

data.columns #features

Out[5]:
Index(['R&D Spend', 'Administration', 'Marketing Spend', 'State', 'Profit'], dtype='object')

Checking missing values

In [6]:
#check for missing values
data.isnull().any()
#It is observed that every column has missing values

Out[6]:
R&D Spend False
Administration False
Marketing Spend False
State False
Profit False
dtype: bool

Handling missing values

1. Drop rows having null values

2. Fill missing values with mean/median/mode or any relevant value

In [7]:
# Dropping null rows
data.dropna(inplace=True)
data.isnull().any()
#No null values now

Out[7]:
R&D Spend False
Administration False
Marketing Spend False
State False
Profit False
dtype: bool

In [8]:
print(data.shape)

(50, 5)

Handling categorical variables

In [17]:
data2 = pd.read_csv('50_Startups.csv')
data2.head()

Out[17]:

R&D Spend Administration Marketing Spend State Profit

0 165349.20 136897.80 471784.10 New York 192261.83

1 162597.70 151377.59 443898.53 California 191792.06

2 153441.51 101145.55 407934.54 Florida 191050.39

3 144372.41 118671.85 383199.62 New York 182901.99

4 142107.34 91391.77 366168.42 Florida 166187.94

In [18]:

data2['Profit'].unique()

Out[18]:
array([192261.83, 191792.06, 191050.39, 182901.99, 166187.94, 156991.12,
156122.51, 155752.6 , 152211.77, 149759.96, 146121.95, 144259.4 ,
141585.52, 134307.35, 132602.65, 129917.04, 126992.93, 125370.37,
124266.9 , 122776.86, 118474.03, 111313.02, 110352.25, 108733.99,
108552.04, 107404.34, 105733.54, 105008.31, 103282.38, 101004.64,
99937.59, 97483.56, 97427.84, 96778.92, 96712.8 , 96479.51,
99937.59, 97483.56, 97427.84, 96778.92, 96712.8 , 96479.51,
90708.19, 89949.14, 81229.06, 81005.76, 78239.91, 77798.83,
71498.49, 69758.98, 65200.33, 64926.08, 49490.75, 42559.73,
35673.41, 14681.4 ])

In [160]:

from sklearn.preprocessing import LabelEncoder


label_encoder = LabelEncoder()

In [19]:

data_LE = data2.copy()
data_LE['State'] = label_encoder.fit_transform(data_LE['State'])

In [20]:

data_LE.head()

Out[20]:

R&D Spend Administration Marketing Spend State Profit

0 165349.20 136897.80 471784.10 2 192261.83

1 162597.70 151377.59 443898.53 0 191792.06

2 153441.51 101145.55 407934.54 1 191050.39

3 144372.41 118671.85 383199.62 2 182901.99

4 142107.34 91391.77 366168.42 1 166187.94

In [21]:
data_LE_df = pd.DataFrame(data_LE)

In [22]:
data_LE_df.dropna(inplace=True)

In [23]:
data_LE_df

Out[23]:

R&D Spend Administration Marketing Spend State Profit

0 165349.20 136897.80 471784.10 2 192261.83

1 162597.70 151377.59 443898.53 0 191792.06

2 153441.51 101145.55 407934.54 1 191050.39

3 144372.41 118671.85 383199.62 2 182901.99

4 142107.34 91391.77 366168.42 1 166187.94

5 131876.90 99814.71 362861.36 2 156991.12

6 134615.46 147198.87 127716.82 0 156122.51

7 130298.13 145530.06 323876.68 1 155752.60

8 120542.52 148718.95 311613.29 2 152211.77

9 123334.88 108679.17 304981.62 0 149759.96

10 101913.08 110594.11 229160.95 1 146121.95

11 100671.96 91790.61 249744.55 0 144259.40

12 93863.75 127320.38 249839.44 1 141585.52


13 R&D Spend Administration
91992.39 135495.07 Marketing
252664.93 Spend State
0 Profit
134307.35

14 119943.24 156547.42 256512.92 1 132602.65

15 114523.61 122616.84 261776.23 2 129917.04

16 78013.11 121597.55 264346.06 0 126992.93

17 94657.16 145077.58 282574.31 2 125370.37

18 91749.16 114175.79 294919.57 1 124266.90

19 86419.70 153514.11 0.00 2 122776.86

20 76253.86 113867.30 298664.47 0 118474.03

21 78389.47 153773.43 299737.29 2 111313.02

22 73994.56 122782.75 303319.26 1 110352.25

23 67532.53 105751.03 304768.73 1 108733.99

24 77044.01 99281.34 140574.81 2 108552.04

25 64664.71 139553.16 137962.62 0 107404.34

26 75328.87 144135.98 134050.07 1 105733.54

27 72107.60 127864.55 353183.81 2 105008.31

28 66051.52 182645.56 118148.20 1 103282.38

29 65605.48 153032.06 107138.38 2 101004.64

30 61994.48 115641.28 91131.24 1 99937.59

31 61136.38 152701.92 88218.23 2 97483.56

32 63408.86 129219.61 46085.25 0 97427.84

33 55493.95 103057.49 214634.81 1 96778.92

34 46426.07 157693.92 210797.67 0 96712.80

35 46014.02 85047.44 205517.64 2 96479.51

36 28663.76 127056.21 201126.82 1 90708.19

37 44069.95 51283.14 197029.42 0 89949.14

38 20229.59 65947.93 185265.10 2 81229.06

39 38558.51 82982.09 174999.30 0 81005.76

40 28754.33 118546.05 172795.67 0 78239.91

41 27892.92 84710.77 164470.71 1 77798.83

42 23640.93 96189.63 148001.11 0 71498.49

43 15505.73 127382.30 35534.17 2 69758.98

44 22177.74 154806.14 28334.72 0 65200.33

45 1000.23 124153.04 1903.93 2 64926.08

46 1315.46 115816.21 297114.46 1 49490.75

47 0.00 135426.92 0.00 0 42559.73

48 542.05 51743.15 0.00 2 35673.41

49 0.00 116983.80 45173.06 0 14681.40

Splitting into training and testing sets

In [26]:
from sklearn.cross_validation import train_test_split
X_train, X_test, y_train, y_test = train_test_split(data_LE_df,data_LE_df['Profit'],test_size=0.2)

/anaconda3/lib/python3.6/site-packages/sklearn/cross_validation.py:41: DeprecationWarning: This mo


dule was deprecated in version 0.18 in favor of the model_selection module into which all the refa
ctored classes and functions are moved. Also note that the interface of the new CV iterators are d
ctored classes and functions are moved. Also note that the interface of the new CV iterators are d
ifferent from that of this module. This module will be removed in 0.20.
"This module will be removed in 0.20.", DeprecationWarning)

In [27]:
X_train.head()

Out[27]:

R&D Spend Administration Marketing Spend State Profit

25 64664.71 139553.16 137962.62 0 107404.34

0 165349.20 136897.80 471784.10 2 192261.83

10 101913.08 110594.11 229160.95 1 146121.95

14 119943.24 156547.42 256512.92 1 132602.65

35 46014.02 85047.44 205517.64 2 96479.51

In [28]:
y_train.head()

Out[28]:
25 107404.34
0 192261.83
10 146121.95
14 132602.65
35 96479.51
Name: Profit, dtype: float64

Feature Scaling

In [29]:
from sklearn.preprocessing import StandardScaler
standard_X = StandardScaler()

In [30]:
X_train = standard_X.fit_transform(X_train)
X_test = standard_X.fit_transform(X_test)

In [31]:
pd.DataFrame(X_train) #SCALED

Out[31]:

0 1 2 3 4

0 -0.147778 0.768777 -0.732925 -1.248168 -0.078585

1 2.099133 0.672035 2.246595 1.187282 2.114855

2 0.683470 -0.286287 0.081064 -0.030443 0.922208

3 1.085838 1.387929 0.325194 -0.030443 0.572754

4 -0.563993 -1.217028 -0.129964 1.187282 -0.360975

5 -0.949166 0.003426 -0.422023 -1.248168 -0.832442

6 -1.590858 -0.053492 -1.561117 -1.248168 -2.475335

7 0.158509 1.286864 0.710993 1.187282 0.022449

8 -0.730373 -1.292275 -0.402355 -1.248168 -0.760949


9 0.5215450 0.9700481 0.5578052 1.1872823 0.3858104

10 1.316921 0.986533 0.926449 -0.030443 1.171146

11 -0.607378 -2.447162 -0.205725 -1.248168 -0.529776

12 -0.352435 -0.560869 -0.048589 -0.030443 -0.353236

13 0.964891 0.151737 0.372172 1.187282 0.503335

14 -1.244827 0.325357 -1.647149 1.187282 -1.051661

15 -1.578762 -2.430403 -1.964309 1.187282 -1.932723

16 -1.139408 -1.912880 -0.310728 1.187282 -0.755177

17 -1.561502 -0.096031 0.687583 -0.030443 -1.575565

18 -0.968390 -1.229294 -0.496328 -0.030443 -0.843843

19 1.631008 0.008009 1.455935 1.187282 1.872917

20 0.018321 0.342926 1.188029 1.187282 -0.140519

21 -1.095932 1.324489 -1.711408 -1.248168 -1.169496

22 -0.083778 -0.462735 0.755901 -0.030443 -0.044215

23 0.090208 0.935743 -0.767847 -0.030443 -0.121772

24 0.150110 0.114601 0.395109 -1.248168 0.427751

25 0.462077 0.620929 0.290849 -1.248168 0.616818

26 1.099212 1.102714 0.816992 1.187282 1.079621

27 1.413268 1.047333 -0.824374 -1.248168 1.180707

28 1.580460 -0.985886 1.303923 -0.030443 1.440884

29 1.352154 -0.679013 1.274406 1.187282 1.203160

30 -0.951188 0.313476 -0.169154 -0.030443 -0.510155

31 0.655773 -0.971355 0.264783 -1.248168 0.874064

32 -0.554798 1.429699 -0.082837 -1.248168 -0.354945

33 -1.063279 -0.811085 -0.643327 -1.248168 -1.006698

34 -1.568537 0.207705 -1.947316 1.187282 -1.176585

35 0.503839 0.323101 0.265630 -0.030443 0.804948

36 0.060431 0.157781 0.742964 -0.030443 -0.002386

37 0.110850 -0.167035 0.701417 -1.248168 0.207550

38 0.456649 -0.155796 0.667992 -0.030443 0.357287

39 0.337715 1.277416 -1.964309 1.187282 0.318772

Result
The data set was pre-processed by using label encoding, feature scaling (standardisation, normalization), checking for missing values
and splitting data into training and testing sets.

Conclusion
In Real world data are generally incomplete: lacking attribute values, lacking certain attributes of interest, or containing only
aggregate data. Noisy: containing errors or outliers. Inconsistent: containing discrepancies in codes or names. Hence, it is essential
to pre-process data so that algorithms can be applied without any hindrance.

You might also like