Data Preprocessing ML Lab
Data Preprocessing ML Lab
1
Aim : To apply various data-preprocessing techniques on a data set to prepare it for machine learning algorithms. Theory What is
Data Preprocessing ? Data preprocessing is a data mining technique that involves transforming raw data into an understandable
format. Real-world data is often incomplete, inconsistent, and/or lacking in certain behaviors or trends, and is likely to contain many
errors. Data preprocessing is a proven method of resolving such issues.
Steps Check missing values Categorical variables (Label Encoding and One-Hot Enconding) Split data into training and testing sets
Feature scaling
In [1]:
import numpy as np
import pandas as pd
In [2]:
data = pd.read_csv('50_Startups.csv')
In [3]:
data.head(5)
Out[3]:
In [4]:
data.shape
Out[4]:
(50, 5)
In [5]:
data.columns #features
Out[5]:
Index(['R&D Spend', 'Administration', 'Marketing Spend', 'State', 'Profit'], dtype='object')
In [6]:
#check for missing values
data.isnull().any()
#It is observed that every column has missing values
Out[6]:
R&D Spend False
Administration False
Marketing Spend False
State False
Profit False
dtype: bool
In [7]:
# Dropping null rows
data.dropna(inplace=True)
data.isnull().any()
#No null values now
Out[7]:
R&D Spend False
Administration False
Marketing Spend False
State False
Profit False
dtype: bool
In [8]:
print(data.shape)
(50, 5)
In [17]:
data2 = pd.read_csv('50_Startups.csv')
data2.head()
Out[17]:
In [18]:
data2['Profit'].unique()
Out[18]:
array([192261.83, 191792.06, 191050.39, 182901.99, 166187.94, 156991.12,
156122.51, 155752.6 , 152211.77, 149759.96, 146121.95, 144259.4 ,
141585.52, 134307.35, 132602.65, 129917.04, 126992.93, 125370.37,
124266.9 , 122776.86, 118474.03, 111313.02, 110352.25, 108733.99,
108552.04, 107404.34, 105733.54, 105008.31, 103282.38, 101004.64,
99937.59, 97483.56, 97427.84, 96778.92, 96712.8 , 96479.51,
99937.59, 97483.56, 97427.84, 96778.92, 96712.8 , 96479.51,
90708.19, 89949.14, 81229.06, 81005.76, 78239.91, 77798.83,
71498.49, 69758.98, 65200.33, 64926.08, 49490.75, 42559.73,
35673.41, 14681.4 ])
In [160]:
In [19]:
data_LE = data2.copy()
data_LE['State'] = label_encoder.fit_transform(data_LE['State'])
In [20]:
data_LE.head()
Out[20]:
In [21]:
data_LE_df = pd.DataFrame(data_LE)
In [22]:
data_LE_df.dropna(inplace=True)
In [23]:
data_LE_df
Out[23]:
In [26]:
from sklearn.cross_validation import train_test_split
X_train, X_test, y_train, y_test = train_test_split(data_LE_df,data_LE_df['Profit'],test_size=0.2)
In [27]:
X_train.head()
Out[27]:
In [28]:
y_train.head()
Out[28]:
25 107404.34
0 192261.83
10 146121.95
14 132602.65
35 96479.51
Name: Profit, dtype: float64
Feature Scaling
In [29]:
from sklearn.preprocessing import StandardScaler
standard_X = StandardScaler()
In [30]:
X_train = standard_X.fit_transform(X_train)
X_test = standard_X.fit_transform(X_test)
In [31]:
pd.DataFrame(X_train) #SCALED
Out[31]:
0 1 2 3 4
Result
The data set was pre-processed by using label encoding, feature scaling (standardisation, normalization), checking for missing values
and splitting data into training and testing sets.
Conclusion
In Real world data are generally incomplete: lacking attribute values, lacking certain attributes of interest, or containing only
aggregate data. Noisy: containing errors or outliers. Inconsistent: containing discrepancies in codes or names. Hence, it is essential
to pre-process data so that algorithms can be applied without any hindrance.