Data Cleaningin ML
Data Cleaningin ML
Data Cleaning
Python
import pandas as pd
import numpy as np
df = pd.read_csv('titanic.csv')
df.head()
Output:
Let’s first understand the data by inspecting its structure and identifying
missing values, outliers, and inconsistencies and check the duplicate rows
with below python code:
Python
df.duplicated()
Output:
0 False
1 False
2 False
3 False
4 False
...
886 False
887 False
888 False
889 False
890 False
Length: 891, dtype: bool
Python
df.info()
Output:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 PassengerId 891 non-null int64
1 Survived 891 non-null int64
2 Pclass 891 non-null int64
3 Name 891 non-null object
4 Sex 891 non-null object
5 Age 714 non-null float64
6 SibSp 891 non-null int64
7 Parch 891 non-null int64
8 Ticket 891 non-null object
9 Fare 891 non-null float64
10 Cabin 204 non-null object
11 Embarked 889 non-null object
dtypes: float64(2), int64(5), object(5)
memory usage: 83.7+ KB
From the above data info, we can see that Age and Cabin have an unequal
number of counts. And some of the columns are categorical and have data
type objects and some are integer and float values.
Python
# Categorical columns
# Numerical columns
Output:
Python
df[cat_col].nunique()
Output:
Name 891
Sex 2
Ticket 681
Cabin 147
Embarked 3
dtype: int64
As we know our machines don’t understand the text data. So, we have to
either drop or convert the categorical column values into numerical types.
Here we are dropping the Name columns because the Name will be always
unique and it hasn’t a great influence on target variables. For the ticket, Let’s
first print the 50 unique tickets.
Python
df['Ticket'].unique()[:50]
Output:
From the above tickets, we can observe that it is made of two like first values
‘A/5 21171’ is joint from of ‘A/5’ and ‘21171’ this may influence our target
variables. It will the case of Feature Engineering. where we derived new
features from a column or a group of columns. In the current case, we are
dropping the “Name” and “Ticket” columns.
Python
df1 = df.drop(columns=['Name','Ticket'])
df1.shape
Output:
(891, 10)
Missing data is a common issue in real-world datasets, and it can occur due
to various reasons such as human errors, system failures, or data collection
issues. Various techniques can be used to handle missing data, such as
imputation, deletion, or substitution.
Let’s check the % missing values columns-wise for each row using df.isnull()
it checks whether the values are null or not and gives returns boolean
values. and .sum() will sum the total number of null values rows and we
divide it by the total number of rows present in the dataset then we multiply
to get values in % i.e per 100 values how much values are null.
Python
round((df1.isnull().sum()/df1.shape[0])*100,2)
Output:
PassengerId 0.00
Survived 0.00
Pclass 0.00
Sex 0.00
Age 19.87
SibSp 0.00
Parch 0.00
Fare 0.00
Cabin 77.10
Embarked 0.22
dtype: float64
The two most common ways to deal with missing data are:
o The fact that the value was missing may be informative in itself.
So, it’s not a good idea to fill 77% of null values. So, we will drop the Cabin
column. Embarked column has only 0.22% of null values so, we drop the null
values rows of Embarked column.
Python
df2 = df1.drop(columns='Cabin')
df2.shape
Output:
(889, 9)
Note:
Python
# Mean imputation
df3 = df2.fillna(df2.Age.mean())
df3.isnull().sum()
Output:
PassengerId 0
Survived 0
Pclass 0
Sex 0
Age 0
SibSp 0
Parch 0
Fare 0
Embarked 0
dtype: int64
Handling Outliers
Outliers are extreme values that deviate significantly from the majority of the
data. They can negatively impact the analysis and model performance.
Techniques such as clustering, interpolation, or transformation can be used
to handle outliers.
To check the outliers, We generally use a box plot. A box plot, also referred to
as a box-and-whisker plot, is a graphical representation of a dataset’s
distribution. It shows a variable’s median, quartiles, and potential outliers.
The line inside the box denotes the median, while the box itself denotes the
interquartile range (IQR). The whiskers extend to the most extreme non-
outlier values within 1.5 times the IQR. Individual points beyond the whiskers
are considered potential outliers. A box plot offers an easy-to-understand
overview of the range of the data and makes it possible to identify outliers or
skewness in the distribution.
Python
plt.boxplot(df3['Age'], vert=False)
plt.ylabel('Variable')
plt.xlabel('Age')
plt.title('Box Plot')
plt.show()
Output:
Box Plot
As we can see from the above Box and whisker plot, Our age dataset has
outliers values. The values less than 5 and more than 55 are outliers.
Python
mean = df3['Age'].mean()
std = df3['Age'].std()
Output:
Data Transformation
Data transformation involves converting the data from one form to another
to make it more suitable for analysis. Techniques such as normalization,
scaling, or encoding can be used to transform the data.
Data validation and verification involve ensuring that the data is accurate
and consistent by comparing it with external sources or expert knowledge.
Python
X = df3[['Pclass','Sex','Age', 'SibSp','Parch','Fare','Embarked']]
Y = df3['Survived']
Data formatting
Scaling
Scaling involves transforming the values of features to a specific range.
It maintains the shape of the original distribution while changing the
scale.
Python
# Numerical columns
x1 = X
# learning the statistical parameters for each of the data and transforming
x1[num_col_] = scaler.fit_transform(x1[num_col_])
x1.head()
Output:
Z = (X - μ) / σ
Where,
X = Data
μ = Mean value of X
σ = Standard deviation of X
OpenRefine
Trifacta Wrangler
TIBCO Clarity
Cloudingo
Improved data quality: Improve the quality of the data, making it more
reliable and accurate.