Overview of Data Cleaning
Overview of Data Cleaning
Data Cleaning
import pandas as pd
import numpy as np
df.head()
Output:
Passe Sur Pc A Si Pa Ca Emb
ngerI vive las Na Se g bS rc Tic Fa bi arke
d d s me x e p h ket re n d
Brau
nd,
Mr. 2 A/5
ma 7.2 Na
1 0 3 Owe 2. 1 0 2117 S
le 500 N
n 0 1
Harri
s
0
Cum
ings,
Mrs.
John
Brad fe 3 PC 71.
C8
2 1 1 ley ma 8. 1 0 1759 283 C
5
(Flor le 0 9 3
ence
Brig
gs
Th…
1
Heik
kine STO
n, fe 2 N/
7.9 Na
3 1 3 Miss ma 6. 0 0 O2. S
250 N
. le 0 3101
Lain 282
a
2
Jacq
ues
Heat
h le 0 0
(Lily
May
Peel)
Alle
n,
Mr. 3
ma 3734 8.0 Na
5 0 3 Willi 5. 0 0 S
le 50 500 N
am 0
Henr
y
4
This step involves understanding the data by inspecting its structure and
identifying missing values, outliers, and inconsistencies.
Check the duplicate rows.
Python3
df.duplicated()
Output:
0 False
1 False
...
889 False
890 False
Length: 891, dtype: bool
Check the data information using df.info()
Python3
df.info()
Output:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 PassengerId 891 non-null int64
1 Survived 891 non-null int64
2 Pclass 891 non-null int64
3 Name 891 non-null object
4 Sex 891 non-null object
5 Age 714 non-null float64
6 SibSp 891 non-null int64
7 Parch 891 non-null int64
8 Ticket 891 non-null object
9 Fare 891 non-null float64
10 Cabin 204 non-null object
11 Embarked 889 non-null object
dtypes: float64(2), int64(5), object(5)
memory usage: 83.7+ KB
From the above data info, we can see that Age and Cabin have an unequal
number of counts. And some of the columns are categorical and have data
type objects and some are integer and float values.
Let’s see the descriptive structure of the data using df.describe()
Python3
df1.describe()
Output:
Passenge Surviv
rId ed Pclass Age SibSp Parch Fare
# Categorical columns
Output:
Categorical columns : ['Name', 'Sex', 'Ticket', 'Cabin',
'Embarked']
Numerical columns : ['PassengerId', 'Survived', 'Pclass', 'Age',
'SibSp', 'Parch', 'Fare']
Check the total number of unique values in the Categorical columns
Python3
df[cat_col].nunique()
Output:
Name 891
Sex 2
Ticket 681
Cabin 147
Embarked 3
dtype: int64
2. Removal of unwanted observations
This includes deleting duplicate/ redundant or irrelevant values from your
dataset. Duplicate observations most frequently arise during data collection
and Irrelevant observations are those that don’t actually fit the specific
problem that you’re trying to solve.
Redundant observations alter the efficiency to a great extent as the data
repeats and may add towards the correct side or towards the incorrect
side, thereby producing unfaithful results.
Irrelevant observations are any type of data that is of no use to us and
can be removed directly.
Now we have to make a decision according to the subject of analysis, which
factor is important for our discussion. As we know our machines don’t
understand the text data. So, we have to either drop or convert the
categorical column values into numerical types. Here we are dropping the
Name columns because the Name will be always unique and it hasn’t a great
influence on target variables. For the ticket, Let’s first print the 50 unique
tickets.
Python3
df['Ticket'].unique()[:50]
Output:
array(['A/5 21171', 'PC 17599', 'STON/O2. 3101282', '113803',
'373450',
'330877', '17463', '349909', '347742', '237736', 'PP 9549',
'113783', 'A/5. 2151', '347082', '350406', '248706',
'382652',
'244373', '345763', '2649', '239865', '248698', '330923',
'113788',
'347077', '2631', '19950', '330959', '349216', 'PC 17601',
'PC 17569', '335677', 'C.A. 24579', 'PC 17604', '113789',
'2677',
'A./5. 2152', '345764', '2651', '7546', '11668', '349253',
'SC/Paris 2123', '330958', 'S.C./A.4. 23567', '370371',
'14311',
'2662', '349237', '3101295'], dtype=object)
From the above tickets, we can observe that it is made of two like first values
‘A/5 21171’ is joint from of ‘A/5’ and ‘21171’ this may influence our target
variables. It will the case of Feature Engineering. where we derived new
features from a column or a group of columns. In the current case, we are
dropping the “Name” and “Ticket” columns.
Drop Name and Ticket columns.
Python3
df1 = df.drop(columns=['Name','Ticket'])
df1.shape
Output:
(891, 10)
3. Handling missing data:
Missing data is a common issue in real-world datasets, and it can occur due
to various reasons such as human errors, system failures, or data collection
issues. Various techniques can be used to handle missing data, such as
imputation, deletion, or substitution.
Let’s check the % missing values columns-wise for each row using df.isnull()
it checks whether the values are null or not and gives returns boolean
values. and .sum() will sum the total number of null values rows and we
divide it by the total number of rows present in the dataset then we multiply
to get values in % i.e per 100 values how much values are null.
Python3
round((df1.isnull().sum()/df1.shape[0])*100,2)
Output:
PassengerId 0.00
Survived 0.00
Pclass 0.00
Sex 0.00
Age 19.87
SibSp 0.00
Parch 0.00
Fare 0.00
Cabin 77.10
Embarked 0.22
dtype: float64
We cannot just ignore or remove the missing observation. They must be
handled carefully as they can be an indication of something important.
The two most common ways to deal with missing data are:
Dropping observations with missing values.
The fact that the value was missing may be informative in itself.
Plus, in the real world, you often need to make predictions on
new data even if some of the features are missing!
As we can see from the above result that Cabin has 77% null values and
Age has 19.87% and Embarked has 0.22% of null values. So, it’s not a good
idea to fill 77% of null values. So, we will drop the Cabin column. Embarked
column has only 0.22% of null values so, we drop the null values rows of
Embarked column.
Python3
df2 = df1.drop(columns='Cabin')
df2.shape
Output:
(889, 9)
Imputing the missing values from past observations.
Again, “missingness” is almost always informative in itself, and
you should tell your algorithm if a value was missing.
Even if you build a model to impute your values, you’re not
adding any real information. You’re just reinforcing the patterns
already provided by other features.
From the above describe table, we can see that there are very less
differences between the mean and median i..e 29.6 and 28. So, here we can
do any one from mean imputation or Median imputations.
Note:
Mean imputation is suitable when the data is normally distributed and has
no extreme outliers.
Median imputation is preferable when the data contains outliers or is
skewed.
Python3
# Mean imputation
df3 = df2.fillna(df2.Age.mean())
df3.isnull().sum()
Output:
PassengerId 0
Survived 0
Pclass 0
Sex 0
Age 0
SibSp 0
Parch 0
Fare 0
Embarked 0
dtype: int64
4. Handling outliers:
Outliers are extreme values that deviate significantly from the majority of the
data. They can negatively impact the analysis and model performance.
Techniques such as clustering, interpolation, or transformation can be used
to handle outliers.
To check the outliers, We generally use a box plot. A box plot, also referred
to as a box-and-whisker plot, is a graphical representation of a dataset’s
distribution. It shows a variable’s median, quartiles, and potential outliers.
The line inside the box denotes the median, while the box itself denotes the
interquartile range (IQR). The whiskers extend to the most extreme non-
outlier values within 1.5 times the IQR. Individual points beyond the whiskers
are considered potential outliers. A box plot offers an easy-to-understand
overview of the range of the data and makes it possible to identify outliers or
skewness in the distribution.
Let’s plot the box plot for Age column data.
Python3
plt.boxplot(df3['Age'], vert=False)
plt.ylabel('Variable')
plt.xlabel('Age')
plt.title('Box Plot')
plt.show()
Output:
Box Plot
As we can see from the above Box and whisker plot, Our age dataset has
outliers values. The values less than 5 and more 55 are outliers.
Python3
mean = df3['Age'].mean()
std = df3['Age'].std()
# Calculate the lower and upper bounds
Output:
Lower Bound : 3.705400107925648
Upper Bound : 55.578785285332785
Similarly, we can remove the outliers of the remaining columns.
5. Data transformation
Data transformation involves converting the data from one form to another to
make it more suitable for analysis. Techniques such as normalization,
scaling, or encoding can be used to transform the data.
Data validation and verification: Data validation and verification involve
ensuring that the data is accurate and consistent by comparing it with
external sources or expert knowledge.
For the machine learning prediction, First, we separate independent and
target features. Here we will consider only ‘Sex’ ‘Age’ ‘SibSp’, ‘Parch’
‘Fare’ ‘Embarked’ only as the independent features and Survived as target
variables. Because PassengerId will not affect the survival rate.
Python3
X = df3[['Pclass','Sex','Age', 'SibSp','Parch','Fare','Embarked']]
Y = df3['Survived']
# Numerical columns
x1 = X
# learning the statistical parameters for each of the data and transforming
x1[num_col_] = scaler.fit_transform(x1[num_col_])
x1.head()
Output:
SibS
Pclass Sex Age p Parch Fare Embarked
0.01415
1.0 male 0.271174 0.125 0.0 S
1
0
femal 0.13913
0.0 0.472229 0.125 0.0 C
e 6
1
femal 0.01546
1.0 0.321438 0.000 0.0 S
e 9
2
femal 0.10364
0.0 0.434531 0.125 0.0 S
e 4
3
0.01571
1.0 male 0.434531 0.000 0.0 S
3
4
Standardization (Z-score scaling):
Standardization transforms the values to have a mean of 0 and a
standard deviation of 1.
It centers the data around the mean and scales it based on the standard
deviation.
Standardization makes the data more suitable for algorithms that assume
a Gaussian distribution or require features to have zero mean and unit
variance.
Z = (X - μ) / σ
Where,
X = Data
μ = Mean value of X
σ = Standard deviation of X
Some data cleansing tools:
OpenRefine
Trifacta Wrangler
TIBCO Clarity
Cloudingo
IBM Infosphere Quality Stage