0% found this document useful (0 votes)

102 views17 pages

Overview of Data Cleaning

The document discusses data cleaning which is an important part of machine learning. It involves identifying and removing missing, duplicate, or irrelevant data to ensure accurate and consistent data. The key steps in data cleaning are: 1) Data inspection and exploration to understand the data structure and identify issues. This includes checking for duplicates, missing values, outliers and inconsistencies. 2) Removal of unwanted observations like duplicates, redundant or irrelevant values to improve data quality and efficiency. 3) Handling of missing values which is common and needs to be addressed for modeling. Methods include dropping rows, imputing values, and creating indicator variables.

Uploaded by

Shobha Kumari Choudhary

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

102 views17 pages

Overview of Data Cleaning

Uploaded by

Shobha Kumari Choudhary

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

You are on page 1/ 17

Overview of Data Cleaning

Data cleaning is one of the important parts of machine learning. It plays a

significant part in building a model. It surely isn’t the fanciest part of machine
learning and at the same time, there aren’t any hidden tricks or secrets to
uncover. However, the success or failure of a project relies on proper data
cleaning. Professional data scientists usually invest a very large portion of
their time in this step because of the belief that “Better data beats fancier
algorithms”.
If we have a well-cleaned dataset, there are chances that we can get achieve
good results with simple algorithms also, which can prove very beneficial at
times especially in terms of computation when the dataset size is
large. Obviously, different types of data will require different types of
cleaning. However, this systematic approach can always serve as a good
starting point.
Steps Involved in Data Cleaning
Data cleaning is a crucial step in the machine learning (ML) pipeline, as it
involves identifying and removing any missing, duplicate, or irrelevant data.
The goal of data cleaning is to ensure that the data is accurate, consistent,
and free of errors, as incorrect or inconsistent data can negatively impact the
performance of the ML model.
Data cleaning, also known as data cleansing or data preprocessing, is a
crucial step in the data science pipeline that involves identifying and
correcting or removing errors, inconsistencies, and inaccuracies in the data
to improve its quality and usability. Data cleaning is essential because raw
data is often noisy, incomplete, and inconsistent, which can negatively
impact the accuracy and reliability of the insights derived from it.
The following are the most common steps involved in data cleaning:

Data Cleaning

 Import the necessary libraries

 Load the dataset
 Check the data information using df.info()
 Python3

import pandas as pd

import numpy as np

# Load the dataset

df = pd.read_csv('train.csv')

df.head()

Output:
Passe Sur Pc A Si Pa Ca Emb
ngerI vive las Na Se g bS rc Tic Fa bi arke
d d s me x e p h ket re n d

Brau
nd,
Mr. 2 A/5
ma 7.2 Na
1 0 3 Owe 2. 1 0 2117 S
le 500 N
n 0 1
Harri
s
0

Cum
ings,
Mrs.
John
Brad fe 3 PC 71.
C8
2 1 1 ley ma 8. 1 0 1759 283 C
5
(Flor le 0 9 3
ence
Brig
gs
Th…
1

Heik
kine STO
n, fe 2 N/
7.9 Na
3 1 3 Miss ma 6. 0 0 O2. S
250 N
. le 0 3101
Lain 282
a
2

3 4 1 1 Futre fe 3 1 0 1138 53. C1 S

lle, ma 5. 03 100 23
Mrs.
Passe Sur Pc A Si Pa Ca Emb
ngerI vive las Na Se g bS rc Tic Fa bi arke
d d s me x e p h ket re n d

Jacq
ues
Heat
h le 0 0
(Lily
May
Peel)

Alle
n,
Mr. 3
ma 3734 8.0 Na
5 0 3 Willi 5. 0 0 S
le 50 500 N
am 0
Henr
y
4

1. Data inspection and exploration:

This step involves understanding the data by inspecting its structure and
identifying missing values, outliers, and inconsistencies.
 Check the duplicate rows.
 Python3

df.duplicated()

Output:
0 False
1 False
...
889 False
890 False
Length: 891, dtype: bool
 Check the data information using df.info()
 Python3

df.info()

Output:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 PassengerId 891 non-null int64
1 Survived 891 non-null int64
2 Pclass 891 non-null int64
3 Name 891 non-null object
4 Sex 891 non-null object
5 Age 714 non-null float64
6 SibSp 891 non-null int64
7 Parch 891 non-null int64
8 Ticket 891 non-null object
9 Fare 891 non-null float64
10 Cabin 204 non-null object
11 Embarked 889 non-null object
dtypes: float64(2), int64(5), object(5)
memory usage: 83.7+ KB
From the above data info, we can see that Age and Cabin have an unequal
number of counts. And some of the columns are categorical and have data
type objects and some are integer and float values.
Let’s see the descriptive structure of the data using df.describe()
 Python3

df1.describe()
Output:
Passenge Surviv
rId ed Pclass Age SibSp Parch Fare

891.00000 891.000 891.000 714.000 891.000 891.000 891.000

cou
0 000 000 000 000 000 000
nt

446.00000 0.38383 2.30864 29.6991 0.52300 0.38159 32.2042

me
0 8 2 18 8 4 08
an

257.35384 0.48659 0.83607 14.5264 1.10274 0.80605 49.6934

2 2 1 97 3 7 29
std

0.00000 1.00000 0.42000 0.00000 0.00000 0.00000

1.000000
0 0 0 0 0 0
min

223.50000 0.00000 2.00000 20.1250 0.00000 0.00000 7.91040

25
0 0 0 00 0 0 0
%

446.00000 0.00000 3.00000 28.0000 0.00000 0.00000 14.4542

50
0 0 0 00 0 0 00
%

668.50000 1.00000 3.00000 38.0000 1.00000 0.00000 31.0000

75
0 0 0 00 0 0 00
%

891.00000 1.00000 3.00000 80.0000 8.00000 6.00000 512.329

ma
0 0 0 00 0 0 200
x
Check the categorical and numerical columns
 Python3

# Categorical columns

cat_col = [col for col in df.columns if df[col].dtype == 'object']

print('Categorical columns :',cat_col)

# Numerical columns

num_col = [col for col in df.columns if df[col].dtype != 'object']

print('Numerical columns :',num_col)

Output:
Categorical columns : ['Name', 'Sex', 'Ticket', 'Cabin',
'Embarked']
Numerical columns : ['PassengerId', 'Survived', 'Pclass', 'Age',
'SibSp', 'Parch', 'Fare']
Check the total number of unique values in the Categorical columns

 Python3

df[cat_col].nunique()

Output:
Name 891
Sex 2
Ticket 681
Cabin 147
Embarked 3
dtype: int64
2. Removal of unwanted observations
This includes deleting duplicate/ redundant or irrelevant values from your
dataset. Duplicate observations most frequently arise during data collection
and Irrelevant observations are those that don’t actually fit the specific
problem that you’re trying to solve.
 Redundant observations alter the efficiency to a great extent as the data
repeats and may add towards the correct side or towards the incorrect
side, thereby producing unfaithful results.
 Irrelevant observations are any type of data that is of no use to us and
can be removed directly.
Now we have to make a decision according to the subject of analysis, which
factor is important for our discussion. As we know our machines don’t
understand the text data. So, we have to either drop or convert the
categorical column values into numerical types. Here we are dropping the
Name columns because the Name will be always unique and it hasn’t a great
influence on target variables. For the ticket, Let’s first print the 50 unique
tickets.

 Python3

df['Ticket'].unique()[:50]

Output:
array(['A/5 21171', 'PC 17599', 'STON/O2. 3101282', '113803',
'373450',
'330877', '17463', '349909', '347742', '237736', 'PP 9549',
'113783', 'A/5. 2151', '347082', '350406', '248706',
'382652',
'244373', '345763', '2649', '239865', '248698', '330923',
'113788',
'347077', '2631', '19950', '330959', '349216', 'PC 17601',
'PC 17569', '335677', 'C.A. 24579', 'PC 17604', '113789',
'2677',
'A./5. 2152', '345764', '2651', '7546', '11668', '349253',
'SC/Paris 2123', '330958', 'S.C./A.4. 23567', '370371',
'14311',
'2662', '349237', '3101295'], dtype=object)
From the above tickets, we can observe that it is made of two like first values
‘A/5 21171’ is joint from of ‘A/5’ and ‘21171’ this may influence our target
variables. It will the case of Feature Engineering. where we derived new
features from a column or a group of columns. In the current case, we are
dropping the “Name” and “Ticket” columns.
Drop Name and Ticket columns.
 Python3

df1 = df.drop(columns=['Name','Ticket'])

df1.shape

Output:
(891, 10)
3. Handling missing data:
Missing data is a common issue in real-world datasets, and it can occur due
to various reasons such as human errors, system failures, or data collection
issues. Various techniques can be used to handle missing data, such as
imputation, deletion, or substitution.
Let’s check the % missing values columns-wise for each row using df.isnull()
it checks whether the values are null or not and gives returns boolean
values. and .sum() will sum the total number of null values rows and we
divide it by the total number of rows present in the dataset then we multiply
to get values in % i.e per 100 values how much values are null.

 Python3

round((df1.isnull().sum()/df1.shape[0])*100,2)

Output:
PassengerId 0.00
Survived 0.00
Pclass 0.00
Sex 0.00
Age 19.87
SibSp 0.00
Parch 0.00
Fare 0.00
Cabin 77.10
Embarked 0.22
dtype: float64
We cannot just ignore or remove the missing observation. They must be
handled carefully as they can be an indication of something important.
The two most common ways to deal with missing data are:
 Dropping observations with missing values.
 The fact that the value was missing may be informative in itself.
 Plus, in the real world, you often need to make predictions on
new data even if some of the features are missing!
As we can see from the above result that Cabin has 77% null values and
Age has 19.87% and Embarked has 0.22% of null values. So, it’s not a good
idea to fill 77% of null values. So, we will drop the Cabin column. Embarked
column has only 0.22% of null values so, we drop the null values rows of
Embarked column.

 Python3

df2 = df1.drop(columns='Cabin')

df2.dropna(subset=['Embarked'], axis=0, inplace=True)

df2.shape

Output:
(889, 9)
 Imputing the missing values from past observations.
 Again, “missingness” is almost always informative in itself, and
you should tell your algorithm if a value was missing.
 Even if you build a model to impute your values, you’re not
adding any real information. You’re just reinforcing the patterns
already provided by other features.
From the above describe table, we can see that there are very less
differences between the mean and median i..e 29.6 and 28. So, here we can
do any one from mean imputation or Median imputations.
Note:
 Mean imputation is suitable when the data is normally distributed and has
no extreme outliers.
 Median imputation is preferable when the data contains outliers or is
skewed.
 Python3

# Mean imputation

df3 = df2.fillna(df2.Age.mean())

# Let's check the null values again

df3.isnull().sum()

Output:
PassengerId 0
Survived 0
Pclass 0
Sex 0
Age 0
SibSp 0
Parch 0
Fare 0
Embarked 0
dtype: int64

4. Handling outliers:

Outliers are extreme values that deviate significantly from the majority of the
data. They can negatively impact the analysis and model performance.
Techniques such as clustering, interpolation, or transformation can be used
to handle outliers.
To check the outliers, We generally use a box plot. A box plot, also referred
to as a box-and-whisker plot, is a graphical representation of a dataset’s
distribution. It shows a variable’s median, quartiles, and potential outliers.
The line inside the box denotes the median, while the box itself denotes the
interquartile range (IQR). The whiskers extend to the most extreme non-
outlier values within 1.5 times the IQR. Individual points beyond the whiskers
are considered potential outliers. A box plot offers an easy-to-understand
overview of the range of the data and makes it possible to identify outliers or
skewness in the distribution.
Let’s plot the box plot for Age column data.

 Python3

import matplotlib.pyplot as plt

plt.boxplot(df3['Age'], vert=False)

plt.ylabel('Variable')

plt.xlabel('Age')
plt.title('Box Plot')

plt.show()

Output:

Box Plot

As we can see from the above Box and whisker plot, Our age dataset has
outliers values. The values less than 5 and more 55 are outliers.

 Python3

# calculate summary statistics

mean = df3['Age'].mean()

std = df3['Age'].std()
# Calculate the lower and upper bounds

lower_bound = mean - std*2

upper_bound = mean + std*2

print('Lower Bound :',lower_bound)

print('Upper Bound :',upper_bound)

# Drop the outliers

df4 = df3[(df3['Age'] >= lower_bound)

& (df3['Age'] <= upper_bound)]

Output:
Lower Bound : 3.705400107925648
Upper Bound : 55.578785285332785
Similarly, we can remove the outliers of the remaining columns.
5. Data transformation
Data transformation involves converting the data from one form to another to
make it more suitable for analysis. Techniques such as normalization,
scaling, or encoding can be used to transform the data.
 Data validation and verification: Data validation and verification involve
ensuring that the data is accurate and consistent by comparing it with
external sources or expert knowledge.
For the machine learning prediction, First, we separate independent and
target features. Here we will consider only ‘Sex’ ‘Age’ ‘SibSp’, ‘Parch’
‘Fare’ ‘Embarked’ only as the independent features and Survived as target
variables. Because PassengerId will not affect the survival rate.
 Python3
X = df3[['Pclass','Sex','Age', 'SibSp','Parch','Fare','Embarked']]

Y = df3['Survived']

 Data formatting: Data formatting involves converting the data into a

standard format or structure that can be easily processed by the
algorithms or models used for analysis. Here we will discuss commonly
used data formatting techniques i.e. Scaling and Normalization.
Scaling:
 Scaling involves transforming the values of features to a specific range. It
maintains the shape of the original distribution while changing the scale.
 Scaling is particularly useful when features have different scales, and
certain algorithms are sensitive to the magnitude of the features.
 Common scaling methods include Min-Max scaling and Standardization
(Z-score scaling).
Min-Max Scaling:
 Min-Max scaling rescales the values to a specified range, typically
between 0 and 1.
 It preserves the original distribution and ensures that the minimum value
maps to 0 and the maximum value maps to 1.
 Python3

from sklearn.preprocessing import MinMaxScaler

# initialising the MinMaxScaler

scaler = MinMaxScaler(feature_range=(0, 1))

# Numerical columns

num_col_ = [col for col in X.columns if X[col].dtype != 'object']

x1 = X

# learning the statistical parameters for each of the data and transforming
x1[num_col_] = scaler.fit_transform(x1[num_col_])

x1.head()

Output:
SibS
Pclass Sex Age p Parch Fare Embarked

0.01415
1.0 male 0.271174 0.125 0.0 S
1
0

femal 0.13913
0.0 0.472229 0.125 0.0 C
e 6
1

femal 0.01546
1.0 0.321438 0.000 0.0 S
e 9
2

femal 0.10364
0.0 0.434531 0.125 0.0 S
e 4
3

0.01571
1.0 male 0.434531 0.000 0.0 S
3
4
Standardization (Z-score scaling):
 Standardization transforms the values to have a mean of 0 and a
standard deviation of 1.
 It centers the data around the mean and scales it based on the standard
deviation.
 Standardization makes the data more suitable for algorithms that assume
a Gaussian distribution or require features to have zero mean and unit
variance.
Z = (X - μ) / σ
Where,
 X = Data
 μ = Mean value of X
 σ = Standard deviation of X
Some data cleansing tools:
 OpenRefine
 Trifacta Wrangler
 TIBCO Clarity
 Cloudingo
 IBM Infosphere Quality Stage

Advantages of Data Cleaning in Machine Learning:

1. Improved model performance: Data cleaning helps improve the

performance of the ML model by removing errors, inconsistencies, and
irrelevant data, which can help the model to better learn from the data.
2. Increased accuracy: Data cleaning helps ensure that the data is accurate,
consistent, and free of errors, which can help improve the accuracy of the
ML model.
3. Better representation of the data: Data cleaning allows the data to be
transformed into a format that better represents the underlying
relationships and patterns in the data, making it easier for the ML model
to learn from the data.
4. Improved data quality: Data cleaning helps to improve the quality of the
data, making it more reliable and accurate. This ensures that the machine
learning models are trained on high-quality data, which can lead to better
predictions and outcomes.
5. Improved data security: Data cleaning can help to identify and remove
sensitive or confidential information that could compromise data security.
By eliminating this information, data cleaning can help to ensure that only
the necessary and relevant data is used for machine learning.

Disadvantages of Data Cleaning in Machine Learning:

1. Time-consuming: Data cleaning can be a time-consuming task, especially

for large and complex datasets.
2. Error-prone: Data cleaning can be error-prone, as it involves transforming
and cleaning the data, which can result in the loss of important
information or the introduction of new errors.
3. Limited understanding of the data: Data cleaning can lead to a limited
understanding of the data, as the transformed data may not be
representative of the underlying relationships and patterns in the data.
4. Data loss: Data cleaning can result in the loss of important information
that may be valuable for machine learning analysis. In some cases, data
cleaning may result in the removal of data that appears to be irrelevant or
inconsistent, but which may contain valuable insights or patterns.
5. Cost and resource-intensive: Data cleaning can be a resource-intensive
process that requires significant time, effort, and expertise. It can also
require the use of specialized software tools, which can add to the cost
and complexity of data cleaning.
6. Overfitting: Overfitting occurs when a machine learning model is trained
too closely on a particular dataset, resulting in poor performance when
applied to new or different data. Data cleaning can inadvertently
contribute to overfitting by removing too much data, leading to a loss of
information that could be important for model training and performance.
Conclusion: So, we have discussed four different steps in data cleaning to
make the data more reliable and to produce good results. After properly
completing the Data Cleaning steps, we’ll have a robust dataset that avoids
many of the most common pitfalls. This step should not be rushed as it
proves very beneficial in the further process.
In summary, data cleaning is a crucial step in the data science pipeline that
involves identifying and correcting errors, inconsistencies, and inaccuracies
in the data to improve its quality and usability. It involves various techniques
such as handling missing data, handling outliers, data transformation, data
integration, data validation and verification, and data formatting. The goal of
data cleaning is to prepare the data for analysis and ensure that the insights
derived from it are accurate and reliable.

Boylestad 11th Edition - Solman
25% (4)
Boylestad 11th Edition - Solman
25 pages
Fractals: On The Edge Of Chaos
From Everand
Fractals: On The Edge Of Chaos
Oliver Linton
3/5 (2)
Data Cleaning
No ratings yet
Data Cleaning
13 pages
Data Cleaningin ML
No ratings yet
Data Cleaningin ML
15 pages
Python (Unit - 2)
No ratings yet
Python (Unit - 2)
22 pages
Statistical Transform Data Cleaning
No ratings yet
Statistical Transform Data Cleaning
30 pages
Data Cleaning
No ratings yet
Data Cleaning
20 pages
Data Mining Lab 03
No ratings yet
Data Mining Lab 03
10 pages
String (Pandas) - Removing $ After Int Sales ( Revenue') Sales ( Revenue') .STR - Strip ( $') #Convert String To Int
No ratings yet
String (Pandas) - Removing $ After Int Sales ( Revenue') Sales ( Revenue') .STR - Strip ( $') #Convert String To Int
12 pages
Reading 5 - Data Preparation
No ratings yet
Reading 5 - Data Preparation
23 pages
Part A Assignment 6
No ratings yet
Part A Assignment 6
28 pages
Data Cleaning in Python
No ratings yet
Data Cleaning in Python
14 pages
FDS Chapter 3
No ratings yet
FDS Chapter 3
103 pages
Cleaning Data in Python
No ratings yet
Cleaning Data in Python
8 pages
PDS Exp 7 To 9
No ratings yet
PDS Exp 7 To 9
10 pages
cdp201 10 11 2023
No ratings yet
cdp201 10 11 2023
17 pages
CSC407 - Chapter 2-3
No ratings yet
CSC407 - Chapter 2-3
46 pages
Lesson 3. Data Preparation and Structuring 1 Data Cleaning
No ratings yet
Lesson 3. Data Preparation and Structuring 1 Data Cleaning
36 pages
What Is Data Cleaning
No ratings yet
What Is Data Cleaning
8 pages
Data Exploration Preparation
No ratings yet
Data Exploration Preparation
12 pages
Chapter 3
No ratings yet
Chapter 3
47 pages
Data Analytics and Visualization Lab
No ratings yet
Data Analytics and Visualization Lab
81 pages
Deep Learning Ram
No ratings yet
Deep Learning Ram
21 pages
DS Lec 6
No ratings yet
DS Lec 6
27 pages
ch4 Slides PDF
No ratings yet
ch4 Slides PDF
44 pages
Document
No ratings yet
Document
29 pages
Advanced Python Programming Data Science: The University of Sheffield
No ratings yet
Advanced Python Programming Data Science: The University of Sheffield
55 pages
Unit 4 - Working With Graphs - Python
No ratings yet
Unit 4 - Working With Graphs - Python
49 pages
Pandas 1
No ratings yet
Pandas 1
13 pages
III Unit
No ratings yet
III Unit
4 pages
B Tech-AIML-question Bank-2 Answer Key
No ratings yet
B Tech-AIML-question Bank-2 Answer Key
9 pages
Cleaning Data in Python
No ratings yet
Cleaning Data in Python
47 pages
DAP Writeups - Merged
No ratings yet
DAP Writeups - Merged
33 pages
Code Explanation For Date Types
No ratings yet
Code Explanation For Date Types
8 pages
Tutorial 4
No ratings yet
Tutorial 4
8 pages
1.2.1. Retrieving Data - 1.2.2. Cleaning Data
No ratings yet
1.2.1. Retrieving Data - 1.2.2. Cleaning Data
35 pages
DW Lab File
No ratings yet
DW Lab File
18 pages
Prac 7
No ratings yet
Prac 7
5 pages
Phython Example
No ratings yet
Phython Example
12 pages
6.data Cleaning
No ratings yet
6.data Cleaning
20 pages
DSBDA Lab Assignment No 2
No ratings yet
DSBDA Lab Assignment No 2
7 pages
Week 6 - Data Cleaning
No ratings yet
Week 6 - Data Cleaning
8 pages
Ads Exp2 C35
No ratings yet
Ads Exp2 C35
9 pages
Data Cleaning in Python
No ratings yet
Data Cleaning in Python
6 pages
Day 10 Pandasdatacleaning
No ratings yet
Day 10 Pandasdatacleaning
6 pages
Dataframing in CSV
No ratings yet
Dataframing in CSV
14 pages
Cleaning Data in Python: Pu!ing It All Together
No ratings yet
Cleaning Data in Python: Pu!ing It All Together
14 pages
Exp 01-B Feature Selection and Extraction
No ratings yet
Exp 01-B Feature Selection and Extraction
12 pages
DataAnalytics Lab Manual
No ratings yet
DataAnalytics Lab Manual
35 pages
Lab 3 DWM
No ratings yet
Lab 3 DWM
5 pages
Hduud
No ratings yet
Hduud
55 pages
Lecture Week5
No ratings yet
Lecture Week5
72 pages
Exp3 Python
No ratings yet
Exp3 Python
15 pages
PW2 DataCleaning
No ratings yet
PW2 DataCleaning
6 pages
Group A Assignment No2 Writeup
No ratings yet
Group A Assignment No2 Writeup
9 pages
Explorotary Data Analysis
100% (1)
Explorotary Data Analysis
30 pages
DS Question Bank Unit-1 Part-2
No ratings yet
DS Question Bank Unit-1 Part-2
3 pages
Data Science in Society Cat
No ratings yet
Data Science in Society Cat
5 pages
Crochet Mandalas
From Everand
Crochet Mandalas
Marinke Slump
3.5/5 (9)
Filet Crochet: Projects and Charted Designs
From Everand
Filet Crochet: Projects and Charted Designs
Mrs. F. W. Kettelle
4/5 (7)
Instruction for Using a Slide Rule
From Everand
Instruction for Using a Slide Rule
W. Stanley
No ratings yet
K Means Clustering
No ratings yet
K Means Clustering
11 pages
Activation Functions
No ratings yet
Activation Functions
15 pages
Linear Equations-2
No ratings yet
Linear Equations-2
2 pages
Difference Between K Means and Hierarchical Clustering
No ratings yet
Difference Between K Means and Hierarchical Clustering
2 pages
Implementing PCA in Python With Scikit
No ratings yet
Implementing PCA in Python With Scikit
6 pages
SQL Sequences
No ratings yet
SQL Sequences
3 pages
SQL Query Processing10
No ratings yet
SQL Query Processing10
3 pages
SQL UNION Clause
No ratings yet
SQL UNION Clause
3 pages
SQL WITH Clause
No ratings yet
SQL WITH Clause
3 pages
Power For All - UttarPradesh
No ratings yet
Power For All - UttarPradesh
106 pages
Plag Report
No ratings yet
Plag Report
18 pages
BSA Question (November 2024)
No ratings yet
BSA Question (November 2024)
2 pages
TE Comp Sem VI - AI For May 2022 Examination
No ratings yet
TE Comp Sem VI - AI For May 2022 Examination
3 pages
Delcam - PowerMILL 2015 R2 WhatsNew EN - 2015
No ratings yet
Delcam - PowerMILL 2015 R2 WhatsNew EN - 2015
71 pages
F20 HMGT 6335 OPRE 6332 Spreadsheet Modeling SYLLABUS
No ratings yet
F20 HMGT 6335 OPRE 6332 Spreadsheet Modeling SYLLABUS
9 pages
VLSI Physical Design Automation PDF
No ratings yet
VLSI Physical Design Automation PDF
29 pages
User Manual: ATEQ D570
No ratings yet
User Manual: ATEQ D570
120 pages
APS 502 LP Models
No ratings yet
APS 502 LP Models
37 pages
Bpo
No ratings yet
Bpo
8 pages
2-DigitalOcean Invoice 2023 Sep (7467235-466314537)
No ratings yet
2-DigitalOcean Invoice 2023 Sep (7467235-466314537)
2 pages
Classical Planning in AI
100% (1)
Classical Planning in AI
5 pages
Gridadvisor Series II Smart Sensor Catalog Ca915001en
No ratings yet
Gridadvisor Series II Smart Sensor Catalog Ca915001en
4 pages
Rovertown 1
No ratings yet
Rovertown 1
47 pages
Examples On Sampling and Aliasing Phenomena: Example 1
No ratings yet
Examples On Sampling and Aliasing Phenomena: Example 1
5 pages
Knight's Tour
No ratings yet
Knight's Tour
8 pages
Computer Project: For Loop
No ratings yet
Computer Project: For Loop
11 pages
2BN1 2BN2 2012
No ratings yet
2BN1 2BN2 2012
63 pages
Advance Excel Toolkit
No ratings yet
Advance Excel Toolkit
3 pages
Pana Bežični Manujal kx-tcd150FX
No ratings yet
Pana Bežični Manujal kx-tcd150FX
77 pages
Ajay Asthana: Greater Atlanta Area SR SAP Basis/SAP HANA Consultant at SAP
No ratings yet
Ajay Asthana: Greater Atlanta Area SR SAP Basis/SAP HANA Consultant at SAP
6 pages
Artificial Intelligence 417 Class X Sample Paper Test 02 For Board Exam 2023
No ratings yet
Artificial Intelligence 417 Class X Sample Paper Test 02 For Board Exam 2023
6 pages
Fdma Technology PDF
No ratings yet
Fdma Technology PDF
2 pages
Teknik Lipatan Minggu 14
No ratings yet
Teknik Lipatan Minggu 14
42 pages
16631271278
No ratings yet
16631271278
12 pages
52 72 PDF
No ratings yet
52 72 PDF
22 pages
7 Magnificent Tools of Quality
100% (1)
7 Magnificent Tools of Quality
31 pages
HECOS CAH Subjectmapping Sept2021 V2
No ratings yet
HECOS CAH Subjectmapping Sept2021 V2
201 pages
Collaborative Digital Tools
No ratings yet
Collaborative Digital Tools
3 pages

Overview of Data Cleaning

Uploaded by

Overview of Data Cleaning

Uploaded by

Overview of Data Cleaning

Data cleaning is one of the important parts of machine learning. It plays a

 Import the necessary libraries

# Load the dataset

3 4 1 1 Futre fe 3 1 0 1138 53. C1 S

1. Data inspection and exploration:

891.00000 891.000 891.000 714.000 891.000 891.000 891.000

446.00000 0.38383 2.30864 29.6991 0.52300 0.38159 32.2042

257.35384 0.48659 0.83607 14.5264 1.10274 0.80605 49.6934

0.00000 1.00000 0.42000 0.00000 0.00000 0.00000

223.50000 0.00000 2.00000 20.1250 0.00000 0.00000 7.91040

446.00000 0.00000 3.00000 28.0000 0.00000 0.00000 14.4542

668.50000 1.00000 3.00000 38.0000 1.00000 0.00000 31.0000

891.00000 1.00000 3.00000 80.0000 8.00000 6.00000 512.329

cat_col = [col for col in df.columns if df[col].dtype == 'object']

print('Categorical columns :',cat_col)

num_col = [col for col in df.columns if df[col].dtype != 'object']

print('Numerical columns :',num_col)

df2.dropna(subset=['Embarked'], axis=0, inplace=True)

# Let's check the null values again

import matplotlib.pyplot as plt

# calculate summary statistics

lower_bound = mean - std*2

upper_bound = mean + std*2

print('Lower Bound :',lower_bound)

print('Upper Bound :',upper_bound)

# Drop the outliers

df4 = df3[(df3['Age'] >= lower_bound)

& (df3['Age'] <= upper_bound)]

 Data formatting: Data formatting involves converting the data into a

from sklearn.preprocessing import MinMaxScaler

# initialising the MinMaxScaler

scaler = MinMaxScaler(feature_range=(0, 1))

num_col_ = [col for col in X.columns if X[col].dtype != 'object']

Advantages of Data Cleaning in Machine Learning:

1. Improved model performance: Data cleaning helps improve the

Disadvantages of Data Cleaning in Machine Learning:

1. Time-consuming: Data cleaning can be a time-consuming task, especially

You might also like