0% found this document useful (0 votes)

36 views20 pages

Data Cleaning

Uploaded by

kaustubh.bhosale30

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

36 views20 pages

Data Cleaning

Uploaded by

kaustubh.bhosale30

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 20

-What is Data Cleaning?

Data Cleaning means the process of identifying the incorrect, incomplete,

inaccurate, irrelevant or missing part of the data and then modifying, replacing
or deleting them according to the necessity. Data cleaning is considered a
foundational element of the basic data science.

Data is the most valuable thing for Analytics and Machine learning. In
computing or Business data is needed everywhere. When it comes to the real
world data, it is not improbable that data may contain incomplete, inconsistent
or missing values. If the data is corrupted then it may hinder the process or
provide inaccurate results. Let’s see some examples of the importance of data
cleaning.

Suppose you are a general manager of a company. Your company collects data
of different customers who buy products produced by your company. Now you
want to know on which products people are interested most and according to
that you want to increase the production of that product. But if the data is
corrupted or contains missing values then you will be misguided to make the
correct decision and you will be in trouble.

At the end of all, Machine Learning is a data-driven AI. In machine learning, if

the data is irrelevant or error-prone then it leads to an incorrect model building.
Figure 1: Impact of data on Machine Learning Modeling.
As much as you make your data clean, as much as you can make a better model.
So, we need to process or clean the data before using it. Without the quality
data,it would be foolish to expect anything good outcome.

Different Ways of Cleaning Data

Now let’s take a closer look in the different ways of cleaning data.

Inconsistent column :

If your DataFrame (A Data frame is a two-dimensional data structure, i.e., data

is aligned in a tabular fashion in rows and columns) contains columns that are
irrelevant or you are never going to use them then you can drop them to give
more focus on the columns you will work on. Let’s see an example of how to
deal with such data set. Let’s create an example of students data set using
pandas DataFrame.

import numpy as np # linear algebra

import pandas as pd # data processing, CSV file I/O

data={'Name':['A','B','C','D','E','F','G','H']

,'Height':[5.2,5.7,5.6,5.5,5.3,5.8,5.6,5.5],
'Roll':[55,99,15,80,1,12,47,104],

'Department':['CSE','EEE','BME','CSE','ME','ME','CE','CSE'],

'Address':['polashi','banani','farmgate','mirpur','dhanmondi','ishwardi','khulna','utt
ara']}

df=pd.DataFrame(data)

print(df)
Figure 2: Student data set

Here if we want to remove the “Height” column, we can use python

pandas.DataFrame.drop to drop specified labels from rows or columns.

DataFrame.drop(self, labels=None, axis=0, index=None, columns=None,

level=None, inplace=False, errors='raise')

Let us drop the height column. For this you need to push the column name in
the column keyword.

df=df.drop(columns='Height')

print(df.head())
Figure 3: “Height” column dropped

Missing data:

It is rare to have a real world dataset without having any missing values. When
you start to work with real world data, you will find that most of the dataset
contains missing values. Handling missing values is very important because if
you leave the missing values as it is, it may affect your analysis and machine
learning models. So, you need to be sure that whether your dataset contains
missing values or not. If you find missing values in your dataset you must
handle it. If you find any missing values in the dataset you can perform any of
these three task on it:

1. Leave as it is

2. Filling the missing values

3. Drop them

For filling the missing values we can perform different methods. For example,
Figure 4 shows that airquality dataset has missing values.

airquality.head() # return top n (5 by default) rows of a data frame

Figure 4: missing values.

In figure 4, NaN indicates that the dataset contains missing values in that
position. After finding missing values in your dataset, You can use
pandas.DataFrame.fillna to fill the missing values.

DataFrame.fillna(self, value=None, method=None, axis=None, inplace=False,

limit=None, downcast=None, **kwargs)

You can use different statistical methods to fill the missing values according to
your needs. For example, here in figure 5, we will use the statistical mean
method to fill the missing values.

airquality['Ozone'] = airquality['Ozone'].fillna(airquality.Ozone.mean())

airquality.head()
Figure 5: Filling missing values with the mean value.

You can see that the missing values in “Ozone” column is filled with the mean
value of that column.

You can also drop the rows or columns where missing values are found. we
drop the rows containing missing values. Here You can drop missing values
with the help of pandas.DataFrame.dropna.

airquality = airquality.dropna() #drop the rows containing at least one missing

value

airquality.head()
Figure 6: Rows are dropped having at least one missing value.

Here, in figure 6, you can see that rows have missing values in column Solar.R
is dropped.

airquality.isnull().sum(axis=0)
Figure 7: Shows the numbers of missing values in column.

Outliers:

If you are new data Science then the first question that will arise in your head is
“what does these outliers mean” ? Let’s talk about the outliers first and then we
will talk about the detection of these outliers in the dataset and what will we do
after detecting the outliers.

According to wikipedia,

“In statistics, an outlier is a data point that differs significantly from other
observations.”

That means an outlier indicates a data point that is significantly different from
the other data points in the data set. Outliers can be created due to the errors in
the experiments or the variability in the measurements. Let’s look an example to
clear the concept.
Figure 8: Table contains outlier.

In Figure 4 all the values in math column are in range between 90–95 except 20
which is significantly different from others. It can be an input error in the
dataset. So we can call it a outliers. One thing should be added here — “ Not all
the outliers are bad data points. Some can be errors but others are the valid
values. ”

So, now the question is how can we detect the outliers in the dataset.

For detecting the outliers we can use :

1. Box Plot

2. Scatter plot

3. Z-score etc.

We will see the Scatter Plot method here. Let’s draw a scatter plot of a dataset.

dataset.plot(kind='scatter' , x='initial_cost' , y='total_est_fee' , rot = 70)

plt.show()

Figure 9: Scatter plotting with outlier.

Here in Figure 9 there is a outlier with red outline. After detecting this, we can
remove this from the dataset.

df_removed_outliers = dataset[dataset.total_est_fee<17500]

df_removed_outliers.plot(kind='scatter', x='initial_cost' , y='total_est_fee' , rot =

70)

plt.show()
Figure 10: Scatter plotting with removed outliers.

Duplicate rows:

Datasets may contain duplicate entries. It is one of the most easiest task to
delete duplicate rows. To delete the duplicate rows you can use —

dataset_name.drop_duplicates(). Figure 12 shows a sample of a dataset having

duplicate rows.
Figure 11: Data having duplicate rows.

dataset=dataset.drop_duplicates()#this will remove the duplicate rows.

print(dataset)

Figure 12: Data without duplicate rows.

Tidy data set:

Tidy dataset means each columns represent separate variables and each rows
represent individual observations. But in untidy data each columns represent
values but not the variables. Tidy data is useful to fix common data
problem.You can turn the untidy data to tidy data by using pandas.melt.

import pandas as pd

pd.melt(frame=df,id_vars='name',value_vars=['treatment a','treatment b'])

Figure 13: Converting from Untidy to tidy data.

You can also see pandas.DataFrame.pivot for un-melting the tidy data.

Converting data types:

In DataFrame data can be of many types. As example :

1. Categorical data

2. Object data

3. Numeric data

4. Boolean data

Some columns data type can be changed due to some reason or have
inconsistent data type. You can convert from one data type to another by using
pandas.DataFrame.astype.

DataFrame.astype(self, dtype, copy=True, errors='raise', **kwargs)

String manipulation:

One of the most important and interesting part of data cleaning is string
manipulation. In the real world most of the data are unstructured data. String
manipulation means the process of changing, parsing, matching or analyzing
strings. For string manipulation, you should have some knowledge about regular
expressions. Sometimes you need to extract some value from a large sentence.
Here string manipulation gives us a strong benefit. Let say,

“This umbrella costs $12 and he took this money from his mother.”

If you want to exact the “$12” information from the sentence then you have to
build a regular expression for matching that pattern.After that you can use the
python libraries.There are many built in and external libraries in python for
string manipulation.

import re

pattern = re.compile('|\$|d*')

result = pattern.match("$12312312")

print(bool(result))

This will give you an output showing “True”.

Data Concatenation:
In this modern era of data science the volume of data is increasing day by day.
Due to the large number of volume of data data may stored in separated files. If
you work with multiple files then you can concatenate them for simplicity. You
can use the following python library for concatenate.

pandas.concat(objs, axis=0, join='outer', join_axes=None,

ignore_index=False, keys=None, levels=None, names=None,
verify_integrity=False, sort=None, copy=True)

Let’s see an example how to concatenate two dataset. Figure 14 shows an

example of two different datasets loaded from two different files. We will
concatenate them using pandas.concat.

Figure 14: Dataset1(left) & Dataset2(right)

concatenated_data=pd.concat([dataset1,dataset2])

print(concatenated_data)
Figure 15: Concatenated dataset.
https://towardsdatascience.com/what-is-data-cleaning-how-to-process-data-for-analytics-and
-machine-learning-modeling-c2afcf4fbf45

Car Price Prediction Using Machine Learning: SRM Institute of Science & Technology Faculty of Engineering & Technology
100% (4)
Car Price Prediction Using Machine Learning: SRM Institute of Science & Technology Faculty of Engineering & Technology
21 pages
Ostrich RC Correlation
100% (2)
Ostrich RC Correlation
41 pages
Uster Classimat 5
100% (1)
Uster Classimat 5
9 pages
Data Cleaning & Preparation
100% (2)
Data Cleaning & Preparation
2 pages
Procmt Gui
No ratings yet
Procmt Gui
34 pages
Data Cleaning With Python and Pandas
No ratings yet
Data Cleaning With Python and Pandas
49 pages
Data Cleaning 1728415892
No ratings yet
Data Cleaning 1728415892
10 pages
ASTM D 6299 10 Cartas de Control
100% (2)
ASTM D 6299 10 Cartas de Control
27 pages
6.data Cleaning
No ratings yet
6.data Cleaning
20 pages
Explorotary Data Analysis
100% (1)
Explorotary Data Analysis
30 pages
Simulink Design Optimization - User's Guide
No ratings yet
Simulink Design Optimization - User's Guide
411 pages
Lecture Week5
No ratings yet
Lecture Week5
72 pages
AP Statistics Problems #09
0% (1)
AP Statistics Problems #09
6 pages
PDS Exp 7 To 9
No ratings yet
PDS Exp 7 To 9
10 pages
Three Standards : Workshop 4 March 2010 Hong Kong
No ratings yet
Three Standards : Workshop 4 March 2010 Hong Kong
5 pages
Cleaning Data in Python: Pu!ing It All Together
No ratings yet
Cleaning Data in Python: Pu!ing It All Together
14 pages
Data Cleaning and Preprocessing
No ratings yet
Data Cleaning and Preprocessing
4 pages
Module II - Data Processing
No ratings yet
Module II - Data Processing
54 pages
Report PDF
No ratings yet
Report PDF
80 pages
Chapter 3
No ratings yet
Chapter 3
47 pages
Part A Assignment 6
No ratings yet
Part A Assignment 6
28 pages
Data Cleaning in Python
No ratings yet
Data Cleaning in Python
14 pages
3b. Data Pre-Processing
No ratings yet
3b. Data Pre-Processing
84 pages
Marielle Caccam Jewel Refran
No ratings yet
Marielle Caccam Jewel Refran
100 pages
Lectures On Spss 2010
No ratings yet
Lectures On Spss 2010
94 pages
ch4 Slides PDF
No ratings yet
ch4 Slides PDF
44 pages
Lesson 3. Data Preparation and Structuring 1 Data Cleaning
No ratings yet
Lesson 3. Data Preparation and Structuring 1 Data Cleaning
36 pages
1-Introduction To Data Cleaning
No ratings yet
1-Introduction To Data Cleaning
22 pages
S08 Slides
No ratings yet
S08 Slides
14 pages
DAP Writeups - Merged
No ratings yet
DAP Writeups - Merged
33 pages
1.2.1. Retrieving Data - 1.2.2. Cleaning Data
No ratings yet
1.2.1. Retrieving Data - 1.2.2. Cleaning Data
35 pages
Data Cleaning
No ratings yet
Data Cleaning
13 pages
Module 3
No ratings yet
Module 3
20 pages
What Is The Concept of Data Cleaning
No ratings yet
What Is The Concept of Data Cleaning
20 pages
Master Data Cleaning With Python
No ratings yet
Master Data Cleaning With Python
11 pages
ML Lecture 5 Data Quality
No ratings yet
ML Lecture 5 Data Quality
19 pages
Python (Unit - 2)
No ratings yet
Python (Unit - 2)
22 pages
3 DSEngineering
No ratings yet
3 DSEngineering
64 pages
Reading 5 - Data Preparation
No ratings yet
Reading 5 - Data Preparation
23 pages
Data Cleaningin ML
No ratings yet
Data Cleaningin ML
15 pages
Influence of Turning Parameters On Residual Stress
No ratings yet
Influence of Turning Parameters On Residual Stress
31 pages
Chapter3 DS
No ratings yet
Chapter3 DS
17 pages
DS Lec 6
No ratings yet
DS Lec 6
27 pages
DWM - Co2-10
No ratings yet
DWM - Co2-10
27 pages
DS Unit 2
No ratings yet
DS Unit 2
23 pages
Anomaly Detection Algorithms For RapidMiner
No ratings yet
Anomaly Detection Algorithms For RapidMiner
12 pages
Pandas Data Cleaning Presentation
No ratings yet
Pandas Data Cleaning Presentation
11 pages
Document
No ratings yet
Document
29 pages
Advanced Python Programming Data Science: The University of Sheffield
No ratings yet
Advanced Python Programming Data Science: The University of Sheffield
55 pages
Osmosis Practical Write UP
No ratings yet
Osmosis Practical Write UP
13 pages
2-Concept Hierarchy To Classification of DMS
No ratings yet
2-Concept Hierarchy To Classification of DMS
75 pages
Pandas 1
No ratings yet
Pandas 1
13 pages
What Is Data Cleaning
No ratings yet
What Is Data Cleaning
8 pages
DSBDA Lab Assignment No 2
No ratings yet
DSBDA Lab Assignment No 2
7 pages
Dataminin Presentation (1) .PPTX - Read-Only
No ratings yet
Dataminin Presentation (1) .PPTX - Read-Only
23 pages
Chapter - 2 - Cleaning and Transforming Data
No ratings yet
Chapter - 2 - Cleaning and Transforming Data
27 pages
Overview of Data Cleaning
No ratings yet
Overview of Data Cleaning
17 pages
Statistical Transform Data Cleaning
No ratings yet
Statistical Transform Data Cleaning
30 pages
Data Cleanups
No ratings yet
Data Cleanups
16 pages
Deep Learning Ram
No ratings yet
Deep Learning Ram
21 pages
Lab 3 DWM
No ratings yet
Lab 3 DWM
5 pages
Data Cleaning
No ratings yet
Data Cleaning
42 pages
EffectiveSchoolspaperwithPurkeyED221534 PDF
No ratings yet
EffectiveSchoolspaperwithPurkeyED221534 PDF
70 pages
Data Cleaning Using Pandas
No ratings yet
Data Cleaning Using Pandas
9 pages
Day 10 Pandasdatacleaning
No ratings yet
Day 10 Pandasdatacleaning
6 pages
Tycs Data Science Sem6
No ratings yet
Tycs Data Science Sem6
99 pages
Ads Exp2 C35
No ratings yet
Ads Exp2 C35
9 pages
Prac 7
No ratings yet
Prac 7
5 pages
Group A Assignment No2 Writeup
No ratings yet
Group A Assignment No2 Writeup
9 pages
Core of ML - Part 1 Handling Data
No ratings yet
Core of ML - Part 1 Handling Data
3 pages
Data Cleaning (Examples)
No ratings yet
Data Cleaning (Examples)
9 pages
Tutorial 4
No ratings yet
Tutorial 4
8 pages
7 Cleaning Data w3s.............................................
No ratings yet
7 Cleaning Data w3s.............................................
2 pages
Data Mining Tasks
No ratings yet
Data Mining Tasks
3 pages
Data Exploration Preparation
No ratings yet
Data Exploration Preparation
12 pages
Shrug Codebook
No ratings yet
Shrug Codebook
19 pages
Combining Unsupervised and Supervised Learning in Credit Card Fraud Detection
No ratings yet
Combining Unsupervised and Supervised Learning in Credit Card Fraud Detection
28 pages
MaC 13 - Binge-Watching Netflix - Insights From Data Donations
No ratings yet
MaC 13 - Binge-Watching Netflix - Insights From Data Donations
19 pages
HW Lesson 5 - MAA HL - Statistics 1 DP1
No ratings yet
HW Lesson 5 - MAA HL - Statistics 1 DP1
19 pages
Fraud and Anomaly in Banking
No ratings yet
Fraud and Anomaly in Banking
20 pages
Water Potability PPT
No ratings yet
Water Potability PPT
12 pages
Be A 65 Ads Exp 3
No ratings yet
Be A 65 Ads Exp 3
6 pages
Activity 1: Statistics and Data Handling in Analytical Chemistry
No ratings yet
Activity 1: Statistics and Data Handling in Analytical Chemistry
16 pages
Outliers
No ratings yet
Outliers
4 pages
BM-707 & BM-607 MID Assignment
No ratings yet
BM-707 & BM-607 MID Assignment
4 pages
Skittles Project 2
No ratings yet
Skittles Project 2
5 pages
Chapter09 Part 2
No ratings yet
Chapter09 Part 2
18 pages

Data Cleaning

Uploaded by

Data Cleaning

Uploaded by

-What is Data Cleaning?

Data Cleaning means the process of identifying the incorrect, incomplete,

At the end of all, Machine Learning is a data-driven AI. In machine learning, if

Different Ways of Cleaning Data

If your DataFrame (A Data frame is a two-dimensional data structure, i.e., data

import numpy as np # linear algebra

import pandas as pd # data processing, CSV file I/O

Here if we want to remove the “Height” column, we can use python

DataFrame.drop(self, labels=None, axis=0, index=None, columns=None,

2. Filling the missing values

airquality.head() # return top n (5 by default) rows of a data frame

DataFrame.fillna(self, value=None, method=None, axis=None, inplace=False,

airquality = airquality.dropna() #drop the rows containing at least one missing

For detecting the outliers we can use :

dataset.plot(kind='scatter' , x='initial_cost' , y='total_est_fee' , rot = 70)

Figure 9: Scatter plotting with outlier.

df_removed_outliers.plot(kind='scatter', x='initial_cost' , y='total_est_fee' , rot =

dataset_name.drop_duplicates(). Figure 12 shows a sample of a dataset having

dataset=dataset.drop_duplicates()#this will remove the duplicate rows.

Figure 12: Data without duplicate rows.

pd.melt(frame=df,id_vars='name',value_vars=['treatment a','treatment b'])

Converting data types:

In DataFrame data can be of many types. As example :

DataFrame.astype(self, dtype, copy=True, errors='raise', **kwargs)

This will give you an output showing “True”.

pandas.concat(objs, axis=0, join='outer', join_axes=None,

Let’s see an example how to concatenate two dataset. Figure 14 shows an

Figure 14: Dataset1(left) & Dataset2(right)

You might also like