Data Cleaning
Data Cleaning
Data is the most valuable thing for Analytics and Machine learning. In
computing or Business data is needed everywhere. When it comes to the real
world data, it is not improbable that data may contain incomplete, inconsistent
or missing values. If the data is corrupted then it may hinder the process or
provide inaccurate results. Let’s see some examples of the importance of data
cleaning.
Suppose you are a general manager of a company. Your company collects data
of different customers who buy products produced by your company. Now you
want to know on which products people are interested most and according to
that you want to increase the production of that product. But if the data is
corrupted or contains missing values then you will be misguided to make the
correct decision and you will be in trouble.
Now let’s take a closer look in the different ways of cleaning data.
Inconsistent column :
data={'Name':['A','B','C','D','E','F','G','H']
,'Height':[5.2,5.7,5.6,5.5,5.3,5.8,5.6,5.5],
'Roll':[55,99,15,80,1,12,47,104],
'Department':['CSE','EEE','BME','CSE','ME','ME','CE','CSE'],
'Address':['polashi','banani','farmgate','mirpur','dhanmondi','ishwardi','khulna','utt
ara']}
df=pd.DataFrame(data)
print(df)
Figure 2: Student data set
Let us drop the height column. For this you need to push the column name in
the column keyword.
df=df.drop(columns='Height')
print(df.head())
Figure 3: “Height” column dropped
Missing data:
It is rare to have a real world dataset without having any missing values. When
you start to work with real world data, you will find that most of the dataset
contains missing values. Handling missing values is very important because if
you leave the missing values as it is, it may affect your analysis and machine
learning models. So, you need to be sure that whether your dataset contains
missing values or not. If you find missing values in your dataset you must
handle it. If you find any missing values in the dataset you can perform any of
these three task on it:
1. Leave as it is
3. Drop them
For filling the missing values we can perform different methods. For example,
Figure 4 shows that airquality dataset has missing values.
In figure 4, NaN indicates that the dataset contains missing values in that
position. After finding missing values in your dataset, You can use
pandas.DataFrame.fillna to fill the missing values.
You can use different statistical methods to fill the missing values according to
your needs. For example, here in figure 5, we will use the statistical mean
method to fill the missing values.
airquality['Ozone'] = airquality['Ozone'].fillna(airquality.Ozone.mean())
airquality.head()
Figure 5: Filling missing values with the mean value.
You can see that the missing values in “Ozone” column is filled with the mean
value of that column.
You can also drop the rows or columns where missing values are found. we
drop the rows containing missing values. Here You can drop missing values
with the help of pandas.DataFrame.dropna.
airquality.head()
Figure 6: Rows are dropped having at least one missing value.
Here, in figure 6, you can see that rows have missing values in column Solar.R
is dropped.
airquality.isnull().sum(axis=0)
Figure 7: Shows the numbers of missing values in column.
Outliers:
If you are new data Science then the first question that will arise in your head is
“what does these outliers mean” ? Let’s talk about the outliers first and then we
will talk about the detection of these outliers in the dataset and what will we do
after detecting the outliers.
According to wikipedia,
“In statistics, an outlier is a data point that differs significantly from other
observations.”
That means an outlier indicates a data point that is significantly different from
the other data points in the data set. Outliers can be created due to the errors in
the experiments or the variability in the measurements. Let’s look an example to
clear the concept.
Figure 8: Table contains outlier.
In Figure 4 all the values in math column are in range between 90–95 except 20
which is significantly different from others. It can be an input error in the
dataset. So we can call it a outliers. One thing should be added here — “ Not all
the outliers are bad data points. Some can be errors but others are the valid
values. ”
So, now the question is how can we detect the outliers in the dataset.
1. Box Plot
2. Scatter plot
3. Z-score etc.
We will see the Scatter Plot method here. Let’s draw a scatter plot of a dataset.
df_removed_outliers = dataset[dataset.total_est_fee<17500]
plt.show()
Figure 10: Scatter plotting with removed outliers.
Duplicate rows:
Datasets may contain duplicate entries. It is one of the most easiest task to
delete duplicate rows. To delete the duplicate rows you can use —
print(dataset)
Tidy dataset means each columns represent separate variables and each rows
represent individual observations. But in untidy data each columns represent
values but not the variables. Tidy data is useful to fix common data
problem.You can turn the untidy data to tidy data by using pandas.melt.
import pandas as pd
You can also see pandas.DataFrame.pivot for un-melting the tidy data.
1. Categorical data
2. Object data
3. Numeric data
4. Boolean data
Some columns data type can be changed due to some reason or have
inconsistent data type. You can convert from one data type to another by using
pandas.DataFrame.astype.
One of the most important and interesting part of data cleaning is string
manipulation. In the real world most of the data are unstructured data. String
manipulation means the process of changing, parsing, matching or analyzing
strings. For string manipulation, you should have some knowledge about regular
expressions. Sometimes you need to extract some value from a large sentence.
Here string manipulation gives us a strong benefit. Let say,
“This umbrella costs $12 and he took this money from his mother.”
If you want to exact the “$12” information from the sentence then you have to
build a regular expression for matching that pattern.After that you can use the
python libraries.There are many built in and external libraries in python for
string manipulation.
import re
pattern = re.compile('|\$|d*')
result = pattern.match("$12312312")
print(bool(result))
Data Concatenation:
In this modern era of data science the volume of data is increasing day by day.
Due to the large number of volume of data data may stored in separated files. If
you work with multiple files then you can concatenate them for simplicity. You
can use the following python library for concatenate.
concatenated_data=pd.concat([dataset1,dataset2])
print(concatenated_data)
Figure 15: Concatenated dataset.
https://towardsdatascience.com/what-is-data-cleaning-how-to-process-data-for-analytics-and
-machine-learning-modeling-c2afcf4fbf45