Data wrangling with python
Lab 4: Data wrangling with python
Lab 4: Data Wangling with python
1. Objective
• Data Wrangling
• Data Cleanup and its usage
• Unnecessary Columns dealing
• Manipulating DataFrame in Python
• Dealing with Missing Values
• Discretizing and Binning
• Aggregation and Grouping in DataFrame
• Outliers Detection
• Outliers removal
2. Data Wrangling
Data Wrangling is the process of converting data from the initial format to a format that may be
better for analysis.
3. Data Cleanup:
Cleaning up your data is not the most glamourous of tasks, but it’s an essential part of
data wrangling. Becoming a data cleaning expert requires precision and a healthy knowledge of
your area of research or study. Knowing how to properly clean and assemble your data will set
you miles apart from others in your field. Python is well designed for data cleanup; it helps you
build functions around patterns, eliminating repetitive work.
4. Why Clean Data?
Some data may come to you properly formatted and ready to use. If this is the case,
consider yourself lucky! Most data, even if it is cleaned, has some formatting inconsistencies or
readability issues (e.g., acronyms or mismatched description headers). This is especially true if
you are using data from more than one dataset. It’s unlikely your data will properly join and be
useful unless you spend time formatting and standardizing it.
Data scientists spend a large amount of their time cleaning datasets and getting them down to a
form with which they can work. In fact, a lot of data scientists argue that the initial steps of
obtaining and cleaning data constitute 80% of the job.
4.1 Dropping unnecessary columns in data
• Data set we are using: books.txt – A file containing information about books from the
British Library
By: Faizan Irshad Page 2
Lab 4: Data wrangling with python
This lab assumes a basic understanding of the Pandas and NumPy libraries, including Panda’s
workhorse Series and DataFrame objects, common methods that can be applied to these objects,
and familiarity with NumPy’s NaN values.
Let’s import the required modules and get started!
Pandas provides a handy way of removing unwanted columns or rows from a DataFrame with
the drop() function. Let’s look at a simple example where we drop a number of columns from a
DataFrame.
First, let’s create a DataFrame out of the file ‘books.txt’.
df = pd.read_csv("books.txt")
print(df.columns)
Removing unnecessary columns
we can see that a handful of columns provide information that would be helpful to the library but
isn’t very descriptive of the books themselves: Edition Statement, Corporate Author, Corporate
Contributors, Former owner, Engraver, Issuance type and Shelfmarks.
We can drop these columns in the following way:
to_drop = ['Edition Statement','Corporate Author','Corporate Contributors',
'Former owner', 'Engraver', 'Contributors', 'Issuance type', 'Shelfmarks']
df.drop(to_drop, inplace=True, axis=1)
we defined a list that contains the names of all the columns we want to drop. Next, we call the
drop() function on our object, passing in the inplace parameter as True and the axis parameter as
1. This tells Pandas that we want the changes to be made directly in our object and that it should
look for the values to be dropped in the columns of the object.
When we inspect the DataFrame again, we’ll see that the unwanted columns have been removed:
4.2 Manipulating Indexes of Data frame
A Pandas Index extends the functionality of NumPy arrays to allow for more versatile
slicing and labeling. In many cases, it is helpful to use a uniquely valued identifying field of
the data as its index.
print(df['Identifier'].is_unique)
By: Faizan Irshad Page 3
Lab 4: Data wrangling with python
Let’s replace the existing index with this column using set_index:
df = df.set_index('Identifier')
Now we can extract values from any row specifying its index
reference (identifier column)
print(df.loc[472])
Note: You may have noticed that we reassigned the variable to the object returned by the
method with df = df.set_index(...). This is because, by default, the method returns a
modified copy of our object and does not make the changes directly to the object. We can
avoid this by setting the inplace parameter:
df.set_index('Identifier', inplace=True)
4.3 Dealing with missing values
With every dataset it is vital to evaluate the missing values. How many are there? Is it an error?
Are there too many missing values? Does a missing value have a meaning relative to its context?
We can sum up the total missing values using the following:
# Any missing values?
print(df.isnull().values.any())
print(df['Publisher'].isnull().values.any())//one
column check
print(df.isna().sum()) //total sum of all nan
Isnull can also be used> print(df.isnull().sum())
Isnull or isna are same
Now that we have identified our missing values, we have a few options. We can fill them in with
a certain value (zero, mean/max/median by column, string) or drop them by row.
I. Drop null value rows
new = df.dropna(axis = 0, how = 'any')
print(new)
II. Fill Values
Often times you’ll have to figure out how you want to handle missing values. Sometimes
you’ll simply want to delete those rows, other times you’ll replace them.
# Replace missing values with a number
Newdf=df.fillna('Test')
By: Faizan Irshad Page 4
Lab 4: Data wrangling with python
More likely, you might want to do a location-based imputation. Here’s how you would do that.
newdf.loc[216,'Publisher'] = 'ICAP'
print(newdf)
III. Drop duplicates
//Read new call records file that have duplicate data
df1 = pd.read_csv("call records.csv")
print(df1['date'].duplicated().any())
dropping duplicates keeping the first occurrence and
deleting rest.
df1 = df1.drop_duplicates('date', keep="first")
IV. Fill data Using Median
A very common way to replace missing values is using a median.
#Phone data
median = df1['duration'].median()
df1['duration'].fillna(median, inplace=True)
4.4 Python Data aggregation
import pandas as pd
import dateutil
# Convert date from string to date times
df1['date'] = df1['date'].apply(dateutil.parser.parse, dayfirst=True)
# How many rows the dataset
print('How many rows the dataset: ', df1['item'].count() )
# What was the longest phone call / data entry?
print('What was the longest phone call: ', df1['duration'].max() )
# Total recording time?
print('How many seconds recorded in total: ', df1['duration'].sum())
# How many seconds of phone calls are recorded in total?
print('How many seconds of phone calls are recorded in total: ', df1['duration'][df1['item']
== 'call'].sum() )
By: Faizan Irshad Page 5
Lab 4: Data wrangling with python
# Number of non-null unique network entries
print('Number of non-null unique network entries: ', df1['network'].nunique() )
*nunique is used to count number of unique entries, if only unique is used then it
shows all unique values.
# How many entries are there for each month?
print('How many entries are there for each month: ',
df1['month'].value_counts())
5. Groups in DataFrame
There’s further power put into your hands by mastering the Pandas “groupby()”
functionality. Groupby essentially splits the data into different groups depending on a variable of
your choice. For example, the expression data.groupby(‘month’) will split our current DataFrame
by month.
The groupby() function returns a GroupBy object, but essentially describes how the rows of the
original data set has been split. the GroupBy object .groups variable is a dictionary whose keys
are the computed unique groups and corresponding values being the axis labels belonging to
each group. For example
print(df1.groupby(['month']).groups.keys())
#groupby here groups data by months and keys function shows names of
those months i.e. a dictionary function because group names are now
keys in dictionary.
print(len(df1.groupby(['month']).groups['2014-11'])) #len shows
length i.e. number of items in each group.
print(len(df1.groupby(['month']).groups['2014-12']))
Functions like max(), min(), mean(), first(), last() can be quickly applied to the GroupBy object
to obtain summary statistics for each group – an immensely useful function
# Get the first entry for each month
print( df1.groupby(['month']).first())
# Get the sum of the durations per month
print( df1.groupby(['month'])['duration'].sum())
# Get the number of dates / entries in each month
print( df1.groupby(['month'])['date'].count())
# What is the sum of durations, for calls only, to each network
By: Faizan Irshad Page 6
Lab 4: Data wrangling with python
print( df1[df1['item'] ==
'call'].groupby(['network'])['duration'].sum())
You can also group by more than one variable, allowing more complex queries.
# How many calls, sms, and data entries are in each month?
print(df1.groupby(['month', 'item'])['date'].count())
# How many calls, sms, and data are sent per month, split by
network_type?
print(df1.groupby(['month', 'network_type'])['date'].count())
By: Faizan Irshad Page 7
Lab 4: Data wrangling with python
5.1 Detecting and Filtering Outliers
Outlier Identification: There can be many reasons for the presence of outliers in the data.
Sometimes the outliers may be genuine, while in other cases, they could exist because of data entry
errors. It is important to understand the reasons for the outliers before cleaning them. We will start
the process of finding outliers by running the summary statistics on the variables. This is done
using the describe() function below, which provides a statistical summary of all the quantitative
variables.
print(df1.describe())
5.2 Identifying Outliers with Interquartile Range (IQR)
The range is the difference between the maximum and the minimum observation of the
distribution. It is defined by
Range = Xmax – Xmin
Quartiles are the partitioned values that divide the whole series into 4 equal parts. So,
there are 3 quartiles. First Quartile is denoted by Q1 known as the lower quartile, the second
Quartile is denoted by Q2 and the third Quartile is denoted by Q3 known as the upper quartile.
The interquartile range (IQR) is a measure of statistical dispersion and is calculated as the
difference between the 75th and 25th percentiles. It is represented by the formula IQR = Q3 − Q1.
The lines of code below calculate and print the interquartile range for each of the variables in the
dataset.
Q1 = df1['duration'].quantile(0.25)
Q3 = df1['duration'].quantile(0.75)
IQR = Q3 - Q1
# Identify outliers
outliers = (df1['duration'] < (Q1 - 1.5 * IQR)) |
(df1['duration'] > (Q3 + 1.5 * IQR))
print(outliers)
The data point where we have False that means these values are valid whereas True indicates
presence of an outlier.
By: Faizan Irshad Page 8
Lab 4: Data wrangling with python
6. Identifying Outliers with Visualization
Box Plot
The box plot is a standardized way of displaying the distribution of data based on the five-
number summary (minimum, first quartile (Q1), median, third quartile (Q3), and maximum). It is
often used to identify data distribution and detect outliers. The line of code below plots the box
plot of the numeric variable duration.
from matplotlib import pyplot as plt
plt.boxplot(df1["duration"])
plt.show()
By: Faizan Irshad Page 9
Lab 4: Data wrangling with python
Histogram
A histogram is used to visualize the distribution of a numerical variable. An outlier will appear
outside the overall pattern of distribution.
plt.hist(df1['duration'])
plt.show()
6.1 Outlier Treatment
In the previous sections, we learned about techniques for outlier detection. However, this is only
half of the task. Once we have identified the outliers, we need to treat them. There are several
techniques for this, and we will discuss the most widely used ones below.
6.1.1 Quantile-based Flooring and Capping
In this technique, we will do the flooring (e.g., the 10th percentile) for the lower values and
capping (e.g., the 90th percentile) for the higher values. The lines of code below print the 10th
and 90th percentiles of the variable 'Income', respectively. These values will be used for
quantile-based flooring and capping.
print(df1['duration'].quantile(0.10))
print(df1['duration'].quantile(0.90))
df1["duration"]=np.where(df1["duration"] <1.0, 1.0,df1['duration'])
df1["duration"]=np.where(df1["duration"] >383.4,383.4,df1['duration'])
df1['duration'].describe() # to see how minimum/maximum values changed
6.1.2 Trimming
In this method, we completely remove data points that are outliers. Consider the 'duration'
variable, which had a minimum value of 1 and a maximum value of 383.4.
index = df1[(df1['duration'] >= 383.4)|(df['duration'] <=
1)].index # to get index (row locations of items that need to be deleted
df1.drop(index, inplace =True)
print(df1['duration'].describe())
By: Faizan Irshad Page 10
Lab 4: Data wrangling with python
7. Practice Task
Please load the autos.csv data given in the folder.
7.1 Practice Task 1
Find the ? in given data and replace it with nan
7.2 Practice Task 2
Count Missing values in each column and display the results.
7.3 Practice Task 3
Calculate the median value for the 'horsepower' column:
7.4 Practice Task 4
Replace "NaN" in ‘horsepower’ column by median value:
7.5 Practice Task 5
Find the car that have maximum highway mile per gallon
7.6 Practice Task 6
Find all honda car details.
7.7 Practice Task 7
Count total cars per company
7.8 Practice Task 8
Find each company’s highest price car
By: Faizan Irshad Page 11