0% found this document useful (0 votes)

12 views11 pages

Lab-4, Data Wrangling With Python

This document outlines a lab focused on data wrangling using Python, covering key topics such as data cleanup, handling missing values, and outlier detection. It emphasizes the importance of cleaning data for effective analysis and provides practical examples using the Pandas library. The lab also includes exercises for hands-on practice with data manipulation techniques.

Uploaded by

hassanali2415

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

12 views11 pages

Lab-4, Data Wrangling With Python

Uploaded by

hassanali2415

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 11

Data wrangling with python

Lab 4: Data wrangling with python

Lab 4: Data Wangling with python

1. Objective

• Data Wrangling
• Data Cleanup and its usage
• Unnecessary Columns dealing
• Manipulating DataFrame in Python
• Dealing with Missing Values
• Discretizing and Binning
• Aggregation and Grouping in DataFrame
• Outliers Detection
• Outliers removal

2. Data Wrangling
Data Wrangling is the process of converting data from the initial format to a format that may be
better for analysis.

3. Data Cleanup:
Cleaning up your data is not the most glamourous of tasks, but it’s an essential part of
data wrangling. Becoming a data cleaning expert requires precision and a healthy knowledge of
your area of research or study. Knowing how to properly clean and assemble your data will set
you miles apart from others in your field. Python is well designed for data cleanup; it helps you
build functions around patterns, eliminating repetitive work.

4. Why Clean Data?

Some data may come to you properly formatted and ready to use. If this is the case,
consider yourself lucky! Most data, even if it is cleaned, has some formatting inconsistencies or
readability issues (e.g., acronyms or mismatched description headers). This is especially true if
you are using data from more than one dataset. It’s unlikely your data will properly join and be
useful unless you spend time formatting and standardizing it.
Data scientists spend a large amount of their time cleaning datasets and getting them down to a
form with which they can work. In fact, a lot of data scientists argue that the initial steps of
obtaining and cleaning data constitute 80% of the job.

4.1 Dropping unnecessary columns in data

• Data set we are using: books.txt – A file containing information about books from the
British Library

By: Faizan Irshad Page 2

Lab 4: Data wrangling with python

This lab assumes a basic understanding of the Pandas and NumPy libraries, including Panda’s
workhorse Series and DataFrame objects, common methods that can be applied to these objects,
and familiarity with NumPy’s NaN values.

Let’s import the required modules and get started!

Pandas provides a handy way of removing unwanted columns or rows from a DataFrame with
the drop() function. Let’s look at a simple example where we drop a number of columns from a
DataFrame.

First, let’s create a DataFrame out of the file ‘books.txt’.

df = pd.read_csv("books.txt")
print(df.columns)

Removing unnecessary columns

we can see that a handful of columns provide information that would be helpful to the library but
isn’t very descriptive of the books themselves: Edition Statement, Corporate Author, Corporate
Contributors, Former owner, Engraver, Issuance type and Shelfmarks.
We can drop these columns in the following way:

to_drop = ['Edition Statement','Corporate Author','Corporate Contributors',

'Former owner', 'Engraver', 'Contributors', 'Issuance type', 'Shelfmarks']

df.drop(to_drop, inplace=True, axis=1)

we defined a list that contains the names of all the columns we want to drop. Next, we call the
drop() function on our object, passing in the inplace parameter as True and the axis parameter as
1. This tells Pandas that we want the changes to be made directly in our object and that it should
look for the values to be dropped in the columns of the object.

When we inspect the DataFrame again, we’ll see that the unwanted columns have been removed:

4.2 Manipulating Indexes of Data frame

A Pandas Index extends the functionality of NumPy arrays to allow for more versatile
slicing and labeling. In many cases, it is helpful to use a uniquely valued identifying field of
the data as its index.

print(df['Identifier'].is_unique)

By: Faizan Irshad Page 3

Lab 4: Data wrangling with python

Let’s replace the existing index with this column using set_index:
df = df.set_index('Identifier')

Now we can extract values from any row specifying its index
reference (identifier column)

print(df.loc[472])

Note: You may have noticed that we reassigned the variable to the object returned by the
method with df = df.set_index(...). This is because, by default, the method returns a
modified copy of our object and does not make the changes directly to the object. We can
avoid this by setting the inplace parameter:

df.set_index('Identifier', inplace=True)

4.3 Dealing with missing values

With every dataset it is vital to evaluate the missing values. How many are there? Is it an error?
Are there too many missing values? Does a missing value have a meaning relative to its context?
We can sum up the total missing values using the following:

# Any missing values?

print(df.isnull().values.any())

print(df['Publisher'].isnull().values.any())//one
column check

print(df.isna().sum()) //total sum of all nan

Isnull can also be used> print(df.isnull().sum())

Isnull or isna are same

Now that we have identified our missing values, we have a few options. We can fill them in with
a certain value (zero, mean/max/median by column, string) or drop them by row.

I. Drop null value rows

new = df.dropna(axis = 0, how = 'any')

print(new)

II. Fill Values

Often times you’ll have to figure out how you want to handle missing values. Sometimes
you’ll simply want to delete those rows, other times you’ll replace them.

# Replace missing values with a number

Newdf=df.fillna('Test')
By: Faizan Irshad Page 4
Lab 4: Data wrangling with python

More likely, you might want to do a location-based imputation. Here’s how you would do that.
newdf.loc[216,'Publisher'] = 'ICAP'
print(newdf)

III. Drop duplicates

//Read new call records file that have duplicate data

df1 = pd.read_csv("call records.csv")

print(df1['date'].duplicated().any())
dropping duplicates keeping the first occurrence and
deleting rest.
df1 = df1.drop_duplicates('date', keep="first")

IV. Fill data Using Median

A very common way to replace missing values is using a median.
#Phone data
median = df1['duration'].median()
df1['duration'].fillna(median, inplace=True)

4.4 Python Data aggregation

import pandas as pd
import dateutil

# Convert date from string to date times

df1['date'] = df1['date'].apply(dateutil.parser.parse, dayfirst=True)

# How many rows the dataset

print('How many rows the dataset: ', df1['item'].count() )

# What was the longest phone call / data entry?

print('What was the longest phone call: ', df1['duration'].max() )

# Total recording time?

print('How many seconds recorded in total: ', df1['duration'].sum())

# How many seconds of phone calls are recorded in total?

print('How many seconds of phone calls are recorded in total: ', df1['duration'][df1['item']
== 'call'].sum() )
By: Faizan Irshad Page 5
Lab 4: Data wrangling with python

# Number of non-null unique network entries

print('Number of non-null unique network entries: ', df1['network'].nunique() )
*nunique is used to count number of unique entries, if only unique is used then it
shows all unique values.

# How many entries are there for each month?

print('How many entries are there for each month: ',
df1['month'].value_counts())

5. Groups in DataFrame
There’s further power put into your hands by mastering the Pandas “groupby()”
functionality. Groupby essentially splits the data into different groups depending on a variable of
your choice. For example, the expression data.groupby(‘month’) will split our current DataFrame
by month.

The groupby() function returns a GroupBy object, but essentially describes how the rows of the
original data set has been split. the GroupBy object .groups variable is a dictionary whose keys
are the computed unique groups and corresponding values being the axis labels belonging to
each group. For example

print(df1.groupby(['month']).groups.keys())

#groupby here groups data by months and keys function shows names of
those months i.e. a dictionary function because group names are now
keys in dictionary.

print(len(df1.groupby(['month']).groups['2014-11'])) #len shows

length i.e. number of items in each group.

print(len(df1.groupby(['month']).groups['2014-12']))

Functions like max(), min(), mean(), first(), last() can be quickly applied to the GroupBy object
to obtain summary statistics for each group – an immensely useful function

# Get the first entry for each month

print( df1.groupby(['month']).first())

# Get the sum of the durations per month

print( df1.groupby(['month'])['duration'].sum())

# Get the number of dates / entries in each month

print( df1.groupby(['month'])['date'].count())

# What is the sum of durations, for calls only, to each network

By: Faizan Irshad Page 6
Lab 4: Data wrangling with python

print( df1[df1['item'] ==
'call'].groupby(['network'])['duration'].sum())

You can also group by more than one variable, allowing more complex queries.

# How many calls, sms, and data entries are in each month?
print(df1.groupby(['month', 'item'])['date'].count())

# How many calls, sms, and data are sent per month, split by
network_type?
print(df1.groupby(['month', 'network_type'])['date'].count())

By: Faizan Irshad Page 7

Lab 4: Data wrangling with python

5.1 Detecting and Filtering Outliers

Outlier Identification: There can be many reasons for the presence of outliers in the data.
Sometimes the outliers may be genuine, while in other cases, they could exist because of data entry
errors. It is important to understand the reasons for the outliers before cleaning them. We will start
the process of finding outliers by running the summary statistics on the variables. This is done
using the describe() function below, which provides a statistical summary of all the quantitative
variables.
print(df1.describe())

5.2 Identifying Outliers with Interquartile Range (IQR)

The range is the difference between the maximum and the minimum observation of the
distribution. It is defined by

Range = Xmax – Xmin

Quartiles are the partitioned values that divide the whole series into 4 equal parts. So,
there are 3 quartiles. First Quartile is denoted by Q1 known as the lower quartile, the second
Quartile is denoted by Q2 and the third Quartile is denoted by Q3 known as the upper quartile.
The interquartile range (IQR) is a measure of statistical dispersion and is calculated as the
difference between the 75th and 25th percentiles. It is represented by the formula IQR = Q3 − Q1.
The lines of code below calculate and print the interquartile range for each of the variables in the
dataset.

Q1 = df1['duration'].quantile(0.25)
Q3 = df1['duration'].quantile(0.75)
IQR = Q3 - Q1
# Identify outliers
outliers = (df1['duration'] < (Q1 - 1.5 * IQR)) |
(df1['duration'] > (Q3 + 1.5 * IQR))

print(outliers)

The data point where we have False that means these values are valid whereas True indicates
presence of an outlier.

By: Faizan Irshad Page 8

Lab 4: Data wrangling with python

6. Identifying Outliers with Visualization

Box Plot
The box plot is a standardized way of displaying the distribution of data based on the five-
number summary (minimum, first quartile (Q1), median, third quartile (Q3), and maximum). It is
often used to identify data distribution and detect outliers. The line of code below plots the box
plot of the numeric variable duration.

from matplotlib import pyplot as plt

plt.boxplot(df1["duration"])
plt.show()

By: Faizan Irshad Page 9

Lab 4: Data wrangling with python

Histogram

A histogram is used to visualize the distribution of a numerical variable. An outlier will appear
outside the overall pattern of distribution.

plt.hist(df1['duration'])
plt.show()

6.1 Outlier Treatment

In the previous sections, we learned about techniques for outlier detection. However, this is only
half of the task. Once we have identified the outliers, we need to treat them. There are several
techniques for this, and we will discuss the most widely used ones below.
6.1.1 Quantile-based Flooring and Capping
In this technique, we will do the flooring (e.g., the 10th percentile) for the lower values and
capping (e.g., the 90th percentile) for the higher values. The lines of code below print the 10th
and 90th percentiles of the variable 'Income', respectively. These values will be used for
quantile-based flooring and capping.
print(df1['duration'].quantile(0.10))
print(df1['duration'].quantile(0.90))
df1["duration"]=np.where(df1["duration"] <1.0, 1.0,df1['duration'])
df1["duration"]=np.where(df1["duration"] >383.4,383.4,df1['duration'])

df1['duration'].describe() # to see how minimum/maximum values changed

6.1.2 Trimming
In this method, we completely remove data points that are outliers. Consider the 'duration'
variable, which had a minimum value of 1 and a maximum value of 383.4.

index = df1[(df1['duration'] >= 383.4)|(df['duration'] <=

1)].index # to get index (row locations of items that need to be deleted

df1.drop(index, inplace =True)

print(df1['duration'].describe())

By: Faizan Irshad Page 10

Lab 4: Data wrangling with python

7. Practice Task
Please load the autos.csv data given in the folder.

7.1 Practice Task 1

Find the ? in given data and replace it with nan

7.2 Practice Task 2

Count Missing values in each column and display the results.

7.3 Practice Task 3

Calculate the median value for the 'horsepower' column:

7.4 Practice Task 4

Replace "NaN" in ‘horsepower’ column by median value:

7.5 Practice Task 5

Find the car that have maximum highway mile per gallon

7.6 Practice Task 6

Find all honda car details.

7.7 Practice Task 7

Count total cars per company

7.8 Practice Task 8

Find each company’s highest price car

By: Faizan Irshad Page 11

Shreve S.E. Stochastic Calculus For Finance I.. The Binomial Asset Pricing Model
No ratings yet
Shreve S.E. Stochastic Calculus For Finance I.. The Binomial Asset Pricing Model
203 pages
Text Linguistics and Classical Studies - Facebook Com LinguaLIB
100% (1)
Text Linguistics and Classical Studies - Facebook Com LinguaLIB
129 pages
Sports Acoustics
No ratings yet
Sports Acoustics
43 pages
Maths
No ratings yet
Maths
6 pages
Assignment JTW115E 2023-2024 v5
No ratings yet
Assignment JTW115E 2023-2024 v5
5 pages
General Anisotropic Elasticity: Abstract This Chapter Is An Introduction To General Anisotropic Elasticity, I.E. To The
100% (1)
General Anisotropic Elasticity: Abstract This Chapter Is An Introduction To General Anisotropic Elasticity, I.E. To The
56 pages
RCC Structure by PANDI MANI
No ratings yet
RCC Structure by PANDI MANI
13 pages
Instruction Manual FOR New Mather Metals, Inc.: Ajax TOCCO Magnethermic Corporation
100% (1)
Instruction Manual FOR New Mather Metals, Inc.: Ajax TOCCO Magnethermic Corporation
289 pages
Lab-3 Pandas Library
No ratings yet
Lab-3 Pandas Library
18 pages
Transformers Noise Questions and Answers - Sanfoundry
No ratings yet
Transformers Noise Questions and Answers - Sanfoundry
9 pages
Python For Data Analysis Jan 28
No ratings yet
Python For Data Analysis Jan 28
105 pages
Python Fundamentals
No ratings yet
Python Fundamentals
10 pages
Partmart Price List 2024
No ratings yet
Partmart Price List 2024
16 pages
Big Data Open Source Framework-Hadoop
No ratings yet
Big Data Open Source Framework-Hadoop
22 pages
Unit 5 Python
No ratings yet
Unit 5 Python
30 pages
Skewb Guide
No ratings yet
Skewb Guide
12 pages
Introduction To Cloud Computing
No ratings yet
Introduction To Cloud Computing
28 pages
Introduction To Big Data
No ratings yet
Introduction To Big Data
25 pages
Fresh Water Generator
No ratings yet
Fresh Water Generator
6 pages
SQL Hands On Practice Lab - Part 3
No ratings yet
SQL Hands On Practice Lab - Part 3
27 pages
07a40105-Fluid Mechanics and Hydraulic Machinery
No ratings yet
07a40105-Fluid Mechanics and Hydraulic Machinery
8 pages
Kit - 500 Coating Thickness Gauge
No ratings yet
Kit - 500 Coating Thickness Gauge
8 pages
Data Warehousing Extract, Transform and Load (ETL)
No ratings yet
Data Warehousing Extract, Transform and Load (ETL)
32 pages
1 Forecasting-Questions
No ratings yet
1 Forecasting-Questions
4 pages
Your Paper: You January 3, 2025
No ratings yet
Your Paper: You January 3, 2025
3 pages
SPPS M1507 D Datasheet
No ratings yet
SPPS M1507 D Datasheet
2 pages
CH 4 Determinants Multiple Choice Questions With Answers PDF
No ratings yet
CH 4 Determinants Multiple Choice Questions With Answers PDF
4 pages
Data Mining - Week - 4
No ratings yet
Data Mining - Week - 4
8 pages
Exp 1
No ratings yet
Exp 1
3 pages
ICDE 2024 Managing The Future Route Planning Influence Evaluation in Transportation Systems
No ratings yet
ICDE 2024 Managing The Future Route Planning Influence Evaluation in Transportation Systems
15 pages
Control Structures in PLSQL
No ratings yet
Control Structures in PLSQL
8 pages
Semen Analysis
No ratings yet
Semen Analysis
42 pages
Lec 4
No ratings yet
Lec 4
9 pages
CH-6 Data Loading, Storage, and File Formats
No ratings yet
CH-6 Data Loading, Storage, and File Formats
163 pages
Dignitas - Style Guide: Vector Logo Pack PNG Logo Pack
No ratings yet
Dignitas - Style Guide: Vector Logo Pack PNG Logo Pack
4 pages
Infoblox Datasheet - Trinzic 800, 1400, 2200 and 4000 Series Specifications Details PDF
No ratings yet
Infoblox Datasheet - Trinzic 800, 1400, 2200 and 4000 Series Specifications Details PDF
6 pages
547-Article Text-1844-1-10-20210628
No ratings yet
547-Article Text-1844-1-10-20210628
7 pages
Math - Exercise of Pat
No ratings yet
Math - Exercise of Pat
5 pages
Unit V
No ratings yet
Unit V
47 pages
Data Wrangling
No ratings yet
Data Wrangling
15 pages
Asfasdas
No ratings yet
Asfasdas
36 pages
EDA - Exploratory Data Analysis
No ratings yet
EDA - Exploratory Data Analysis
16 pages
Azure Databricks Course Content - Pratap - Qbex Technologies - 8886230001
No ratings yet
Azure Databricks Course Content - Pratap - Qbex Technologies - 8886230001
3 pages
DAP Module4 Notes
No ratings yet
DAP Module4 Notes
17 pages
Ds With Py
No ratings yet
Ds With Py
39 pages
SAP Payroll Wage Types Explained: Search
No ratings yet
SAP Payroll Wage Types Explained: Search
2 pages
The Geek Way Andrew Mcafee Reid Hoffman Download
100% (1)
The Geek Way Andrew Mcafee Reid Hoffman Download
40 pages
Data Science - Sec4
No ratings yet
Data Science - Sec4
16 pages
Pandas 1
No ratings yet
Pandas 1
50 pages
S08 Slides
No ratings yet
S08 Slides
14 pages
Data Wrangling
No ratings yet
Data Wrangling
8 pages
Experiment 1 Solution
No ratings yet
Experiment 1 Solution
5 pages
Pandas Module (Part-I)
No ratings yet
Pandas Module (Part-I)
36 pages
Core of ML - Part 1 Handling Data
No ratings yet
Core of ML - Part 1 Handling Data
3 pages
Python Basics Refresher
No ratings yet
Python Basics Refresher
19 pages
DHP Unit -4 Part2
No ratings yet
DHP Unit -4 Part2
16 pages
Module 3
No ratings yet
Module 3
20 pages
Exercise 3
No ratings yet
Exercise 3
25 pages
DSBDAL
No ratings yet
DSBDAL
87 pages
More On Pandas
No ratings yet
More On Pandas
51 pages
IntroToPython Unit 5
No ratings yet
IntroToPython Unit 5
42 pages
Lecture Week5
No ratings yet
Lecture Week5
72 pages
Lecture Week2
No ratings yet
Lecture Week2
72 pages
CSE445 NSU Week - 3
No ratings yet
CSE445 NSU Week - 3
48 pages
Python (Unit - 2)
No ratings yet
Python (Unit - 2)
22 pages
Justenoughpython Pandas 220915 175329
No ratings yet
Justenoughpython Pandas 220915 175329
64 pages
Hands On Data Cleaning With Pandas and NumPy
No ratings yet
Hands On Data Cleaning With Pandas and NumPy
20 pages
Unit 4 - Working With Graphs - Python
No ratings yet
Unit 4 - Working With Graphs - Python
49 pages
Lab 1 ML Lab
No ratings yet
Lab 1 ML Lab
15 pages
Murali Internship
No ratings yet
Murali Internship
34 pages
Eriez CrossFlowTeeterBedSeparators Brochure
No ratings yet
Eriez CrossFlowTeeterBedSeparators Brochure
2 pages
Analysis of Algorithms: Matplotlib and Pandas Dataframe
No ratings yet
Analysis of Algorithms: Matplotlib and Pandas Dataframe
67 pages
Phython Example
No ratings yet
Phython Example
12 pages
ML Practical 03
No ratings yet
ML Practical 03
20 pages
Exp3 Python
No ratings yet
Exp3 Python
15 pages
Pandas 1
No ratings yet
Pandas 1
13 pages
CH 3 2
No ratings yet
CH 3 2
17 pages
ANL252 SU4 Jul2022
No ratings yet
ANL252 SU4 Jul2022
55 pages
Advance Python Unit 4
No ratings yet
Advance Python Unit 4
13 pages
Lecture 4 Data Pre-Processing
No ratings yet
Lecture 4 Data Pre-Processing
43 pages
Unit 4 Fod
100% (1)
Unit 4 Fod
21 pages
Data Wrangling With Python and Pandas
No ratings yet
Data Wrangling With Python and Pandas
7 pages
Data Preprocessing
No ratings yet
Data Preprocessing
84 pages
CHP 8 Pandas
No ratings yet
CHP 8 Pandas
49 pages
Pandas: Import
100% (1)
Pandas: Import
13 pages
Pandas
No ratings yet
Pandas
30 pages
Pae Receiver Type t6r Maintenance Handbook
80% (5)
Pae Receiver Type t6r Maintenance Handbook
80 pages
Data Wrangling PDF
No ratings yet
Data Wrangling PDF
14 pages
AI Student HandbookXII 2025-26!8!20
No ratings yet
AI Student HandbookXII 2025-26!8!20
13 pages
Pandas
No ratings yet
Pandas
94 pages
Data Wrangling
No ratings yet
Data Wrangling
13 pages
DevOps Session 3 Pandas
No ratings yet
DevOps Session 3 Pandas
33 pages
Pandas DataFrame Notes
No ratings yet
Pandas DataFrame Notes
13 pages
Lab Assignment 1 Title: Data Wrangling I: Problem Statement
No ratings yet
Lab Assignment 1 Title: Data Wrangling I: Problem Statement
12 pages

Lab-4, Data Wrangling With Python

Uploaded by

Lab-4, Data Wrangling With Python

Uploaded by

Data wrangling with python

Lab 4: Data wrangling with python

Lab 4: Data Wangling with python

4. Why Clean Data?

4.1 Dropping unnecessary columns in data

By: Faizan Irshad Page 2

Let’s import the required modules and get started!

First, let’s create a DataFrame out of the file ‘books.txt’.

Removing unnecessary columns

to_drop = ['Edition Statement','Corporate Author','Corporate Contributors',

df.drop(to_drop, inplace=True, axis=1)

4.2 Manipulating Indexes of Data frame

By: Faizan Irshad Page 3

4.3 Dealing with missing values

# Any missing values?

print(df.isna().sum()) //total sum of all nan

Isnull can also be used> print(df.isnull().sum())

I. Drop null value rows

new = df.dropna(axis = 0, how = 'any')

II. Fill Values

# Replace missing values with a number

III. Drop duplicates

df1 = pd.read_csv("call records.csv")

IV. Fill data Using Median

4.4 Python Data aggregation

# Convert date from string to date times

# How many rows the dataset

# What was the longest phone call / data entry?

# Total recording time?

# How many seconds of phone calls are recorded in total?

# Number of non-null unique network entries

# How many entries are there for each month?

print(len(df1.groupby(['month']).groups['2014-11'])) #len shows

# Get the first entry for each month

# Get the sum of the durations per month

# Get the number of dates / entries in each month

# What is the sum of durations, for calls only, to each network

By: Faizan Irshad Page 7

5.1 Detecting and Filtering Outliers

5.2 Identifying Outliers with Interquartile Range (IQR)

Range = Xmax – Xmin

By: Faizan Irshad Page 8

6. Identifying Outliers with Visualization

from matplotlib import pyplot as plt

By: Faizan Irshad Page 9

6.1 Outlier Treatment

df1['duration'].describe() # to see how minimum/maximum values changed

index = df1[(df1['duration'] >= 383.4)|(df['duration'] <=

df1.drop(index, inplace =True)

By: Faizan Irshad Page 10

7.1 Practice Task 1

Find the ? in given data and replace it with nan

7.2 Practice Task 2

Count Missing values in each column and display the results.

7.3 Practice Task 3

Calculate the median value for the 'horsepower' column:

7.4 Practice Task 4

7.5 Practice Task 5

7.6 Practice Task 6

7.7 Practice Task 7

Count total cars per company

7.8 Practice Task 8

By: Faizan Irshad Page 11

You might also like