0% found this document useful (0 votes)

15 views36 pages

Unit - 4 - Part 2

Data wrangling is a crucial step in data science that involves transforming raw data into a more suitable format for analysis, including tasks like data exploration, handling null values, and feature engineering. The document outlines various Python libraries such as NumPy, Pandas, and SciKit-Learn that facilitate data manipulation and analysis. It also covers methods for reading, filtering, grouping, and aggregating data using Pandas DataFrames.

Uploaded by

sibixzzz

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

15 views36 pages

Unit - 4 - Part 2

Uploaded by

sibixzzz

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 36

DATA WRANGLING

Data wrangling (otherwise known as data munging or

preprocessing) is a key component of any data science
project. Wrangling is a process where one transforms
“raw” data for making it more suitable for analysis and it
will improve the quality of your data.
2
� Data Exploration: Checking for feature data types,
unique values, and describing data.
� Null Values: Counting null values and deciding what to
do with them.
� Reshaping and Feature Engineering: This step
transforms raw data into a more useful format. Examples
of feature engineering include one-hot encoding,
aggregation, joins, and grouping.
� Text Processing: BeautifulSoup and Regex (among
other tools) are often used to clean and extract web
scraped texts from HTML and XML documents.

3
AGENDA
� Reading Data
� Selecting and Filtering the Data
� Grouping
� Slicing
� Data manipulation
� loc, iloc
� Sorting
� Handling Missing values
� Aggregation

4
Python Libraries for Data Science
NumPy:
▪ introduces objects for multidimensional arrays and matrices, as
well as functions that allow to easily perform advanced
mathematical and statistical operations on those objects

▪ provides vectorization of mathematical operations on arrays

and matrices which significantly improves the performance

▪ many other python libraries are built on NumPy

Link: http://www.numpy.org/

5
Python Libraries for Data Science
SciPy:
▪ collection of algorithms for linear algebra, differential
equations, numerical integration, optimization, statistics and
more

▪ part of SciPy Stack

▪ built on NumPy

Link: https://www.scipy.org/scipylib/

6
Python Libraries for Data Science
Pandas:
▪ adds data structures and tools designed to work with table-like
data (similar to Series and Data Frames in R)

▪ provides tools for data manipulation: reshaping, merging,

sorting, slicing, aggregation etc.

▪ allows handling missing data

Link: http://pandas.pydata.org/

7
Python Libraries for Data Science
SciKit-Learn:
▪ provides machine learning algorithms: classification, regression,
clustering, model validation etc.

▪ built on NumPy, SciPy and matplotlib

Link: http://scikit-learn.org/

8
Python Libraries for Data Science
matplotlib:
▪ python 2D plotting library which produces publication quality
figures in a variety of hardcopy formats

▪ a set of functionalities similar to those of MATLAB

▪ line plots, scatter plots, barcharts, histograms, pie charts etc.

▪ relatively low-level; some effort needed to create advanced

visualization
Link: https://matplotlib.org/

9
Python Libraries for Data Science
Seaborn:
▪ based on matplotlib

▪ provides high level interface for drawing attractive statistical

graphics

▪ Similar (in style) to the popular ggplot2 library in R

Link: https://seaborn.pydata.org/

10
Loading Python Libraries

In [ ]: #Import Python Libraries

import numpy as np
import scipy as sp
import pandas as pd
import matplotlib as mpl
import seaborn as sns

Press Shift+Enter to execute the jupyter cell

11
Reading data using pandas
In [ ]:
#Read csv file
df = pd.read_csv("http://rcs.bu.edu/examples/python/data_analysis/Salaries.csv")

Note: The above command has many optional arguments to

fine-tune the data import process.

There is a number of pandas commands to read other data formats:

pd.read_excel('myfile.xlsx',sheet_name='Sheet1',
index_col=None, na_values=['NA'])
pd.read_stata('myfile.dta')
pd.read_sas('myfile.sas7bdat')
pd.read_hdf('myfile.h5','df')

12
Exploring data frames

In [3]: #List first 5 records

df.head()
Out[3]:

13
Data Frame data types

Pandas Type Native Python Type Description

object string The most general dtype. Will be
assigned to your column if
column has mixed types
(numbers and strings).
int64 int Numeric characters. 64 refers to
the memory allocated to hold
this character.
float64 float Numeric characters with
decimals. If a column contains
numbers and NaNs(see below),
pandas will default to float64, in
case your missing value has a
decimal.
datetime64, timedelta[ns] N/A (but see Values meant to hold time data.
the datetime module in Python’s Look into these for time series
standard library) experiments.

14
Data Frame data types

In [4]:
#Check a particular column type
df['salary'].dtype
Out[4]: dtype('int64')

In [5]: #Check types for all the columns

df.dtypes
Out[4]: rank object
discipline object
phd int64
service int64
sex object
salary int64
dtype: object

15
Data Frames attributes

Python objects have attributes and methods.

df.attribute description
dtypes list the types of the columns
columns list the column names
axes list the row labels and column names
ndim number of dimensions
size number of elements
shape return a tuple representing the dimensionality
values numpy representation of the data

16
Data Frames methods

Unlike attributes, python methods have parenthesis.

All attributes and methods can be listed with a dir() function:
dir(df)

df.method() description
head( [n] ), tail( [n] ) first/last n rows

describe() generate descriptive statistics (for numeric columns only)

max(), min() return max/min values for all numeric columns

mean(), median() return mean/median values for all numeric columns

std() standard deviation

sample([n]) returns a random sample of the data frame

dropna() drop all the records with missing values

17
Selecting a column in a Data Frame
Method 1: Subset the data frame using column name:
df['sex']

Method 2: Use the column name as an attribute:

df.sex

Note: there is an attribute rank for pandas data frames, so to select a column with a
name "rank" we should use method 1.

18
Data Frames groupby method

Using "group by" method we can:

• Split the data into groups based on some criteria

• Calculate statistics (or apply a function) to each group
• Similar to dplyr() function in R

In [ ]: #Group data using rank

df_rank = df.groupby(['rank'])

In [ ]:
#Calculate mean value for each numeric column per each
group
df_rank.mean()

19
Data Frames groupby method

Once groupby object is create we can calculate various statistics for each

group:

In [ ]: #Calculate mean salary for each professor rank:

df.groupby('rank')[['salary']].mean()

Note: If single brackets are used to specify the column (e.g. salary), then
the output is Pandas Series object. When double brackets are used the
output is a Data Frame
20
Data Frames groupby method

groupby performance notes:

- no grouping/splitting occurs until it's needed. Creating the groupby

object only verifies that you have passed a valid mapping
- by default the group keys are sorted during the groupby operation.
You may want to pass sort=False for potential speedup:

In [ ]: #Calculate mean salary for each professor rank:

df.groupby(['rank'], sort=False)[['salary']].mean()

21
Data Frame: filtering

To subset the data we can apply Boolean indexing. This indexing is

commonly known as a filter. For example if we want to subset the rows in
which the salary value is greater than $120K:

In [ ]: #Calculate mean salary for each professor rank:

df_sub = df[ df['salary'] > 120000 ]
Any Boolean operator can be used to subset the data:
> greater; >= greater or equal;
< less; <= less or equal;
== equal; != not equal;
In [ ]: #Select only those rows that contain female
professors:
df_f = df[ df['sex'] == 'Female' ]

22
Data Frames: Slicing

There are a number of ways to subset the Data Frame:

• one or more columns
• one or more rows
• a subset of rows and columns

Rows and columns can be selected by their position or label

23
Data Frames: Slicing

When selecting one column, it is possible to use single set of brackets, but
the resulting object will be a Series (not a DataFrame):
In [ ]: #Select column salary:
df['salary']

When we need to select more than one column and/or make the output to
be a DataFrame, we should use double brackets:
In [ ]: #Select column salary:
df[['rank','salary']]

24
Data Frames: Selecting rows

If we need to select a range of rows, we can specify the range using ":"

In [ ]: #Select rows by their position:

df[10:20]

Notice that the first row has a position 0, and the last value in the range is
omitted:
So for 0:10 range the first 10 rows are returned with the positions starting
with 0 and ending with 9

25
Data Frames: method loc

If we need to select a range of rows, using their labels we can use method
loc:
In [ ]: #Select rows by their labels:
df_sub.loc[10:20,['rank','sex','salary']]

Out[ ]:

26
Data Frames: method iloc

If we need to select a range of rows and/or columns, using their positions

we can use method iloc:
In [ ]: #Select rows by their labels:
df_sub.iloc[10:20,[0, 3, 4, 5]]

Out[ ]:

27
Data Frames: method iloc (summary)

df.iloc[0] # First row of a data frame

df.iloc[i] #(i+1)th row
df.iloc[-1] # Last row

df.iloc[:, 0] # First column

df.iloc[:, -1] # Last column

df.iloc[0:7] #First 7 rows

df.iloc[:, 0:2] #First 2 columns
df.iloc[1:3, 0:2] #Second through third rows and
first 2 columns
df.iloc[[0,5], [1,3]] #1st and 6th rows and 2nd and 4th
columns

28
Data Frames: Sorting

We can sort the data by a value in the column. By default the sorting will
occur in ascending order and a new data frame is return.

In [ ]: # Create a new data frame from the original sorted by

the column Salary
df_sorted = df.sort_values( by ='service')
df_sorted.head()

Out[ ]:

29
Data Frames: Sorting

We can sort the data using 2 or more columns:

In [ ]: df_sorted = df.sort_values( by =['service', 'salary'], ascending = [True, False])

df_sorted.head(10)

Out[ ]:

30
Missing Values

Missing values are marked as NaN

In [ ]: # Read a dataset with missing values
flights = pd.read_csv("http://rcs.bu.edu/examples/python/data_analysis/flights.csv")

In [ ]: # Select the rows that have at least one missing value

flights[flights.isnull().any(axis=1)].head()

Out[ ]:

31
Missing Values
There are a number of methods to deal with missing values in the data
frame:

df.method() description
dropna() Drop missing observations

dropna(how='all') Drop observations where all cells is NA

dropna(axis=1, Drop column if all the values are missing

how='all')
dropna(thresh = 5) Drop rows that contain less than 5 non-missing values

fillna(0) Replace missing values with zeros

isnull() returns True if the value is missing

notnull() Returns True for non-missing values

32
Missing Values

• When summing the data, missing values will be treated as zero

• If all values are missing, the sum will be equal to NaN
• cumsum() and cumprod() methods ignore missing values but preserve
them in the resulting arrays
• Missing values in GroupBy method are excluded (just like in R)
• Many descriptive statistics methods have skipna option to control if
missing data should be excluded . This value is set to True by default
(unlike R)

33
Aggregation Functions in Pandas

Aggregation - computing a summary statistic about each group, i.e.

• compute group sums or means
• compute group sizes/counts

Common aggregation functions:

min, max
count, sum, prod
mean, median, mode, mad
std, var

34
Aggregation Functions in Pandas

agg() method are useful when multiple statistics are computed per column:
In [ ]: flights[['dep_delay','arr_delay']].agg(['min','mean','max'])

Out[ ]:

35
Basic Descriptive Statistics

df.method() description
describe Basic statistics (count, mean, std, min, quantiles, max)

min, max Minimum and maximum values

mean, median, mode Arithmetic average, median and mode

var, std Variance and standard deviation

sem Standard error of mean

skew Sample skewness

kurt kurtosis

Python Pandas Tutorial For Beginners
No ratings yet
Python Pandas Tutorial For Beginners
203 pages
Tutorial Data Visualization Pandas Matplotlib Seaborn
No ratings yet
Tutorial Data Visualization Pandas Matplotlib Seaborn
32 pages
Phan1 Pandas Numpy Matplotlib
No ratings yet
Phan1 Pandas Numpy Matplotlib
158 pages
Python For Data Analysis
No ratings yet
Python For Data Analysis
47 pages
CO3 - 1 - Pandas Series and Data Frame
No ratings yet
CO3 - 1 - Pandas Series and Data Frame
37 pages
Python For Data Analysis: Dr. Kishore Kunal
100% (1)
Python For Data Analysis: Dr. Kishore Kunal
43 pages
13-007 Datasets and DataFrames
No ratings yet
13-007 Datasets and DataFrames
10 pages
20 Pandas Functions For 80% of Your Data Science
No ratings yet
20 Pandas Functions For 80% of Your Data Science
22 pages
Cheat Sheet: Python For Data Science
No ratings yet
Cheat Sheet: Python For Data Science
1 page
PYTHON Pandas and Manipulation Data
No ratings yet
PYTHON Pandas and Manipulation Data
36 pages
Python For Data Analysis
No ratings yet
Python For Data Analysis
37 pages
Python Libraries 2
No ratings yet
Python Libraries 2
80 pages
VIVA Questions forOOAD
71% (7)
VIVA Questions forOOAD
10 pages
Justenoughpython Pandas 220915 175329
No ratings yet
Justenoughpython Pandas 220915 175329
64 pages
CSE445 NSU Week - 3
No ratings yet
CSE445 NSU Week - 3
48 pages
Chapter 4 - Python For Data Analysis
No ratings yet
Chapter 4 - Python For Data Analysis
47 pages
More On Pandas
No ratings yet
More On Pandas
51 pages
Starting Out With Pandas - Ext
No ratings yet
Starting Out With Pandas - Ext
18 pages
Lesson 2 - Data Preprocessing
100% (1)
Lesson 2 - Data Preprocessing
72 pages
Pandas PDF
No ratings yet
Pandas PDF
25 pages
Chapter-2 Python Pandas
100% (2)
Chapter-2 Python Pandas
33 pages
Lecture Week2
No ratings yet
Lecture Week2
72 pages
Python For Data Analysis Edgar
No ratings yet
Python For Data Analysis Edgar
49 pages
Python For ML
No ratings yet
Python For ML
41 pages
Pandas Basics
No ratings yet
Pandas Basics
84 pages
Pandas Notes
No ratings yet
Pandas Notes
4 pages
ICT2103 Full Book-Part-3
No ratings yet
ICT2103 Full Book-Part-3
14 pages
Pandas Introduction: What Is Python Pandas Used For?
No ratings yet
Pandas Introduction: What Is Python Pandas Used For?
28 pages
Python For Data Science
No ratings yet
Python For Data Science
45 pages
Intro Pandas
No ratings yet
Intro Pandas
18 pages
Python For Statistics
No ratings yet
Python For Statistics
40 pages
Usage of NumPy For Numerical Data in Detail
No ratings yet
Usage of NumPy For Numerical Data in Detail
52 pages
Python For Data Analysis
No ratings yet
Python For Data Analysis
96 pages
Murali Internship
No ratings yet
Murali Internship
34 pages
Python For DA
100% (2)
Python For DA
47 pages
Pandas
No ratings yet
Pandas
25 pages
L6 and 7-Data Preprocessing-Coding
No ratings yet
L6 and 7-Data Preprocessing-Coding
34 pages
DAP 3 Module
No ratings yet
DAP 3 Module
62 pages
Pandas Data Structures: Sections
No ratings yet
Pandas Data Structures: Sections
13 pages
Unit 3 (FODS)
No ratings yet
Unit 3 (FODS)
34 pages
Introduction To Pandas
No ratings yet
Introduction To Pandas
27 pages
Pandas
No ratings yet
Pandas
26 pages
Python For DS Cheat Sheet
100% (2)
Python For DS Cheat Sheet
6 pages
FDS Module 2 Notes
No ratings yet
FDS Module 2 Notes
24 pages
Python-for-Data-Analysis (Pandas
No ratings yet
Python-for-Data-Analysis (Pandas
31 pages
Data Manipulation With Pandas
No ratings yet
Data Manipulation With Pandas
39 pages
Wa0005.
No ratings yet
Wa0005.
29 pages
What Is Pandas
No ratings yet
What Is Pandas
9 pages
DevOps Session 3 Pandas
No ratings yet
DevOps Session 3 Pandas
33 pages
Data Wrangling With Python and Pandas
No ratings yet
Data Wrangling With Python and Pandas
7 pages
Dbms Practical File
No ratings yet
Dbms Practical File
29 pages
00-Coa Transmittal-Letter Template
100% (1)
00-Coa Transmittal-Letter Template
2 pages
ML Lab1 Python Panda
No ratings yet
ML Lab1 Python Panda
9 pages
Pandas
No ratings yet
Pandas
5 pages
Pandas
No ratings yet
Pandas
13 pages
3Y3Z2Xzqn7 U Y%K : 2. How To Create A Data Frame Using A Dictionary of Pre-Existing Columns or Numpy 2D Arrays?
No ratings yet
3Y3Z2Xzqn7 U Y%K : 2. How To Create A Data Frame Using A Dictionary of Pre-Existing Columns or Numpy 2D Arrays?
8 pages
Python Data Science 101
100% (1)
Python Data Science 101
41 pages
Pandas Cheat Sheet - Python For Data Science
No ratings yet
Pandas Cheat Sheet - Python For Data Science
5 pages
XTC - Black Sea (Bass Tablature Songbook)
100% (2)
XTC - Black Sea (Bass Tablature Songbook)
43 pages
SAP Business Technology Platform (SAP BTP)
No ratings yet
SAP Business Technology Platform (SAP BTP)
5 pages
List of Autorised Recovery Agencies
No ratings yet
List of Autorised Recovery Agencies
74 pages
EmSAT English College Entry Exam Specification English
100% (1)
EmSAT English College Entry Exam Specification English
1 page
2G Huawei Site Solution
100% (1)
2G Huawei Site Solution
31 pages
04 0862 02 MS 4RP AFP tcm143-736388
No ratings yet
04 0862 02 MS 4RP AFP tcm143-736388
10 pages
World English 3e Level 2 Grammar Activities Unit 1 Lesson C
No ratings yet
World English 3e Level 2 Grammar Activities Unit 1 Lesson C
1 page
BT 3308
No ratings yet
BT 3308
29 pages
Foxit PDF Editor Cloud User Manual
No ratings yet
Foxit PDF Editor Cloud User Manual
238 pages
Resource Notebook For Families of Children Who Are Deaf or Hard of Hearing
No ratings yet
Resource Notebook For Families of Children Who Are Deaf or Hard of Hearing
118 pages
Brill Typeface 2011 1
No ratings yet
Brill Typeface 2011 1
2 pages
Time RPH Year 4
No ratings yet
Time RPH Year 4
6 pages
Friction Lesson Plan
No ratings yet
Friction Lesson Plan
7 pages
6 Strategies For Teaching Special Education Classes
No ratings yet
6 Strategies For Teaching Special Education Classes
2 pages
Bloomsbury Guidelines For Contributors
No ratings yet
Bloomsbury Guidelines For Contributors
4 pages
Kaar Technologies - Registration Status - 2026-2
No ratings yet
Kaar Technologies - Registration Status - 2026-2
21 pages
Akademik - Its.ac - Id Rep Transkrip Sementara - PHP
No ratings yet
Akademik - Its.ac - Id Rep Transkrip Sementara - PHP
1 page
CLASS X PT1 Maths
No ratings yet
CLASS X PT1 Maths
4 pages
Mastering G2 MATHEMATICS Secondary 1 - Sample Pages
No ratings yet
Mastering G2 MATHEMATICS Secondary 1 - Sample Pages
17 pages
Noc 25 Cs 65 S 243306161
No ratings yet
Noc 25 Cs 65 S 243306161
2 pages
Lord of The Flies Word Cloud PDF
No ratings yet
Lord of The Flies Word Cloud PDF
3 pages
Akshat Goyal
No ratings yet
Akshat Goyal
1 page
Past Simple and Past Continuous: Grammar
No ratings yet
Past Simple and Past Continuous: Grammar
4 pages
Sachin MC Assignment 5
No ratings yet
Sachin MC Assignment 5
4 pages
Unleashing The Power of ChatGPT For Translation
No ratings yet
Unleashing The Power of ChatGPT For Translation
10 pages
Elevate Series Resources
No ratings yet
Elevate Series Resources
16 pages
Office 2003 Prof Step by Step Installation Guide Prof
No ratings yet
Office 2003 Prof Step by Step Installation Guide Prof
32 pages
Python Functions For Data Science
No ratings yet
Python Functions For Data Science
1 page
Logical Reasoning
No ratings yet
Logical Reasoning
7 pages
Test 3 - Ôn Thi 10
No ratings yet
Test 3 - Ôn Thi 10
3 pages
Contoh Format Skrip Role Play (F2F)
No ratings yet
Contoh Format Skrip Role Play (F2F)
7 pages
God's Will or Your Will
No ratings yet
God's Will or Your Will
4 pages
Loop IBM
No ratings yet
Loop IBM
3 pages

Unit - 4 - Part 2

Uploaded by

Unit - 4 - Part 2

Uploaded by

DATA WRANGLING

Data wrangling (otherwise known as data munging or

▪ provides vectorization of mathematical operations on arrays

▪ many other python libraries are built on NumPy

▪ part of SciPy Stack

▪ provides tools for data manipulation: reshaping, merging,

▪ allows handling missing data

▪ built on NumPy, SciPy and matplotlib

▪ a set of functionalities similar to those of MATLAB

▪ line plots, scatter plots, barcharts, histograms, pie charts etc.

▪ relatively low-level; some effort needed to create advanced

▪ provides high level interface for drawing attractive statistical

▪ Similar (in style) to the popular ggplot2 library in R

In [ ]: #Import Python Libraries

Press Shift+Enter to execute the jupyter cell

Note: The above command has many optional arguments to

There is a number of pandas commands to read other data formats:

In [3]: #List first 5 records

Pandas Type Native Python Type Description

In [5]: #Check types for all the columns

Python objects have attributes and methods.

Unlike attributes, python methods have parenthesis.

describe() generate descriptive statistics (for numeric columns only)

max(), min() return max/min values for all numeric columns

mean(), median() return mean/median values for all numeric columns

std() standard deviation

sample([n]) returns a random sample of the data frame

dropna() drop all the records with missing values

Method 2: Use the column name as an attribute:

Using "group by" method we can:

• Split the data into groups based on some criteria

In [ ]: #Group data using rank

In [ ]: #Calculate mean salary for each professor rank:

groupby performance notes:

- no grouping/splitting occurs until it's needed. Creating the groupby

In [ ]: #Calculate mean salary for each professor rank:

To subset the data we can apply Boolean indexing. This indexing is

In [ ]: #Calculate mean salary for each professor rank:

There are a number of ways to subset the Data Frame:

Rows and columns can be selected by their position or label

In [ ]: #Select rows by their position:

If we need to select a range of rows and/or columns, using their positions

df.iloc[0] # First row of a data frame

df.iloc[:, 0] # First column

df.iloc[0:7] #First 7 rows

In [ ]: # Create a new data frame from the original sorted by

We can sort the data using 2 or more columns:

In [ ]: df_sorted = df.sort_values( by =['service', 'salary'], ascending = [True, False])

Missing values are marked as NaN

In [ ]: # Select the rows that have at least one missing value

dropna(how='all') Drop observations where all cells is NA

dropna(axis=1, Drop column if all the values are missing

fillna(0) Replace missing values with zeros

isnull() returns True if the value is missing

notnull() Returns True for non-missing values

• When summing the data, missing values will be treated as zero

Aggregation - computing a summary statistic about each group, i.e.

Common aggregation functions:

min, max Minimum and maximum values

mean, median, mode Arithmetic average, median and mode

var, std Variance and standard deviation

sem Standard error of mean

skew Sample skewness

You might also like