0% found this document useful (0 votes)

17 views33 pages

Lecture 14

Data science

Uploaded by

aaaaalshammari

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

17 views33 pages

Lecture 14

Data science

Uploaded by

aaaaalshammari

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 33

STATS 701

Data Analysis using Python

Lecture 14: Advanced pandas
Recap
Previous lecture: basics of pandas
Series and DataFrames
Indexing, changing entries
Function application

This lecture: more complicated operations

Statistical computations
Group-By operations
Reshaping, stacking and pivoting
Recap
Previous lecture: basics of pandas
Series and DataFrames
Indexing, changing entries
Function application

This lecture: more complicated operations

Statistical computations
Group-By operations Caveat: pandas is a large, complicated
Reshaping, stacking and pivoting package, so I will not endeavor to mention
every feature here. These slides should be
enough to get you started, but there’s no
substitute for reading the documentation.
Percent change over time

pct_change method is supported by both Series and

DataFrames. Series.pct_change returns a new
Series representing the step-wise percent change.

pct_change includes control over how missing

data is imputed, how large a time-lag to use, etc.
See documentation for more detail:
https://pandas.pydata.org/pandas-docs/stable/ge
nerated/pandas.Series.pct_change.html
Percent change over time
pct_change operates on columns of a DataFrame, by
default. Periods argument specifies the time-lag to use
in computing percent change. So periods=2 looks at
percent change compared to two time steps ago.

Note: pandas has extensive support for time series

data, which we mostly won’t talk about in this course.

pct_change includes control over how missing

cov method is also supported by DataFrame,

but instead computes a new DataFrame of
covariances between columns.

cov supports extra arguments for further specifying behavior:

https://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.cov.html
Pairwise correlations

DataFrame corr method computes

correlations between columns (use axis
keyword to change this behavior).
method argument controls which
correlation score to use (default is
Pearson’s correlation.
Ranking data

rank method returns a new Series

whose values are the data ranks.

Ties are broken by assigning the

mean rank to both values.
By default, rank ranks columns
Ranking data of a DataFrame individually.

Rank rows instead by supplying

an axis argument.

Note: more complicated ranking of whole rows (i.e., sorting

whole rows rather than sorting columns individually) is possible,
but requires we define an ordering on Series.
Group By: reorganizing data
“Group By” operations are a concept from databases
Splitting data based on some criteria
Applying functions to different splits
Combining results into a single data structure

Fundamental object: pandas GroupBy objects

Group By: reorganizing data

DataFrame groupby method

returns a pandas groupby object.
Group By: reorganizing data

Every groupby object has an attribute groups,

which is a dictionary with maps group labels to
the indices in the DataFrame.

In this example, we are splitting on the

column ‘A’, which has two values:
‘plant’ and ‘animal’, so the groups
dictionary has two keys.
Group By: reorganizing data

Every groupby object has an attribute groups,

which is a dictionary with maps group labels to
the indices in the DataFrame.

The important point is that the groupby object is

storing information about how to partition the rows
of the original DataFrame according to the
argument(s) passed to the groupby method.

In this example, we are splitting on the

column ‘A’, which has two values:
‘plant’ and ‘animal’, so the groups
dictionary has two keys.
Group By: aggregation

Split on group ‘A’, then compute the means

within each group. Note that columns for which
means are not supported are removed, so
column ‘B’ doesn’t show up in the result.
Group By: aggregation

Here we’re building a hierarchically-indexed

Series (i.e., multi-indexed), recording (fictional)
scores of students by major and handedness.

Suppose I want to collapse over handedness to get

average scores by major. In essence, I want to group by
major and ignore handedness.
Group By: aggregation
Suppose I want to collapse over handedness to get
average scores by major. In essence, I want to group by
major and ignore handedness.

Group by the 0-th level of the hierarchy

(i.e., ‘major’), and take means.

We could have equivalently written

groupby(‘major’) , here.
Group By: examining groups

groupby.get_group lets us pick out

an individual group. Here, we’re
grabbing just the data from the ‘econ’
group, after grouping by ‘major’.
Group By: aggregation
Similar aggregation to what we did a
few slides ago, but now we have a
DataFrame instead of a Series.
Group By: aggregation
Similar aggregation to what we did a
few slides ago, but now we have a
DataFrame instead of a Series.

Groupby objects also support the aggregate

method, which is often more convenient.
From the documentation: “The transform
method returns an object that is indexed the
Transforming data same (same size) as the one being grouped.”

Building a time series,

indexed by year-month-day.

Suppose we want to
standardize these scores
within each year. Group the data according to the output
of the key function, apply the given
transformation within each group, then
un-group the data.

Important point: the result of groupby.transform has

the same dimension as the original DataFrame or Series.
From the documentation: “The
Filtering data argument of filter must be a function
that, applied to the group as a whole,
returns True or False.”

So this will throw out all the

groups with sum <= 2.

Like transform, the

result is ungrouped.
Combining DataFrames

pandas concat function concatenates

DataFrames into a single DataFrame.

Repeated indices remain repeated

in the resulting DataFrame.
Missing values
get NaN.
pandas.concat accepts numerous
optional arguments for finer control over
how concatenation is performed. See the
documentation for more.
Merges and joins
pandas DataFrames support many common database operations
Most notably, join and merge operations

We’ll learn about these when we discuss SQL later in the semester
So we won’t discuss them here

Important: What we learn for SQL later has analogues in pandas

If you are already familiar with SQL, you might like to read this:
https://pandas.pydata.org/pandas-docs/stable/comparison_with_sql.html
Data in this format is usually called stacked. It
Pivoting and Stacking is common to store data in this form in a file, but
once it’s read into a table, it often makes more
sense to create columns for A, B and C. That is,
we want to unstack this DataFrame.
Pivoting and Stacking The pivot method takes care of unstacking
DataFrames. We supply indices for the new
DataFrame, and tell it to turn the variable
column in the old DataFrame into a set of
column names in the unstacked one.

https://en.wikipedia.org/wiki/Pivot_table
Pivoting and Stacking

How do we stack this? That is, how do we get a

non-pivot version of this DataFrame? The answer
is to use the DataFrame stack method.
Pivoting and Stacking

The DataFrame stack method makes a stacked version

of the calling DataFrame. In the event that the resulting
column index set is a trivial, the result is a Series. Note
that df.stack() no longer has columns A or B. The
column labels A and B have become an extra index.
Pivoting and Stacking

Here is a more complicated example.

Notice that the column labels have a
three-level hierarchical structure.

There are multiple ways to stack this data. At

one extreme, we could make all three levels
into columns. At the other extreme, we could
choose only one to make into a column.
Pivoting and Stacking
Stack only according to level 1
(i.e., the animal column index).

Missing animal x cond x hair_length

conditions default to NaN.
Pivoting and Stacking

Stacking across all three levels

yields a Series, since there is no
longer any column structure. This is
often called flattening a table.

Notice that the NaN entries are not

necessary here, since we have an
entry in the Series only for entries of
the original DataFrame.
Plotting DataFrames

cumsum gets partial sums,

just like in numpy.

Note: this requires that you

have imported matplotlib.

Note that legend is automatically

populated and x-ticks are
automatically date formatted.
Plotting DataFrames
DataFrames.plot() method is largely identical to matplotlib.pyplot
So you already mostly know how to use it!

Additional plot types:

https://pandas.pydata.org/pandas-docs/stable/visualization.html#other-plots

More advanced plotting tools:

https://pandas.pydata.org/pandas-docs/stable/visualization.html#plotting-tools
Readings
Required:
Group By:
https://pandas.pydata.org/pandas-docs/stable/groupby.html
Reshaping and pivoting:
https://pandas.pydata.org/pandas-docs/stable/reshaping.html

Recommended:
Merge, join and concatenation:
https://pandas.pydata.org/pandas-docs/stable/merging.html
Time series functionality:
https://pandas.pydata.org/pandas-docs/stable/timeseries.html

Pandas Cheat Sheet CN
No ratings yet
Pandas Cheat Sheet CN
4 pages
Python Interviews
No ratings yet
Python Interviews
154 pages
EDA Module 3-1
No ratings yet
EDA Module 3-1
40 pages
Chapter-2 Python Pandas
100% (2)
Chapter-2 Python Pandas
33 pages
Unit Iv
No ratings yet
Unit Iv
63 pages
Python Cheat Sheet: Pandas - Numpy - Sklearn Matplotlib - Seaborn BS4 - Selenium - Scrapy
100% (3)
Python Cheat Sheet: Pandas - Numpy - Sklearn Matplotlib - Seaborn BS4 - Selenium - Scrapy
9 pages
Python CSBS Bhavya Lab Manual
No ratings yet
Python CSBS Bhavya Lab Manual
14 pages
Python Pandas Demo PDF
100% (2)
Python Pandas Demo PDF
23 pages
Pandas Cheat Sheet
100% (4)
Pandas Cheat Sheet
2 pages
Groupby RST
No ratings yet
Groupby RST
32 pages
Unit 4 Fod
100% (1)
Unit 4 Fod
21 pages
07 Data Wrangling
No ratings yet
07 Data Wrangling
51 pages
Data Mining - Week - 4
No ratings yet
Data Mining - Week - 4
8 pages
Intro Pandas
No ratings yet
Intro Pandas
18 pages
Pandas
No ratings yet
Pandas
25 pages
MLStack Cafe 2
No ratings yet
MLStack Cafe 2
11 pages
The Religions of Ancient Egypt, Assyria & Babylonia - Archibald Henry Sayce (1906)
100% (4)
The Religions of Ancient Egypt, Assyria & Babylonia - Archibald Henry Sayce (1906)
530 pages
Pandas
No ratings yet
Pandas
44 pages
Al-'Ilm Al-Hudûrî - Conhecimento Por Presença (Inglês) - Volume I
No ratings yet
Al-'Ilm Al-Hudûrî - Conhecimento Por Presença (Inglês) - Volume I
176 pages
Pandas Cheat Sheet
100% (1)
Pandas Cheat Sheet
2 pages
Introduction To Ephesians
No ratings yet
Introduction To Ephesians
15 pages
Pandas
No ratings yet
Pandas
13 pages
Pandas
No ratings yet
Pandas
94 pages
Data Aggregation and Group Operations
No ratings yet
Data Aggregation and Group Operations
34 pages
PANDAS Python
No ratings yet
PANDAS Python
2 pages
Introduction To Pandas in Data Analytics
No ratings yet
Introduction To Pandas in Data Analytics
12 pages
Pandas Questions
No ratings yet
Pandas Questions
11 pages
Understanding Pandas Groupby For Data Aggregation
No ratings yet
Understanding Pandas Groupby For Data Aggregation
49 pages
Python Libraries Cheat Sheets
No ratings yet
Python Libraries Cheat Sheets
6 pages
Module 4
No ratings yet
Module 4
38 pages
04-Data Manipulation With Pandas
No ratings yet
04-Data Manipulation With Pandas
28 pages
Python Cheat Sheet For Excel Users
100% (2)
Python Cheat Sheet For Excel Users
5 pages
3Y3Z2Xzqn7 U Y%K : 2. How To Create A Data Frame Using A Dictionary of Pre-Existing Columns or Numpy 2D Arrays?
No ratings yet
3Y3Z2Xzqn7 U Y%K : 2. How To Create A Data Frame Using A Dictionary of Pre-Existing Columns or Numpy 2D Arrays?
8 pages
Python 2.1.3
No ratings yet
Python 2.1.3
6 pages
Pandas Cheat Sheet
85% (13)
Pandas Cheat Sheet
2 pages
Introduction To Pandas
No ratings yet
Introduction To Pandas
27 pages
All Document Reader 1715619870900
No ratings yet
All Document Reader 1715619870900
6 pages
Pandas 1705297450
No ratings yet
Pandas 1705297450
21 pages
Python Cheat Sheet Code Academy
100% (1)
Python Cheat Sheet Code Academy
1 page
DAP 3 Module
No ratings yet
DAP 3 Module
62 pages
Pandas
No ratings yet
Pandas
25 pages
Pandas Cheat Sheet
100% (2)
Pandas Cheat Sheet
6 pages
Pandas DataFrame Notes
No ratings yet
Pandas DataFrame Notes
13 pages
Pandas
No ratings yet
Pandas
26 pages
Data Wrangling With Python and Pandas
No ratings yet
Data Wrangling With Python and Pandas
7 pages
FDS Module 2 Notes
No ratings yet
FDS Module 2 Notes
24 pages
Content Pandas Cheat Sheet
No ratings yet
Content Pandas Cheat Sheet
9 pages
Lab-3 Pandas Library
No ratings yet
Lab-3 Pandas Library
14 pages
Fundamental - Python
No ratings yet
Fundamental - Python
3 pages
Data Science Cheat Sheet: KEY Imports
100% (1)
Data Science Cheat Sheet: KEY Imports
1 page
Data Handling Module
No ratings yet
Data Handling Module
10 pages
Pandas Merged
No ratings yet
Pandas Merged
2 pages
Commands SQL, Python (BASICS)
No ratings yet
Commands SQL, Python (BASICS)
7 pages
EDS - Python Cheat Sheet
0% (1)
EDS - Python Cheat Sheet
3 pages
DevOps Session 3 Pandas
No ratings yet
DevOps Session 3 Pandas
33 pages
4.IRREGULAR VERBS - RO Version PDF
No ratings yet
4.IRREGULAR VERBS - RO Version PDF
6 pages
Pandas
No ratings yet
Pandas
9 pages
Topic01 SQLDataDefinition
No ratings yet
Topic01 SQLDataDefinition
6 pages
Using Git
No ratings yet
Using Git
42 pages
Pandas Cheat Sheet
No ratings yet
Pandas Cheat Sheet
2 pages
Pandas Cheat Sheet PDF
67% (3)
Pandas Cheat Sheet PDF
1 page
First Draft - Funciones y Exponentes 2013
100% (1)
First Draft - Funciones y Exponentes 2013
8 pages
Mah Mba Cet 2022 Question Paper PDF
No ratings yet
Mah Mba Cet 2022 Question Paper PDF
9 pages
The Art of Hearing God's Voice Clearly - PROPHET SAINT GODCASE G C.OKOJERE
No ratings yet
The Art of Hearing God's Voice Clearly - PROPHET SAINT GODCASE G C.OKOJERE
50 pages
LBDL
No ratings yet
LBDL
142 pages
Selection of Quotes About Prosody and Action Research
No ratings yet
Selection of Quotes About Prosody and Action Research
5 pages
POMA
No ratings yet
POMA
5 pages
Customs of The Tagalogs
No ratings yet
Customs of The Tagalogs
6 pages
Year 3 CEFR English Language PDPR Module/LP
No ratings yet
Year 3 CEFR English Language PDPR Module/LP
5 pages
Data Gathering
No ratings yet
Data Gathering
17 pages
Resume Pooja Dipak Sawant-2
No ratings yet
Resume Pooja Dipak Sawant-2
2 pages
Nurul Atiqah Abd Haffidz No, 30 JLN Mewah 2/11 Taman Pandan Mewah, Ampang Selangor Darul Ehsan
100% (1)
Nurul Atiqah Abd Haffidz No, 30 JLN Mewah 2/11 Taman Pandan Mewah, Ampang Selangor Darul Ehsan
3 pages
AJP DatagramPacket MCQ
No ratings yet
AJP DatagramPacket MCQ
9 pages
Gr11-Gr12 Trigs Study Sheet
No ratings yet
Gr11-Gr12 Trigs Study Sheet
6 pages
System Analysis and Design! PP
No ratings yet
System Analysis and Design! PP
21 pages
Imaginary Situations
No ratings yet
Imaginary Situations
5 pages
LeaP Math G7 Week 6 Q3
No ratings yet
LeaP Math G7 Week 6 Q3
4 pages
KEYS đề giữa HK 2
No ratings yet
KEYS đề giữa HK 2
3 pages
The Fortune Teller
No ratings yet
The Fortune Teller
3 pages
SOAL B. INGGRIS STS Kelas 3 Edit
No ratings yet
SOAL B. INGGRIS STS Kelas 3 Edit
3 pages
Model 4
No ratings yet
Model 4
1 page
2344 1101 Zensar Placement Paper A Interview Questions
No ratings yet
2344 1101 Zensar Placement Paper A Interview Questions
1 page
Tere Sang Tere Bin Chord Sheet
No ratings yet
Tere Sang Tere Bin Chord Sheet
2 pages
Uh 1 Congratulations, Hopes, and Wishes
100% (2)
Uh 1 Congratulations, Hopes, and Wishes
3 pages
Development Plan-Part IV, 2022-2023
100% (10)
Development Plan-Part IV, 2022-2023
3 pages
Python for Data Science: A Hands-On Introduction
From Everand
Python for Data Science: A Hands-On Introduction
Yuli Vasiliev
No ratings yet
Learning JavaScript Data Structures and Algorithms - Second Edition
From Everand
Learning JavaScript Data Structures and Algorithms - Second Edition
Loiane Groner
No ratings yet
Mastering Data Structures and Algorithms in Python & Java
From Everand
Mastering Data Structures and Algorithms in Python & Java
Sachin Naha
No ratings yet
IGNOU PGDCA MCS 206 Object Oriented Programming using Java Previous Years solved Papers
From Everand
IGNOU PGDCA MCS 206 Object Oriented Programming using Java Previous Years solved Papers
Manish Soni
No ratings yet
Alternating Decision Tree: Fundamentals and Applications
From Everand
Alternating Decision Tree: Fundamentals and Applications
Fouad Sabry
No ratings yet

Lecture 14

Uploaded by

Lecture 14

Uploaded by

STATS 701

Data Analysis using Python

This lecture: more complicated operations

This lecture: more complicated operations

pct_change method is supported by both Series and

pct_change includes control over how missing

Note: pandas has extensive support for time series

pct_change includes control over how missing

cov method is also supported by DataFrame,

cov supports extra arguments for further specifying behavior:

DataFrame corr method computes

rank method returns a new Series

Ties are broken by assigning the

Rank rows instead by supplying

Note: more complicated ranking of whole rows (i.e., sorting

Fundamental object: pandas GroupBy objects

DataFrame groupby method

Every groupby object has an attribute groups,

In this example, we are splitting on the

Every groupby object has an attribute groups,

The important point is that the groupby object is

In this example, we are splitting on the

Split on group ‘A’, then compute the means

Here we’re building a hierarchically-indexed

Suppose I want to collapse over handedness to get

Group by the 0-th level of the hierarchy

We could have equivalently written

groupby.get_group lets us pick out

Groupby objects also support the aggregate

Building a time series,

Important point: the result of groupby.transform has

So this will throw out all the

Like transform, the

pandas concat function concatenates

Repeated indices remain repeated

Important: What we learn for SQL later has analogues in pandas

How do we stack this? That is, how do we get a

The DataFrame stack method makes a stacked version

Here is a more complicated example.

There are multiple ways to stack this data. At

Missing animal x cond x hair_length

Stacking across all three levels

Notice that the NaN entries are not

cumsum gets partial sums,

Note: this requires that you

Note that legend is automatically

Additional plot types:

More advanced plotting tools:

You might also like