0% found this document useful (0 votes)
17 views33 pages

Lecture 14

Data science

Uploaded by

aaaaalshammari
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
17 views33 pages

Lecture 14

Data science

Uploaded by

aaaaalshammari
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 33

STATS 701

Data Analysis using Python


Lecture 14: Advanced pandas
Recap
Previous lecture: basics of pandas
Series and DataFrames
Indexing, changing entries
Function application

This lecture: more complicated operations


Statistical computations
Group-By operations
Reshaping, stacking and pivoting
Recap
Previous lecture: basics of pandas
Series and DataFrames
Indexing, changing entries
Function application

This lecture: more complicated operations


Statistical computations
Group-By operations Caveat: pandas is a large, complicated
Reshaping, stacking and pivoting package, so I will not endeavor to mention
every feature here. These slides should be
enough to get you started, but there’s no
substitute for reading the documentation.
Percent change over time

pct_change method is supported by both Series and


DataFrames. Series.pct_change returns a new
Series representing the step-wise percent change.

pct_change includes control over how missing


data is imputed, how large a time-lag to use, etc.
See documentation for more detail:
https://pandas.pydata.org/pandas-docs/stable/ge
nerated/pandas.Series.pct_change.html
Percent change over time
pct_change operates on columns of a DataFrame, by
default. Periods argument specifies the time-lag to use
in computing percent change. So periods=2 looks at
percent change compared to two time steps ago.

Note: pandas has extensive support for time series


data, which we mostly won’t talk about in this course.

pct_change includes control over how missing


data is imputed, how large a time-lag to use, etc.
See documentation for more detail:
https://pandas.pydata.org/pandas-docs/stable/ge
nerated/pandas.Series.pct_change.html
Computing covariances
cov method computes covariance
between a Series and another Series.

cov method is also supported by DataFrame,


but instead computes a new DataFrame of
covariances between columns.

cov supports extra arguments for further specifying behavior:


https://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.cov.html
Pairwise correlations

DataFrame corr method computes


correlations between columns (use axis
keyword to change this behavior).
method argument controls which
correlation score to use (default is
Pearson’s correlation.
Ranking data

rank method returns a new Series


whose values are the data ranks.

Ties are broken by assigning the


mean rank to both values.
By default, rank ranks columns
Ranking data of a DataFrame individually.

Rank rows instead by supplying


an axis argument.

Note: more complicated ranking of whole rows (i.e., sorting


whole rows rather than sorting columns individually) is possible,
but requires we define an ordering on Series.
Group By: reorganizing data
“Group By” operations are a concept from databases
Splitting data based on some criteria
Applying functions to different splits
Combining results into a single data structure

Fundamental object: pandas GroupBy objects


Group By: reorganizing data

DataFrame groupby method


returns a pandas groupby object.
Group By: reorganizing data

Every groupby object has an attribute groups,


which is a dictionary with maps group labels to
the indices in the DataFrame.

In this example, we are splitting on the


column ‘A’, which has two values:
‘plant’ and ‘animal’, so the groups
dictionary has two keys.
Group By: reorganizing data

Every groupby object has an attribute groups,


which is a dictionary with maps group labels to
the indices in the DataFrame.

The important point is that the groupby object is


storing information about how to partition the rows
of the original DataFrame according to the
argument(s) passed to the groupby method.

In this example, we are splitting on the


column ‘A’, which has two values:
‘plant’ and ‘animal’, so the groups
dictionary has two keys.
Group By: aggregation

Split on group ‘A’, then compute the means


within each group. Note that columns for which
means are not supported are removed, so
column ‘B’ doesn’t show up in the result.
Group By: aggregation

Here we’re building a hierarchically-indexed


Series (i.e., multi-indexed), recording (fictional)
scores of students by major and handedness.

Suppose I want to collapse over handedness to get


average scores by major. In essence, I want to group by
major and ignore handedness.
Group By: aggregation
Suppose I want to collapse over handedness to get
average scores by major. In essence, I want to group by
major and ignore handedness.

Group by the 0-th level of the hierarchy


(i.e., ‘major’), and take means.

We could have equivalently written


groupby(‘major’) , here.
Group By: examining groups

groupby.get_group lets us pick out


an individual group. Here, we’re
grabbing just the data from the ‘econ’
group, after grouping by ‘major’.
Group By: aggregation
Similar aggregation to what we did a
few slides ago, but now we have a
DataFrame instead of a Series.
Group By: aggregation
Similar aggregation to what we did a
few slides ago, but now we have a
DataFrame instead of a Series.

Groupby objects also support the aggregate


method, which is often more convenient.
From the documentation: “The transform
method returns an object that is indexed the
Transforming data same (same size) as the one being grouped.”

Building a time series,


indexed by year-month-day.

Suppose we want to
standardize these scores
within each year. Group the data according to the output
of the key function, apply the given
transformation within each group, then
un-group the data.

Important point: the result of groupby.transform has


the same dimension as the original DataFrame or Series.
From the documentation: “The
Filtering data argument of filter must be a function
that, applied to the group as a whole,
returns True or False.”

So this will throw out all the


groups with sum <= 2.

Like transform, the


result is ungrouped.
Combining DataFrames

pandas concat function concatenates


DataFrames into a single DataFrame.

Repeated indices remain repeated


in the resulting DataFrame.
Missing values
get NaN.
pandas.concat accepts numerous
optional arguments for finer control over
how concatenation is performed. See the
documentation for more.
Merges and joins
pandas DataFrames support many common database operations
Most notably, join and merge operations

We’ll learn about these when we discuss SQL later in the semester
So we won’t discuss them here

Important: What we learn for SQL later has analogues in pandas

If you are already familiar with SQL, you might like to read this:
https://pandas.pydata.org/pandas-docs/stable/comparison_with_sql.html
Data in this format is usually called stacked. It
Pivoting and Stacking is common to store data in this form in a file, but
once it’s read into a table, it often makes more
sense to create columns for A, B and C. That is,
we want to unstack this DataFrame.
Pivoting and Stacking The pivot method takes care of unstacking
DataFrames. We supply indices for the new
DataFrame, and tell it to turn the variable
column in the old DataFrame into a set of
column names in the unstacked one.

https://en.wikipedia.org/wiki/Pivot_table
Pivoting and Stacking

How do we stack this? That is, how do we get a


non-pivot version of this DataFrame? The answer
is to use the DataFrame stack method.
Pivoting and Stacking

The DataFrame stack method makes a stacked version


of the calling DataFrame. In the event that the resulting
column index set is a trivial, the result is a Series. Note
that df.stack() no longer has columns A or B. The
column labels A and B have become an extra index.
Pivoting and Stacking

Here is a more complicated example.


Notice that the column labels have a
three-level hierarchical structure.

There are multiple ways to stack this data. At


one extreme, we could make all three levels
into columns. At the other extreme, we could
choose only one to make into a column.
Pivoting and Stacking
Stack only according to level 1
(i.e., the animal column index).

Missing animal x cond x hair_length


conditions default to NaN.
Pivoting and Stacking

Stacking across all three levels


yields a Series, since there is no
longer any column structure. This is
often called flattening a table.

Notice that the NaN entries are not


necessary here, since we have an
entry in the Series only for entries of
the original DataFrame.
Plotting DataFrames

cumsum gets partial sums,


just like in numpy.

Note: this requires that you


have imported matplotlib.

Note that legend is automatically


populated and x-ticks are
automatically date formatted.
Plotting DataFrames
DataFrames.plot() method is largely identical to matplotlib.pyplot
So you already mostly know how to use it!

Additional plot types:


https://pandas.pydata.org/pandas-docs/stable/visualization.html#other-plots

More advanced plotting tools:


https://pandas.pydata.org/pandas-docs/stable/visualization.html#plotting-tools
Readings
Required:
Group By:
https://pandas.pydata.org/pandas-docs/stable/groupby.html
Reshaping and pivoting:
https://pandas.pydata.org/pandas-docs/stable/reshaping.html

Recommended:
Merge, join and concatenation:
https://pandas.pydata.org/pandas-docs/stable/merging.html
Time series functionality:
https://pandas.pydata.org/pandas-docs/stable/timeseries.html

You might also like