.. currentmodule:: pandas
.. ipython:: python :suppress: import pandas as pd import numpy as np options.display.max_rows=15
Since pandas
aims to provide a lot of the data manipulation and analysis
functionality that people use R for, this page
was started to provide a more detailed look at the R language and its many third
party libraries as they relate to pandas
. In comparisons with R and CRAN
libraries, we care about the following things:
- Functionality / flexibility: what can/cannot be done with each tool
- Performance: how fast are operations. Hard numbers/benchmarks are preferable
- Ease-of-use: Is one tool easier/harder to use (you may have to be the judge of this, given side-by-side code comparisons)
This page is also here to offer a bit of a translation guide for users of these R packages.
.. versionadded:: 0.13
The :meth:`~pandas.DataFrame.query` method is similar to the base R subset
function. In R you might want to get the rows of a data.frame
where one
column's values are less than another column's values:
df <- data.frame(a=rnorm(10), b=rnorm(10))
subset(df, a <= b)
df[df$a <= df$b,] # note the comma
In pandas
, there are a few ways to perform subsetting. You can use
:meth:`~pandas.DataFrame.query` or pass an expression as if it were an
index/slice as well as standard boolean indexing:
.. ipython:: python from pandas import DataFrame from numpy import random df = DataFrame({'a': random.randn(10), 'b': random.randn(10)}) df.query('a <= b') df[df.a <= df.b] df.loc[df.a <= df.b]
For more details and examples see :ref:`the query documentation <indexing.query>`.
.. versionadded:: 0.13
An expression using a data.frame called df
in R with the columns a
and
b
would be evaluated using with
like so:
df <- data.frame(a=rnorm(10), b=rnorm(10))
with(df, a + b)
df$a + df$b # same as the previous expression
In pandas
the equivalent expression, using the
:meth:`~pandas.DataFrame.eval` method, would be:
.. ipython:: python df = DataFrame({'a': random.randn(10), 'b': random.randn(10)}) df.eval('a + b') df.a + df.b # same as the previous expression
In certain cases :meth:`~pandas.DataFrame.eval` will be much faster than evaluation in pure Python. For more details and examples see :ref:`the eval documentation <enhancingperf.eval>`.
plyr
is an R library for the split-apply-combine strategy for data
analysis. The functions revolve around three data structures in R, a
for arrays
, l
for lists
, and d
for data.frame
. The
table below shows how these data structures could be mapped in Python.
R | Python |
---|---|
array | list |
lists | dictionary or list of objects |
data.frame | dataframe |
An expression using a data.frame called df
in R where you want to
summarize x
by month
:
require(plyr)
df <- data.frame(
x = runif(120, 1, 168),
y = runif(120, 7, 334),
z = runif(120, 1.7, 20.7),
month = rep(c(5,6,7,8),30),
week = sample(1:4, 120, TRUE)
)
ddply(df, .(month, week), summarize,
mean = round(mean(x), 2),
sd = round(sd(x), 2))
In pandas
the equivalent expression, using the
:meth:`~pandas.DataFrame.groupby` method, would be:
.. ipython:: python df = DataFrame({ 'x': random.uniform(1., 168., 120), 'y': random.uniform(7., 334., 120), 'z': random.uniform(1.7, 20.7, 120), 'month': [5,6,7,8]*30, 'week': random.randint(1,4, 120) }) grouped = df.groupby(['month','week']) print grouped['x'].agg([np.mean, np.std])
For more details and examples see :ref:`the groupby documentation <groupby.aggregate>`.
An expression using a 3 dimensional array called a
in R where you want to
melt it into a data.frame:
a <- array(c(1:23, NA), c(2,3,4))
data.frame(melt(a))
In Python, since a
is a list, you can simply use list comprehension.
.. ipython:: python a = np.array(range(1,24)+[np.NAN]).reshape(2,3,4) DataFrame([tuple(list(x)+[val]) for x, val in np.ndenumerate(a)])
An expression using a list called a
in R where you want to melt it
into a data.frame:
a <- as.list(c(1:4, NA))
data.frame(melt(a))
In Python, this list would be a list of tuples, so :meth:`~pandas.DataFrame` method would convert it to a dataframe as required.
.. ipython:: python a = list(enumerate(range(1,5)+[np.NAN])) DataFrame(a)
For more details and examples see :ref:`the Into to Data Structures documentation <basics.dataframe.from_items>`.
An expression using a data.frame called cheese
in R where you want to
reshape the data.frame:
cheese <- data.frame(
first = c('John', 'Mary'),
last = c('Doe', 'Bo'),
height = c(5.5, 6.0),
weight = c(130, 150)
)
melt(cheese, id=c("first", "last"))
In Python, the :meth:`~pandas.melt` method is the R equivalent:
.. ipython:: python cheese = DataFrame({'first' : ['John', 'Mary'], 'last' : ['Doe', 'Bo'], 'height' : [5.5, 6.0], 'weight' : [130, 150]}) pd.melt(cheese, id_vars=['first', 'last']) cheese.set_index(['first', 'last']).stack() # alternative way
For more details and examples see :ref:`the reshaping documentation <reshaping.melt>`.
An expression using a data.frame called df
in R to cast into a higher
dimensional array:
df <- data.frame(
x = runif(12, 1, 168),
y = runif(12, 7, 334),
z = runif(12, 1.7, 20.7),
month = rep(c(5,6,7),4),
week = rep(c(1,2), 6)
)
mdf <- melt(df, id=c("month", "week"))
acast(mdf, week ~ month ~ variable, mean)
In Python the best way is to make use of :meth:`~pandas.pivot_table`:
.. ipython:: python df = DataFrame({ 'x': random.uniform(1., 168., 12), 'y': random.uniform(7., 334., 12), 'z': random.uniform(1.7, 20.7, 12), 'month': [5,6,7]*4, 'week': [1,2]*6 }) mdf = pd.melt(df, id_vars=['month', 'week']) pd.pivot_table(mdf, values='value', rows=['variable','week'], cols=['month'], aggfunc=np.mean)
For more details and examples see :ref:`the reshaping documentation <reshaping.pivot>`.