-
-
Notifications
You must be signed in to change notification settings - Fork 18.8k
Description
Sometimes I need to "apply" complicated functions that return differently shaped dataframes.
In these situations I often need to handle explicitly the case in which the result of the function is empty and return a properly shaped empty dataframe.
Failing to return a proper df gives normally an exception, but I found a case where the exception is thrown depending on the order of the groups:
In [1]: import pandas as pd
In [2]: pd.__version__
Out[2]: '0.10.1'
def f1(x):
y = x[(x.b % 2) == 1]**2
if y.empty:
multiindex = pd.MultiIndex(
levels = [[]]*2,
labels = [[]]*2,
names = ['b', 'c']
)
res = pd.DataFrame(None,
columns=['a'],
index=multiindex)
return res
else:
y = y.set_index(['b','c'])
return y
def f2(x):
y = x[(x.b % 2) == 1]**2
if y.empty:
return pd.DataFrame()
else:
y = y.set_index(['b','c'])
return y
def f3(x):
y = x[(x.b % 2) == 1]**2
if y.empty:
multiindex = pd.MultiIndex(
levels = [[]]*2,
labels = [[]]*2,
names = ['foo', 'bar']
)
res = pd.DataFrame(None,
columns=['a','b'],
index=multiindex)
return res
else:
return y
df = pd.DataFrame({'a':[1,2,2,2],
'b':range(4),
'c':range(5,9)})
df2 = pd.DataFrame({'a':[3,2,2,2],
'b':range(4),
'c':range(5,9)})
f1 is the correct function and it works
f2 is wrong because it returns an empty dataframe with a simple index and it fails (as it should)
f3 is wrong but the exception is thrown only with df2
In [4]: df.groupby('a').apply(f1)
Out[4]:
a
a b c
2 1 36 4
9 64 4
In [5]: df2.groupby('a').apply(f1)
Out[5]:
a
a b c
2 1 36 4
9 64 4
In [6]: df.groupby('a').apply(f3)
Out[6]:
a b c
a
2 1 4 1 36
3 4 9 64
In [7]: df2.groupby('a').apply(f3)
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
/home/ldeleo/<ipython console> in <module>()
/home/ldeleo/.local/lib/python2.7/site-packages/pandas-0.10.1-py2.7-linux-x86_64.egg/pandas/core/groupby.pyc in apply(self, func, *args, **kwargs)
320 func = _intercept_function(func)
321 f = lambda g: func(g, *args, **kwargs)
--> 322 return self._python_apply_general(f)
323
324 def _python_apply_general(self, f):
/home/ldeleo/.local/lib/python2.7/site-packages/pandas-0.10.1-py2.7-linux-x86_64.egg/pandas/core/groupby.pyc in _python_apply_general(self, f)
326
327 return self._wrap_applied_output(keys, values,
--> 328 not_indexed_same=mutated)
329
330 def aggregate(self, func, *args, **kwargs):
/home/ldeleo/.local/lib/python2.7/site-packages/pandas-0.10.1-py2.7-linux-x86_64.egg/pandas/core/groupby.pyc in _wrap_applied_output(self, keys, values, not_indexed_same)
1742 if isinstance(values[0], DataFrame):
1743 return self._concat_objects(keys, values,
-> 1744 not_indexed_same=not_indexed_same)
1745 elif hasattr(self.grouper, 'groupings'):
1746 if len(self.grouper.groupings) > 1:
/home/ldeleo/.local/lib/python2.7/site-packages/pandas-0.10.1-py2.7-linux-x86_64.egg/pandas/core/groupby.pyc in _concat_objects(self, keys, values, not_indexed_same)
483 group_names = self.grouper.names
484 result = concat(values, axis=self.axis, keys=group_keys,
--> 485 levels=group_levels, names=group_names)
486 else:
487 result = concat(values, axis=self.axis)
/home/ldeleo/.local/lib/python2.7/site-packages/pandas-0.10.1-py2.7-linux-x86_64.egg/pandas/tools/merge.pyc in concat(objs, axis, join, join_axes, ignore_index, keys, levels, names, verify_integrity)
892 ignore_index=ignore_index, join=join,
893 keys=keys, levels=levels, names=names,
--> 894 verify_integrity=verify_integrity)
895 return op.get_result()
896
/home/ldeleo/.local/lib/python2.7/site-packages/pandas-0.10.1-py2.7-linux-x86_64.egg/pandas/tools/merge.pyc in __init__(self, objs, axis, join, join_axes, keys, levels, names, ignore_index, verify_integrity)
962 self.verify_integrity = verify_integrity
963
--> 964 self.new_axes = self._get_new_axes()
965
966 def get_result(self):
/home/ldeleo/.local/lib/python2.7/site-packages/pandas-0.10.1-py2.7-linux-x86_64.egg/pandas/tools/merge.pyc in _get_new_axes(self)
1134 concat_axis = None
1135 else:
-> 1136 concat_axis = self._get_concat_axis()
1137
1138 new_axes[self.axis] = concat_axis
/home/ldeleo/.local/lib/python2.7/site-packages/pandas-0.10.1-py2.7-linux-x86_64.egg/pandas/tools/merge.pyc in _get_concat_axis(self)
1169 else:
1170 concat_axis = _make_concat_multiindex(indexes, self.keys,
-> 1171 self.levels, self.names)
1172
1173 self._maybe_check_integrity(concat_axis)
/home/ldeleo/.local/lib/python2.7/site-packages/pandas-0.10.1-py2.7-linux-x86_64.egg/pandas/tools/merge.pyc in _make_concat_multiindex(indexes, keys, levels, names)
1243 names = names + _get_consensus_names(indexes)
1244
-> 1245 return MultiIndex(levels=levels, labels=label_list, names=names)
1246
1247 new_index = indexes[0]
/home/ldeleo/.local/lib/python2.7/site-packages/pandas-0.10.1-py2.7-linux-x86_64.egg/pandas/core/index.pyc in __new__(cls, levels, labels, sortorder, names)
1343 if len(names) != subarr.nlevels:
1344 raise AssertionError(('Length of names must be same as level '
-> 1345 '(%d), got %d') % (subarr.nlevels))
1346
1347 subarr.names = list(names)
TypeError: not enough arguments for format string
There are actually two bugs. The first is simply that instead of (subarr.nlevels) on line 1345 there should be (subarr.nlevels, len(names))
The second is more serious and goes back to _get_consensus_names (index.py:2671)
I guess that the task of this function is, given a list of possibly different indexes, to give back the minimal "shape" that can contain all the indexes.
Only that what it does now is to give back the "shape" of the second type encountered along the list.
For this reason, when applied to df, it acts on
[Multiindex[], Int64Index([1,3])]
giving back [[None]]
When applied to df2 which differs only by the order of the groups
[Int64Index([1,3]), Multiindex[]]
it gives back [[None], [None]] and then fails.
I agree that the case I showed is quite abstruse, but the non predictability of the behavior is quite dangerous in my opinion.