Skip to content

Groupby - apply concatenation of indexes depends on the order of the groups #2808

@l736x

Description

@l736x

Sometimes I need to "apply" complicated functions that return differently shaped dataframes.
In these situations I often need to handle explicitly the case in which the result of the function is empty and return a properly shaped empty dataframe.

Failing to return a proper df gives normally an exception, but I found a case where the exception is thrown depending on the order of the groups:

In [1]: import pandas as pd

In [2]: pd.__version__
Out[2]: '0.10.1'
def f1(x):
    y = x[(x.b % 2) == 1]**2
    if y.empty:
        multiindex = pd.MultiIndex(
                levels = [[]]*2,
                labels = [[]]*2,
                names = ['b', 'c']
        )
        res = pd.DataFrame(None,
                           columns=['a'],
                           index=multiindex)
        return res
    else:
        y = y.set_index(['b','c'])
        return y

def f2(x):
    y = x[(x.b % 2) == 1]**2
    if y.empty:
        return pd.DataFrame()
    else:
        y = y.set_index(['b','c'])
        return y

def f3(x):
    y = x[(x.b % 2) == 1]**2
    if y.empty:
        multiindex = pd.MultiIndex(
                levels = [[]]*2,
                labels = [[]]*2,
                names = ['foo', 'bar']
        )
        res = pd.DataFrame(None,
                           columns=['a','b'],
                           index=multiindex)
        return res
    else:
        return y

df = pd.DataFrame({'a':[1,2,2,2],
                   'b':range(4),
                   'c':range(5,9)})

df2 = pd.DataFrame({'a':[3,2,2,2],
                    'b':range(4),
                    'c':range(5,9)})

f1 is the correct function and it works
f2 is wrong because it returns an empty dataframe with a simple index and it fails (as it should)
f3 is wrong but the exception is thrown only with df2

In [4]: df.groupby('a').apply(f1)
Out[4]:
        a
a b c
2 1 36  4
  9 64  4

In [5]: df2.groupby('a').apply(f1)
Out[5]:
        a
a b c
2 1 36  4
  9 64  4
In [6]: df.groupby('a').apply(f3)
Out[6]:
     a  b   c
a
2 1  4  1  36
  3  4  9  64

In [7]: df2.groupby('a').apply(f3)
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)

/home/ldeleo/<ipython console> in <module>()

/home/ldeleo/.local/lib/python2.7/site-packages/pandas-0.10.1-py2.7-linux-x86_64.egg/pandas/core/groupby.pyc in apply(self, func, *args, **kwargs)
    320         func = _intercept_function(func)
    321         f = lambda g: func(g, *args, **kwargs)
--> 322         return self._python_apply_general(f)
    323
    324     def _python_apply_general(self, f):

/home/ldeleo/.local/lib/python2.7/site-packages/pandas-0.10.1-py2.7-linux-x86_64.egg/pandas/core/groupby.pyc in _python_apply_general(self, f)
    326
    327         return self._wrap_applied_output(keys, values,
--> 328                                          not_indexed_same=mutated)
    329
    330     def aggregate(self, func, *args, **kwargs):

/home/ldeleo/.local/lib/python2.7/site-packages/pandas-0.10.1-py2.7-linux-x86_64.egg/pandas/core/groupby.pyc in _wrap_applied_output(self, keys, values, not_indexed_same)
   1742         if isinstance(values[0], DataFrame):
   1743             return self._concat_objects(keys, values,
-> 1744                                         not_indexed_same=not_indexed_same)
   1745         elif hasattr(self.grouper, 'groupings'):
   1746             if len(self.grouper.groupings) > 1:

/home/ldeleo/.local/lib/python2.7/site-packages/pandas-0.10.1-py2.7-linux-x86_64.egg/pandas/core/groupby.pyc in _concat_objects(self, keys, values, not_indexed_same)
    483             group_names = self.grouper.names
    484             result = concat(values, axis=self.axis, keys=group_keys,
--> 485                             levels=group_levels, names=group_names)
    486         else:
    487             result = concat(values, axis=self.axis)

/home/ldeleo/.local/lib/python2.7/site-packages/pandas-0.10.1-py2.7-linux-x86_64.egg/pandas/tools/merge.pyc in concat(objs, axis, join, join_axes, ignore_index, keys, levels, names, verify_integrity)
    892                        ignore_index=ignore_index, join=join,
    893                        keys=keys, levels=levels, names=names,
--> 894                        verify_integrity=verify_integrity)
    895     return op.get_result()
    896

/home/ldeleo/.local/lib/python2.7/site-packages/pandas-0.10.1-py2.7-linux-x86_64.egg/pandas/tools/merge.pyc in __init__(self, objs, axis, join, join_axes, keys, levels, names, ignore_index, verify_integrity)
    962         self.verify_integrity = verify_integrity
    963
--> 964         self.new_axes = self._get_new_axes()
    965
    966     def get_result(self):

/home/ldeleo/.local/lib/python2.7/site-packages/pandas-0.10.1-py2.7-linux-x86_64.egg/pandas/tools/merge.pyc in _get_new_axes(self)
   1134             concat_axis = None
   1135         else:
-> 1136             concat_axis = self._get_concat_axis()
   1137
   1138         new_axes[self.axis] = concat_axis

/home/ldeleo/.local/lib/python2.7/site-packages/pandas-0.10.1-py2.7-linux-x86_64.egg/pandas/tools/merge.pyc in _get_concat_axis(self)
   1169         else:
   1170             concat_axis = _make_concat_multiindex(indexes, self.keys,
-> 1171                                                   self.levels, self.names)
   1172
   1173         self._maybe_check_integrity(concat_axis)

/home/ldeleo/.local/lib/python2.7/site-packages/pandas-0.10.1-py2.7-linux-x86_64.egg/pandas/tools/merge.pyc in _make_concat_multiindex(indexes, keys, levels, names)
   1243             names = names + _get_consensus_names(indexes)
   1244
-> 1245         return MultiIndex(levels=levels, labels=label_list, names=names)
   1246
   1247     new_index = indexes[0]

/home/ldeleo/.local/lib/python2.7/site-packages/pandas-0.10.1-py2.7-linux-x86_64.egg/pandas/core/index.pyc in __new__(cls, levels, labels, sortorder, names)
   1343             if len(names) != subarr.nlevels:
   1344                 raise AssertionError(('Length of names must be same as level '
-> 1345                                       '(%d), got %d') % (subarr.nlevels))
   1346
   1347             subarr.names = list(names)

TypeError: not enough arguments for format string

There are actually two bugs. The first is simply that instead of (subarr.nlevels) on line 1345 there should be (subarr.nlevels, len(names))

The second is more serious and goes back to _get_consensus_names (index.py:2671)
I guess that the task of this function is, given a list of possibly different indexes, to give back the minimal "shape" that can contain all the indexes.
Only that what it does now is to give back the "shape" of the second type encountered along the list.

For this reason, when applied to df, it acts on

[Multiindex[], Int64Index([1,3])]

giving back [[None]]

When applied to df2 which differs only by the order of the groups

[Int64Index([1,3]), Multiindex[]]

it gives back [[None], [None]] and then fails.

I agree that the case I showed is quite abstruse, but the non predictability of the behavior is quite dangerous in my opinion.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions