-
-
Notifications
You must be signed in to change notification settings - Fork 18.8k
BUG: Groupby lost index, when one of the agg keys had no function all… #33086
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
@@ -439,7 +439,13 @@ def is_any_frame() -> bool: | |||
# we have a dict of DataFrames | |||
# return a MI DataFrame | |||
|
|||
return concat([result[k] for k in keys], keys=keys, axis=1), True | |||
keys_to_use = [k for k in keys if not result[k].empty] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the bug is not actually here. rather pls see where concat actually mishandles this and adjust there. concat handles None / empty frames, must not be discarding the keys when that happens.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we don't want to compare vs []
, rather not len(...)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jreback The desired behavior for concat would be, that it keeps all information and returns the DataFrame with the right index. Did I get that right?
Then I will look into this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jreback
I looked into this now. I think everything in concat works as expected. We have the following starting point:
The concat
function receives two DataFrames as input.
- the empty DataFrame with index
Index([], dtype="object", name=None)
- the c-DataFrame with index
Int64Index([1, 2], dtype="int64", name="a")
concat
performs the following relevant steps in our case.
- It casts both indices to datatype object, because they do not match beforehand.
- It determines the name as follows: If both names are equal, the regular name is returned. If the names differ, None is returned (our case, because None != "a"). You can look this up here: https://github.com/pandas-dev/pandas/blob/master/pandas/core/ops/init.py#L139
As far as I understood the code, we could do the following to change this behavior:
- Change the logic during the name definition to returne the name != None if one of them is None (this breaks tests, so I think this idea is not that good
test_maybe_match_name
in here - We could also change the function, which casts both indices to datatype object to avoid datatype issues in the resulting index. We would have to modify the code in here
Both parts would change the indexes part, which is not directly related to our group by issue.
Alternatively, we could modfiy the DataFrames in the concat
part before defining the final index, but I think that is not a good idea.
Could you tell me how to proceed?
I am not sure I agree with supporting this case. what does an empty list in an aggregation even mean? |
Should the calculation fail as a whole in this case? |
will look at this again - might be ok with your soln |
can you merge master and will have a look |
� Conflicts: � doc/source/whatsnew/v1.1.0.rst
@jreback merged master |
kk ping on green. |
@jreback green, can be merged |
thanks @phofl |
…ocated
black pandas
git diff upstream/master -u -- "*.py" | flake8 --diff
The issue was the concatenation with an empty DataFrame and the result of the min function. This resulted in the lost index.
I changed the input for the concatenation, so that only non empty DataFrames would be concatenated. We have to catch the case, that all DataFrames are empty, because this would result in an error.