BUG(string dtype): groupby/resampler.min/max returns float on all NA strings #60985

rhshadrach · 2025-02-22T15:30:08Z

closes BUG: Inconsistent dtype with GroupBy for StrDtype and all missing values #60810 (Replace xxxx with the GitHub issue number)
Tests added and passed if fixing a bug or adding a new feature
All code checks passed.
Added type annotations to new arguments/methods/functions.
Added an entry in the latest doc/source/whatsnew/vX.X.X.rst file if fixing a bug or adding a new feature.

Built on top of #60936

…strings

…groupby_all_na_min_max

WillAyd · 2025-03-24T13:32:14Z

pandas/tests/groupby/test_reductions.py

+    expected_dtype, expected_value = dtype, pd.NA
+    if reduction_func in ["all", "any"]:
+        expected_dtype = "bool"
+        # TODO: For skipna=False, bool(pd.NA) raises; should groupby?


It looks like there are a few TODOs / inconsistencies in our interface here. I think rather than branching and trying to document all of them, it would help to simplify this test and just skip/xfail the cases where things are not consistent. It may even be helpful to split this up into multiple tests that are more focused

I think rather than branching and trying to document all of them, it would help to simplify this test and just skip/xfail the cases where things are not consistent.

If they added significant complexity I would agree. However the complexity added seems minimal to me, and testing the current behavior tells us when it changes. So even if it's not the final behavior we'd desire, testing it seems better than skipping or xfailing.

It may even be helpful to split this up into multiple tests that are more focused

If there was a good way of doing this while ensuring we are going through all the reduction funcs, I'd definitely be on board. However to my knowledge there is not.

It's challenging in the current state because its near impossible to tell what this test is trying to say about the expected behavior of things. Perhaps I am misreading - how would you summarize this unit test?

how would you summarize this unit test?

Testing groupby on input of string dtype where all values are NA. But I fear I might be misunderstanding the question.

@WillAyd - friendly ping.

I still think this needs to be split into multiple tests and/or simplified so that it remains a unit test(s)

I would be happy to do so if there is a way to ensure it runs across all reducers, but do not see a way to make that happen.

A large number of bugs I've worked on over the years existed because so much of our test suite only checks one or a couple of ops, rather than all of them.

This is a common pattern in our tests and by no means the worst offender, e.g.

pandas/pandas/tests/groupby/test_groupby.py

Lines 1713 to 1834 in 07c627c

def test_empty_groupby(columns, keys, values, method, op, dropna, using_infer_string):

# GH8093 & GH26411

override_dtype = None

if isinstance(values, BooleanArray) and op in ["sum", "prod"]:

# We expect to get Int64 back for these

override_dtype = "Int64"

if isinstance(values[0], bool) and op in ("prod", "sum"):

# sum/product of bools is an integer

override_dtype = "int64"

df = DataFrame({"A": values, "B": values, "C": values}, columns=list("ABC"))

if hasattr(values, "dtype"):

# check that we did the construction right

assert (df.dtypes == values.dtype).all()

df = df.iloc[:0]

gb = df.groupby(keys, group_keys=False, dropna=dropna, observed=False)[columns]

def get_result(**kwargs):

if method == "attr":

return getattr(gb, op)(**kwargs)

else:

return getattr(gb, method)(op, **kwargs)

def get_categorical_invalid_expected():

# Categorical is special without 'observed=True', we get an NaN entry

# corresponding to the unobserved group. If we passed observed=True

# to groupby, expected would just be 'df.set_index(keys)[columns]'

# as below

lev = Categorical([0], dtype=values.dtype)

if len(keys) != 1:

idx = MultiIndex.from_product([lev, lev], names=keys)

else:

# all columns are dropped, but we end up with one row

# Categorical is special without 'observed=True'

idx = Index(lev, name=keys[0])

if using_infer_string:

columns = Index([], dtype="str")

else:

columns = []

expected = DataFrame([], columns=columns, index=idx)

return expected

is_per = isinstance(df.dtypes.iloc[0], pd.PeriodDtype)

is_dt64 = df.dtypes.iloc[0].kind == "M"

is_cat = isinstance(values, Categorical)

is_str = isinstance(df.dtypes.iloc[0], pd.StringDtype)

if (

isinstance(values, Categorical)

and not values.ordered

and op in ["min", "max", "idxmin", "idxmax"]

):

if op in ["min", "max"]:

msg = f"Cannot perform {op} with non-ordered Categorical"

klass = TypeError

else:

msg = f"Can't get {op} of an empty group due to unobserved categories"

klass = ValueError

with pytest.raises(klass, match=msg):

get_result()

if op in ["min", "max", "idxmin", "idxmax"] and isinstance(columns, list):

# i.e. DataframeGroupBy, not SeriesGroupBy

result = get_result(numeric_only=True)

expected = get_categorical_invalid_expected()

tm.assert_equal(result, expected)

return

if op in ["prod", "sum", "skew", "kurt"]:

# ops that require more than just ordered-ness

if is_dt64 or is_cat or is_per or (is_str and op != "sum"):

# GH#41291

# datetime64 -> prod and sum are invalid

if is_dt64:

msg = "datetime64 type does not support"

elif is_per:

msg = "Period type does not support"

elif is_str:

msg = f"dtype 'str' does not support operation '{op}'"

else:

msg = "category type does not support"

if op in ["skew", "kurt"]:

msg = "|".join([msg, f"does not support operation '{op}'"])

with pytest.raises(TypeError, match=msg):

get_result()

if not isinstance(columns, list):

# i.e. SeriesGroupBy

return

elif op in ["skew", "kurt"]:

# TODO: test the numeric_only=True case

return

else:

# i.e. op in ["prod", "sum"]:

# i.e. DataFrameGroupBy

# ops that require more than just ordered-ness

# GH#41291

result = get_result(numeric_only=True)

# with numeric_only=True, these are dropped, and we get

# an empty DataFrame back

expected = df.set_index(keys)[[]]

if is_cat:

expected = get_categorical_invalid_expected()

tm.assert_equal(result, expected)

return

result = get_result()

expected = df.set_index(keys)[columns]

if op in ["idxmax", "idxmin"]:

expected = expected.astype(df.index.dtype)

if override_dtype is not None:

expected = expected.astype(override_dtype)

if len(keys) == 1:

expected.index.name = keys[0]

tm.assert_equal(result, expected)

@WillAyd - friendly ping

I still have the same feedback before on making this look more like a unit test. If you don't want to do that its no problem, but probably worth pinging someone else to review then

I think ideally we should have had some EA tests like BaseReduceTests but for groupby that would allow us to have a more "standard" way to cover this, but although this test appears onerous I'm fine having it like this for now as to not block this PR

WillAyd · 2025-03-24T13:32:59Z

pandas/core/groupby/groupby.py

            res_values = res_values.astype(object, copy=False)
+        elif is_string_dtype(dtype) and how in ["min", "max"]:


Is there a way to avoid special-casing these functions here? Where is the return value from other functions being handled?

Yea - good call. We only get here with min/max today. If we do end up here with other ops at some point in the future, either (a) the dtype is already correct in which case _from_sequence is O(1) or (b) we want to cast. So I've removed the condition.

…groupby_all_na_min_max

rhshadrach added 4 commits February 22, 2025 10:29

BUG(string dtype): groupby/resampler.min/max returns float on all NA …

9254a10

…strings

Merge cleanup

fbf2d11

whatsnew

898a9bc

Merge main

f90a268

rhshadrach added Groupby Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate Strings String extension data type and string data Reduction Operations sum, mean, min, max, etc. Bug labels Mar 19, 2025

rhshadrach added this to the 2.3 milestone Mar 19, 2025

rhshadrach added 2 commits March 22, 2025 11:15

Merge branch 'main' of https://github.com/pandas-dev/pandas into bug_…

ffcff02

…groupby_all_na_min_max

Add type-ignore

ba0fba4

rhshadrach marked this pull request as ready for review March 23, 2025 12:07

rhshadrach requested review from WillAyd and jorisvandenbossche March 23, 2025 12:07

WillAyd requested changes Mar 24, 2025

View reviewed changes

rhshadrach added 2 commits March 26, 2025 09:57

Merge branch 'main' of https://github.com/pandas-dev/pandas into bug_…

1337ccb

…groupby_all_na_min_max

Remove condition

4ec7ac4

rhshadrach requested a review from WillAyd March 29, 2025 12:49

mroeschke approved these changes Apr 17, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BUG(string dtype): groupby/resampler.min/max returns float on all NA strings #60985

BUG(string dtype): groupby/resampler.min/max returns float on all NA strings #60985

rhshadrach commented Feb 22, 2025

WillAyd Mar 24, 2025

rhshadrach Mar 26, 2025

WillAyd Mar 31, 2025

rhshadrach Mar 31, 2025

rhshadrach Apr 13, 2025

WillAyd Apr 14, 2025 •

edited

Loading

rhshadrach Apr 14, 2025

rhshadrach Apr 16, 2025

WillAyd Apr 16, 2025

mroeschke Apr 17, 2025

WillAyd Mar 24, 2025

rhshadrach Mar 26, 2025

	def test_empty_groupby(columns, keys, values, method, op, dropna, using_infer_string):
	# GH8093 & GH26411
	override_dtype = None

	if isinstance(values, BooleanArray) and op in ["sum", "prod"]:
	# We expect to get Int64 back for these
	override_dtype = "Int64"

	if isinstance(values[0], bool) and op in ("prod", "sum"):
	# sum/product of bools is an integer
	override_dtype = "int64"

	df = DataFrame({"A": values, "B": values, "C": values}, columns=list("ABC"))

	if hasattr(values, "dtype"):
	# check that we did the construction right
	assert (df.dtypes == values.dtype).all()

	df = df.iloc[:0]

	gb = df.groupby(keys, group_keys=False, dropna=dropna, observed=False)[columns]

	def get_result(**kwargs):
	if method == "attr":
	return getattr(gb, op)(**kwargs)
	else:
	return getattr(gb, method)(op, **kwargs)

	def get_categorical_invalid_expected():
	# Categorical is special without 'observed=True', we get an NaN entry
	# corresponding to the unobserved group. If we passed observed=True
	# to groupby, expected would just be 'df.set_index(keys)[columns]'
	# as below
	lev = Categorical([0], dtype=values.dtype)
	if len(keys) != 1:
	idx = MultiIndex.from_product([lev, lev], names=keys)
	else:
	# all columns are dropped, but we end up with one row
	# Categorical is special without 'observed=True'
	idx = Index(lev, name=keys[0])

	if using_infer_string:
	columns = Index([], dtype="str")
	else:
	columns = []
	expected = DataFrame([], columns=columns, index=idx)
	return expected

	is_per = isinstance(df.dtypes.iloc[0], pd.PeriodDtype)
	is_dt64 = df.dtypes.iloc[0].kind == "M"
	is_cat = isinstance(values, Categorical)
	is_str = isinstance(df.dtypes.iloc[0], pd.StringDtype)

	if (
	isinstance(values, Categorical)
	and not values.ordered
	and op in ["min", "max", "idxmin", "idxmax"]
	):
	if op in ["min", "max"]:
	msg = f"Cannot perform {op} with non-ordered Categorical"
	klass = TypeError
	else:
	msg = f"Can't get {op} of an empty group due to unobserved categories"
	klass = ValueError
	with pytest.raises(klass, match=msg):
	get_result()

	if op in ["min", "max", "idxmin", "idxmax"] and isinstance(columns, list):
	# i.e. DataframeGroupBy, not SeriesGroupBy
	result = get_result(numeric_only=True)
	expected = get_categorical_invalid_expected()
	tm.assert_equal(result, expected)
	return

	if op in ["prod", "sum", "skew", "kurt"]:
	# ops that require more than just ordered-ness
	if is_dt64 or is_cat or is_per or (is_str and op != "sum"):
	# GH#41291
	# datetime64 -> prod and sum are invalid
	if is_dt64:
	msg = "datetime64 type does not support"
	elif is_per:
	msg = "Period type does not support"
	elif is_str:
	msg = f"dtype 'str' does not support operation '{op}'"
	else:
	msg = "category type does not support"
	if op in ["skew", "kurt"]:
	msg = "\|".join([msg, f"does not support operation '{op}'"])
	with pytest.raises(TypeError, match=msg):
	get_result()

	if not isinstance(columns, list):
	# i.e. SeriesGroupBy
	return
	elif op in ["skew", "kurt"]:
	# TODO: test the numeric_only=True case
	return
	else:
	# i.e. op in ["prod", "sum"]:
	# i.e. DataFrameGroupBy
	# ops that require more than just ordered-ness
	# GH#41291
	result = get_result(numeric_only=True)

	# with numeric_only=True, these are dropped, and we get
	# an empty DataFrame back
	expected = df.set_index(keys)[[]]
	if is_cat:
	expected = get_categorical_invalid_expected()
	tm.assert_equal(result, expected)
	return

	result = get_result()
	expected = df.set_index(keys)[columns]
	if op in ["idxmax", "idxmin"]:
	expected = expected.astype(df.index.dtype)
	if override_dtype is not None:
	expected = expected.astype(override_dtype)
	if len(keys) == 1:
	expected.index.name = keys[0]
	tm.assert_equal(result, expected)

		res_values = res_values.astype(object, copy=False)
		elif is_string_dtype(dtype) and how in ["min", "max"]:

BUG(string dtype): groupby/resampler.min/max returns float on all NA strings #60985

Are you sure you want to change the base?

BUG(string dtype): groupby/resampler.min/max returns float on all NA strings #60985

Conversation

rhshadrach commented Feb 22, 2025

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

WillAyd Apr 14, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

WillAyd Apr 14, 2025 •

edited

Loading