Skip to content

Pandas cut method gives an error if labels are non-unique #33141

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
harmbuisman opened this issue Mar 30, 2020 · 8 comments · Fixed by #33480
Closed

Pandas cut method gives an error if labels are non-unique #33141

harmbuisman opened this issue Mar 30, 2020 · 8 comments · Fixed by #33480
Labels
Bug cut cut, qcut
Milestone

Comments

@harmbuisman
Copy link

Code Sample, a copy-pastable example if possible

import pandas as pd
data = pd.DataFrame({'morf':[8142, 8153, 8161]})
bins = [8140, 8150, 8160, 8163]
labels = ['Adenocarcinomas', 'Other specific carcinomas', 'Adenocarcinomas']

data['group'] = pd.cut(data['morf'], bins, labels=labels)

Problem description

I want to use pandas cut to fill in labels for ranges that are scattered. E.g. I want to map the following table onto my data, where the definitions span ranges that are scattered across the morphology continuum:
image

The above code gives the error:
ValueError: Categorical categories must be unique

Expected Output

I expect the example above to run and give back the labeled dataframe without an error.

My workaround is to add the index to the labels and then remove that again after the cut.

Output of pd.show_versions()

[paste the output of pd.show_versions() here below this line]
INSTALLED VERSIONS

commit : None
python : 3.7.6.final.0
python-bits : 64
OS : Windows
OS-release : 10
machine : AMD64
processor : Intel64 Family 6 Model 142 Stepping 10, GenuineIntel
byteorder : little
LC_ALL : None
LANG : None
LOCALE : None.None

pandas : 1.0.1
numpy : 1.18.1
pytz : 2019.3
dateutil : 2.8.1
pip : 20.0.2
setuptools : 45.2.0.post20200209
Cython : None
pytest : None
hypothesis : None
sphinx : None
blosc : None
feather : None
xlsxwriter : None
lxml.etree : None
html5lib : None
pymysql : None
psycopg2 : None
jinja2 : 2.11.1
IPython : 7.12.0
pandas_datareader: None
bs4 : None
bottleneck : None
fastparquet : None
gcsfs : None
lxml.etree : None
matplotlib : 3.1.3
numexpr : None
odfpy : None
openpyxl : None
pandas_gbq : None
pyarrow : None
pytables : None
pytest : None
pyxlsb : None
s3fs : None
scipy : 1.4.1
sqlalchemy : None
tables : None
tabulate : None
xarray : 0.15.0
xlrd : 1.2.0
xlwt : None
xlsxwriter : None
numba : 0.48.0

@TomAugspurger
Copy link
Contributor

I don't think that non-unique labels are possible to support with the current implementation of cut and Categorical, and I don't think that is likely to change any time soon.

@harmbuisman
Copy link
Author

I don't think that non-unique labels are possible to support with the current implementation of cut and Categorical, and I don't think that is likely to change any time soon.

@TomAugspurger, thanks for your assessment. Just to clarify, I do not want multiple categories with the same label.

Would it be possible to skip the conversion to Categorical if a new pd.cut argument, e.g. labels_as_categories==False in tile.py, lines 415-416 in the _bins_to_cuts method?

if not is_categorical_dtype(labels): labels = Categorical(labels, categories=labels, ordered=True)

@TomAugspurger
Copy link
Contributor

TomAugspurger commented Mar 30, 2020 via email

@dsaxton
Copy link
Member

dsaxton commented Mar 30, 2020

@harmbuisman One workaround would be to set labels to False and apply a map to the output:

pd.cut(data["morf"], bins, labels=False).map({i: x for i, x in enumerate(labels)})

@jschendel jschendel added the cut cut, qcut label Mar 31, 2020
@jschendel
Copy link
Member

You can get around this by coercing labels to an unordered Categorical prior to calling cut:

In [1]: import pandas as pd; pd.__version__ 
Out[1]: '1.0.3'

In [2]: data = pd.DataFrame({'morf':[8142, 8153, 8161]}) 
   ...: bins = [8140, 8150, 8160, 8163] 
   ...: labels = ['Adenocarcinomas', 'Other specific carcinomas', 'Adenocarcinomas']

In [3]: pd.cut(data['morf'], bins, labels=pd.Categorical(labels))
Out[3]: 
0              Adenocarcinomas
1    Other specific carcinomas
2              Adenocarcinomas
Name: morf, dtype: category
Categories (2, object): [Adenocarcinomas, Other specific carcinomas]

Maybe I'm reading too much into the implementation of cut, but based on how we always coerce to an ordered Categorical, it looks like one of the assumptions is that labels are strictly ordered (i.e. label0 < label1 < label2), probably because the corresponding bins are required to be. This assumption doesn't appear to be documented anywhere, and it's not always enforced as shown above since we don't coerce an unordered Categorical to an ordered Categorical or raise.

It seems like a simple fix here could be to attempt to coerce to an ordered Categorical first and if it fails fall back to trying to coerce to an unordered Categorical:

diff --git a/pandas/core/reshape/tile.py b/pandas/core/reshape/tile.py
index 11fb8cc12..9ab863bcf 100644
--- a/pandas/core/reshape/tile.py
+++ b/pandas/core/reshape/tile.py
@@ -413,7 +413,11 @@ def _bins_to_cuts(
                 )
 
         if not is_categorical_dtype(labels):
-            labels = Categorical(labels, categories=labels, ordered=True)
+            try:
+                labels = Categorical(labels, categories=labels, ordered=True)
+            except ValueError:
+                # GH 33141: maybe dupes, attempt to coerce to an unordered categorical
+                labels = Categorical(labels)
 
         np.putmask(ids, na_mask, 0)
         result = algos.take_nd(labels, ids - 1)

I could see an argument for not allowing this though, as having cut return both ordered and unordered categoricals depending on labels could be a bit confusing/surprising. If we don't want to allow it we should also probably raise when labels is passed in as an unordered Categorical for consistency (breaking my proposed workaround above).

That being said, having duplicate labels seems like a reasonable use case to support. In addition to the example in the bug description, I could imagine another use case where one wants to give the same label to outliers occurring at both the extreme low and high ends of the data.

@dsaxton
Copy link
Member

dsaxton commented Mar 31, 2020

@jschendel I could actually understand just always treating the labels as-is without trying to impose the same order as bins, since there are situations where you might want labels that aren't monotone (like if you're doing target encoding for a machine learning model for example) or just have no order at all. If you do want that order it's very easy to provide labels that have it by default (like a range). Consider this unusual example caused by using the bin order (1 < 2 < 0?):

In [4]: pd.cut([1, 3, 5], bins=[0, 2, 4, 6], labels=[1, 2, 0])                  
Out[4]: 
[1, 2, 0]
Categories (3, int64): [1 < 2 < 0]

@TomAugspurger
Copy link
Contributor

Doesn't cut necessarily imply some kind of ordering on the data?

I could imagine another use case where one wants to give the same label to outliers occurring at both the extreme low and high ends of the data.

That does seem useful.

I also don't like implicitly returning an unorderd Categorical based on the presence of duplicates in labels. If we want to support this, I say that we offer an ordered=True option in pd.cut. Then users wishing to support non-unique labels can provide a labels=labels, ordered=False.

@mroeschke mroeschke added the Bug label Apr 3, 2020
mabelvj added a commit to mabelvj/pandas that referenced this issue Apr 11, 2020
mabelvj added a commit to mabelvj/pandas that referenced this issue Apr 11, 2020
mabelvj added a commit to mabelvj/pandas that referenced this issue Apr 11, 2020
mabelvj added a commit to mabelvj/pandas that referenced this issue Apr 11, 2020
@mabelvj
Copy link
Contributor

mabelvj commented Apr 11, 2020

Just made PR adding the ordered option to pd.cut #33480.

If no labels provided and ordered is False, it will raise an error.

mabelvj added a commit to mabelvj/pandas that referenced this issue Apr 11, 2020
mabelvj added a commit to mabelvj/pandas that referenced this issue Apr 11, 2020
mabelvj added a commit to mabelvj/pandas that referenced this issue Apr 11, 2020
mabelvj added a commit to mabelvj/pandas that referenced this issue Apr 15, 2020
mabelvj added a commit to mabelvj/pandas that referenced this issue Apr 15, 2020
mabelvj added a commit to mabelvj/pandas that referenced this issue Apr 15, 2020
mabelvj added a commit to mabelvj/pandas that referenced this issue Apr 28, 2020
mabelvj added a commit to mabelvj/pandas that referenced this issue Apr 28, 2020
mabelvj added a commit to mabelvj/pandas that referenced this issue Apr 29, 2020
mabelvj added a commit to mabelvj/pandas that referenced this issue Apr 30, 2020
mabelvj added a commit to mabelvj/pandas that referenced this issue Apr 30, 2020
@jreback jreback added this to the 1.1 milestone May 1, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug cut cut, qcut
Projects
None yet
Development

Successfully merging a pull request may close this issue.

7 participants