-
-
Notifications
You must be signed in to change notification settings - Fork 18.6k
Pandas cut method gives an error if labels are non-unique #33141
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
I don't think that non-unique labels are possible to support with the current implementation of |
@TomAugspurger, thanks for your assessment. Just to clarify, I do not want multiple categories with the same label. Would it be possible to skip the conversion to Categorical if a new pd.cut argument, e.g. labels_as_categories==False in tile.py, lines 415-416 in the _bins_to_cuts method?
|
I'm not sure. Would that change the result type? If so, it's probably not
an option.
…On Mon, Mar 30, 2020 at 10:10 AM harmbuisman ***@***.***> wrote:
I don't think that non-unique labels are possible to support with the
current implementation of cut and Categorical, and I don't think that is
likely to change any time soon.
@TomAugspurger <https://github.com/TomAugspurger>, thanks for your
assessment. Just to clarify, I do not want multiple categories with the
same label.
Would it be possible to skip the conversion to Categorical if a new pd.cut
argument, e.g. labels_as_categories==False in tile.py, lines 415-416 in the
_bins_to_cuts method?
if not is_categorical_dtype(labels): labels = Categorical(labels,
categories=labels, ordered=True)
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#33141 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAKAOIXGWO6EGXWONS4SHG3RKCY6JANCNFSM4LWSRH6Q>
.
|
@harmbuisman One workaround would be to set labels to pd.cut(data["morf"], bins, labels=False).map({i: x for i, x in enumerate(labels)}) |
You can get around this by coercing labels to an unordered In [1]: import pandas as pd; pd.__version__
Out[1]: '1.0.3'
In [2]: data = pd.DataFrame({'morf':[8142, 8153, 8161]})
...: bins = [8140, 8150, 8160, 8163]
...: labels = ['Adenocarcinomas', 'Other specific carcinomas', 'Adenocarcinomas']
In [3]: pd.cut(data['morf'], bins, labels=pd.Categorical(labels))
Out[3]:
0 Adenocarcinomas
1 Other specific carcinomas
2 Adenocarcinomas
Name: morf, dtype: category
Categories (2, object): [Adenocarcinomas, Other specific carcinomas] Maybe I'm reading too much into the implementation of It seems like a simple fix here could be to attempt to coerce to an ordered diff --git a/pandas/core/reshape/tile.py b/pandas/core/reshape/tile.py
index 11fb8cc12..9ab863bcf 100644
--- a/pandas/core/reshape/tile.py
+++ b/pandas/core/reshape/tile.py
@@ -413,7 +413,11 @@ def _bins_to_cuts(
)
if not is_categorical_dtype(labels):
- labels = Categorical(labels, categories=labels, ordered=True)
+ try:
+ labels = Categorical(labels, categories=labels, ordered=True)
+ except ValueError:
+ # GH 33141: maybe dupes, attempt to coerce to an unordered categorical
+ labels = Categorical(labels)
np.putmask(ids, na_mask, 0)
result = algos.take_nd(labels, ids - 1) I could see an argument for not allowing this though, as having That being said, having duplicate labels seems like a reasonable use case to support. In addition to the example in the bug description, I could imagine another use case where one wants to give the same label to outliers occurring at both the extreme low and high ends of the data. |
@jschendel I could actually understand just always treating the labels as-is without trying to impose the same order as bins, since there are situations where you might want labels that aren't monotone (like if you're doing target encoding for a machine learning model for example) or just have no order at all. If you do want that order it's very easy to provide labels that have it by default (like a range). Consider this unusual example caused by using the bin order (1 < 2 < 0?): In [4]: pd.cut([1, 3, 5], bins=[0, 2, 4, 6], labels=[1, 2, 0])
Out[4]:
[1, 2, 0]
Categories (3, int64): [1 < 2 < 0] |
Doesn't
That does seem useful. I also don't like implicitly returning an unorderd Categorical based on the presence of duplicates in labels. If we want to support this, I say that we offer an |
…ut raises error if labels are non-unique (pandas-dev#33141)
…ut raises error if labels are non-unique (pandas-dev#33141)
…ut raises error if labels are non-unique (pandas-dev#33141)
…ut raises error if labels are non-unique (pandas-dev#33141)
Just made PR adding the ordered option to pd.cut #33480. If no labels provided and ordered is False, it will raise an error. |
…ut raises error if labels are non-unique (pandas-dev#33141)
…ut raises error if labels are non-unique (pandas-dev#33141)
…ew lines. Updated docstrings (pandas-dev#33141)
…ew lines. Updated docstrings (pandas-dev#33141)
Code Sample, a copy-pastable example if possible
Problem description
I want to use pandas cut to fill in labels for ranges that are scattered. E.g. I want to map the following table onto my data, where the definitions span ranges that are scattered across the morphology continuum:

The above code gives the error:
ValueError: Categorical categories must be unique
Expected Output
I expect the example above to run and give back the labeled dataframe without an error.
My workaround is to add the index to the labels and then remove that again after the cut.
Output of
pd.show_versions()
[paste the output of
pd.show_versions()
here below this line]INSTALLED VERSIONS
commit : None
python : 3.7.6.final.0
python-bits : 64
OS : Windows
OS-release : 10
machine : AMD64
processor : Intel64 Family 6 Model 142 Stepping 10, GenuineIntel
byteorder : little
LC_ALL : None
LANG : None
LOCALE : None.None
pandas : 1.0.1
numpy : 1.18.1
pytz : 2019.3
dateutil : 2.8.1
pip : 20.0.2
setuptools : 45.2.0.post20200209
Cython : None
pytest : None
hypothesis : None
sphinx : None
blosc : None
feather : None
xlsxwriter : None
lxml.etree : None
html5lib : None
pymysql : None
psycopg2 : None
jinja2 : 2.11.1
IPython : 7.12.0
pandas_datareader: None
bs4 : None
bottleneck : None
fastparquet : None
gcsfs : None
lxml.etree : None
matplotlib : 3.1.3
numexpr : None
odfpy : None
openpyxl : None
pandas_gbq : None
pyarrow : None
pytables : None
pytest : None
pyxlsb : None
s3fs : None
scipy : 1.4.1
sqlalchemy : None
tables : None
tabulate : None
xarray : 0.15.0
xlrd : 1.2.0
xlwt : None
xlsxwriter : None
numba : 0.48.0
The text was updated successfully, but these errors were encountered: