Skip to content

Categorical(vals, cats) bad performance with NaNs #12077

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
Winand opened this issue Jan 18, 2016 · 4 comments
Closed

Categorical(vals, cats) bad performance with NaNs #12077

Winand opened this issue Jan 18, 2016 · 4 comments
Labels
Categorical Categorical Data Type Performance Memory or execution speed performance

Comments

@Winand
Copy link
Contributor

Winand commented Jan 18, 2016

NaNs in datetime64 data values GREATLY reduce performance of Categorical(values, cats):

import pandas as pd
tmp=pd.Series(pd.DatetimeIndex(pd.np.datetime64('1995-01-01 00:00')+i for i in range(1000000)))
to = tmp.astype('category')
cats = to.cat.categorical._categories.values

%timeit pd.Categorical(tmp.values, cats)
1 loops, best of 3: 250 ms per loop

tmp[500000] = pd.NaT
%timeit pd.Categorical(tmp.values, cats)
1 loops, best of 3: 10.1 s per loop

tmp[tmp.isnull()] = pd.np.datetime64('0')
%timeit pd.Categorical(tmp.values, cats)
1 loops, best of 3: 251 ms per loop

Small issue with printing Categorical datetime64:

ds = pd.Categorical([pd.np.datetime64("2014-01-01"), pd.NaT])
>>> ds
[2014-01-01, 2014-01-01] <--- NO, -1 is NOT a category:-)
Categories (1, datetime64[ns]): [2014-01-01]
>>> ds.astype('datetime64')
array(['2014-01-01T03:00:00.000000000+0300', 'NaT'], dtype='datetime64[ns]')

Versions:

commit: None
python: 3.4.3.final.0
python-bits: 32
OS: Windows
OS-release: 7
machine: AMD64
processor: Intel64 Family 6 Model 58 Stepping 9, GenuineIntel
byteorder: little
LC_ALL: None
LANG: en

pandas: 0.17.1
nose: 1.3.7
pip: 7.1.2
setuptools: 18.7.1
Cython: 0.23.4
numpy: 1.9.3
scipy: 0.16.1
statsmodels: 0.6.1
IPython: 4.0.1
sphinx: 1.3.3
patsy: 0.4.1
dateutil: 2.4.2
pytz: 2015.7
blosc: 1.2.8
bottleneck: 1.0.0
tables: 3.2.2
numexpr: 2.4.6
matplotlib: 1.5.0
openpyxl: None
xlrd: 0.9.4
xlwt: None
xlsxwriter: 0.7.7
lxml: 3.5.0
bs4: 4.4.1
html5lib: 0.9999999
httplib2: None
apiclient: None
sqlalchemy: 1.0.9
pymysql: None
psycopg2: None
Jinja2: 2.8
@jreback
Copy link
Contributor

jreback commented Jan 18, 2016

pls show a copy pastable example and
pd.show_versions()

@jreback
Copy link
Contributor

jreback commented Jan 18, 2016

can u show using
%%timeit in ipython instead
it's much easier to read

@Winand
Copy link
Contributor Author

Winand commented Jan 18, 2016

At first i've tried to initialize like this:

to=pd.Series(pd.DatetimeIndex(range(1000000))).astype('category')
cats = to.cat.categorical._categories.values
tmp=pd.Series(pd.DatetimeIndex(range(1000000)))

but it gives wrong results in the 1st case (a bug?):

>>>c1
[1970-01-01, 1970-01-01, 1970-01-01, 1970-01-01, 1970-01-01, ..., 1970-01-01 00:00:00.000999, 1970-01-01 00:00:00.000999, 1970-01-01 00:00:00.000999, 1970-01-01 00:00:00.000999, 1970-01-01 00:00:00.000999]
Length: 1000000
Categories (1000000, datetime64[ns]): [1970-01-01 00:00:00.000000000, 1970-01-01 00:00:00.000000001,
                                       1970-01-01 00:00:00.000000002, 1970-01-01 00:00:00.000000003, ...,
                                       1970-01-01 00:00:00.000999996, 1970-01-01 00:00:00.000999997,
                                       1970-01-01 00:00:00.000999998, 1970-01-01 00:00:00.000999999]

>>>c2
[1970-01-01 00:00:00.000000000, 1970-01-01 00:00:00.000000001, 1970-01-01 00:00:00.000000002, 1970-01-01 00:00:00.000000003, 1970-01-01 00:00:00.000000004, ..., 1970-01-01 00:00:00.000999995, 1970-01-01 00:00:00.000999996, 1970-01-01 00:00:00.000999997, 1970-01-01 00:00:00.000999998, 1970-01-01 00:00:00.000999999]
Length: 1000000
Categories (1000000, datetime64[ns]): [1970-01-01 00:00:00.000000000, 1970-01-01 00:00:00.000000001,
                                       1970-01-01 00:00:00.000000002, 1970-01-01 00:00:00.000000003, ...,
                                       1970-01-01 00:00:00.000999996, 1970-01-01 00:00:00.000999997,
                                       1970-01-01 00:00:00.000999998, 1970-01-01 00:00:00.000999999]
equal? False

@jreback jreback added Performance Memory or execution speed performance Categorical Categorical Data Type labels Jan 19, 2016
jreback added a commit to jreback/pandas that referenced this issue Jan 25, 2016
@jreback
Copy link
Contributor

jreback commented Jan 25, 2016

@Winand

#12128 should fix a multitude of categorical with NaT issues/perf.

was converting them to object dtype under the hood (bad) and not treating NaT like nan

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Categorical Categorical Data Type Performance Memory or execution speed performance
Projects
None yet
2 participants