Categorical(vals, cats) bad performance with NaNs #12077

Winand · 2016-01-18T11:31:50Z

NaNs in datetime64 data values GREATLY reduce performance of Categorical(values, cats):

import pandas as pd
tmp=pd.Series(pd.DatetimeIndex(pd.np.datetime64('1995-01-01 00:00')+i for i in range(1000000)))
to = tmp.astype('category')
cats = to.cat.categorical._categories.values

%timeit pd.Categorical(tmp.values, cats)
1 loops, best of 3: 250 ms per loop

tmp[500000] = pd.NaT
%timeit pd.Categorical(tmp.values, cats)
1 loops, best of 3: 10.1 s per loop

tmp[tmp.isnull()] = pd.np.datetime64('0')
%timeit pd.Categorical(tmp.values, cats)
1 loops, best of 3: 251 ms per loop

Small issue with printing Categorical datetime64:

ds = pd.Categorical([pd.np.datetime64("2014-01-01"), pd.NaT])
>>> ds
[2014-01-01, 2014-01-01] <--- NO, -1 is NOT a category:-)
Categories (1, datetime64[ns]): [2014-01-01]
>>> ds.astype('datetime64')
array(['2014-01-01T03:00:00.000000000+0300', 'NaT'], dtype='datetime64[ns]')

Versions:

commit: None
python: 3.4.3.final.0
python-bits: 32
OS: Windows
OS-release: 7
machine: AMD64
processor: Intel64 Family 6 Model 58 Stepping 9, GenuineIntel
byteorder: little
LC_ALL: None
LANG: en

pandas: 0.17.1
nose: 1.3.7
pip: 7.1.2
setuptools: 18.7.1
Cython: 0.23.4
numpy: 1.9.3
scipy: 0.16.1
statsmodels: 0.6.1
IPython: 4.0.1
sphinx: 1.3.3
patsy: 0.4.1
dateutil: 2.4.2
pytz: 2015.7
blosc: 1.2.8
bottleneck: 1.0.0
tables: 3.2.2
numexpr: 2.4.6
matplotlib: 1.5.0
openpyxl: None
xlrd: 0.9.4
xlwt: None
xlsxwriter: 0.7.7
lxml: 3.5.0
bs4: 4.4.1
html5lib: 0.9999999
httplib2: None
apiclient: None
sqlalchemy: 1.0.9
pymysql: None
psycopg2: None
Jinja2: 2.8

The text was updated successfully, but these errors were encountered:

jreback · 2016-01-18T11:36:31Z

pls show a copy pastable example and
pd.show_versions()

jreback · 2016-01-18T12:16:17Z

can u show using
%%timeit in ipython instead
it's much easier to read

Winand · 2016-01-18T12:17:01Z

At first i've tried to initialize like this:

to=pd.Series(pd.DatetimeIndex(range(1000000))).astype('category')
cats = to.cat.categorical._categories.values
tmp=pd.Series(pd.DatetimeIndex(range(1000000)))

but it gives wrong results in the 1st case (a bug?):

>>>c1
[1970-01-01, 1970-01-01, 1970-01-01, 1970-01-01, 1970-01-01, ..., 1970-01-01 00:00:00.000999, 1970-01-01 00:00:00.000999, 1970-01-01 00:00:00.000999, 1970-01-01 00:00:00.000999, 1970-01-01 00:00:00.000999]
Length: 1000000
Categories (1000000, datetime64[ns]): [1970-01-01 00:00:00.000000000, 1970-01-01 00:00:00.000000001,
                                       1970-01-01 00:00:00.000000002, 1970-01-01 00:00:00.000000003, ...,
                                       1970-01-01 00:00:00.000999996, 1970-01-01 00:00:00.000999997,
                                       1970-01-01 00:00:00.000999998, 1970-01-01 00:00:00.000999999]

>>>c2
[1970-01-01 00:00:00.000000000, 1970-01-01 00:00:00.000000001, 1970-01-01 00:00:00.000000002, 1970-01-01 00:00:00.000000003, 1970-01-01 00:00:00.000000004, ..., 1970-01-01 00:00:00.000999995, 1970-01-01 00:00:00.000999996, 1970-01-01 00:00:00.000999997, 1970-01-01 00:00:00.000999998, 1970-01-01 00:00:00.000999999]
Length: 1000000
Categories (1000000, datetime64[ns]): [1970-01-01 00:00:00.000000000, 1970-01-01 00:00:00.000000001,
                                       1970-01-01 00:00:00.000000002, 1970-01-01 00:00:00.000000003, ...,
                                       1970-01-01 00:00:00.000999996, 1970-01-01 00:00:00.000999997,
                                       1970-01-01 00:00:00.000999998, 1970-01-01 00:00:00.000999999]
equal? False

…ical construction with NaT, pandas-dev#12077

jreback · 2016-01-25T01:40:37Z

@Winand

#12128 should fix a multitude of categorical with NaT issues/perf.

was converting them to object dtype under the hood (bad) and not treating NaT like nan

jreback added Performance Memory or execution speed performance Categorical Categorical Data Type labels Jan 19, 2016

jreback mentioned this issue Jan 25, 2016

PERF: add support for NaT in hashtable factorizers, improving Categorical construction with NaT #12128

Closed

jreback added a commit to jreback/pandas that referenced this issue Jan 25, 2016

PERF: add support for NaT in hashtable factorizers, improving Categor…

e1385d8

…ical construction with NaT, pandas-dev#12077

jreback closed this as completed in 81bb972 Jan 25, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Categorical(vals, cats) bad performance with NaNs #12077

Categorical(vals, cats) bad performance with NaNs #12077

Winand commented Jan 18, 2016

jreback commented Jan 18, 2016

Uh oh!

jreback commented Jan 18, 2016

Uh oh!

Winand commented Jan 18, 2016

Uh oh!

jreback commented Jan 25, 2016

Uh oh!

Uh oh!

Categorical(vals, cats) bad performance with NaNs #12077

Categorical(vals, cats) bad performance with NaNs #12077

Comments

Winand commented Jan 18, 2016

jreback commented Jan 18, 2016

Uh oh!

jreback commented Jan 18, 2016

Uh oh!

Winand commented Jan 18, 2016

Uh oh!

jreback commented Jan 25, 2016

Uh oh!