Skip to content

Bug in DataFrame.duplicated() when dealing with datetime64 #1833

@snth

Description

@snth

The following looks like a bug to me as DataFrame.duplicated() gives different results on what should be identical inputs. To me it looks like the problem is with the datetime64 values because if you look at the output of dates.values it's clear that the last 4 values are duplicates.

Please see the code below that reproduces the problem:

In [701]: dates = date_range('2010-07-01', end='2010-08-05')

In [702]: tst = DataFrame({'symbol': 'AAA', 'date': dates})

In [703]: tst.tail()
Out[703]: 
                  date symbol
31 2010-08-01 00:00:00    AAA
32 2010-08-02 00:00:00    AAA
33 2010-08-03 00:00:00    AAA
34 2010-08-04 00:00:00    AAA
35 2010-08-05 00:00:00    AAA

In [704]: tst.duplicated().tail()
Out[704]: 
31    False
32    False
33    False
34    False
35    False

In [705]: tst.duplicated(['date', 'symbol']).tail()
Out[705]: 
31    False
32     True
33     True
34     True
35     True

In [706]: dates.values
Out[706]: 
array([1970-01-15 40:00:00, 1970-01-15 64:00:00, 1970-01-15 88:00:00,
       1970-01-15 112:00:00, 1970-01-15 136:00:00, 1970-01-15 160:00:00,
       1970-01-15 184:00:00, 1970-01-15 208:00:00, 1970-01-15 232:00:00,
       1970-01-15 00:00:00, 1970-01-15 24:00:00, 1970-01-15 48:00:00,
       1970-01-15 72:00:00, 1970-01-15 96:00:00, 1970-01-15 120:00:00,
       1970-01-15 144:00:00, 1970-01-15 168:00:00, 1970-01-15 192:00:00,
       1970-01-15 216:00:00, 1970-01-15 240:00:00, 1970-01-15 08:00:00,
       1970-01-15 32:00:00, 1970-01-15 56:00:00, 1970-01-15 80:00:00,
       1970-01-15 104:00:00, 1970-01-15 128:00:00, 1970-01-15 152:00:00,
       1970-01-15 176:00:00, 1970-01-15 200:00:00, 1970-01-15 224:00:00,
       1970-01-15 248:00:00, 1970-01-15 16:00:00, 1970-01-15 40:00:00,
       1970-01-15 64:00:00, 1970-01-15 88:00:00, 1970-01-15 112:00:00], dtype=datetime64[ns])

In [707]: dates
Out[707]: 
<class 'pandas.tseries.index.DatetimeIndex'>
[2010-07-01 00:00:00, ..., 2010-08-05 00:00:00]
Length: 36, Freq: D, Timezone: None

In [708]: dates.dtype
Out[708]: dtype('datetime64[ns]')

In [709]: dates.values.dtype
Out[709]: dtype('datetime64[ns]')

In [710]: sys.version
Out[710]: '2.7.3 (default, Aug  1 2012, 05:14:39) \n[GCC 4.6.3]'

In [711]: np.version.version
Out[711]: '1.6.1'

In [713]: pd.version.version
Out[713]: '0.8.1'

In [714]: 

I remember reading somewhere that there are problems with datetime64 in numpy 1.6 but I don't understand what coercions are taking place behind the scenes. Also, if someone could please explain to me why the dates in dates.values above are wrong and how to avoid this, I would appreciate it.

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions