-
-
Notifications
You must be signed in to change notification settings - Fork 18.7k
Closed
Labels
Milestone
Description
The following looks like a bug to me as DataFrame.duplicated() gives different results on what should be identical inputs. To me it looks like the problem is with the datetime64 values because if you look at the output of dates.values
it's clear that the last 4 values are duplicates.
Please see the code below that reproduces the problem:
In [701]: dates = date_range('2010-07-01', end='2010-08-05')
In [702]: tst = DataFrame({'symbol': 'AAA', 'date': dates})
In [703]: tst.tail()
Out[703]:
date symbol
31 2010-08-01 00:00:00 AAA
32 2010-08-02 00:00:00 AAA
33 2010-08-03 00:00:00 AAA
34 2010-08-04 00:00:00 AAA
35 2010-08-05 00:00:00 AAA
In [704]: tst.duplicated().tail()
Out[704]:
31 False
32 False
33 False
34 False
35 False
In [705]: tst.duplicated(['date', 'symbol']).tail()
Out[705]:
31 False
32 True
33 True
34 True
35 True
In [706]: dates.values
Out[706]:
array([1970-01-15 40:00:00, 1970-01-15 64:00:00, 1970-01-15 88:00:00,
1970-01-15 112:00:00, 1970-01-15 136:00:00, 1970-01-15 160:00:00,
1970-01-15 184:00:00, 1970-01-15 208:00:00, 1970-01-15 232:00:00,
1970-01-15 00:00:00, 1970-01-15 24:00:00, 1970-01-15 48:00:00,
1970-01-15 72:00:00, 1970-01-15 96:00:00, 1970-01-15 120:00:00,
1970-01-15 144:00:00, 1970-01-15 168:00:00, 1970-01-15 192:00:00,
1970-01-15 216:00:00, 1970-01-15 240:00:00, 1970-01-15 08:00:00,
1970-01-15 32:00:00, 1970-01-15 56:00:00, 1970-01-15 80:00:00,
1970-01-15 104:00:00, 1970-01-15 128:00:00, 1970-01-15 152:00:00,
1970-01-15 176:00:00, 1970-01-15 200:00:00, 1970-01-15 224:00:00,
1970-01-15 248:00:00, 1970-01-15 16:00:00, 1970-01-15 40:00:00,
1970-01-15 64:00:00, 1970-01-15 88:00:00, 1970-01-15 112:00:00], dtype=datetime64[ns])
In [707]: dates
Out[707]:
<class 'pandas.tseries.index.DatetimeIndex'>
[2010-07-01 00:00:00, ..., 2010-08-05 00:00:00]
Length: 36, Freq: D, Timezone: None
In [708]: dates.dtype
Out[708]: dtype('datetime64[ns]')
In [709]: dates.values.dtype
Out[709]: dtype('datetime64[ns]')
In [710]: sys.version
Out[710]: '2.7.3 (default, Aug 1 2012, 05:14:39) \n[GCC 4.6.3]'
In [711]: np.version.version
Out[711]: '1.6.1'
In [713]: pd.version.version
Out[713]: '0.8.1'
In [714]:
I remember reading somewhere that there are problems with datetime64 in numpy 1.6 but I don't understand what coercions are taking place behind the scenes. Also, if someone could please explain to me why the dates in dates.values
above are wrong and how to avoid this, I would appreciate it.