Description
If you have a DataFrame with a repeated or non-unique column, then some assignments fail.
df = pd.DataFrame(np.random.randn(10,2), columns=['that', 'that'])
df2
Out[10]:
that that
0 1 1
1 1 1
2 1 1
3 1 1
4 1 1
5 1 1
6 1 1
7 1 1
8 1 1
9 1 1
[10 rows x 2 columns]
This is float data and the following works:
df['that'] = 1.0
However, this fails with an error and breaks the dataframe (e.g. a subsequent repr will also fail.)
df2['that'] = 1
Traceback (most recent call last):
File "/users/is/dbew/pyenvs/timeseries/lib/python2.7/site-packages/ipython-1.1.0_1_ahl1-py2.7.egg/IPython/core/interactiveshell.py", line 2830, in run_code
exec code_obj in self.user_global_ns, self.user_ns
File "<ipython-input-11-8701f5b0efe4>", line 1, in <module>
df2['that'] = 1
File "/users/is/dbew/pyenvs/timeseries/lib/python2.7/site-packages/pandas-0.13.0_292_g4dcecb0-py2.7-linux-x86_64.egg/pandas/core/frame.py", line 1879, in __setitem__
self._set_item(key, value)
File "/users/is/dbew/pyenvs/timeseries/lib/python2.7/site-packages/pandas-0.13.0_292_g4dcecb0-py2.7-linux-x86_64.egg/pandas/core/frame.py", line 1960, in _set_item
NDFrame._set_item(self, key, value)
File "/users/is/dbew/pyenvs/timeseries/lib/python2.7/site-packages/pandas-0.13.0_292_g4dcecb0-py2.7-linux-x86_64.egg/pandas/core/generic.py", line 1057, in _set_item
self._data.set(key, value)
File "/users/is/dbew/pyenvs/timeseries/lib/python2.7/site-packages/pandas-0.13.0_292_g4dcecb0-py2.7-linux-x86_64.egg/pandas/core/internals.py", line 2968, in set
_set_item(item, arr[None, :])
File "/users/is/dbew/pyenvs/timeseries/lib/python2.7/site-packages/pandas-0.13.0_292_g4dcecb0-py2.7-linux-x86_64.egg/pandas/core/internals.py", line 2927, in _set_item
self._add_new_block(item, arr, loc=None)
File "/users/is/dbew/pyenvs/timeseries/lib/python2.7/site-packages/pandas-0.13.0_292_g4dcecb0-py2.7-linux-x86_64.egg/pandas/core/internals.py", line 3108, in _add_new_block
new_block = make_block(value, self.items[loc:loc + 1].copy(),
TypeError: unsupported operand type(s) for +: 'slice' and 'int'
I stepped through the code and it looked like most places handle repeated columns ok except the code that reallocates arrays when the dtype changes.
I've tested this against pandas 0.13.0 and the latest master. Here's the output of installed versions when running on the master:
commit: None
python: 2.7.3.final.0
python-bits: 64
OS: Linux
OS-release: 2.6.18-308.el5
processor: x86_64
byteorder: little
LC_ALL: None
LANG: en_GB
pandas: 0.13.0-292-g4dcecb0
Cython: 0.16
numpy: 1.7.1
scipy: 0.9.0
statsmodels: None
patsy: None
scikits.timeseries: None
dateutil: 1.5
pytz: None
bottleneck: 0.6.0
tables: 2.3.1-1
numexpr: 2.0.1
matplotlib: 1.1.1
openpyxl: None
xlrd: 0.8.0
xlwt: None
xlsxwriter: None
sqlalchemy: None
lxml: 2.3.6
bs4: None
html5lib: None
bq: None
apiclient: None