Skip to content

read_fwf(), read_table() are mangling columns #3594

@danmd

Description

@danmd

version 0.11.0 has introduced a confusing behavior when importing data via read_fwf (and, I'm pretty sure, read_table)

import pandas as pd
from cStringIO import StringIO
from datetime import datetime

tzlist = [1,10,20,30,60,80,100]
ntz = len(tzlist)
tcolspecs = [16]+[8]*ntz
tcolnames = ['SST'] + ["T%03d" % z for z in tzlist[1:]]
data = '''  2009164202000   9.5403  9.4105  8.6571  7.8372  6.0612  5.8843  5.5192
  2009164203000   9.5435  9.2010  8.6167  7.8176  6.0804  5.8728  5.4869
  2009164204000   9.5873  9.1326  8.4694  7.5889  6.0422  5.8526  5.4657
  2009164205000   9.5810  9.0896  8.4009  7.4652  6.0322  5.8189  5.4379
  2009164210000   9.6034  9.0897  8.3822  7.4905  6.0908  5.7904  5.4039'''
dftemp = pd.read_fwf(StringIO(data),
                     index_col=0,
                     header=None,
                     names=tcolnames,
                     widths=tcolspecs,
                     parse_dates=True,
                     date_parser=lambda s: datetime.strptime(s,'%Y%j%H%M%S'))

With version 0.10.1

In [1]: pd.__version__
Out[1]: '0.10.1'

In [2]: dftemp
Out[2]: 
                        SST    T010    T020    T030    T060    T080    T100
2009-06-13 20:20:00  9.5403  9.4105  8.6571  7.8372  6.0612  5.8843  5.5192
2009-06-13 20:30:00  9.5435  9.2010  8.6167  7.8176  6.0804  5.8728  5.4869
2009-06-13 20:40:00  9.5873  9.1326  8.4694  7.5889  6.0422  5.8526  5.4657
2009-06-13 20:50:00  9.5810  9.0896  8.4009  7.4652  6.0322  5.8189  5.4379
2009-06-13 21:00:00  9.6034  9.0897  8.3822  7.4905  6.0908  5.7904  5.4039

In [3]: dftemp.T030
Out[3]: 
2009-06-13 20:20:00    7.8372
2009-06-13 20:30:00    7.8176
2009-06-13 20:40:00    7.5889
2009-06-13 20:50:00    7.4652
2009-06-13 21:00:00    7.4905
Name: T030, dtype: float64

In [4]: dftemp.T060
Out[4]: 
2009-06-13 20:20:00    6.0612
2009-06-13 20:30:00    6.0804
2009-06-13 20:40:00    6.0422
2009-06-13 20:50:00    6.0322
2009-06-13 21:00:00    6.0908
Name: T060

In [5]: dftemp.T080
Out[5]: 
2009-06-13 20:20:00    5.8843
2009-06-13 20:30:00    5.8728
2009-06-13 20:40:00    5.8526
2009-06-13 20:50:00    5.8189
2009-06-13 21:00:00    5.7904
Name: T080

In [6]: dftemp.T100
Out[6]: 
2009-06-13 20:20:00    5.5192
2009-06-13 20:30:00    5.4869
2009-06-13 20:40:00    5.4657
2009-06-13 20:50:00    5.4379
2009-06-13 21:00:00    5.4039
Name: T100, dtype: float64

and, with version 0.11.0

In [1]: pd.__version__
Out[1]: '0.11.0'

In [2]: dftemp
Out[2]: 
                        SST    T010    T020    T030    T060    T080    T100
2009-06-13 20:20:00  9.5403  9.4105  8.6571  7.8372  6.0612  5.8843  5.5192
2009-06-13 20:30:00  9.5435  9.2010  8.6167  7.8176  6.0804  5.8728  5.4869
2009-06-13 20:40:00  9.5873  9.1326  8.4694  7.5889  6.0422  5.8526  5.4657
2009-06-13 20:50:00  9.5810  9.0896  8.4009  7.4652  6.0322  5.8189  5.4379
2009-06-13 21:00:00  9.6034  9.0897  8.3822  7.4905  6.0908  5.7904  5.4039

In [3]: dftemp.T030
Out[3]: 
2009-06-13 20:20:00    7.8372
2009-06-13 20:30:00    7.8176
2009-06-13 20:40:00    7.5889
2009-06-13 20:50:00    7.4652
2009-06-13 21:00:00    7.4905
Name: T030, dtype: float64

In [4]: dftemp.T060
Out[4]: 
Empty DataFrame
Columns: [SST, T010, T020, T030, T060, T080, T100]
Index: []

In [5]: dftemp.T080
Out[5]: 
Empty DataFrame
Columns: [SST, T010, T020, T030, T060, T080, T100]
Index: []

In [6]: dftemp.T100
Out[6]: 
2009-06-13 20:20:00    5.5192
2009-06-13 20:30:00    5.4869
2009-06-13 20:40:00    5.4657
2009-06-13 20:50:00    5.4379
2009-06-13 21:00:00    5.4039
Name: T100, dtype: float64

No matter how many columns I've been importing (up to 32 in some cases), it seems like it is always the 5th and 6th columns getting hit.

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions