-
-
Notifications
You must be signed in to change notification settings - Fork 18.7k
Closed
Labels
Dtype ConversionsUnexpected or buggy dtype conversionsUnexpected or buggy dtype conversionsEnhancementOutput-Formatting__repr__ of pandas objects, to_string__repr__ of pandas objects, to_string
Milestone
Description
DataFrame.to_records()
outputs string columns with the object
dtype, which is sometimes not efficient (e.g. for short, similar-length strings, or when storing with np.save()
). I wrote the following function to fix this:
def to_records_plain(df):
"""Return a NumPy recarray like df.to_records() but with strings stored as bytes, not objects.
This gives more compact storage and does not require pickling objects when saving to disk.
Assumes all object arrays in df are strings.
>>> df = pd.DataFrame({'a': [1, 2], 'b': [0.5, 0.9], 'c': ['x', 'yyy']})
>>> to_records_plain(df)
rec.array([(0, 1, 0.5, b'x'), (1, 2, 0.9, b'yyy')],
dtype=[('index', '<i8'), ('a', '<i8'), ('b', '<f8'), ('c', 'S3')])
"""
records = df.to_records()
descr = records.dtype.descr
for ii, (name, dtype) in enumerate(descr):
if dtype == '|O':
length = df[name].str.len().max()
descr[ii] = (name, 'S{}'.format(length))
return records.astype(descr)
I suggest exposing something like this as an option in DataFrame.to_records()
. An option to convert to Unicode ('U'
) too would be good too (NumPy's 'S'
is effectively bytes
in Python 3).
heroxbd
Metadata
Metadata
Assignees
Labels
Dtype ConversionsUnexpected or buggy dtype conversionsUnexpected or buggy dtype conversionsEnhancementOutput-Formatting__repr__ of pandas objects, to_string__repr__ of pandas objects, to_string