ENH: add to_records() option to output NumPy string dtypes, not objects

`DataFrame.to_records()` outputs string columns with the `object` dtype, which is sometimes not efficient (e.g. for short, similar-length strings, or when storing with `np.save()`).  I wrote the following function to fix this:

```
def to_records_plain(df):
    """Return a NumPy recarray like df.to_records() but with strings stored as bytes, not objects.
    This gives more compact storage and does not require pickling objects when saving to disk.
    Assumes all object arrays in df are strings.

    >>> df = pd.DataFrame({'a': [1, 2], 'b': [0.5, 0.9], 'c': ['x', 'yyy']})
    >>> to_records_plain(df)
    rec.array([(0, 1,  0.5, b'x'), (1, 2,  0.9, b'yyy')], 
              dtype=[('index', '<i8'), ('a', '<i8'), ('b', '<f8'), ('c', 'S3')])
    """
    records = df.to_records()
    descr = records.dtype.descr
    for ii, (name, dtype) in enumerate(descr):
        if dtype == '|O':
            length = df[name].str.len().max()
            descr[ii] = (name, 'S{}'.format(length))

    return records.astype(descr)
```

I suggest exposing something like this as an option in `DataFrame.to_records()`.  An option to convert to Unicode (`'U'`) too would be good too (NumPy's `'S'` is effectively `bytes` in Python 3).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

ENH: add to_records() option to output NumPy string dtypes, not objects #18146

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

ENH: add to_records() option to output NumPy string dtypes, not objects #18146

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions