Skip to content

ENH: Serialize view of ArrowStringArray #42600

Closed
@mrocklin

Description

@mrocklin

Currently Pandas serializes views of ArrowStringArrays by serailizing the whole thing, rather than a subset. Here is an example:

In [1]: import pandas as pd

In [2]: s = pd.Series([c * 1000 for c in "abcdefghijklmnopqrstuvwxyz"])

In [3]: s
Out[3]: 
0     aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa...
1     bbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbb...
2     cccccccccccccccccccccccccccccccccccccccccccccc...
3     dddddddddddddddddddddddddddddddddddddddddddddd...
4     eeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeee...
5     ffffffffffffffffffffffffffffffffffffffffffffff...
6     gggggggggggggggggggggggggggggggggggggggggggggg...
7     hhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhh...
8     iiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiii...
9     jjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjj...
10    kkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkk...
11    llllllllllllllllllllllllllllllllllllllllllllll...
12    mmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmm...
13    nnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnn...
14    oooooooooooooooooooooooooooooooooooooooooooooo...
15    pppppppppppppppppppppppppppppppppppppppppppppp...
16    qqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqq...
17    rrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrr...
18    ssssssssssssssssssssssssssssssssssssssssssssss...
19    tttttttttttttttttttttttttttttttttttttttttttttt...
20    uuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuu...
21    vvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvv...
22    wwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwww...
23    xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx...
24    yyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyy...
25    zzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzz...
dtype: object

In [4]: import pickle

In [5]: len(pickle.dumps(s))
Out[5]: 26758

In [6]: len(pickle.dumps(s.astype("string[pyarrow]")))
Out[6]: 26891

In [7]: len(pickle.dumps(s.head(5)))
Out[7]: 5632

In [8]: len(pickle.dumps(s.astype("string[pyarrow]").head(5)))
Out[8]: 26891

This negatively affects dask dataframe operations that cut up pandas dataframes into small pieces, moves them around to different computers, and then pieces them back together again.

Metadata

Metadata

Assignees

No one assigned

    Labels

    Arrowpyarrow functionalityEnhancementIO Parquetparquet, featherIO Pickleread_pickle, to_pickleStringsString extension data type and string data

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions