Closed
Description
Currently Pandas serializes views of ArrowStringArrays by serailizing the whole thing, rather than a subset. Here is an example:
In [1]: import pandas as pd
In [2]: s = pd.Series([c * 1000 for c in "abcdefghijklmnopqrstuvwxyz"])
In [3]: s
Out[3]:
0 aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa...
1 bbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbb...
2 cccccccccccccccccccccccccccccccccccccccccccccc...
3 dddddddddddddddddddddddddddddddddddddddddddddd...
4 eeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeee...
5 ffffffffffffffffffffffffffffffffffffffffffffff...
6 gggggggggggggggggggggggggggggggggggggggggggggg...
7 hhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhh...
8 iiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiii...
9 jjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjj...
10 kkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkk...
11 llllllllllllllllllllllllllllllllllllllllllllll...
12 mmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmm...
13 nnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnn...
14 oooooooooooooooooooooooooooooooooooooooooooooo...
15 pppppppppppppppppppppppppppppppppppppppppppppp...
16 qqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqq...
17 rrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrrr...
18 ssssssssssssssssssssssssssssssssssssssssssssss...
19 tttttttttttttttttttttttttttttttttttttttttttttt...
20 uuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuu...
21 vvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvv...
22 wwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwww...
23 xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx...
24 yyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyy...
25 zzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzz...
dtype: object
In [4]: import pickle
In [5]: len(pickle.dumps(s))
Out[5]: 26758
In [6]: len(pickle.dumps(s.astype("string[pyarrow]")))
Out[6]: 26891
In [7]: len(pickle.dumps(s.head(5)))
Out[7]: 5632
In [8]: len(pickle.dumps(s.astype("string[pyarrow]").head(5)))
Out[8]: 26891
This negatively affects dask dataframe operations that cut up pandas dataframes into small pieces, moves them around to different computers, and then pieces them back together again.