Skip to content

Allow StringArray[python] to be backed by numpy StringDType in numpy 2.0 #58578

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Draft
wants to merge 73 commits into
base: main
Choose a base branch
from
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
73 commits
Select commit Hold shift + click to select a range
56ae252
WIP: preliminary support for stringdtype
ngoldbaum Apr 4, 2023
206d2f0
add NumpyStringArray and string[numpy] dtype
ngoldbaum Apr 26, 2023
5adadfa
WIP: making progress
ngoldbaum May 15, 2023
a1175f2
fix factorize
ngoldbaum May 19, 2023
d6d21c8
Merge branch 'main' into stringdtype
ngoldbaum May 26, 2023
7426cd5
adapt to new PandasStringDType and circular dependency on pandas
ngoldbaum May 31, 2023
2f4ab45
Merge branch 'main' into stringdtype
ngoldbaum May 31, 2023
d92f0cb
Merge branch 'main' into stringdtype
ngoldbaum Jun 1, 2023
8e59bba
fix more tests
ngoldbaum Jun 7, 2023
f2f0798
Merge branch 'main' into stringdtype
ngoldbaum Jun 7, 2023
64f85d3
fix remaining ExtensionArray tests
ngoldbaum Jun 14, 2023
82922d0
Merge branch 'main' into stringdtype
ngoldbaum Jun 14, 2023
1654f8b
deal with stringdtype not coercing NaN and None to NA
ngoldbaum Jun 16, 2023
2d736ae
Merge branch 'main' into stringdtype
ngoldbaum Jun 23, 2023
87e2d14
adapt to stringdtype getting rid of PandasStringDType
ngoldbaum Jul 11, 2023
c68ea5a
Merge branch 'main' into stringdtype
ngoldbaum Jul 20, 2023
0f0589e
support latest version of stringdtype
ngoldbaum Aug 1, 2023
ca39aaf
Merge branch 'main' into stringdtype
ngoldbaum Aug 1, 2023
88c7d5d
Merge branch 'main' into stringdtype
ngoldbaum Aug 10, 2023
41ab894
adapt to changes in pandas and stringdtype
ngoldbaum Aug 10, 2023
43b3ce7
avoid copy when loading numpy string data
ngoldbaum Aug 29, 2023
13cf458
Merge branch 'main' into stringdtype
ngoldbaum Aug 29, 2023
ffb5ab7
Merge branch 'main' into stringdtype
ngoldbaum Nov 15, 2023
e6a6d6d
Merge branch 'main' into stringdtype
ngoldbaum Dec 6, 2023
8cf1081
Merge branch 'main' into stringdtype
ngoldbaum Feb 19, 2024
7e5ea63
update to work with stringdtype in numpy
ngoldbaum Feb 20, 2024
6a5563f
Merge branch 'main' into stringdtype
ngoldbaum Feb 20, 2024
65abaa6
some fixes for numpy support
ngoldbaum Mar 11, 2024
43a6a2a
Merge branch 'main' into stringdtype
ngoldbaum Mar 11, 2024
23f594b
Merge branch 'main' into stringdtype
ngoldbaum Mar 14, 2024
85609ca
fix coercion tests
ngoldbaum Mar 14, 2024
86ffe1c
more test fixes
ngoldbaum Mar 15, 2024
dc9419d
fix memory usage test
ngoldbaum Mar 18, 2024
155ec68
Avoid copying in NumpyStringArray initializer
ngoldbaum Mar 19, 2024
8dadaf9
more fixes
ngoldbaum Mar 26, 2024
aad5f32
fix SyntaxError
ngoldbaum Mar 27, 2024
190ffe3
fix comparisons with scalars
ngoldbaum Mar 27, 2024
dcf2cec
Implement some ufuncs
ngoldbaum Apr 2, 2024
b5cdea8
Add index/rindex
ngoldbaum Apr 2, 2024
ba0a8b4
drop unnecessary type annotations in map_infer_mask
ngoldbaum Apr 2, 2024
10437a0
Merge branch 'main' into stringdtype
ngoldbaum Apr 19, 2024
5691409
Add more string method implementations
ngoldbaum Apr 19, 2024
4b3e48b
delete unnecessary input sanitization
ngoldbaum Apr 19, 2024
1e1d651
hotfix issue with hashing
ngoldbaum Apr 24, 2024
d27816c
Avoid unnecessary copies in NumpyStringArray initializer
ngoldbaum Apr 26, 2024
19d85bb
copy to hotfix issue in groupby
ngoldbaum Apr 26, 2024
11778ed
Add stringdtype to more test fixtures
ngoldbaum Apr 26, 2024
2034a25
revert unnecessary changes to ObjectStringArrayMixin._str_map
ngoldbaum Apr 26, 2024
151fe64
handle NA values for inputs that might be coerced to string
ngoldbaum Apr 26, 2024
8394495
remove implementations for string methods that won't be available unt…
ngoldbaum Apr 26, 2024
aa7cec9
delegate to superclass for some startswith and endswith parameters
ngoldbaum Apr 26, 2024
dfedd1e
fix null entries in findlike ufuncs
ngoldbaum Apr 26, 2024
d64dcf8
revert np min API version and try to fix tests
lithomas1 Apr 29, 2024
8e32211
modify base object string array instead
lithomas1 May 4, 2024
187d068
go for green
lithomas1 May 5, 2024
3626c63
try again for green
lithomas1 May 5, 2024
908c9e1
hopefully fix hashtable stuff
lithomas1 May 6, 2024
70be1f6
wip
lithomas1 May 7, 2024
ffe133b
Update test for directly passing in numpy StringDType arrays
ngoldbaum May 10, 2024
b684da0
xfail memory usage test
ngoldbaum May 10, 2024
a202f1b
Merge branch 'stringdtype2' of github.com:lithomas1/pandas into strin…
lithomas1 May 17, 2024
1a0e783
fixup merge conflict
lithomas1 May 17, 2024
7e0649f
update
lithomas1 May 19, 2024
fd2ba65
fix ci
lithomas1 May 20, 2024
f301506
try to fix rest
lithomas1 May 20, 2024
2c46b75
avoid nanops test failures
ngoldbaum May 24, 2024
4a538a0
fix ruff lints
ngoldbaum May 24, 2024
c88884a
fix cython lints
ngoldbaum May 24, 2024
d0e3f1e
fix more fuff lints
ngoldbaum May 24, 2024
a175c7a
run ruff-format
ngoldbaum May 24, 2024
961a67c
tweak for nanops case
ngoldbaum May 24, 2024
37143be
Merge branch 'main' into stringdtype2
ngoldbaum May 30, 2024
fbabedc
Merge branch 'main' of github.com:pandas-dev/pandas into stringdtype2
lithomas1 Aug 30, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions asv_bench/asv.conf.json
Original file line number Diff line number Diff line change
Expand Up @@ -42,6 +42,7 @@
// followed by the pip installed packages).
"matrix": {
"pip+build": [],
"numpy": ["2.0rc1"],
"Cython": ["3.0"],
"matplotlib": [],
"sqlalchemy": [],
Expand Down
3 changes: 3 additions & 0 deletions pandas/_libs/hashtable.pyx
Original file line number Diff line number Diff line change
Expand Up @@ -25,6 +25,9 @@ from pandas._libs.khash cimport (
are_equivalent_float64_t,
are_equivalent_khcomplex64_t,
are_equivalent_khcomplex128_t,
kh_end,
kh_exist,
kh_key,
kh_needed_n_buckets,
kh_python_hash_equal,
kh_python_hash_func,
Expand Down
80 changes: 71 additions & 9 deletions pandas/_libs/hashtable_class_helper.pxi.in
Original file line number Diff line number Diff line change
Expand Up @@ -5,6 +5,17 @@ WARNING: DO NOT edit .pxi FILE directly, .pxi is generated from .pxi.in
"""
from cpython.unicode cimport PyUnicode_AsUTF8

from numpy cimport (
flatiter,
PyArray_GETITEM,
PyArray_ITER_DATA,
PyArray_ITER_NEXT,
PyArray_IterNew,
)


from libc.string cimport strdup

{{py:

# name
Expand Down Expand Up @@ -970,7 +981,12 @@ cdef class StringHashTable(HashTable):
kh_resize_str(self.table, size_hint)

def __dealloc__(self):
cdef:
khiter_t k
if self.table is not NULL:
for k in range(kh_end(self.table)):
if kh_exist(self.table, k):
free(<char*>kh_key(self.table, k))
kh_destroy_str(self.table)
self.table = NULL

Expand Down Expand Up @@ -1013,6 +1029,8 @@ cdef class StringHashTable(HashTable):

v = PyUnicode_AsUTF8(key)

v = strdup(v)

k = kh_put_str(self.table, v, &ret)
if kh_exist_str(self.table, k):
self.table.vals[k] = val
Expand Down Expand Up @@ -1051,7 +1069,7 @@ cdef class StringHashTable(HashTable):
return labels

@cython.boundscheck(False)
def lookup(self, ndarray[object] values, object mask = None) -> ndarray:
def lookup(self, ndarray values, object mask = None) -> ndarray:
# -> np.ndarray[np.intp]
# mask not yet implemented
cdef:
Expand All @@ -1061,22 +1079,34 @@ cdef class StringHashTable(HashTable):
const char *v
khiter_t k
intp_t[::1] locs = np.empty(n, dtype=np.intp)
flatiter it = PyArray_IterNew(values)

# these by-definition *must* be strings
vecs = <const char **>malloc(n * sizeof(char *))
if vecs is NULL:
raise MemoryError()
for i in range(n):
val = values[i]
val = PyArray_GETITEM(values, PyArray_ITER_DATA(it))

if isinstance(val, str):
# GH#31499 if we have a np.str_ PyUnicode_AsUTF8 won't recognize
# it as a str, even though isinstance does.
v = PyUnicode_AsUTF8(<str>val)
else:
v = PyUnicode_AsUTF8(self.na_string_sentinel)

# Need to copy result from PyUnicode_AsUTF8 when we have
# numpy strings
# Since numpy strings aren't backed by object arrays
# the buffer returned by PyUnicode_AsUTF8 will get freed
# in the next iteration when the created str object is GC'ed,
# clobbering the value of v
v = strdup(v)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm a bit wary of managing lifecycle this way - so the existing implemention has no ownership of the string lifecycle then right? Its probably easier to make that a StringView hash table then and creating a dedicated String hash table which does copy

This is another case where using C++ would be a better language choice than tempita (see also https://github.com/pandas-dev/pandas/pull/57730/files)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, editing anything in tempita kinda sucks in general.

But yes, I think the existing implementation doesn't have ownership of the Python string objects.

Turning this into StringViewHashTable, and subclassing this sounds good to me.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You could get the UTF-8 string data from the array entry directly, without going throuh PyArray_GETITEM via the NumPy C API:

https://numpy.org/neps/nep-0055-string_dtype.html#packing-and-loading-strings

There aren't cython bindings for this API yet in the numpy cython bindings but it's on my list of things to do. It probably makes sense to manage the allocators with a context manager, for example.

I also see that the new C API isn't yet covered in the C API docs and I need to make sure there are docs for the stringdtype C API before the 2.0 release happens. Thank you for prompting me to notice that oversight!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.


vecs[i] = v

PyArray_ITER_NEXT(it)

with nogil:
for i in range(n):
v = vecs[i]
Expand All @@ -1086,11 +1116,16 @@ cdef class StringHashTable(HashTable):
else:
locs[i] = -1

if values.dtype.kind == "T":
# free copied strings
for i in range(n):
free(vecs[i])

free(vecs)
return np.asarray(locs)

@cython.boundscheck(False)
def map_locations(self, ndarray[object] values, object mask = None) -> None:
def map_locations(self, ndarray values, object mask = None) -> None:
# mask not yet implemented
cdef:
Py_ssize_t i, n = len(values)
Expand All @@ -1099,32 +1134,45 @@ cdef class StringHashTable(HashTable):
const char *v
const char **vecs
khiter_t k
flatiter it = PyArray_IterNew(values)

# these by-definition *must* be strings
vecs = <const char **>malloc(n * sizeof(char *))
if vecs is NULL:
raise MemoryError()
for i in range(n):
val = values[i]
val = PyArray_GETITEM(values, PyArray_ITER_DATA(it))

if isinstance(val, str):
# GH#31499 if we have a np.str_ PyUnicode_AsUTF8 won't recognize
# it as a str, even though isinstance does.
v = PyUnicode_AsUTF8(<str>val)
else:
v = PyUnicode_AsUTF8(self.na_string_sentinel)

# Need to copy result from PyUnicode_AsUTF8 when we have
# numpy strings
# Since numpy strings aren't backed by object arrays
# the buffer returned by PyUnicode_AsUTF8 will get freed
# in the next iteration when the created str object is GC'ed,
# clobbering the value of v
v = strdup(v)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This probably leaks in the current implementation

Copy link
Member Author

@lithomas1 lithomas1 May 8, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I should be freeing these strings in __dealloc__ if I didn't mess this up.

EDIT: Nevermind, I'm stupid 😓

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No I wouldn't put it there either - __dealloc__ is the inverse of __cinit__; any memory allocations performed outside of those functions needs to be managed with its own explicit lifecycle


vecs[i] = v

PyArray_ITER_NEXT(it)

with nogil:
for i in range(n):
v = vecs[i]
k = kh_put_str(self.table, v, &ret)
self.table.vals[k] = i

free(vecs)

@cython.boundscheck(False)
@cython.wraparound(False)
def _unique(self, ndarray[object] values, ObjectVector uniques,
def _unique(self, ndarray values, ObjectVector uniques,
Py_ssize_t count_prior=0, Py_ssize_t na_sentinel=-1,
object na_value=None, bint ignore_na=False,
bint return_inverse=False):
Expand Down Expand Up @@ -1171,11 +1219,13 @@ cdef class StringHashTable(HashTable):
const char **vecs
khiter_t k
bint use_na_value
flatiter it = PyArray_IterNew(values)
bint non_null_na_value

if return_inverse:
labels = np.zeros(n, dtype=np.intp)
uindexer = np.empty(n, dtype=np.int64)

use_na_value = na_value is not None
non_null_na_value = not checknull(na_value)

Expand All @@ -1184,7 +1234,7 @@ cdef class StringHashTable(HashTable):
if vecs is NULL:
raise MemoryError()
for i in range(n):
val = values[i]
val = PyArray_GETITEM(values, PyArray_ITER_DATA(it))

if (ignore_na
and (not isinstance(val, str)
Expand All @@ -1202,10 +1252,22 @@ cdef class StringHashTable(HashTable):
# if ignore_na is False, we also stringify NaN/None/etc.
try:
v = PyUnicode_AsUTF8(<str>val)
except UnicodeEncodeError:
except (UnicodeEncodeError,TypeError):
# pd.NA will raise TypeError
v = PyUnicode_AsUTF8(<str>repr(val))

# Need to copy result from PyUnicode_AsUTF8 when we have
# numpy strings
# Since numpy strings aren't backed by object arrays
# the buffer returned by PyUnicode_AsUTF8 will get freed
# in the next iteration when the created str object is GC'ed,
# clobbering the value of v
v = strdup(v)

vecs[i] = v

PyArray_ITER_NEXT(it)

# compute
with nogil:
for i in range(n):
Expand Down Expand Up @@ -1239,7 +1301,7 @@ cdef class StringHashTable(HashTable):
return uniques.to_array(), labels.base # .base -> underlying ndarray
return uniques.to_array()

def unique(self, ndarray[object] values, *, bint return_inverse=False, object mask=None):
def unique(self, ndarray values, *, bint return_inverse=False, object mask=None):
"""
Calculate unique values and labels (no sorting!)

Expand All @@ -1264,7 +1326,7 @@ cdef class StringHashTable(HashTable):
return self._unique(values, uniques, ignore_na=False,
return_inverse=return_inverse)

def factorize(self, ndarray[object] values, Py_ssize_t na_sentinel=-1,
def factorize(self, ndarray values, Py_ssize_t na_sentinel=-1,
object na_value=None, object mask=None, ignore_na=True):
"""
Calculate unique values and labels (no sorting!)
Expand Down
8 changes: 8 additions & 0 deletions pandas/_libs/khash.pxd
Original file line number Diff line number Diff line change
Expand Up @@ -125,5 +125,13 @@ cdef extern from "pandas/vendored/klib/khash_python.h":

khuint_t kh_needed_n_buckets(khuint_t element_n) nogil

# Needed to free the strings we copied in StringHashTable

khuint_t kh_end(kh_str_t* h) nogil

int kh_exist(kh_str_t* h, khuint_t x) nogil

void* kh_key(kh_str_t* h, khuint_t x) nogil


include "khash_for_primitive_helper.pxi"
51 changes: 27 additions & 24 deletions pandas/_libs/lib.pyx
Original file line number Diff line number Diff line change
Expand Up @@ -675,41 +675,36 @@ def is_sequence_range(const int6432_t[:] sequence, int64_t step) -> bool:
return True


ctypedef fused ndarr_object:
ndarray[object, ndim=1]
ndarray[object, ndim=2]

# TODO: get rid of this in StringArray and modify
# and go through ensure_string_array instead


@cython.wraparound(False)
@cython.boundscheck(False)
def convert_nans_to_NA(ndarr_object arr) -> ndarray:
def convert_nans_to_NA(ndarray arr) -> ndarray:
"""
Helper for StringArray that converts null values that
are not pd.NA(e.g. np.nan, None) to pd.NA. Assumes elements
have already been validated as null.
"""
cdef:
Py_ssize_t i, m, n
Py_ssize_t i
Py_ssize_t n = len(arr)
object val
ndarr_object result
result = np.asarray(arr, dtype="object")
if arr.ndim == 2:
m, n = arr.shape[0], arr.shape[1]
for i in range(m):
for j in range(n):
val = arr[i, j]
if not isinstance(val, str):
result[i, j] = <object>C_NA
else:
n = len(arr)
for i in range(n):
val = arr[i]
if not isinstance(val, str):
result[i] = <object>C_NA
return result
flatiter it = cnp.PyArray_IterNew(arr)

for i in range(n):
# The PyArray_GETITEM and PyArray_ITER_NEXT are faster
# equivalents to `val = values[i]`
val = PyArray_GETITEM(arr, PyArray_ITER_DATA(it))

# Not string so has to be null since they're already validated
if not isinstance(val, str):
val = <object>C_NA

PyArray_SETITEM(arr, PyArray_ITER_DATA(it), val)

PyArray_ITER_NEXT(it)


@cython.wraparound(False)
Expand Down Expand Up @@ -1475,6 +1470,8 @@ def infer_dtype(value: object, skipna: bool = True) -> str:
- mixed
- unknown-array

Returns a dtype object for non-legacy numpy dtypes

Raises
------
TypeError
Expand Down Expand Up @@ -1585,6 +1582,9 @@ def infer_dtype(value: object, skipna: bool = True) -> str:
if inferred is not None:
# Anything other than object-dtype should return here.
return inferred
elif values.dtype.kind == "T":
# NumPy StringDType
return values.dtype

if values.descr.type_num != NPY_OBJECT:
# i.e. values.dtype != np.object_
Expand All @@ -1600,7 +1600,7 @@ def infer_dtype(value: object, skipna: bool = True) -> str:
it = PyArray_IterNew(values)
for i in range(n):
# The PyArray_GETITEM and PyArray_ITER_NEXT are faster
# equivalents to `val = values[i]`
# equivalents to `val = values[i]`
val = PyArray_GETITEM(values, PyArray_ITER_DATA(it))
PyArray_ITER_NEXT(it)

Expand Down Expand Up @@ -1911,7 +1911,10 @@ cdef class StringValidator(Validator):
return isinstance(value, str)

cdef bint is_array_typed(self) except -1:
return self.dtype.type_num == cnp.NPY_UNICODE
if self.dtype.char == "T" or self.dtype.char == "U":
return True
# this lets user-defined string DTypes through
return issubclass(<object>self.dtype.typeobj, (np.str_, str))


cpdef bint is_string_array(ndarray values, bint skipna=False):
Expand Down
9 changes: 8 additions & 1 deletion pandas/_libs/missing.pyi
Original file line number Diff line number Diff line change
@@ -1,3 +1,5 @@
from typing import overload

import numpy as np
from numpy import typing as npt

Expand All @@ -12,5 +14,10 @@ def is_matching_na(
def isposinf_scalar(val: object) -> bool: ...
def isneginf_scalar(val: object) -> bool: ...
def checknull(val: object) -> bool: ...
def isnaobj(arr: np.ndarray) -> npt.NDArray[np.bool_]: ...
@overload
def isnaobj(arr: np.ndarray, check_for_any_na=...) -> npt.NDArray[np.bool_]: ...
@overload
def isnaobj(
arr: np.ndarray, check_for_any_na=True
) -> tuple[npt.NDArray[np.bool_], bool]: ...
def is_numeric_na(values: np.ndarray) -> npt.NDArray[np.bool_]: ...
1 change: 1 addition & 0 deletions pandas/arrays/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -33,5 +33,6 @@
"PeriodArray",
"SparseArray",
"StringArray",
"ObjectStringArray",
"TimedeltaArray",
]
1 change: 0 additions & 1 deletion pandas/conftest.py
Original file line number Diff line number Diff line change
Expand Up @@ -150,7 +150,6 @@ def pytest_collection_modifyitems(items, config) -> None:
("SeriesGroupBy.fillna", "SeriesGroupBy.fillna is deprecated"),
("SeriesGroupBy.idxmin", "The behavior of Series.idxmin"),
("SeriesGroupBy.idxmax", "The behavior of Series.idxmax"),
("to_pytimedelta", "The behavior of TimedeltaProperties.to_pytimedelta"),
("NDFrame.reindex_like", "keyword argument 'method' is deprecated"),
# Docstring divides by zero to show behavior difference
("missing.mask_zero_div_zero", "divide by zero encountered"),
Expand Down
Loading
Loading