-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Allow StringArray[python] to be backed by numpy StringDType in numpy 2.0 #58578
base: main
Are you sure you want to change the base?
Conversation
# in the next iteration when the created str object is GC'ed, | ||
# clobbering the value of v | ||
#if values.dtype.kind == "T": | ||
v = strdup(v) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm a bit wary of managing lifecycle this way - so the existing implemention has no ownership of the string lifecycle then right? Its probably easier to make that a StringView hash table then and creating a dedicated String hash table which does copy
This is another case where using C++ would be a better language choice than tempita (see also https://github.com/pandas-dev/pandas/pull/57730/files)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, editing anything in tempita kinda sucks in general.
But yes, I think the existing implementation doesn't have ownership of the Python string objects.
Turning this into StringViewHashTable, and subclassing this sounds good to me.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You could get the UTF-8 string data from the array entry directly, without going throuh PyArray_GETITEM
via the NumPy C API:
https://numpy.org/neps/nep-0055-string_dtype.html#packing-and-loading-strings
There aren't cython bindings for this API yet in the numpy cython bindings but it's on my list of things to do. It probably makes sense to manage the allocators with a context manager, for example.
I also see that the new C API isn't yet covered in the C API docs and I need to make sure there are docs for the stringdtype C API before the 2.0 release happens. Thank you for prompting me to notice that oversight!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
# in the next iteration when the created str object is GC'ed, | ||
# clobbering the value of v | ||
#if values.dtype.kind == "T": | ||
v = strdup(v) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This probably leaks in the current implementation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I should be freeing these strings in __dealloc__
if I didn't mess this up.
EDIT: Nevermind, I'm stupid 😓
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No I wouldn't put it there either - __dealloc__
is the inverse of __cinit__
; any memory allocations performed outside of those functions needs to be managed with its own explicit lifecycle
pandas/core/arrays/_mixins.py
Outdated
value = value._ndarray | ||
|
||
# np.where will not preserve the StringDType | ||
# TODO: ask Nathan about this |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I opened numpy/numpy#26420 for this.
This pull request is stale because it has been open for thirty days with no activity. Please update and respond to this comment if you're still interested in working on this. |
doc/source/whatsnew/vX.X.X.rst
file if fixing a bug or adding a new feature.Just testing for CI again.