-
-
Notifications
You must be signed in to change notification settings - Fork 11.2k
ENH: expose datetime.c functions to cython #21199
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
We have cython tests in
Maybe datetime64_c.h in I still would prefer we declare datetime64 to be a user-defined dtype, and split the code out into a separate PyPI-installable package. This would be a good test of the new dtype machinery as well as clean up NumPy internals. |
So create a file "numpy/core/include/numpy/datetime_c.h" and in put
I've talked with @seberg about something similar to this. A nice upside would be that we could start using it in pandas immediately(ish) instead of waiting for our numpy min-version to catch up. |
numpy/core/src/multiarray/datetime.c
Outdated
@@ -2035,7 +2033,7 @@ metastr_to_unicode(PyArray_DatetimeMetaData *meta, int skip_brackets) | |||
/* | |||
* Adjusts a datetimestruct based on a seconds offset. Assumes | |||
* the current values are valid. | |||
*/ | |||
|
|||
NPY_NO_EXPORT void | |||
add_seconds_to_datetimestruct(npy_datetimestruct *dts, int seconds) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
disabled this function entirely in last commit to confirm it is unused. any objection to ripping it out? OK to consider this out of scope.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FYI another way that might help check this would be to declare the function as static and use gcc's -Wunused-function flag
That flag might be a part of -Wall or -Wextra so you might not need an extra flag at all, but it only warns for static functions
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, sorry for not noticing this. I am happy to delete it here or otherwise. The issue is that we have NPY_NO_EXPORT
in a few places too much. It should be static
and then our CI would already take note.
The random number approach seems reasonable on first sight. I am a bit curious how it compares to most of the API which uses the API table rather than relying on linking? Splitting out |
This is way outside my region of competence and I'll definitely need help with... whatever it means.
Depending on how much of the datetime code we're talking about splitting out, at some point we'd need to reconcile with the place/places where pandas intentionally diverges from the numpy behavior. Most of those discussions occurred before my time so I don't have a good read on what that would entail.
Even if this PR were fixed and merged today, it would take a year or two before pandas could actually rely on it and rip out our own code (minimum supported numpy on our last release was 1.18.5). So I'm not at all concerned about having to wait a while. |
That question was for @mattip, since he was directly involved in exposing the random number API :).
I was talking about splitting it out completely. NumPy would keep its current ones around for backcompat probably, but we would deprecate it soon after, and point to the new DType defined outside of NumPy. That is the question for you :). We could start on new DTypes for datetimes/timedelta. Obviously, it would be targeted to do whatever pandas wants. But, both times and pandas are complicated, so while I could take care of most of the DType part we will need to work together and create plan. And I would probably need help with many of the time related things to make good progress. My guess is something like:
The problem is that it seems hard to tell how quickly this can materialize into something usable. I don't think a prototype will take super long if pandas adoption isn't extremely tricky for some reason. But we may have to spend a few weeks of work just to see whether or not it is difficult. Are you interested in attempting that? |
The big question mark this raises for me is "to what extent can numpy with new Dtypes be dropped in to existing pandas code and Just Work?" Have you tried running the pandas test suite with the new Dtype system in place? I'll be happy to serve as a guinea pig for the new Dtypes, but can't make any promises about them actually getting used inside pandas anytime soon.
Setting aside my ignorance of what a attributes/methods a Dtype needs, the idea I've been pursuing has been to create a Localizer class that would look a little bit like a dtype if you squint. It would be pinned as private attribute on a Timestamp and on a DatetimeTZDtype, and its attributes would include a tzinfo and a NPY_DATETIMEUNIT. My last attempt to implement Localizer as a class was https://github.com/pandas-dev/pandas/pull/46246/files#diff-23ab6ca878fcae5ed2cb46d90ad29a18c2c4f5ba18bf16918b91b8ef613afefdR537, and its utc_val_to_local_val method was the guts of it. The problem I faced was that making that a method instead of repeated inline really hurt performance (see asvs posted in the OP to that PR). In the interim, is pushing forward with this PR worthwhile? |
The API table is for NumPy C-API functions (like |
Well, I am OK with this if Matti is. I have a bit of a bad feeling about linking against The alternative is to rename the functions |
I'm not sure what the distinction is. The functions touched in this PR are all ones that pandas re-implements near-verbatim.
If there's a non-trivial chance of some of these getting split out into a third package, maybe the future-proof thing to do would be to not put them into the "official" C-API? |
Lets mark it for discussion at next weeks meeting. The distinction is about how we link the functions. This PR links them like a normal dynamic library. Which we do for some math and random functions. |
Noting as I go (in implementing non-nanosecond support in pandas) other numpy functions that I'd like to re-use rather than re-implement (will update this list if/when I find more): get_datetime_metadata_from_dtype cc @WillAyd if im missing anything |
I think there might be a few in the JSON space as well we use for serialization: |
which meeting(s) should i plan to attend? this will be my first one. |
@jbrockmendel wednesday at 16UTC: https://hackmd.io/68i_JvOYQfy9ERiHgXMPvg. I wanted to see if we can nail down the linking thing with Matti and others. But happy if you come. I am seriously considering making a (minimal) "new" Datetime dtype so that you can see what is missing for proper support in pandas. My |
zoom wouldn't let me into the meeting, which was probably OK because I was barely awake |
@jbrockmendel we discussed this briefly. I think there is a tendency of "can't we remove this eventually?", but that doesn't matter too much...
The rest should happen automatically, but you should copy paste that magic comment from somewhere, and if you get strange complication errors formatting can be the reason (e.g. a marked function has to have the initial As for a user DType, we did not really discuss it. But I am happy to create a prototype for an external DType, so that you could see what modifications in pandas may make it work. Something that would be useful in any case, probably. |
One option would be to move the duplicated code to a third package upstream of both numpy and pandas. That's a complication I'd prefer to avoid for now, but if it makes things easier on the numpy devs, can revisit it.
Based on our last phone call I got the impression that this is mostly-orthogonal to what I'm working on. Which is great! Because it means that I can spend time helping you with your project and feel less bad about taking up your time!
OK, I think I get what you're asking. In addition, I'm going to see if there's anything worth salvaging from #16364. |
I take it back. I'm lost on what you're asking for in those bullet points, with the exception of
I assume this is an internal convention despite the fact that none of the affected functions take ndarrays. If renaming is under consideration, I'd suggest clarifications on "convert_datetimestruct_to_datetime" adding a "64" at the end to avoid confusion with pydatetime. Same with the reverse function "convert_datetime_to_datetimestruct". |
Its a bit annoying, basically, you list the function here: numpy/numpy/core/code_generators/numpy_api.py Line 356 in aaae9d1
The next will be
In principle this should only mean adding that I think there may be a hash hardcoded somewhere that needs updating (i.e. compilation will complain), but I would have to The header for a function marked with |
Thanks for looking into this. I've spread myself too thin and need to put this on the back-burner. It may also turn out that we will want to get access to parse_iso_8601_datetime. Will revisit in a few weeks or months. |
Revisiting this, I think I got the numpy_api.py file thing sorted out. Having trouble with the
Either way it says the file is not found. The actual location of the file is numpy/core/src/multiarray/datetime_strings.h |
Are your .h files located at numpy/core/include/numpy/file-name.h? The |
Look in |
These functions also need documentation, probably add a page in |
I don't know if @jbrockmendel cares, but I wouldn't mind pushing this through. If the docs is the only thing that is blocking things, can we maybe backport that? Since docs don't actually matter for the RC process. (Sorry, don't want to block the release on this, but I had assumed we can push this through.) |
I could compromise on a release note. |
I tried rebasing this morning, had a merge conflict, and got discouraged.
Getting this in for 1.2x doesn't make much difference for me. Getting it in by 2.0 is a bigger deal. |
I just tried writing a cast from the string dtype I've been working on to numpy datetimes and realized it would be nice to have |
That sounds great. |
Merged with main, added release notes, and new C API datetime docs. If there are any test failures I'll clean those up. Is there a way to see the doc build generated by the CI job for a PR? |
@@ -73,4 +73,6 @@ | |||
0x00000011 = ca1aebdad799358149567d9d93cbca09 | |||
|
|||
# Version 18 (NumPy 2.0.0) | |||
0x00000012 = 5af92e858ce8e95409ae1fcc8f508ddd | |||
# Many API deprecations have been finalized | |||
# Add datetime conversion functions GH#21199 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we're not enumerating every single change for numpy 2.0 I'm happy to remove these.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yah, lets not, already a few removals too.
Oops, forgot the cython binding for |
Click on |
Time to put this in, docs looked good without checking every tiny detail, thanks @jbrockmendel and @ngoldbaum for pushing it over the finish line! |
nice! thanks reviewers for walking me through this and @ngoldbaum for getting it across the finish line. Looking forward to ripping a bunch of this out of pandas next year! |
I'd be interested in particular if we could get pyarrow to share the implementation, since install size has been coming up as a pain point recently. @mattip are you the person to talk to about making this a reality? |
I would think that in parallel to the string dtype that is progressing quite nicely, a datetime64-like dtype could take form there. It is something to discuss at a community meeting: who are the interested parties to design the underlying structures and how they find the time (haha) to make progress. |
xref #9675 would be nice to cimport these rather than duplicating them in pandas.
Need to add a
cdef extern from "numpy/foobar.h":
for these in__init__.pxd
, need guidance and what that foobar.h should be.Is there a standard way of testing what is exposed to cython? cc @bashtage