Skip to content

ENH: expose datetime.c functions to cython #21199

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 28 commits into from
Jul 3, 2023

Conversation

jbrockmendel
Copy link
Contributor

xref #9675 would be nice to cimport these rather than duplicating them in pandas.

Need to add a cdef extern from "numpy/foobar.h": for these in __init__.pxd, need guidance and what that foobar.h should be.

Is there a standard way of testing what is exposed to cython? cc @bashtage

@mattip
Copy link
Member

mattip commented Mar 15, 2022

Is there a standard way of testing what is exposed to cython

We have cython tests in numpy/core/tests/test_cython.py which builds and imports the checks.pyx cython file. You could add to there or add a new file in numpy/core/tests/examples.

what that foobar.h should be

Maybe datetime64_c.h in numpy/core/include/numpy/? This would be similar to numpy/core/include/numpy/random/distributions.h

I still would prefer we declare datetime64 to be a user-defined dtype, and split the code out into a separate PyPI-installable package. This would be a good test of the new dtype machinery as well as clean up NumPy internals.

@jbrockmendel
Copy link
Contributor Author

Maybe datetime64_c.h in numpy/core/include/numpy/? This would be similar to numpy/core/include/numpy/random/distributions.h

So create a file "numpy/core/include/numpy/datetime_c.h" and in put #includes to core/src/multiarray/_datetime.h and core/src/multiarray/datetime_strings.h"? [somehow?]

I still would prefer we declare datetime64 to be a user-defined dtype, and split the code out into a separate PyPI-installable package. This would be a good test of the new dtype machinery as well as clean up NumPy internals.

I've talked with @seberg about something similar to this. A nice upside would be that we could start using it in pandas immediately(ish) instead of waiting for our numpy min-version to catch up.

@@ -2035,7 +2033,7 @@ metastr_to_unicode(PyArray_DatetimeMetaData *meta, int skip_brackets)
/*
* Adjusts a datetimestruct based on a seconds offset. Assumes
* the current values are valid.
*/

NPY_NO_EXPORT void
add_seconds_to_datetimestruct(npy_datetimestruct *dts, int seconds)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

disabled this function entirely in last commit to confirm it is unused. any objection to ripping it out? OK to consider this out of scope.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FYI another way that might help check this would be to declare the function as static and use gcc's -Wunused-function flag
That flag might be a part of -Wall or -Wextra so you might not need an extra flag at all, but it only warns for static functions

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, sorry for not noticing this. I am happy to delete it here or otherwise. The issue is that we have NPY_NO_EXPORT in a few places too much. It should be static and then our CI would already take note.

@seberg
Copy link
Member

seberg commented Mar 15, 2022

The random number approach seems reasonable on first sight. I am a bit curious how it compares to most of the API which uses the API table rather than relying on linking?

Splitting out datetime would be cool and it might also be a good time for some cleanups. I.e. we could change some things and keep the current datetimes unmodified and deprecated for a while.
To be clear, I would love a new external DType approach and I do expect it is the right way forward and I am willing to do most of the ground-work related to porting the current datetimes.
However, I don't want to stand in the way of more pragmatic solutions, since I don't know the full picture in pandas. Also a user DType will require at least the next NumPy version to actually function (which may just be OK for you?).

@jbrockmendel
Copy link
Contributor Author

I am a bit curious how it compares to most of the API which uses the API table rather than relying on linking?

This is way outside my region of competence and I'll definitely need help with... whatever it means.

Splitting out datetime would be cool and it might also be a good time for some cleanups

Depending on how much of the datetime code we're talking about splitting out, at some point we'd need to reconcile with the place/places where pandas intentionally diverges from the numpy behavior. Most of those discussions occurred before my time so I don't have a good read on what that would entail.

Also a user DType will require at least the next NumPy version to actually function (which may just be OK for you?).

Even if this PR were fixed and merged today, it would take a year or two before pandas could actually rely on it and rip out our own code (minimum supported numpy on our last release was 1.18.5). So I'm not at all concerned about having to wait a while.

@seberg
Copy link
Member

seberg commented Mar 15, 2022

I am a bit curious how it compares to most of the API which uses the API table rather than relying on linking?

This is way outside my region of competence and I'll definitely need help with... whatever it means.

That question was for @mattip, since he was directly involved in exposing the random number API :).

Depending on how much of the datetime code we're talking about splitting out, at some point we'd need to reconcile with the place/places where pandas intentionally diverges from the numpy behavior

I was talking about splitting it out completely. NumPy would keep its current ones around for backcompat probably, but we would deprecate it soon after, and point to the new DType defined outside of NumPy.

That is the question for you :). We could start on new DTypes for datetimes/timedelta. Obviously, it would be targeted to do whatever pandas wants. But, both times and pandas are complicated, so while I could take care of most of the DType part we will need to work together and create plan. And I would probably need help with many of the time related things to make good progress. My guess is something like:

  • Figure out the basic requirements (timezones? resolution?). One way to solve this for me would be if we had "scalars" for blueprints (that could be pandas scalars or even datetime.datetime)
  • Create a corresponding DType (this I can do, maybe not full featured, but with the basics there)
  • Fixes, fixes, fixes:
    • Pandas won't work with it out of the box, I have no idea how hard that will be
    • There will be holes in the NumPy API that need to get fixed (I should be able to do most of this of course)
  • "Pandas adoption": Even if pandas works OK with the new dtype, there is probably quite a bit of things that need to work with the new dtype explicitly. Maybe that is easy (dtype is very similar to the old one anyway) or maybe not.

The problem is that it seems hard to tell how quickly this can materialize into something usable. I don't think a prototype will take super long if pandas adoption isn't extremely tricky for some reason. But we may have to spend a few weeks of work just to see whether or not it is difficult. Are you interested in attempting that?

@jbrockmendel
Copy link
Contributor Author

That is the question for you :). We could start on new DTypes for datetimes/timedelta. Obviously, it would be targeted to do whatever pandas wants

The big question mark this raises for me is "to what extent can numpy with new Dtypes be dropped in to existing pandas code and Just Work?" Have you tried running the pandas test suite with the new Dtype system in place?

I'll be happy to serve as a guinea pig for the new Dtypes, but can't make any promises about them actually getting used inside pandas anytime soon.

Figure out the basic requirements (timezones? resolution?). One way to solve this for me would be if we had "scalars" for blueprints (that could be pandas scalars or even datetime.datetime)
Create a corresponding DType (this I can do, maybe not full featured, but with the basics there)

Setting aside my ignorance of what a attributes/methods a Dtype needs, the idea I've been pursuing has been to create a Localizer class that would look a little bit like a dtype if you squint. It would be pinned as private attribute on a Timestamp and on a DatetimeTZDtype, and its attributes would include a tzinfo and a NPY_DATETIMEUNIT.

My last attempt to implement Localizer as a class was https://github.com/pandas-dev/pandas/pull/46246/files#diff-23ab6ca878fcae5ed2cb46d90ad29a18c2c4f5ba18bf16918b91b8ef613afefdR537, and its utc_val_to_local_val method was the guts of it. The problem I faced was that making that a method instead of repeated inline really hurt performance (see asvs posted in the OP to that PR).


In the interim, is pushing forward with this PR worthwhile?

@mattip
Copy link
Member

mattip commented Mar 16, 2022

[H]ow it compares to most of the API which uses the API table rather than relying on linking?

The API table is for NumPy C-API functions (like PyArray_NDIM), not for pure C functions (like the ones in mathlib). The random distribution functions are more similar to the mathlib ones. I guess there could be room for discussion around exporting these functions: are they NumPy C-API or C functions?

@seberg
Copy link
Member

seberg commented Mar 16, 2022

Well, I am OK with this if Matti is. npymath is its own library though and doesn't even need Python to be loaded? This code does create Python errors and work with python datetime. The header would also have to be in core/src/include/numpy/*, I think.

I have a bit of a bad feeling about linking against multiarray, though. I am sure this cython example works on linux, but I don't know if we can link multiarray easily? If that is indeed OK, then this is OK, but right now I don't quite trust that approach.

The alternative is to rename the functions PyArray_... to align them with most of our C-API, and put them in the C-API table like everything else. They all have error returns I think, so we can reasonably deprecate them when we want to.
I somewhat expect that would be the way to go. I don't have an opinion on whether this will be useful, so I am willing to just say that it should be more use than nuisance.

@jbrockmendel
Copy link
Contributor Author

are they NumPy C-API or C functions?

I'm not sure what the distinction is. The functions touched in this PR are all ones that pandas re-implements near-verbatim.

The alternative is to rename the functions PyArray_... to align them with most of our C-API, and put them in the C-API table like everything else

If there's a non-trivial chance of some of these getting split out into a third package, maybe the future-proof thing to do would be to not put them into the "official" C-API?

@seberg seberg added the triage review Issue/PR to be discussed at the next triage meeting label Mar 16, 2022
@seberg
Copy link
Member

seberg commented Mar 16, 2022

Lets mark it for discussion at next weeks meeting.

The distinction is about how we link the functions. This PR links them like a normal dynamic library. Which we do for some math and random functions.
But the rest of the NumPy C-API uses an API "table". I.e. when you do import_array() there are some python (not C!) function calls that fill in a table of function pointers for you.
The API functions like PyArray_NewFromDescr(...) are then actually macros expanding into something like multiarray_api_table[9](...) (if this was function number 9).
So from within the NumPy Python module, there is currently no C linking at all. Because of that, I am unsure that it is something we can do reliably.

@jbrockmendel
Copy link
Contributor Author

Noting as I go (in implementing non-nanosecond support in pandas) other numpy functions that I'd like to re-use rather than re-implement (will update this list if/when I find more):

get_datetime_metadata_from_dtype
npy_timedeltastruct

cc @WillAyd if im missing anything

@WillAyd
Copy link
Contributor

WillAyd commented Mar 16, 2022

I think there might be a few in the JSON space as well we use for serialization:

https://github.com/pandas-dev/pandas/blob/main/pandas/_libs/src/ujson/python/date_conversions.h

@jbrockmendel
Copy link
Contributor Author

Lets mark it for discussion at next weeks meeting.

which meeting(s) should i plan to attend? this will be my first one.

@seberg
Copy link
Member

seberg commented Mar 21, 2022

@jbrockmendel wednesday at 16UTC: https://hackmd.io/68i_JvOYQfy9ERiHgXMPvg. I wanted to see if we can nail down the linking thing with Matti and others. But happy if you come. I am seriously considering making a (minimal) "new" Datetime dtype so that you can see what is missing for proper support in pandas. My unitdtype prototype doesn't quite work, but maybe a datetime/timedelta DType will be an incitment to try that.

@jbrockmendel
Copy link
Contributor Author

zoom wouldn't let me into the meeting, which was probably OK because I was barely awake

@seberg
Copy link
Member

seberg commented Mar 23, 2022

@jbrockmendel we discussed this briefly. I think there is a tendency of "can't we remove this eventually?", but that doesn't matter too much...
The point is, if we are to expose this, I think we need to do it to the C-API table, which means:

The rest should happen automatically, but you should copy paste that magic comment from somewhere, and if you get strange complication errors formatting can be the reason (e.g. a marked function has to have the initial { bracket on the next line I think).

As for a user DType, we did not really discuss it. But I am happy to create a prototype for an external DType, so that you could see what modifications in pandas may make it work. Something that would be useful in any case, probably.

@jbrockmendel
Copy link
Contributor Author

I think there is a tendency of "can't we remove this eventually?", but that doesn't matter too much...

One option would be to move the duplicated code to a third package upstream of both numpy and pandas. That's a complication I'd prefer to avoid for now, but if it makes things easier on the numpy devs, can revisit it.

As for a user DType, we did not really discuss it. But I am happy to create a prototype for an external DType, so that you could see what modifications in pandas may make it work. Something that would be useful in any case, probably.

Based on our last phone call I got the impression that this is mostly-orthogonal to what I'm working on. Which is great! Because it means that I can spend time helping you with your project and feel less bad about taking up your time!

[...] The rest should happen automatically, but you should copy paste that magic comment from somewhere [...]

OK, I think I get what you're asking. In addition, I'm going to see if there's anything worth salvaging from #16364.

@jbrockmendel
Copy link
Contributor Author

I take it back. I'm lost on what you're asking for in those bullet points, with the exception of

I would want to rename the functions to PyArray_* or some similar pattern.

I assume this is an internal convention despite the fact that none of the affected functions take ndarrays. If renaming is under consideration, I'd suggest clarifications on "convert_datetimestruct_to_datetime" adding a "64" at the end to avoid confusion with pydatetime. Same with the reverse function "convert_datetime_to_datetimestruct".

@seberg
Copy link
Member

seberg commented Mar 23, 2022

Its a bit annoying, basically, you list the function here:

The next will be (307,) because there is a 306 at the top (this is really just an enumeration). The function itself has to be NPY_NO_EXPORT ..., but must have a comment right before the definition that is formatted as:

/*NUMPY_API
 *
 * Actual comment
 */
NPY_NO_EXPORT int
myfunction(...)
{

In principle this should only mean adding that NUMPY_API thing on the first line of the comment, assuming that there is already a code comment. (Look e.g. at PyArray_All or any other C-API function.)

I think there may be a hash hardcoded somewhere that needs updating (i.e. compilation will complain), but I would have to grep for it also right now.

The header for a function marked with /*NUMPY_API is auto-generated, so you have to remove it from the current headers.

@jbrockmendel
Copy link
Contributor Author

Thanks for looking into this. I've spread myself too thin and need to put this on the back-burner. It may also turn out that we will want to get access to parse_iso_8601_datetime. Will revisit in a few weeks or months.

@charris charris closed this Apr 6, 2022
@charris charris reopened this Apr 6, 2022
@InessaPawson InessaPawson added triaged Issue/PR that was discussed in a triage meeting and removed triage review Issue/PR to be discussed at the next triage meeting labels Apr 20, 2022
@jbrockmendel
Copy link
Contributor Author

Revisiting this, I think I got the numpy_api.py file thing sorted out. Having trouble with the __init__.pxd files when using

cdef extern from "numpy/datetime_strings.h":
# or
cdef extern from "numpy/mutltiarray/datetime_strings.h":
# or
cdef extern from "numpy/core/src/mutltiarray/datetime_strings.h":

Either way it says the file is not found. The actual location of the file is numpy/core/src/multiarray/datetime_strings.h

@bashtage
Copy link
Contributor

bashtage commented Jul 13, 2022

Are your .h files located at numpy/core/include/numpy/file-name.h? The numpy/header.h includes works since the include path is numpy/core/include

@charris
Copy link
Member

charris commented May 19, 2023

Look in doc/release/upcoming_changes to see how the release notes are done.

@charris
Copy link
Member

charris commented May 22, 2023

These functions also need documentation, probably add a page in doc/source/reference/c-api. Given that I would like to branch today or tomorrow, I'm going to push this off to the next release, which will require some additional changes.

@charris charris modified the milestones: 1.25.0 release, 2.0.0 release May 22, 2023
@seberg
Copy link
Member

seberg commented May 22, 2023

I don't know if @jbrockmendel cares, but I wouldn't mind pushing this through. If the docs is the only thing that is blocking things, can we maybe backport that? Since docs don't actually matter for the RC process.

(Sorry, don't want to block the release on this, but I had assumed we can push this through.)

@charris
Copy link
Member

charris commented May 22, 2023

but I had assumed we can push this through.

I could compromise on a release note.

@jbrockmendel
Copy link
Contributor Author

I tried rebasing this morning, had a merge conflict, and got discouraged.

I don't know if @jbrockmendel cares, but I wouldn't mind pushing this through

Getting this in for 1.2x doesn't make much difference for me. Getting it in by 2.0 is a bigger deal.

@ngoldbaum
Copy link
Member

I just tried writing a cast from the string dtype I've been working on to numpy datetimes and realized it would be nice to have parse_iso_8601_datetime exposed in the C API. @jbrockmendel would you mind if I pushed some commits to this PR to expose it and add it to the cython bindings? Also happy to fix the other issues raised by reviewers so this can go in sooner rather than later.

@jbrockmendel
Copy link
Contributor Author

That sounds great.

@ngoldbaum
Copy link
Member

Merged with main, added release notes, and new C API datetime docs. If there are any test failures I'll clean those up.

Is there a way to see the doc build generated by the CI job for a PR?

@@ -73,4 +73,6 @@
0x00000011 = ca1aebdad799358149567d9d93cbca09

# Version 18 (NumPy 2.0.0)
0x00000012 = 5af92e858ce8e95409ae1fcc8f508ddd
# Many API deprecations have been finalized
# Add datetime conversion functions GH#21199
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we're not enumerating every single change for numpy 2.0 I'm happy to remove these.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yah, lets not, already a few removals too.

@ngoldbaum
Copy link
Member

Oops, forgot the cython binding for parse_iso_8601_datetime, will add that too.

@ganesh-k13
Copy link
Member

Is there a way to see the doc build generated by the CI job for a PR?

https://output.circle-artifacts.com/output/job/2a322bc7-156b-4c32-b269-ca21c17a5534/artifacts/0/doc/build/html/index.html

Click on details of ci/circleci: build artifacts

@seberg
Copy link
Member

seberg commented Jul 3, 2023

Time to put this in, docs looked good without checking every tiny detail, thanks @jbrockmendel and @ngoldbaum for pushing it over the finish line!

@seberg seberg merged commit 10ab6aa into numpy:main Jul 3, 2023
@jbrockmendel jbrockmendel deleted the npy_yes_export branch July 3, 2023 14:21
@jbrockmendel
Copy link
Contributor Author

nice! thanks reviewers for walking me through this and @ngoldbaum for getting it across the finish line. Looking forward to ripping a bunch of this out of pandas next year!

@jbrockmendel
Copy link
Contributor Author

I still would prefer we declare datetime64 to be a user-defined dtype, and split the code out into a separate PyPI-installable package. This would be a good test of the new dtype machinery as well as clean up NumPy internals.

I'd be interested in particular if we could get pyarrow to share the implementation, since install size has been coming up as a pain point recently. @mattip are you the person to talk to about making this a reality?

@mattip
Copy link
Member

mattip commented Aug 7, 2023

I would think that in parallel to the string dtype that is progressing quite nicely, a datetime64-like dtype could take form there. It is something to discuss at a community meeting: who are the interested parties to design the underlying structures and how they find the time (haha) to make progress.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
01 - Enhancement 56 - Needs Release Note. Needs an entry in doc/release/upcoming_changes 63 - C API Changes or additions to the C API. Mailing list should usually be notified. triaged Issue/PR that was discussed in a triage meeting
Projects
None yet
Development

Successfully merging this pull request may close these issues.

9 participants