Skip to content

BUG: pd.NA.__format__ fails with format_specs #34740

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 3 commits into from
Jun 15, 2020

Conversation

topper-123
Copy link
Contributor

pd.NA fails if passed to a format string and format parameters are supplied. This is different behaviour than np.nan and makes converting arrays containing pd.NA to strings very brittle and annoying.

Examples:

>>> format(pd.NA)
'<NA>'  # master and PR, ok
>>> format(pd.NA, ".1f")
TypeError  # master
'<NA>'  # this PR
>>> format(pd.NA, ">5")
TypeError  # master
' <NA>'  # this PR, tries to behave like a string, then falls back to '<NA>', like np.na

The new behaviour mirrors the behaviour of np.nan.

@jorisvandenbossche
Copy link
Member

@topper-123 Thanks for looking into this!

Personally, instead of relying on a try/except of NaN to check what is supported, I would rather try to understand how and what works for NaN, and try to implement the same logic here.

For example, I suppose that format(pd.NA, ">10.1f") will fail on this branch? While for NaN this works.

Now, properly implementing __format__ manually might be too complicated though, and the "fallback" of formatting the string might already be useful anyway.

@jorisvandenbossche
Copy link
Member

jorisvandenbossche commented Jun 13, 2020

Hmm, np.nan is just a float, so using the builtin float.__format__, I think, which is probably a bit complicated to replicate ...

Another idea: how robust would it be if we format some other value (eg np.nan), and then replace "nan" with "<NA>" in the result? We would need a bit of logic to potentially replace " nan" instead of "nan" if possible, but for the rest it might work in many cases?

@topper-123
Copy link
Contributor Author

Another idea: how robust would it be if we format some other value (eg np.nan), and then replace "nan" with "" in the result?

Wouldn't work out of the box, e.g. "nantes_{}.format(np.nan)", I don't think adding logic to get the correct "nan" is the right approach, it's too complicated IMO.

Another idea: pd.NA is supposed to work with all dtypes, not just floats, so probably should'nt be restricted to format_specs accepted by float. How about a simple:

def __format__(self, format_spec):
    try:
        return self.__repr__().__format__(format_spec)
    except ValueError:
        return self.__repr__()

This would allow string format_spec to work (as they do for floats already) and make self.repr() a fallback that always works.

@jorisvandenbossche
Copy link
Member

Wouldn't work out of the box, e.g. "nantes_{}.format(np.nan)",

I don't fully know how the inner python details of this method work, but I suppose the above would end up calling pd.NA.__format__("") ?As long that the nan -> NA replacement happens inside the __format__ function, I would think the above to work fine.

How about a simple:

I think that is certainly better (avoiding only accepting the rules valid for float), but that still wouldn't work for the example I gave of format(pd.NA, ">10.1f") (I think).

(now, it's certainly already fixing a set of use cases, so could also be a good start)

@jorisvandenbossche
Copy link
Member

jorisvandenbossche commented Jun 13, 2020

Very quick try with

    def __format__(self, format_spec) -> str:
        res = format(np.nan, format_spec)
        return res.replace("nan", "<NA>")

works for the example you gave, and also for the example I gave:

In [1]: "nantes_{}".format(pd.NA)  
Out[1]: 'nantes_<NA>'

In [3]: format(pd.NA, ">10.1f")
Out[3]: '       <NA>'

Of course, the above still needs 1) take the 1 char length difference into account in case there is whitespace (like the second example) and 2) still fallback to formatting with the string repr and finally the plain <NA> string repr (like your example impl at #34740 (comment)).

@topper-123
Copy link
Contributor Author

topper-123 commented Jun 13, 2020

Yeah, __format__ only works inside the brackets, so you're right there.

The length format spec would be one special case that would need to be handled, but are there other? I don't think so for floats, but there could be for other format_specs?

@topper-123
Copy link
Contributor Author

I've made the simpler implementation that I suggested. I'm a bit hesitant that adding the special cases will make this too complex.

@jreback jreback added Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate Output-Formatting __repr__ of pandas objects, to_string labels Jun 14, 2020
@jreback jreback added this to the 1.1 milestone Jun 14, 2020
Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member

@jorisvandenbossche jorisvandenbossche left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I am fine with the simplest solution that at least fixes the basic formatting, for now. I still think it wouldn't be hard to support proper floating point / numeric formatting (with the NaN formatting and replace afterwards)

@jreback jreback merged commit 594dc2a into pandas-dev:master Jun 15, 2020
@jreback
Copy link
Contributor

jreback commented Jun 15, 2020

thanks @topper-123

@topper-123 topper-123 deleted the format_na branch June 15, 2020 22:34
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate Output-Formatting __repr__ of pandas objects, to_string
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants