Skip to content

Json fix normalize #49920

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 5 commits into from
Nov 28, 2022
Merged

Json fix normalize #49920

merged 5 commits into from
Nov 28, 2022

Conversation

WillAyd
Copy link
Member

@WillAyd WillAyd commented Nov 26, 2022

closes #49861

@@ -153,7 +153,7 @@ def _normalise_json(
# to avoid adding the separator to the start of every key
# GH#43831 avoid adding key if key_string blank
key_string=new_key
if new_key[: len(separator)] != separator
if key_string or new_key[: len(separator)] != separator
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does key_string being false imply that

new_key[: len(separator)] == separator

?

If so, then can this be simplified to just if key_string?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The implication if set is that you are within at least one recursive call. Seems like the string substitution in place should only effect the very top of the hierarchy.

Probably a cleaner way to represent it - this was just a quick bolt on to the existing code

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry not sure I understand the reply (or perhaps my suggestion was unclear) - I was suggesting:

            _normalise_json(
                data=value,
                # to avoid adding the separator to the start of every key
                # GH#43831 avoid adding key if key_string blank
                key_string=new_key if key_string else removeprefix(new_key, separator),
                normalized_dict=normalized_dict,
                separator=separator,
            )

because if key_string is falsey, then new_key[: len(separator)] != separator must also be falsey, and so the latter isn't needed (if you have a or b and you know not a implies not b, then a or b is the same as a). Wouldn't this also be a quick bolt-on to the existing code, but simpler?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice idea. We still support Python 3.8 though right? I think removeprefix was added in 3.9

Can also move this out of the argument list if that helps readability - even the way it was I agree is less than desirable with readability

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

True, but there's a 3.8 version of it in pandas, which is used in some places, e.g.

if sys.version_info < (3, 9):
from pandas.util._str_methods import (
removeprefix,
removesuffix,
)
stripped_name = removesuffix(removeprefix(name, "__"), "__")
else:
stripped_name = name.removeprefix("__").removesuffix("__")

If you write it like that (with the if sys.version_info < (3, 9): check) then pyupgrade will automatically only keep the 3.9+ version when pandas drops 3.8

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah very nice. Cool let me do a refactor with this should make things cleaner

@@ -148,13 +149,13 @@ def _normalise_json(
if isinstance(data, dict):
for key, value in data.items():
new_key = f"{key_string}{separator}{key}"

if not key_string:
new_key = removeprefix(new_key, separator)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is only available in python3.8 and under - this is on purpose, because it "forces" you to write it like

Suggested change
new_key = removeprefix(new_key, separator)
if sys.version_info < (3, 9):
from pandas.util._str_methods import removeprefix
new_key = removeprefix(new_key, separator)
else:
new_key = new_key.removeprefix(separator)

and then when Python3.8 is dropped, pyupgrade will rewrite this automatically to only keep

new_key = new_key.removeprefix(separator)

(you can see what will happen with pyupgrade pandas/io/json/_normalize.py --py39-plus)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah gotcha - sorry misunderstood before thought that compat was handled directly in pandas.util._str_methods

@@ -21,6 +21,7 @@
Scalar,
)
from pandas.util._decorators import deprecate
from pandas.util._str_methods import removeprefix
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
from pandas.util._str_methods import removeprefix

Copy link
Member

@MarcoGorelli MarcoGorelli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me, thanks @WillAyd !

(as an aside, this could probably be rewritten better without recursion? I'll take a look when I get a chance)

@MarcoGorelli MarcoGorelli added the IO JSON read_json, to_json, json_normalize label Nov 27, 2022
@MarcoGorelli MarcoGorelli added this to the 2.0 milestone Nov 27, 2022
@@ -148,13 +149,18 @@ def _normalise_json(
if isinstance(data, dict):
for key, value in data.items():
new_key = f"{key_string}{separator}{key}"

if not key_string:
if sys.version_info < (3, 9):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@MarcoGorelli if we use if not PY310 where PY310 is from pandas.compat would pyupgrade still flag this?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it wouldn't, no, pyupgrade just does static analysis (it wouldn't know what the symbol PY310 means) - in fact, I was kinda tempted to replace all the PY310 and other pandas.compat constants with sys.version_info checks, so we don't need to remember what to clean up when dropping versions each year

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would be open to this change.

@mroeschke mroeschke merged commit cd58f3b into pandas-dev:main Nov 28, 2022
@mroeschke
Copy link
Member

Thanks @WillAyd

@rhshadrach
Copy link
Member

This patch may have induced a potential regression. Please check the links below. If any ASVs are parameterized, the combinations of parameters that a regression has been detected appear as subbullets. This is a partially automated message.

  • https://asv-runner.github.io/asv-collection/pandas/#io.json.NormalizeJSON.time_normalize_json
    • orient='columns'; frame='df'
    • orient='columns'; frame='df_date_idx'
    • orient='columns'; frame='df_int_float_str'
    • orient='columns'; frame='df_int_floats'
    • orient='columns'; frame='df_td_int_ts'
    • orient='index'; frame='df'
    • orient='index'; frame='df_date_idx'
    • orient='index'; frame='df_int_float_str'
    • orient='index'; frame='df_int_floats'
    • orient='index'; frame='df_td_int_ts'
    • orient='records'; frame='df'
    • orient='records'; frame='df_date_idx'
    • orient='records'; frame='df_int_float_str'
    • orient='records'; frame='df_int_floats'
    • orient='records'; frame='df_td_int_ts'
    • orient='split'; frame='df'
    • orient='split'; frame='df_date_idx'
    • orient='split'; frame='df_int_float_str'
    • orient='split'; frame='df_int_floats'
    • orient='split'; frame='df_td_int_ts'
    • orient='values'; frame='df'
    • orient='values'; frame='df_date_idx'
    • orient='values'; frame='df_int_float_str'
    • orient='values'; frame='df_int_floats'
    • orient='values'; frame='df_td_int_ts'

@WillAyd WillAyd deleted the json-fix-normalize branch December 24, 2022 22:05
@WillAyd
Copy link
Member Author

WillAyd commented Jan 3, 2023

@rhshadrach awesome bot. Will take a look - moving the import to the global space might help

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
IO JSON read_json, to_json, json_normalize
Projects
None yet
Development

Successfully merging this pull request may close these issues.

json_normalize - incorrect removing separator from beginning of key
4 participants