Skip to content

replace() alters unrelated values #7140

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
toobaz opened this issue May 16, 2014 · 8 comments · Fixed by #7304
Closed

replace() alters unrelated values #7140

toobaz opened this issue May 16, 2014 · 8 comments · Fixed by #7304
Labels
Bug Dtype Conversions Unexpected or buggy dtype conversions
Milestone

Comments

@toobaz
Copy link
Member

toobaz commented May 16, 2014

I don't know if it is the same bug than #7126 (looks similar, but I didn't check the implementation), but

In [1]: from pandas import DataFrame

In [2]: import numpy as np

In [3]: df = DataFrame(index=range(2))

In [4]: df['a'] = True

In [5]: df.a

Out[5]: 
0    True
1    True
Name: a, dtype: bool

In [6]: df.replace([np.inf, -np.inf], np.nan).a

Out[6]: 
0   NaN
1   NaN
Name: a, dtype: float64

Notice the operation of doing such a replace on boolean variables may seem stupid... but it is less so if you consider that "replace" (differently from i.e. dropna) does not support a "subset" argument.

@cpcloud cpcloud added the Bug label May 16, 2014
@cpcloud cpcloud added this to the 0.14.1 milestone May 16, 2014
@cpcloud cpcloud self-assigned this May 16, 2014
@cpcloud
Copy link
Member

cpcloud commented May 16, 2014

seems to be replacing "truthy" values ... again the dtype issues arise :)

@hayd
Copy link
Contributor

hayd commented May 30, 2014

The docs on replace suggest that, if it's a list, it should be a list of strings... so I'm not sure this is a bug per se. wow, replace really does do a lot!

I would definitely consider using one of these instead:

df.where(~df.isin([np.inf, -np.inf]))
df.where(np.isfinite(df.values))

Note: These work with both series and dataframes.

@toobaz
Copy link
Member Author

toobaz commented May 30, 2014

@hayd : I interpret "list of str, regex, or numeric" as "list of str, list of regex, or list of numeric"... after all,

df['b'] = range(2)
df.replace([-1, 0], np.nan)

works as expected (and there is no theoretical reason why it shouldn't).

Moreover, "where" does work on dataframes... but I don't see what I am supposed to do with its result, in order to get the result of "replace" (but this may be me temporarily blind).

I do understand that a "replace" on a Series of bools may want to filter based on the truthy value, but I'm pretty sure that I want a quick way to replace any numeric value in a DataFrame, without touching booleans. So maybe one possible solution would be to clarify the problem in the docs, and add an argument "subset" as for dropna. Although a cleaner solution would be to add an argument "dtype" which determines the dtype to use for the comparison (i.e. convert booleans to float, not vice versa), and which would have the default value of the series' dtype for a Series, and of the item to be replaced's dtype for a DataFrame with non-homogeneous dtypes.

Notice the same problem arises with

df['c'] = False
df.replace([None], np.nan)

@hayd
Copy link
Contributor

hayd commented May 30, 2014

Good point, that seems like a reasonable interpretation!

@cpcloud
Copy link
Member

cpcloud commented Jun 1, 2014

@toobaz while adding those things (kwarg, subset) might make this use case a bit more clear, replace already tries to do too much. In fact we're considering deprecating the replace method (over at least a couple of releases since it's fairly old) into a namespace like we do with str for vectorized string methods. See #5541 (comment) and that issue (#5541).

@cpcloud
Copy link
Member

cpcloud commented Jun 1, 2014

@toobaz all that said, this IS a bug which will be fixed by the next minor release (0.14.1)

@toobaz
Copy link
Member Author

toobaz commented Jun 1, 2014

@cpcloud Great!
(in particular considered that the API change proposed in #5541 would per se only transfer the problem from "replace" to "replace_list")

@cpcloud
Copy link
Member

cpcloud commented Jun 1, 2014

yep that's the idea, though. i've spent many hours debugging replace (most likely because of my own misunderstanding or breakage of something) and one thing that makes this process annoying is that it's enormously overloaded with different types (and dtypes). Every time I need to step through the code, I can't put a breakpoint in the top of the function (well I could but then I have to keep hitting n until i get to the code that actually matters), i have to actually find where the branch on the types of value and to_replace occur and put it there. This is trivial to do a couple of times, but to do it every time there's an problem is extremely annoying. Plus, user code will be much more clear IMO because it'll let the reader know that "hey i'm doing a (list|scalar|dict) replace" rather than having to go up the call stack and look at the types.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Dtype Conversions Unexpected or buggy dtype conversions
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants