Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Transforms like 'scale' need some way to handle missing data #82

Open
rsgmon opened this issue Apr 6, 2016 · 3 comments
Open

Transforms like 'scale' need some way to handle missing data #82

rsgmon opened this issue Apr 6, 2016 · 3 comments

Comments

@rsgmon
Copy link

rsgmon commented Apr 6, 2016

In[92]: df = pd.DataFrame([(1,3),(2,6),(4,2),(6,5),(7,3),(4,6),(2,2),(6,4)], columns=['y','X'])
In[93]: pt.dmatrices('y ~ X.diff()', df)
Out[93]: 
(DesignMatrix with shape (7, 1)
   y
   2
   4
   6
   7
   4
   2
   6
   Terms:
     'y' (column 0),
 DesignMatrix with shape (7, 2)
   Intercept  X.diff()
           1         3
           1        -4
           1         3
           1        -2
           1         3
           1        -4
           1         2
   Terms:
     'Intercept' (column 0)
     'X.diff()' (column 1))

In[94]: pt.dmatrices('y ~ scale(X.diff())', df)

Traceback (most recent call last):
  File "C:\Python34\lib\site-packages\IPython\core\formatters.py", line 222, in catch_format_error
    r = method(self, *args, **kwargs)
  File "C:\Python34\lib\site-packages\IPython\core\formatters.py", line 699, in __call__
    printer.pretty(obj)
  File "C:\Python34\lib\site-packages\IPython\lib\pretty.py", line 368, in pretty
    return self.type_pprinters[cls](obj, self, cycle)
  File "C:\Python34\lib\site-packages\IPython\lib\pretty.py", line 552, in inner
    p.pretty(x)
  File "C:\Python34\lib\site-packages\IPython\lib\pretty.py", line 382, in pretty
    return meth(obj, self, cycle)
  File "C:\Python34\lib\site-packages\patsy\design_info.py", line 1089, in _repr_pretty_
    for col in formatted_cols]
  File "C:\Python34\lib\site-packages\patsy\design_info.py", line 1089, in <listcomp>
    for col in formatted_cols]
ValueError: max() arg is an empty sequence
Out[94]: 
@njsmith
Copy link
Member

njsmith commented Apr 6, 2016

FYI -- to paste multi-line code blocks on github, use triple-backquotes. (I just fixed your original post -- if you click "edit" on it you can see how I modified it.)

The main problem you are hitting here is that your X.diff() thing has a NaN in it:

In [11]: df["X"].diff()
Out[11]: 
0   NaN
1     3
2    -4
3     3
4    -2
5     3
6    -4
7     2
Name: X, dtype: float64

Then when you pass that to scale, it tries to calculate the mean/stddev of the array, and the nan propagates and it returns an array of all-nans:

In [12]: pt.builtins.scale(df["X"].diff())
Out[12]: 
0   NaN
1   NaN
2   NaN
3   NaN
4   NaN
5   NaN
6   NaN
7   NaN
Name: X, dtype: float128

And then patsy's missing-data handling kicks in and throws away all of these NaNs, you get back a design matrix with zero rows in it.

And then there's a bug in patsy which I should fix, where if you try to print a design matrix with zero rows then it throws an error. But that's not really your main problem, it just obscures it :-)

njsmith added a commit to njsmith/patsy that referenced this issue Apr 6, 2016
@njsmith njsmith changed the title Can't seem to get scaled diff's Transforms like 'scale' need some way to handle missing data Apr 7, 2016
@njsmith
Copy link
Member

njsmith commented Apr 7, 2016

The deeper issue, which is a genuine issue, is that scale doesn't have any way to handle data with missing values inside it :-/. I never implemented this because I'm not really sure what the right approach is -- there are different ways to handle missing values, and there are different ways to flag them, and there isn't really any way right now to propagate the current settings (see the NA_action argument to dmatrix and friends) into scale. So there's definitely something to fix here, but I don't know how to do it right now, so I'll rename this issue to serve as a marker and hopefully come back to it at some point...

@rsgmon
Copy link
Author

rsgmon commented Apr 7, 2016

Thanks Nathaniel for the edit tip and explanation of the underlying issue. I can find a work around for it now that you've explained it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants