Skip to content

Fixes multiple bugs in mstats.theilslopes #3574

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 9 commits into from
May 17, 2014

Conversation

tomflannaghan
Copy link
Contributor

This fixes multiple bugs in scipy.stats.mstats.theilslopes. I believe that the original code is implementing the method given in Sen (1968) for the slope and confidence intervals (this is the standard method), and I have therefore corrected the implementation to follow Sen (1968) faithfully. I have also added a test that highlights some of the failures of the original code.

An outline of the changes:

  1. The original implementation erroneously always masked the last element of the input data (line 699 in the original code), and so the last element was always omitted from the remaining calculations.
  2. The code retained the masked values in the slopes array, which caused problems when the confidence intervals are calculated by array index. I now use x.compressed() and y.compressed() to remove masked values.
  3. Slopes were being computed for repeated values of x, not in accordance with the Sen (1968) method. This is fixed by replacing the routine for computing slopes with one that only includes values where deltax > 0.
  4. The original calculation of sigsq does not divide the second term by 18 (line 722 in original code). It also includes a yties term for handling repeated y values (line 723 in original code) that does not appear in Sen (1968) equation (2.6).
  5. The confidence intervals are computed by computing indices in the slopes array. In Sen (1968), these indices are 1-indexed, but python is 0-indexed. Therefore, we must subtract 1 from the values computed by Sen (1968) to index the slopes array correctly (see lines 728-729 in the new code).

@coveralls
Copy link

Coverage Status

Changes Unknown when pulling 8010f14 on tomflannaghan:theilslopes-fix into * on scipy:master*.

@pv pv added the PR label Apr 27, 2014
@rgommers
Copy link
Member

Hi @tomflannaghan, thanks for fixing this up. Looks good from a quick browse, but I haven't had time to check the original paper yet.

Adding the reference to that paper to the docstring might be useful. And bonus points for a small example of course:)

The implementation was hardly touched after it was introduced in 9f05b93. @pierregm was this code indeed based on the Sen paper, and does this look OK to you?

@rgommers
Copy link
Member

Also, I think this should be in the stats namespace instead of only in stats.mstats.

@jseabold
Copy link
Contributor

On Tue, Apr 29, 2014 at 5:30 PM, Ralf Gommers [email protected]:

Also, I think this should be in the stats namespace instead of only in
stats.mstats.

+1

FWIW, there are several implementations of Theil-Sen estimators in R (and
BSD-licensed matlab code) to compare against, though the test cases here
look fairly straightforward.

@tomflannaghan
Copy link
Contributor Author

Hi @rgommers and @jseabold, thanks for the feedback! I agree that it should be in stats too, so I have now added it. I have changed the mstats implementation to call stats.theilslopes after the masked arrays are handled to avoid repeating the code. I have also added a test for stats.theilslopes, and the reference and an example to the docstrings.

I have also added back the handling of ties in y (removed as part of my point 4 in the original pull request) as I realised that removing it was an error on my part.

@rgommers rgommers added this to the 0.15.0 milestone May 10, 2014
@rgommers
Copy link
Member

@tomflannaghan I have a branch here with some minor fixes and an updated example: https://github.com/rgommers/scipy/tree/theilslopes-fix. Somehow github doesn't let me send you a PR, but maybe you can cherry-pick those commits?

The example does show an issue - the confidence interval is only for slope and not for offset, which makes the plots look quite weird. Do you think it makes sense to fix this?

Finally, I don't have access to the original paper, so didn't review the algorithm changes in detail. If you know of a public version maybe you can direct me to it?

@coveralls
Copy link

Coverage Status

Changes Unknown when pulling a38b7d2 on tomflannaghan:theilslopes-fix into * on scipy:master*.

@tomflannaghan
Copy link
Contributor Author

@rgommers I've merged your commits with my branch. I'm not sure why github didn't allow you to send me a PR but I'm fairly new to it so maybe I messed something up. I'm happy with your changes and just merged them directly into this PR.

The confidence interval for the intercept would probably be useful. However, I am not sure how to implement it properly. The Theil-Sen method only gives the slope, and there are various different methods to compute the intercept, let alone it's confidence interval. I looked at an equivalent R package (zyp) and their confidence interval calculation doesn't agree at all well with a simple bootstrap test that I performed on your example. They compute it using the standard deviation of all possible intercepts, which is not robust to outliers and doesn't fit with the distribution-free philosophy of the rest of the method.

So, in summary, I think not including it is probably better than including something potentially misleading. I suspect most people would bootstrap to get the confidence interval for the intercept if they really needed it (which is very computationally expensive).

As for access to the paper, sadly I don't know any way to get it publicly. I can email you a copy though.

@rgommers
Copy link
Member

If you could send me the paper at ralf.gommers at gmail.com that would be great, thanks.

After reading the stackexchange discussion you link to above, I get the impression that indeed a confidence interval on the intercept doesn't help. Maybe it's best to explain the issue with the intercept value (not well-defined), and that what the example does (which to me looked like the simplest way to use the slope confidence interval) makes little sense.

@tomflannaghan
Copy link
Contributor Author

I have changed the wording of the example a bit to reflect the fact that the intercept confidence interval is missing. I've also added a note explaining that the intercept is not defined by Sen, and that there are several definitions that are used.

rgommers pushed a commit that referenced this pull request May 17, 2014
Fixes multiple bugs in mstats.theilslopes
@rgommers rgommers merged commit 472fde8 into scipy:master May 17, 2014
@rgommers
Copy link
Member

All looks good to me, merged. Thanks Tom!

@tomflannaghan tomflannaghan deleted the theilslopes-fix branch May 17, 2014 20:40
@tomflannaghan
Copy link
Contributor Author

Thanks Ralf!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants