Skip to content

BUG/ENH: Removed non-standard scaling of the covariance matrix and added option to disable scaling completely. #11197

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Nov 27, 2018

Conversation

wummo
Copy link
Contributor

@wummo wummo commented May 30, 2018

Fixes #11196

As discussed in the bug report polyfit uses a non-standard scaling factor for the covariance matrix, this is corrected.

Furthermore, an option is added to disable the scaling of the covariance matrix completely. It would be useful in occasions, where the weights are given by 1/sigma with sigma being the (known) standard errors of (Gaussian distributed) data points, in which case the unscaled matrix is already a correct estimate for the covariance matrix.

@mattip
Copy link
Member

mattip commented May 30, 2018

Thanks for the pull request, and welcome!

Needs a mention in doc/release/1.15.0-notes.rst under the Changes section, as well as a mention in the Improvements section for the additional option.

Also tests are missing for the new option. It would be nice to have a "demonstration of desired behavior" type of test that simply demonstrates the power of the new option, as well as a test for any new error modes. For instance, what happens if w is None but absolute_weights is True?

Since we are changing default behavior, a heads-up to the numpy-discussion mailing list with a link to this commit is also necessary.

@wummo
Copy link
Contributor Author

wummo commented Jun 1, 2018

Hi Matti,

thanks for the welcome.

I changed the release notes and added some notes in the sections "changes" and "improvements".
Also, I added a check if absolute_weights is set to True but w is empty. In this case a ValueError is thrown. Another possible clash of options might be absolute_weights=True but cov=False but I believe that this can be neglected. Finally, I introduced 2 tests, one to check if the covariance matrix is calculated correctly and another one the see if ValueError is thrown. Lastly, I wrote a small note to numpy-discussions about the change in default behavior that comes with this patch.

Best,
Andreas

Copy link
Contributor

@mhvk mhvk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall, definitely a good idea, but I think the name should reflect what is actually done more closely.

Also, definitely include the test!

@@ -423,6 +424,19 @@ def polyfit(x, y, deg, rcond=None, full=False, w=None, cov=False):
cov : bool, optional
Return the estimate and the covariance matrix of the estimate
If full is True, then cov is not returned.
absolute_weights: bool, optional
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I realize it amount to bike-shedding, but I find this name confusing, since I've never encountered this term. If we stick close to this, I'd very much prefer a relative_weigths=True since I think that more clearly indicates that there is something weird about the weights.

But really what this does is forcing the reduced chi2 to unity, so maybe that is what the parameter name should reflect? Indeed, in the actual code, the weights are not used at all. Now force_redchi2_to_unity is a bit long... Maybe rescale_covariance=True? Or just cov_scale or scale_cov?

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just to explain the current choice of absolute_weights: It was suggested by @josef-pkt on the mailing list. It is the analogue to scipy.optimize.curve_fit's absolute_sigma parameter. Its name was decided on in the lengthy discussion within this PR, which was continued in this PR. At least I somewhat like the analogy to the curve_fit terminology but I don't know whether this really is a valid argument here.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, the comments in the first thread do argue specifically against scale_cov is bad ... I do think a bit of a mistake was made in scipy to not call it relative_sigma, but on the other hand there is an advantage for newly introduced flags to be False for the default of "old behaviour".

Let me try another suggestion, though: unlike scipy's curve_fit, right now we already have a flag to ask for the covariance matrix. Could we not broaden its purpose instead to also tell what type we want? If falsy, we do not return it as now, and if truthy, we do return it, but exactly what we return will depend on its value. Specifically, I suggest,

cov : bool or str, optional
         If given and not `False`, return not just the estimate but also its covariance matrix.
         By default, the covariance are scaled by chi2/sqrt(N-dof), i.e., the weights are presumed
         to be unreliable except in a relative sense and everything is scaled such that the reduced
         chi2 is unity. This scaling is omitted is ``cov='unscaled'``, as is relevant for the case that
         the weights are 1/sigma**2, with sigma known to be a reliable estimate of the uncertainty.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

p.s. In the docstring proper, be careful with single back quotes - with those, there should be an actual link target, i.e., something like False works because it links to the python API.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about this suggestion of using cov to indicate whether or not scaling should be done?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To be honest, I like the simple relative_weight better. Like @jotasi already mentioned, it behaves like absolute_sigma in curve_fit. Still, if you insist, I can implement your proposal. In this case it would be nice, if you could point me to another function with a similar parameter, so I can have a look what type of parameter check is performed.

@@ -552,6 +566,8 @@ def polyfit(x, y, deg, rcond=None, full=False, w=None, cov=False):
raise TypeError("expected 1D or 2D array for y")
if x.shape[0] != y.shape[0]:
raise TypeError("expected x and y to have same length")
if absolute_weights and (w is None):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't see why this is necessary: the rescaling could be done or omitted (as is arguably meaningful) independent of whether weights are present. I'd remove this.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wouldn't omitting the rescaling without specifying weights effectively mean that all points' standard deviations are considered 1, or did I misunderstand that?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, for normal distributions that would be the case. But mostly I see no reason to force a user to pass on w=np.ones(y.shape[0]) when the flag is set. The default is not really no weights, but weight equals 1.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the case of relative weights (scaling), having all weights set to one means that the error on all data points is of equal magnitude. In the new case (absolute weights, no scaling), I'm not sure if this is any sensible default. Why would a data point have the error sigma==1? I guess, that would be just by coincidence or in special cases where you draw from a distribution with known width (like your unit test example)?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@wummo - it indeed implies sigma=1 - which I agree is not necessarily all that meaningful, but I don't see a reason to specifically forbid someone from entering it - it just makes the code longer and more complex for no benefit.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, here is my last argument: Isn't sigma==1 a detail of the implementation that might (maybe) change? Then, giving no weights is something like "undefined behavior".

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But it isn't ;-) After all, this is a possibly weighted least-squares fit, not a chi2 one, and the meaning is clear without the weights (the meaning of the covariance admittedly less so, but I don't think one has to hand-hold people that much).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In my opinion, the test is useful, but having very little experience with numpy development/guidlines, I followed your hints and removed the check and the corresponding unit test.

else:
if len(x) <= order:
raise ValueError("the number of data points must exceed order "
"for estimate the covariance matrix")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This error message is not correct any more, it only needs to be true if rescaling is done. Maybe just replace "for estimate" (weird grammar anyway) with "to scale"

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Still to be done: "for estimate" -> "to scale" (or "in order to be able to scale"

[0, 1, 3], [0, 1, 3], deg=0, cov=True)
[1], [1], deg=0, cov=True)

# Check exception when option absolute_weights is True, but no weights
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This would need to be removed again...

"for Bayesian estimate the covariance matrix")
fac = resids / (len(x) - order - 2.0)
if absolute_weights:
fac = 1.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Best to just use 1 here - it cooperates better with Decimal, if that ever becomes supported

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, I changed this.

"for estimate the covariance matrix")
# note, this used to be: fac = resids / (len(x) - order - 2.0)
# it was deciced that the "- 2" (originally justified by "Bayesian
# uncertainty analysis") is not was the user expects
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add "(see gh- and gh-11197)"

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For some reason my comment was weird here: should have been "gh-11196 and gh-11197"

@mhvk
Copy link
Contributor

mhvk commented Jun 6, 2018

Looks good, except for the two remaining issues:

  1. Do we want to forbid getting an unscaled covariance without weights? I feel this is unnecessary hand-holding.
  2. Do we want to fold in the absolute_weights into the cov boolean, by allowing a string value? Personally, I find routines whose outputs depend on flags super-annoying to start with, and adding even more flags seems silly, so I'd prefer just a single flag that tells what type of covariance to return (where False is "don't bother, I don't need them).

@mhvk
Copy link
Contributor

mhvk commented Jun 6, 2018

@wummo - this function is a bit of a mess already... And IIRC it is in fact recommended to use np.polynomial.polynomial.polyfit, since it presents coefficients in more logical order, but which does not even have cov. Adding another argument makes it deviate even more, but I'm certainly not stuck on sticking to one argument. Let me ping @charris, since he has much more experience with the polynomial stuff.

@josef-pkt
Copy link

One argument in favor of rolling it into a trivariate cov is that users only have to change one flag to get fixed scale.

(I find it annoying in some cases in statsmodels, where we have one flag to switch away from the default and another flag to choose an option for the alternative. It's easy to forget to switch the first keyword, and I often have to correct my initial code to fix it.)

@wummo
Copy link
Contributor Author

wummo commented Jun 8, 2018

@mhvk did you have a chance to contact @charris ?

@wummo wummo force-pushed the correct_covariance_scaling branch from 7e86c15 to 977a38a Compare June 18, 2018 09:27
@wummo wummo force-pushed the correct_covariance_scaling branch from 977a38a to 3250479 Compare August 10, 2018 07:35
@wummo
Copy link
Contributor Author

wummo commented Aug 10, 2018

I rebased the code to 1.16 to fix the merge conflicts. Is there anything I can do to move this pull request forward?

@wummo wummo force-pushed the correct_covariance_scaling branch from 3250479 to 39efaf4 Compare August 23, 2018 10:11
@wummo
Copy link
Contributor Author

wummo commented Sep 3, 2018

In the light of PR #11733, is there still interest to fix this issue here or will this function be dropped eventually anyway, @mhvk @charris?

@jsdodge
Copy link
Contributor

jsdodge commented Oct 3, 2018

I just became aware of this issue in numpy.polyfit and would strongly recommend including this fix until numpy.polynomial.polynomial.polyfit includes an option to return the covariance matrix. Otherwise, users will be stuck with a default that does not provide an easy way to determine parameter uncertainties.

Also, since @josef-pkt expressed confusion about why a user would want this feature in mailing list discussion, it's worth noting that it is common practice in physics (my field) to determine the measurement sample variance independently, then treat it as known when fitting a model to data from the same apparatus. Introductory textbooks typically focus on this case.

@mattip
Copy link
Member

mattip commented Nov 15, 2018

Maybe this would be more palatable as two PRs - one to remove the non-standard -2 and another to add the kwarg. Or is it ready to go in as-is? The -2 has been a long standing issue (5 years).

@mhvk
Copy link
Contributor

mhvk commented Nov 15, 2018

Sorry that this has slipped so far. I'd still like the opinion of @charris, because it would be good to move this to the polynomial classes. Absent that, I'm happy to merge the -2 removal, but would prefer the option not to add a new argument, but rather to use a string for the type of covariance one wants.

@wummo
Copy link
Contributor Author

wummo commented Nov 16, 2018

@mhvk Do you think it makes sense to wait any longer, given that we already waited 1/2 year? If you decide that a string for the type of covariance is the better interface, I will implement this and the PR can be merged.

@mhvk
Copy link
Contributor

mhvk commented Nov 16, 2018

@wummo - fair enough. Yes, please do the string interface and we'll merge this.

Copy link
Contributor

@mhvk mhvk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@wummo - thanks for making the changes. Only some small left-overs...

@@ -239,6 +239,15 @@ single elementary function for four related but different signatures,
The ``out`` argument to these functions is now always tested for memory overlap
to avoid corrupted results when memory overlap occurs.

New option ``absolute_weights'' in ``np.polyfit''
-------------------------------------------------
Like ``absolute_sigma'' in ``scipy.optimize.curve_fit`` a boolean option
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Need to change the release notes as well...

weights are given by 1/sigma with sigma being the (known) standard errors of
(Gaussian distributed) data points, in which case the unscaled matrix is already
a correct estimate for the covariance matrix. In case ``absolute_weights'' is set
to true, but no weights are given, a ``ValueError'' is thrown.
Detailed docstrings for scalar numeric types
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that the rebase has removed the empty line that should be here.

covariance matrix. Namely, rather than using the standard chisq/(M-N), it
scales it with chisq/(M-N-2) where M is the number of data points and N is the
number of parameters. This scaling is inconsistent with other fitting programs
such as e.g. ``scipy.optimize.curve_fit`` and was changed to chisq/(M-N).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And another empty line to be added back in.

except in a relative sense and everything is scaled such that the
reduced chi2 is unity. This scaling is omitted if ``cov='unscaled'``,
as is relevant for the case that the weights are 1/sigma**2, with
sigma known to be a reliable estimate of the uncertainty.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very clear, thanks!

else:
if len(x) <= order:
raise ValueError("the number of data points must exceed order "
"for estimate the covariance matrix")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Still to be done: "for estimate" -> "to scale" (or "in order to be able to scale"

"for estimate the covariance matrix")
# note, this used to be: fac = resids / (len(x) - order - 2.0)
# it was deciced that the "- 2" (originally justified by "Bayesian
# uncertainty analysis") is not was the user expects
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For some reason my comment was weird here: should have been "gh-11196 and gh-11197"

@mhvk
Copy link
Contributor

mhvk commented Nov 19, 2018

p.s. While making the last changes, could you also rebase & squash the commits? Thanks again, and apologies that this has all taken so long.

@charris
Copy link
Member

charris commented Nov 19, 2018

I've definitely considered adding a covariance computation to the polynomial package fitting functions, I've done it for myself in practice. I agree that -2 is a bit too clever, in a case like this it is best to follow conventions that folks are used to. Note that for large data sets it won't make much difference, and for small data sets the estimated covariance will have a large error regardless unless the variance of the measurement errors is known by other means.

For std and var, the "ddof" parameter the default is 0, which is also non-standard, but unlikely to change. There are reasons for that convention as @rkern has explained, but it is surprising to many.

@wummo
Copy link
Contributor Author

wummo commented Nov 20, 2018

@mhvk I fixed the problems with the documentation. If everything looks OK, I will do the rebase & squash.

-----------------------------------------------------------

A further possible value has been added to the ``cov'' parameter of the
``np.polyfit`` function. With ``cov=unscaled`` the scaling of the covariance
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One last small thing: missing quotes around unscaled, i.e., cov='unscaled'

@mhvk
Copy link
Contributor

mhvk commented Nov 20, 2018

Looks good modulo the missing quotes. Please go ahead and rebase/squash as well, and I'll merge.

@wummo wummo force-pushed the correct_covariance_scaling branch from 632802a to 1837df7 Compare November 21, 2018 20:24
@wummo
Copy link
Contributor Author

wummo commented Nov 27, 2018

@mhvk I did the rebase and squashing and just wanted to ask whether there is anything more that needs to be done.

@mhvk
Copy link
Contributor

mhvk commented Nov 27, 2018

@wummo - I hadn't seen that the branch was pushed - now all is OK so I'll merge. Thanks for the contribution and more thanks for your patience!

@mhvk mhvk merged commit 1e2cb50 into numpy:master Nov 27, 2018
@wummo
Copy link
Contributor Author

wummo commented Nov 27, 2018

Its great that it got merge. Thanks a lot for your help.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants