Skip to content

MAINT: add test against generic fit method for vonmises distribution #18128

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 2 commits into from
Mar 18, 2023

Conversation

dschmitz89
Copy link
Contributor

Reference issue

Follow-up of #18013

What does this implement/fix?

In #18013, an additional test was requested to to compare the overwritten fit method of the vonmises distribution with the generic one. THis adds that.

Additional information

With this, the tests for von mises take 5.5 seconds on my machine. I think marking the new tests as slow when the PR is accepted would be a good choice to save a little on CI.

I did not fully understand the _assert_less_or_close_loglike machinery, that's why I ended up implementing the log likelihood function as the criterion. The methods used for the other distributions did not work, probably because _fitstart does not exist for vonmises.

@dschmitz89 dschmitz89 requested a review from mdhaber March 10, 2023 21:08
@j-bowhay j-bowhay added scipy.stats maintenance Items related to regular maintenance tasks labels Mar 10, 2023
Copy link
Contributor

@mdhaber mdhaber left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You were using _assert_less_or_close_loglike correctly. It's fine not to use _reduce_func like some of the other tests - I believe that's just the log-likelihood function with a penalty for out-of-bounds values. TBH, existing tests that use _reduce_func should probably be changed - out-of-bounds values should result in NaN. I will do that separately.

Instead, ideally this would use nnlf (which is an erroneous acronym for "negative log-likelihood function", NLLF). But I see why it can't just do that: scale is broken for the vonmises distribution, and it turns out that making the scale tiny results in very good NLLF, giving the default implementation a huge advantage in the comparison. This wouldn't be fair, so your negative_loglikehood function ignores scale.

The best thing to do, though, would be to pass fscale=1 to level the playing field. When I do that, this test fails: the default implementation is a little better than the override. I've made the test xslow and skipped CI, but can you check this out locally? The difference in log-likelihood is small, but all the other fit overrides pass this test, so I think this should, too. It may just require tightening a root-solving tolerance?

@dschmitz89
Copy link
Contributor Author

You were using _assert_less_or_close_loglike correctly. It's fine not to use _reduce_func like some of the other tests - I believe that's just the log-likelihood function with a penalty for out-of-bounds values. TBH, existing tests that use _reduce_func should probably be changed - out-of-bounds values should result in NaN. I will do that separately.

Instead, ideally this would use nnlf (which is an erroneous acronym for "negative log-likelihood function", NLLF). But I see why it can't just do that: scale is broken for the vonmises distribution, and it turns out that making the scale tiny results in very good NLLF, giving the default implementation a huge advantage in the comparison. This wouldn't be fair, so your negative_loglikehood function ignores scale.

The best thing to do, though, would be to pass fscale=1 to level the playing field. When I do that, this test fails: the default implementation is a little better than the override. I've made the test xslow and skipped CI, but can you check this out locally? The difference in log-likelihood is small, but all the other fit overrides pass this test, so I think this should, too. It may just require tightening a root-solving tolerance?

Thanks for helping out with the likelihood function. I tried setting xtol and rtol as low as possible but the results stayed the same. I think the reason that the generic optimizer finds a better solution acc. to the log likelihood is because it simultaneously optimizes both $\kappa$ and $\mu$ whereas the custom fitting method estimates them separately. At the same time, this separate estimation directly originates from the likelihood equations 😕 .

What I did now was adding a relative tolerance parameter to the _assert_less_or_close_loglike function. I am open to suggestions though. In my view, cases like this one can happen for other distributions as well.

@mdhaber
Copy link
Contributor

mdhaber commented Mar 11, 2023

In my view, cases like this one can happen for other distributions as well.

That's something that can be tested. If you find such a case, it would probably be considered a shortcoming, too.

At the same time, this separate estimation directly originates from the likelihood equations

For many distributions, the MLE of one parameter is strictly independent of the value of another. For instance, the MLE of the location parameter of the normal distribution is the sample mean, regardless of scale. The two parameters do not need to be optimized simultaneously to get machine precision MLEs.

Is this not the case for von Mises? (Or do you mean the equations say it is, but that's not what you're observing in practice?)

One think to test is to see what happens when data is definitely not sampled from the von Mises distribution (e.g. bimodal). Do the results disagree even more?

@mdhaber
Copy link
Contributor

mdhaber commented Mar 17, 2023

I've done quite a few fit overrides. I run a standard stress test in which I define a space of possible distribution parameters (e.g. $\mu \in [-10^3, 10^3]$, $\sigma \in [10^{-3}, 10^3]$ and draw ~1000 set of parameters (QMC log-uniformly distributed) from that space. For each set of parameters, I generate a (QMC log-uniformly distributed) random number of points from the distribution and fit with a) the overridden fit method and b) the generic fit method. I'm interested in 1) the number of failed fits, 2) the relative time, and 3) the relative log-likelihood change. Ideally, I do this for all combinations of fixed and free parameters; when the parameters are fixed, I set them to the ideal value, a perturbed ideal value, or a random value (depending on how stressful I want the test to be.) It's also good to try drawing random samples from distributions other than the one being fit, but I usually omit that because it can be unreasonably demanding.


For 1000 sets of parameters, fitting with all parameters treated as free

  1. There are no failed fits (e.g. MLE is NaN because data lies outside the support of the fitted distribution) for either your override or generic fit method. Good.

  2. Your fit override is much faster than the generic fit method.
    image

  3. The relative change in negative log-likelihood of your override is almost always good (closer to negative infinity). In a few cases, the change was big (e.g. 1% better). Any bad changes are very small, on the order of machine precision. That is always OK; it usually just means that both your override and the superclass method did about as well as possible, but there is a tiny bit of numerical noise.

image


For 1000 sets of parameters, fitting with the location fixed

  1. There are no failed fits for either your override or generic fit method. Good.

  2. Your fit override is still much faster than the generic fit method.
    image

  3. The relative change in log-likelihood of your override is usually worse than the generic fit method. This probably indicates a problem in the math. It's probably not valid to fit the shape using the same method regardless of the value of the location. If you didn't derive the equations yourself, this probably means that your reference assumed that both parameters would be fit, so they took advantage of a simplification that occurs when the location is at its optimal value. You would need to do the math yourself to get that term back.

image


To further test this hypothesis (problem in the math), I

import numpy as np
from numpy.testing import assert_allclose, assert_equal
from scipy import stats, optimize, integrate, ndimage
import matplotlib.pyplot as plt
from mpmath import mp
mp.dps = 50
rng = np.random.default_rng(1638083107694713882823079058616272161)

n = 100
kappa, loc, scale = 1, 0, 1
dist_true = stats.vonmises(kappa, loc, scale)
rvs = dist_true.rvs(size=n)

p_mle = stats.vonmises.fit(rvs)
loc_mle = p_mle[1]

n_flocs = 50
offsets = np.linspace(-1, 1, n_flocs)
nllf_overrides = np.zeros(n_flocs)
nllf_supers = np.zeros(n_flocs)
for i, offset in enumerate(offsets):
    print(i)

    p_override = stats.vonmises.fit(rvs, floc=loc_mle + offset)
    nllf_overrides[i] = stats.vonmises.nnlf(p_override, rvs)

    p_super = stats.vonmises.fit(rvs, floc=loc_mle + offset, fscale=1, superfit=True)
    nllf_supers[i] = stats.vonmises.nnlf(p_super, rvs)

diff = (nllf_overrides - nllf_supers)/nllf_supers

plt.plot(offsets, diff)
plt.xlabel('Offset of fixed location from MLE location')
plt.ylabel('Relative change in NLLF (positive is bad)')

The further the offset of floc from the MLE of the location, the greater the regression in the NLLF.
image

In my view, cases like this one can happen for other distributions as well.

I've studied this sort of thing. A summary of my findings for most distributions with overrides is at #11782 (comment). Some seem to be robust to everything I throw at it (e.g. normal distribution, duh), in which case I've marked it "good". For the most part, I haven't merged a PR if I have observed any sign of regression, although there were some fit overrides that predate me. I don't recall merging a fit override if it fails systematically. If I've noticed a problem for certain parameters held fixed, I've suggested just falling back to the generic fit rather than returning an incorrect result.

In this PR, I'd suggest just falling back to the generic fit method if either the location or shape is fixed. (If you are thinking of making any follow-up PRs, I'd recommend lining up a maintainer to review them first. fit PRs are often very complex and tend to stall; e.g. the truncpareto fit PR.)

@mdhaber
Copy link
Contributor

mdhaber commented Mar 17, 2023

Actually, the solution might be simple. My script tells me that the likelihood equation for the shape is:

image

where $\mu$ is the value of the location parameter (MLE or not).

Modifing the term r in find_kappa seems to solve the problem:
It was r = 1 - stats.circvar(data). Now, based on the likelihood equation:

r = np.sum(np.cos(floc - data))/len(data)

And the plot looks much better (note the scale):
image

image

Note: after this fix, occasionally your root_scalar bracket doesn't work (the sign of the function evaluated at the endpoints are not different... often this happens if one or both are numerically zero), so further study is needed.


The likelihood equation for location is:
image

This suggests that there is a solution of this equation independent of $\kappa$, so the maximum likelihood estimate for location you have now should work even if $\kappa$ is fixed at an arbitrary value of the parameter by the user.

This is confirmed by experiment.
image

@dschmitz89
Copy link
Contributor Author

Thanks for this thorough analysis! Its getting late at my place but from what I understand, the floc fixed case was not handled correctly by me. A fix is available though via adjusting the formula that goes into the rootfinder. Should I submit that as a separate PR before we look at this one again?

@mdhaber
Copy link
Contributor

mdhaber commented Mar 17, 2023

I'll go ahead and submit the fix to this branch if that sounds good.

@dschmitz89
Copy link
Contributor Author

Sounds good to me, and thanks again for your help.

From what I saw, the case of fixed $\mu$ was probably not treated anywhere in literature. I wonder how much stats knowledge is buried only in code sometimes that does not get published. In scipy alone, so many things are probably derived for the first time ..

@mdhaber
Copy link
Contributor

mdhaber commented Mar 17, 2023

Yes, so much so that I thought about writing a book compiling distribution MLE results : )

# location likelihood equation has a solution independent of kappa
loc = floc if floc is not None else find_mu(data)
# shape likelihood equation depends on location
shape = fshape if fshape is not None else find_kappa(data, loc)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This also allows the user to fix the shape.

Note that it is preferable to call the final values of the location and shape loc and shape rather than floc and fshape, since they were not actually fixed.

Copy link
Contributor

@mdhaber mdhaber left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Something I'd suggest studying, @dschmitz89, is whether there is a relatively simple but more robust way of solving the kappa equation. For instance:

from scipy import stats
rvs = [-0.92923506, -0.32498224,  0.13054989, -0.97252014,  2.79658071,
       -0.89110948,  1.22520295,  1.44398065,  2.49163859,  1.50315096,
        3.05437696, -2.73126329, -3.06272048,  1.64647173,  1.94509247,
       -1.14328023,  0.8499056 ,  2.36714682, -1.6823179 , -0.88359996,
        1.3990757 ,  0.50580527,  0.39606279,  0.59775193, -0.96931391,
       -2.94886662,  2.51401895,  1.85529066, -2.16612654,  0.69811142,
        1.55245103,  1.93724126,  1.04246362,  2.14759188, -2.59632106,
        1.44772926,  0.13118682, -0.70731962, -3.12222871,  1.1710668 ,
       -2.38427605,  1.177051  , -2.17754098, -0.15177223,  1.23153502,
       -2.17272573,  2.36591449, -2.70966423,  2.78704888,  0.04717811,
       -2.06487457,  1.99469452,  0.39622628, -2.84217529, -2.98719481,
        3.12200669, -1.51914078,  2.58646588, -1.64947704, -2.94559419,
       -3.00862607, -1.19894746,  1.46864545, -2.98512437,  2.02929986,
       -1.44114382,  2.02391158,  2.05532412,  2.97241311,  2.60627323,
        1.97437403,  2.8264543 , -0.86461338,  0.2307659 ,  0.71714181,
        2.93683305,  0.05672313,  2.6922025 ,  2.04911119, -2.10693874,
        1.3875065 ,  2.97284149, -0.08860371, -2.91405584, -2.75601588,
       -2.46343408,  0.29537451,  0.57600184, -1.40230045,  1.73518165,
       -3.09919971,  2.55072157,  3.04286114, -2.32435821]

fixed = {'floc': 0.005191972335031776}  # ValueError: f(a) and f(b) must have different signs
stats.vonmises.fit(rvs, **fixed)

Take a look at how some other fit overrides find their starting brackets, for instance, or consider whether a gradient-based solver would work.

Typically when the fit override is prone to failure like this, I would fall back on the generic fit method. If we can't fix this before release, I think we'll want to do that. But this is much better than in main, so for now I'll go ahead and merge if tests pass.

@mdhaber
Copy link
Contributor

mdhaber commented Mar 18, 2023

I went ahead and force-pushed a rebase with each of our commits squashed, since there was a pretty clean separation. In case you need a copy of the original history, it's here.

I'll merge when a few CI jobs finish so I know the rebase went ok.

@mdhaber mdhaber merged commit 5aca4de into scipy:main Mar 18, 2023
@dschmitz89
Copy link
Contributor Author

Something I'd suggest studying, @dschmitz89, is whether there is a relatively simple but more robust way of solving the kappa equation. For instance:

Take a look at how some other fit overrides find their starting brackets, for instance, or consider whether a gradient-based solver would work.

Typically when the fit override is prone to failure like this, I would fall back on the generic fit method. If we can't fix this before release, I think we'll want to do that. But this is much better than in main, so for now I'll go ahead and merge if tests pass.

For these data, the $\kappa$ equation does not have a root with fixed loc:

image

import numpy as np
from scipy import stats
from scipy import special as sc
import matplotlib.pyplot as plt

rvs = [-0.92923506, -0.32498224,  0.13054989, -0.97252014,  2.79658071,
       -0.89110948,  1.22520295,  1.44398065,  2.49163859,  1.50315096,
        3.05437696, -2.73126329, -3.06272048,  1.64647173,  1.94509247,
       -1.14328023,  0.8499056 ,  2.36714682, -1.6823179 , -0.88359996,
        1.3990757 ,  0.50580527,  0.39606279,  0.59775193, -0.96931391,
       -2.94886662,  2.51401895,  1.85529066, -2.16612654,  0.69811142,
        1.55245103,  1.93724126,  1.04246362,  2.14759188, -2.59632106,
        1.44772926,  0.13118682, -0.70731962, -3.12222871,  1.1710668 ,
       -2.38427605,  1.177051  , -2.17754098, -0.15177223,  1.23153502,
       -2.17272573,  2.36591449, -2.70966423,  2.78704888,  0.04717811,
       -2.06487457,  1.99469452,  0.39622628, -2.84217529, -2.98719481,
        3.12200669, -1.51914078,  2.58646588, -1.64947704, -2.94559419,
       -3.00862607, -1.19894746,  1.46864545, -2.98512437,  2.02929986,
       -1.44114382,  2.02391158,  2.05532412,  2.97241311,  2.60627323,
        1.97437403,  2.8264543 , -0.86461338,  0.2307659 ,  0.71714181,
        2.93683305,  0.05672313,  2.6922025 ,  2.04911119, -2.10693874,
        1.3875065 ,  2.97284149, -0.08860371, -2.91405584, -2.75601588,
       -2.46343408,  0.29537451,  0.57600184, -1.40230045,  1.73518165,
       -3.09919971,  2.55072157,  3.04286114, -2.32435821]

data = np.asarray(rvs)

fixed = 0.005191972335031776
r_fixed = np.sum(np.cos(fixed - data))/len(data)
r_free = 1 - stats.circvar(data,low=-np.pi, high=np.pi)

kappas = np.logspace(-20, 20, 1000)
plt.semilogx(kappas, sc.i1e(kappas)/sc.i0e(kappas) - r_fixed, label="fixed loc")
plt.semilogx(kappas, sc.i1e(kappas)/sc.i0e(kappas) - r_free, label="free loc")
plt.legend()
plt.show()

Defaulting to the generic fit for fixed loc is probably a good idea then.

@mdhaber
Copy link
Contributor

mdhaber commented Mar 18, 2023

Oh, that makes it easier.
If there is no solution to the likelihood equation, then the constraints on the parameter get involved. In that case think the MLE will be $\kappa \rightarrow 0$ or $\kappa \rightarrow \infty$, and the sign of the partial derivative of the log-likelihood w.r.t. $\kappa$ will tell you which. Instead of falling back to the generic optimization, I'd suggest excepting the error and figuring out whether to return the largest or smallest possible number. Maybe emit a warning so the user doesn't think it's a bug that they get an outrageous value.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
maintenance Items related to regular maintenance tasks scipy.stats
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants