Skip to content

BUG: stats: handle infinite df in t distribution #14781

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 11 commits into from
Dec 31, 2021

Conversation

tirthasheshpatel
Copy link
Member

Reference issue

fixes #14777

What does this implement/fix?

t distribution returned garbage values when df was infinity. It is known
that the t distribution limits to the normal distribution when df approaches
infinity. It is much better to handle infinite df by branching to the normal
distribution instead of returning wrong values.

appropriate tests have been added for the fix.

Additional information

`t` distribution returned garbage values when `df` was infinity. It is known
that the `t` distribution limits to the normal distribution when `df` approaches
infinity. It is much better to handle infinite `df` by branching to the normal
distribution instead of returning wrong values.
@tirthasheshpatel tirthasheshpatel added the defect A clear bug or issue that prevents SciPy from being installed or used as expected label Sep 28, 2021
@rgommers rgommers added this to the 1.8.0 milestone Nov 11, 2021
Copy link
Member

@rgommers rgommers left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @tirthasheshpatel. The tests look good, but I'm not sure this is the right fix. The code becomes much harder to understand, and the underlying special functions are still wrong:

In [10]: special.stdtr(1, 1.5)
Out[10]: 0.8128329581890013

In [11]: special.stdtr(10, 1.5)
Out[11]: 0.9177463367772799

In [12]: special.stdtr(100, 1.5)
Out[12]: 0.9316174709376556

In [13]: special.stdtr(100000, 1.5)
Out[13]: 0.9331912202385847

In [14]: special.stdtr(np.inf, 1.5)
Out[14]: 0.5

Did you consider fixing those special functions instead? That should also be much more performant I'd think.

@tirthasheshpatel
Copy link
Member Author

tirthasheshpatel commented Dec 1, 2021

The tests look good, but I'm not sure this is the right fix. The code becomes much harder to understand, and the underlying special functions are still wrong

I agree, I am also not very happy with this. At the time, I thought branching would make sense but your suggestion about fixing the underlying special functions makes more sense. I will look into it. gh-14782 also seems like a good approach.

I won't be able to come back to this until mid-december as I have my final exams. Can we bump the milestone? I don't want to stall the improvement so any other interested contributor, feel free to beat me to it.

@rgommers rgommers modified the milestones: 1.8.0, 1.9.0 Dec 1, 2021
@rgommers
Copy link
Member

rgommers commented Dec 1, 2021

Of course. Good luck with your final exams Tirth!

@tirthasheshpatel
Copy link
Member Author

I will submit two separate PRs, one that fixes the underlying special functions and other that fixes some methods of t distributions (like stats and moment) to work with df = inf.

@tirthasheshpatel
Copy link
Member Author

@mdhaber @rgommers Now that #15253 is merged, I have reverted the changes in cdf, sf, ppf, and isf. pdf, logpdf, stats, and entropy will still need to be fixed. Currently, I just use masking. Let me know if the changes look OK.

r = df[imask]*1.0
res[imask] = (sc.gammaln((r+1)/2) - sc.gammaln(r/2)
- (0.5*np.log(r*np.pi)
+ (r+1)/2*np.log(1+(x[imask]**2)/r)))
Copy link
Contributor

@mdhaber mdhaber Dec 28, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This PR includes manual masking, np.where, and _lazywhere. Is there a reason for switching between these strategies, or should it be consistent (especially with the rest of the distributions, which tend to prefer the lazy functions) as a matter of style?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This PR includes manual masking, np.where, and _lazywhere. Is there a reason for switching between these strategies, or should it be consistent (especially with the rest of the distributions, which tend to prefer the lazy functions) as a matter of style?

Made it consistent. I have refactored to only use _lazywhere. Thanks for pointing it out!

np.exp(sc.gammaln((df+1)/2)-sc.gammaln(df/2))
/ (np.sqrt(df*np.pi)*(1+(x**2)/df)**((df+1)/2))
)
)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Much better, thanks!
Do you think there is a numerical advantage to this separate implementation of pdf or would it be just as good or better to exponentiate logpdf?
It could be considered out of scope, but it's easier to review one function than two.
(That said, both of these look like they fix the issue and preserve behavior otherwise.)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ping @tirthasheshpatel
(I was thinking of merging this today)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you think there is a numerical advantage to this separate implementation of pdf or would it be just as good or better to exponentiate logpdf?

Sorry, I missed this! I think it's safe to exponentiate the logpdf. I don't see how the pdf implementation is more numerically stable than the logpdf version. On the contrary, logpdf should be more accurate on tails so exponentiation it should be better. I will change that and see if tests pass.

Copy link
Member Author

@tirthasheshpatel tirthasheshpatel Dec 31, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The difference between the t.pdf(x, df) and np.exp(t.logpdf(x, df)) is quite small. So, I don't think we need to worry too much about numerical stability:

>>> import numpy as np
>>> from scipy.stats import t
>>> dist = t(5)
>>> x = np.linspace(0, dist.isf(1e-10), num=100_000)
>>> np.mean((dist.pdf(x) - np.exp(dist.logpdf(x)))**2)
3.5557301413708695e-35
>>> np.mean(np.abs(dist.pdf(x) - np.exp(dist.logpdf(x))))
5.971167081946323e-19

I can change the PDF to np.exp(self._logpdf(x, df)) in a follow-up PR only if it sounds good to you.

return mu, mu2, g1, g2

def _entropy(self, df):
if df == np.inf:
return norm._entropy()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note to self... when we rewrite rv_continuous, have a consistent policy for what is passed into these methods. I was not aware that _entropy would be automatically vectorized.

That said, it would be nice if the SciPy developer had the choice to implement either a vectorized _<methodname>_vectorized or an _<methodname>_scalar. To save future headaches, <methodname> should do the following before calling _<methodname>_vectorized:

  • Ensure that all the inputs are at least 1d arrays
  • Broadcast all the input arrays together
  • Take care of returning a scalar or 0d array instead of a 1d array as needed, depending on what the user passed into <methodname>

@mdhaber
Copy link
Contributor

mdhaber commented Dec 28, 2021

We're also getting

scipy/stats/_continuous_distns.py:5900:17: E128 continuation line under-indented for visual indent

@tirthasheshpatel
Copy link
Member Author

Thanks for resolving the lint failure, @mdhaber! Feel free to merge once tests pass.

@mdhaber
Copy link
Contributor

mdhaber commented Dec 31, 2021

OK merging but let me know what you think about using _logpdf to calculate _pdf here. About half of the functions that define _logpdf def _pdf as return np.exp(self._logpdf).

@mdhaber mdhaber merged commit cf71f7e into scipy:master Dec 31, 2021
@tirthasheshpatel
Copy link
Member Author

I commented about it here: #14781 (comment) and #14781 (comment)

@tirthasheshpatel tirthasheshpatel deleted the fix-t-limit branch December 31, 2021 20:48
@mdhaber
Copy link
Contributor

mdhaber commented Dec 31, 2021

Brief tests suggest that the pdf implementation is actually more accurate, at least in some cases. For instance, Wolfram Alpha says that with df=4, the PDF is given by:
image

>>> (t.pdf(x, df = 4) - ref(x))/(ref(x))
2.6295626461790996e-16
>>> (np.exp(t.logpdf(x, df = 4)) - ref(x))/(ref(x))
1.3147813230895499e-15

image

So never mind.

@tylerjereddy tylerjereddy added this to the 1.9.0 milestone Dec 31, 2021
@tylerjereddy
Copy link
Contributor

Labelled as "defect--" does it warrant a backport?

@mdhaber
Copy link
Contributor

mdhaber commented Dec 31, 2021

I don't know how much effort it is to backport, but it is a defect, and it was reported by a user. I'm not sure how often one would really want to use stats.t(df=np.inf), though. So if this were not backported, I'm not sure that anyone else would get hit by this bug.

gh-15253 is probably even more important than this one, because for t.cdf the answer was just wrong.

>>> t.cdf(x=10, df=np.inf)
0.5

This PR seems a little less important because at least the answer was nan, so the user would know they needed to try something else.

@tylerjereddy
Copy link
Contributor

Ok, I added the backport label in that other PR but not here. The real question for me is often related to merge conflict resolution. I'll ping for help if it gets gruesome.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
defect A clear bug or issue that prevents SciPy from being installed or used as expected scipy.stats
Projects
None yet
Development

Successfully merging this pull request may close these issues.

BUG: Wrong limit and no warning in stats.t for df=np.inf
4 participants