-
-
Notifications
You must be signed in to change notification settings - Fork 5.4k
BUG: stats: handle infinite df
in t
distribution
#14781
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
`t` distribution returned garbage values when `df` was infinity. It is known that the `t` distribution limits to the normal distribution when `df` approaches infinity. It is much better to handle infinite `df` by branching to the normal distribution instead of returning wrong values.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @tirthasheshpatel. The tests look good, but I'm not sure this is the right fix. The code becomes much harder to understand, and the underlying special
functions are still wrong:
In [10]: special.stdtr(1, 1.5)
Out[10]: 0.8128329581890013
In [11]: special.stdtr(10, 1.5)
Out[11]: 0.9177463367772799
In [12]: special.stdtr(100, 1.5)
Out[12]: 0.9316174709376556
In [13]: special.stdtr(100000, 1.5)
Out[13]: 0.9331912202385847
In [14]: special.stdtr(np.inf, 1.5)
Out[14]: 0.5
Did you consider fixing those special
functions instead? That should also be much more performant I'd think.
I agree, I am also not very happy with this. At the time, I thought branching would make sense but your suggestion about fixing the underlying special functions makes more sense. I will look into it. gh-14782 also seems like a good approach. I won't be able to come back to this until mid-december as I have my final exams. Can we bump the milestone? I don't want to stall the improvement so any other interested contributor, feel free to beat me to it. |
Of course. Good luck with your final exams Tirth! |
I will submit two separate PRs, one that fixes the underlying special functions and other that fixes some methods of t distributions (like |
scipy/stats/_continuous_distns.py
Outdated
r = df[imask]*1.0 | ||
res[imask] = (sc.gammaln((r+1)/2) - sc.gammaln(r/2) | ||
- (0.5*np.log(r*np.pi) | ||
+ (r+1)/2*np.log(1+(x[imask]**2)/r))) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This PR includes manual masking, np.where
, and _lazywhere
. Is there a reason for switching between these strategies, or should it be consistent (especially with the rest of the distributions, which tend to prefer the lazy functions) as a matter of style?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This PR includes manual masking,
np.where
, and_lazywhere
. Is there a reason for switching between these strategies, or should it be consistent (especially with the rest of the distributions, which tend to prefer the lazy functions) as a matter of style?
Made it consistent. I have refactored to only use _lazywhere
. Thanks for pointing it out!
np.exp(sc.gammaln((df+1)/2)-sc.gammaln(df/2)) | ||
/ (np.sqrt(df*np.pi)*(1+(x**2)/df)**((df+1)/2)) | ||
) | ||
) | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Much better, thanks!
Do you think there is a numerical advantage to this separate implementation of pdf
or would it be just as good or better to exponentiate logpdf
?
It could be considered out of scope, but it's easier to review one function than two.
(That said, both of these look like they fix the issue and preserve behavior otherwise.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ping @tirthasheshpatel
(I was thinking of merging this today)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you think there is a numerical advantage to this separate implementation of
logpdf
?
Sorry, I missed this! I think it's safe to exponentiate the logpdf
. I don't see how the pdf
implementation is more numerically stable than the logpdf
version. On the contrary, logpdf
should be more accurate on tails so exponentiation it should be better. I will change that and see if tests pass.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The difference between the t.pdf(x, df)
and np.exp(t.logpdf(x, df))
is quite small. So, I don't think we need to worry too much about numerical stability:
>>> import numpy as np
>>> from scipy.stats import t
>>> dist = t(5)
>>> x = np.linspace(0, dist.isf(1e-10), num=100_000)
>>> np.mean((dist.pdf(x) - np.exp(dist.logpdf(x)))**2)
3.5557301413708695e-35
>>> np.mean(np.abs(dist.pdf(x) - np.exp(dist.logpdf(x))))
5.971167081946323e-19
I can change the PDF to np.exp(self._logpdf(x, df))
in a follow-up PR only if it sounds good to you.
return mu, mu2, g1, g2 | ||
|
||
def _entropy(self, df): | ||
if df == np.inf: | ||
return norm._entropy() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note to self... when we rewrite rv_continuous
, have a consistent policy for what is passed into these methods. I was not aware that _entropy
would be automatically vectorized.
That said, it would be nice if the SciPy developer had the choice to implement either a vectorized _<methodname>_vectorized
or an _<methodname>_scalar
. To save future headaches, <methodname>
should do the following before calling _<methodname>_vectorized
:
- Ensure that all the inputs are at least 1d arrays
- Broadcast all the input arrays together
- Take care of returning a scalar or 0d array instead of a 1d array as needed, depending on what the user passed into
<methodname>
We're also getting
|
Co-authored-by: Matt Haberland <[email protected]>
MAINT: stats: refactor t distribution _stats method
Thanks for resolving the lint failure, @mdhaber! Feel free to merge once tests pass. |
OK merging but let me know what you think about using |
I commented about it here: #14781 (comment) and #14781 (comment) |
Labelled as "defect--" does it warrant a backport? |
I don't know how much effort it is to backport, but it is a defect, and it was reported by a user. I'm not sure how often one would really want to use gh-15253 is probably even more important than this one, because for
This PR seems a little less important because at least the answer was |
Ok, I added the backport label in that other PR but not here. The real question for me is often related to merge conflict resolution. I'll ping for help if it gets gruesome. |
Reference issue
fixes #14777
What does this implement/fix?
t
distribution returned garbage values whendf
was infinity. It is knownthat the
t
distribution limits to the normal distribution whendf
approachesinfinity. It is much better to handle infinite
df
by branching to the normaldistribution instead of returning wrong values.
appropriate tests have been added for the fix.
Additional information