BUG: stats: handle infinite `df` in `t` distribution #14781

tirthasheshpatel · 2021-09-28T06:55:07Z

Reference issue

What does this implement/fix?

t distribution returned garbage values when df was infinity. It is known
that the t distribution limits to the normal distribution when df approaches
infinity. It is much better to handle infinite df by branching to the normal
distribution instead of returning wrong values.

appropriate tests have been added for the fix.

Additional information

`t` distribution returned garbage values when `df` was infinity. It is known that the `t` distribution limits to the normal distribution when `df` approaches infinity. It is much better to handle infinite `df` by branching to the normal distribution instead of returning wrong values.

rgommers

Thanks @tirthasheshpatel. The tests look good, but I'm not sure this is the right fix. The code becomes much harder to understand, and the underlying special functions are still wrong:

In [10]: special.stdtr(1, 1.5)
Out[10]: 0.8128329581890013

In [11]: special.stdtr(10, 1.5)
Out[11]: 0.9177463367772799

In [12]: special.stdtr(100, 1.5)
Out[12]: 0.9316174709376556

In [13]: special.stdtr(100000, 1.5)
Out[13]: 0.9331912202385847

In [14]: special.stdtr(np.inf, 1.5)
Out[14]: 0.5

Did you consider fixing those special functions instead? That should also be much more performant I'd think.

tirthasheshpatel · 2021-12-01T06:20:26Z

The tests look good, but I'm not sure this is the right fix. The code becomes much harder to understand, and the underlying special functions are still wrong

I agree, I am also not very happy with this. At the time, I thought branching would make sense but your suggestion about fixing the underlying special functions makes more sense. I will look into it. gh-14782 also seems like a good approach.

I won't be able to come back to this until mid-december as I have my final exams. Can we bump the milestone? I don't want to stall the improvement so any other interested contributor, feel free to beat me to it.

rgommers · 2021-12-01T10:27:14Z

Of course. Good luck with your final exams Tirth!

tirthasheshpatel · 2021-12-20T04:56:20Z

I will submit two separate PRs, one that fixes the underlying special functions and other that fixes some methods of t distributions (like stats and moment) to work with df = inf.

tirthasheshpatel · 2021-12-28T17:43:44Z

@mdhaber @rgommers Now that #15253 is merged, I have reverted the changes in cdf, sf, ppf, and isf. pdf, logpdf, stats, and entropy will still need to be fixed. Currently, I just use masking. Let me know if the changes look OK.

mdhaber · 2021-12-28T17:50:23Z

scipy/stats/_continuous_distns.py

+        r = df[imask]*1.0
+        res[imask] = (sc.gammaln((r+1)/2) - sc.gammaln(r/2)
+                      - (0.5*np.log(r*np.pi)
+                         + (r+1)/2*np.log(1+(x[imask]**2)/r)))


This PR includes manual masking, np.where, and _lazywhere. Is there a reason for switching between these strategies, or should it be consistent (especially with the rest of the distributions, which tend to prefer the lazy functions) as a matter of style?

This PR includes manual masking, np.where, and _lazywhere. Is there a reason for switching between these strategies, or should it be consistent (especially with the rest of the distributions, which tend to prefer the lazy functions) as a matter of style?

Made it consistent. I have refactored to only use _lazywhere. Thanks for pointing it out!

mdhaber · 2021-12-28T19:50:16Z

scipy/stats/_continuous_distns.py

+                np.exp(sc.gammaln((df+1)/2)-sc.gammaln(df/2))
+                / (np.sqrt(df*np.pi)*(1+(x**2)/df)**((df+1)/2))
+            )
+        )



Much better, thanks!
Do you think there is a numerical advantage to this separate implementation of pdf or would it be just as good or better to exponentiate logpdf?
It could be considered out of scope, but it's easier to review one function than two.
(That said, both of these look like they fix the issue and preserve behavior otherwise.)

ping @tirthasheshpatel
(I was thinking of merging this today)

Do you think there is a numerical advantage to this separate implementation of pdf or would it be just as good or better to exponentiate logpdf?

Sorry, I missed this! I think it's safe to exponentiate the logpdf. I don't see how the pdf implementation is more numerically stable than the logpdf version. On the contrary, logpdf should be more accurate on tails so exponentiation it should be better. I will change that and see if tests pass.

The difference between the t.pdf(x, df) and np.exp(t.logpdf(x, df)) is quite small. So, I don't think we need to worry too much about numerical stability:

>>> import numpy as np >>> from scipy.stats import t >>> dist = t(5) >>> x = np.linspace(0, dist.isf(1e-10), num=100_000) >>> np.mean((dist.pdf(x) - np.exp(dist.logpdf(x)))**2) 3.5557301413708695e-35 >>> np.mean(np.abs(dist.pdf(x) - np.exp(dist.logpdf(x)))) 5.971167081946323e-19

I can change the PDF to np.exp(self._logpdf(x, df)) in a follow-up PR only if it sounds good to you.

scipy/stats/_continuous_distns.py

mdhaber · 2021-12-28T21:43:43Z

scipy/stats/_continuous_distns.py

        return mu, mu2, g1, g2

    def _entropy(self, df):
+        if df == np.inf:
+            return norm._entropy()


Note to self... when we rewrite rv_continuous, have a consistent policy for what is passed into these methods. I was not aware that _entropy would be automatically vectorized.

That said, it would be nice if the SciPy developer had the choice to implement either a vectorized _<methodname>_vectorized or an _<methodname>_scalar. To save future headaches, <methodname> should do the following before calling _<methodname>_vectorized:

Ensure that all the inputs are at least 1d arrays

Broadcast all the input arrays together

Take care of returning a scalar or 0d array instead of a 1d array as needed, depending on what the user passed into <methodname>

mdhaber · 2021-12-28T22:23:44Z

We're also getting

scipy/stats/_continuous_distns.py:5900:17: E128 continuation line under-indented for visual indent

Co-authored-by: Matt Haberland <[email protected]>

MAINT: stats: refactor t distribution _stats method

scipy/stats/_continuous_distns.py

tirthasheshpatel · 2021-12-31T19:46:23Z

Thanks for resolving the lint failure, @mdhaber! Feel free to merge once tests pass.

mdhaber · 2021-12-31T20:46:35Z

OK merging but let me know what you think about using _logpdf to calculate _pdf here. About half of the functions that define _logpdf def _pdf as return np.exp(self._logpdf).

tirthasheshpatel · 2021-12-31T20:48:18Z

I commented about it here: #14781 (comment) and #14781 (comment)

mdhaber · 2021-12-31T22:04:52Z

Brief tests suggest that the pdf implementation is actually more accurate, at least in some cases. For instance, Wolfram Alpha says that with df=4, the PDF is given by:

>>> (t.pdf(x, df = 4) - ref(x))/(ref(x))
2.6295626461790996e-16
>>> (np.exp(t.logpdf(x, df = 4)) - ref(x))/(ref(x))
1.3147813230895499e-15

So never mind.

tylerjereddy · 2021-12-31T22:12:53Z

Labelled as "defect--" does it warrant a backport?

mdhaber · 2021-12-31T22:20:50Z

I don't know how much effort it is to backport, but it is a defect, and it was reported by a user. I'm not sure how often one would really want to use stats.t(df=np.inf), though. So if this were not backported, I'm not sure that anyone else would get hit by this bug.

gh-15253 is probably even more important than this one, because for t.cdf the answer was just wrong.

>>> t.cdf(x=10, df=np.inf)
0.5

This PR seems a little less important because at least the answer was nan, so the user would know they needed to try something else.

tylerjereddy · 2021-12-31T22:22:36Z

Ok, I added the backport label in that other PR but not here. The real question for me is often related to merge conflict resolution. I'll ping for help if it gets gruesome.

github-actions bot added the scipy.stats label Sep 28, 2021

tirthasheshpatel added the defect A clear bug or issue that prevents SciPy from being installed or used as expected label Sep 28, 2021

tirthasheshpatel mentioned this pull request Sep 28, 2021

BUG: stats: Fixed the issue #14777 #14782

Closed

rgommers added this to the 1.8.0 milestone Nov 11, 2021

rgommers reviewed Nov 30, 2021

View reviewed changes

rgommers modified the milestones: 1.8.0, 1.9.0 Dec 1, 2021

tirthasheshpatel mentioned this pull request Dec 20, 2021

BUG: special: fix stdtr and stdtrit for infinite df #15253

Merged

tirthasheshpatel added 2 commits December 28, 2021 22:15

Merge branch 'master' of github.com:scipy/scipy into fix-t-limit

c3edef0

MAINT: stats: revert changes in cdf, sf, ppf, isf

0d86e2a

mdhaber reviewed Dec 28, 2021

View reviewed changes

MAINT: stats: use lazy functions only

62967b3

mdhaber reviewed Dec 28, 2021

View reviewed changes

scipy/stats/_continuous_distns.py Outdated Show resolved Hide resolved

MAINT: stats: refactor t distribution _stats method

7a3d7a8

mdhaber mentioned this pull request Dec 28, 2021

MAINT: stats: refactor t distribution _stats method tirthasheshpatel/scipy#11

Merged

mdhaber reviewed Dec 28, 2021

View reviewed changes

scipy/stats/_continuous_distns.py Outdated Show resolved Hide resolved

mdhaber reviewed Dec 28, 2021

View reviewed changes

TST: stats: refactor test_t_inf_df for concision

1192245

tirthasheshpatel and others added 4 commits December 29, 2021 13:30

TST, MAINT: stats: improve readability and explain indexing.

3d3b041

Apply suggestions from code review

43d0b1a

Co-authored-by: Matt Haberland <[email protected]>

TST: stats: refactor tests to use masking and parametrization

4b1c9c3

Merge pull request #11 from mdhaber/fix-t-limit-edit

90e3654

MAINT: stats: refactor t distribution _stats method

mdhaber reviewed Dec 31, 2021

View reviewed changes

scipy/stats/_continuous_distns.py Outdated Show resolved Hide resolved

Update scipy/stats/_continuous_distns.py

ad2379d

mdhaber merged commit cf71f7e into scipy:master Dec 31, 2021

tirthasheshpatel deleted the fix-t-limit branch December 31, 2021 20:48

tylerjereddy added this to the 1.9.0 milestone Dec 31, 2021

Uh oh!

BUG: stats: handle infinite df in t distribution #14781

BUG: stats: handle infinite df in t distribution #14781

Uh oh!

Conversation

tirthasheshpatel commented Sep 28, 2021

Reference issue

What does this implement/fix?

Additional information

Uh oh!

rgommers left a comment

Choose a reason for hiding this comment

Uh oh!

tirthasheshpatel commented Dec 1, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

rgommers commented Dec 1, 2021

Uh oh!

tirthasheshpatel commented Dec 20, 2021

Uh oh!

tirthasheshpatel commented Dec 28, 2021

Uh oh!

mdhaber Dec 28, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

tirthasheshpatel Dec 28, 2021

Choose a reason for hiding this comment

Uh oh!

mdhaber Dec 28, 2021

Choose a reason for hiding this comment

Uh oh!

mdhaber Dec 31, 2021

Choose a reason for hiding this comment

Uh oh!

tirthasheshpatel Dec 31, 2021

Choose a reason for hiding this comment

Uh oh!

tirthasheshpatel Dec 31, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

mdhaber Dec 28, 2021

Choose a reason for hiding this comment

Uh oh!

mdhaber commented Dec 28, 2021

Uh oh!

Uh oh!

tirthasheshpatel commented Dec 31, 2021

Uh oh!

mdhaber commented Dec 31, 2021

Uh oh!

tirthasheshpatel commented Dec 31, 2021

Uh oh!

mdhaber commented Dec 31, 2021

Uh oh!

tylerjereddy commented Dec 31, 2021

Uh oh!

mdhaber commented Dec 31, 2021

Uh oh!

tylerjereddy commented Dec 31, 2021

Uh oh!

Uh oh!

BUG: stats: handle infinite `df` in `t` distribution #14781

BUG: stats: handle infinite `df` in `t` distribution #14781

tirthasheshpatel commented Dec 1, 2021 •

edited

Loading

mdhaber Dec 28, 2021 •

edited

Loading

tirthasheshpatel Dec 31, 2021 •

edited

Loading