Skip to content

ENH: stats: add alternative to masked normality tests #13960

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 3 commits into from
Jun 3, 2021

Conversation

tirthasheshpatel
Copy link
Member

Reference issue

Addresses gh-12506.
gh-13549 added the alternative parameter to some normality tests but forgot to add it to the masked version. This is a continuation of that work.

What does this implement/fix?

skewtest and kurtosistest were missing an alternative parameter
in their masked version. It has been added and tested now.

Additional information

N/A

`skewtest` and `kurtosistest` were missing an `alternative` parameter
in their masked version. It has been added and tested now.
@tirthasheshpatel tirthasheshpatel added enhancement A new feature or improvement scipy.stats labels Apr 30, 2021
Copy link
Contributor

@mdhaber mdhaber left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good start, but please review what we did in gh-13549 and apply some of that here. For instance, better document the meaning of the less vs greater and rely on _normtest_finish to raise the error for an invalid alternative argument.

@tirthasheshpatel
Copy link
Member Author

Hi, @mdhaber. Thanks for the review!

better document the meaning of the less vs greater

Ah, I believe I didn't document the masked versions as its documentation points to the stats version where less and greater are documented in more detail. But I am also OK with copy-pasting those docs to the masked versions. What do you think?

rely on _normtest_finish to raise the error for an invalid alternative argument

On an unrelated note, relying on _normtest_finish would result in computing the Z value and then throwing the error. What we probably want is to throw an error before doing all the heavy computations. I don't know if it is a big problem but just something I thought would be worth pointing out/discussing.

@mdhaber
Copy link
Contributor

mdhaber commented Jun 2, 2021

But I am also OK with copy-pasting those docs to the masked versions. What do you think?

I do think that the definitions should appear here; we should not just refer the user to stats for documentation. (If we were to refer the user to stats for a portion of the documentation, why not refer the user to stats for all of the function documentation?)

For implementing this, it would be nice to define the text (e.g. as a variable) in one place and use it in many places, but for now let's copy-paste. There are a lot of other places where we share text that could be cleaned up at the same time (if we decide to go that route at some point).

On an unrelated note, relying on _normtest_finish would result in computing the Z value and then throwing the error

I agree it's not ideal, but there is the same problem in the stats version. I'd keep it consistent here. Maybe just rely on _normtest_finish here, and if you are so inspired, have another PR that does early input validation for hypothesis tests (and remove it from _normtest_finish).

* copy-paste docs explaining `less` and `greater` to `mstats` version.
* add test with masked arrays in `test_mstats_basic`
@tirthasheshpatel
Copy link
Member Author

I do think that the definitions should appear here

Got it. copy-pasted the docs in the latest commit.

I agree it's not ideal, but there is the same problem in the stats version. I'd keep it consistent here.

Makes sense, done.

if you are so inspired, have another PR that does early input validation for hypothesis tests (and remove it from _normtest_finish).

👍

@@ -964,7 +964,7 @@ def regression_test_9033(self):
@pytest.mark.parametrize("test", ["skewtest", "kurtosistest"])
@pytest.mark.parametrize("alternative", ["less", "greater"])
def test_alternative(self, test, alternative):
x = stats.norm.rvs(loc=10, scale=2.5, size=20, random_state=123)
x = stats.norm.rvs(loc=10, scale=2.5, size=30, random_state=123)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is fine - as long as it wasn't necessary to make the test pass...? What was the motivation?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, yes. It wasn't done to make the test pass but because skewtest requires at least 20 samples. As I add nans to some samples, I had to increase the sample size.


Returns
-------
statistic : float
statistic : array_like
Copy link
Contributor

@mdhaber mdhaber Jun 3, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

More accurately, these would be "scalar or ndarray". (Could be scalar, and if it's array-like, it's going to be an ndarray)

Copy link
Member Author

@tirthasheshpatel tirthasheshpatel Jun 3, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

More accurately, these would be "scalar or ndarray". (Could be scalar, and if it's array-like, it's going to be an ndarray)

I think array_like includes both scalars and array outputs. So, I thought it would be better to change it that way.

Copy link
Contributor

@mdhaber mdhaber Jun 3, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure if array_like is formally defined anywhere, so I guess we can use it to mean whatever we want. But we could be more specific as we push toward better documentation. (Not required for this PR, though. Not really worth another CI run IMO.)

Copy link
Member Author

@tirthasheshpatel tirthasheshpatel Jun 3, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure if array_like is formally defined anywhere, so I guess we can use it to mean whatever we want. But we could be more specific as we push toward better documentation. (Not required for this PR, though.)

I think NumPy does define array_like in https://numpydoc.readthedocs.io/en/latest/format.html#other-points-to-keep-in-mind. See gh-13621.

Copy link
Contributor

@mdhaber mdhaber Jun 3, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, nice. Thanks for pointing that out. Note that it only specifically says it's for documenting arguments, though. We are flexible about arguments, but we know what the return types will be.

But even something as fundamental as e.g. np.mean doesn't document this perfectly. It says that the output type will be ndarray
image

but as we know:

isinstance(np.mean([[1, 2, 3]], np.ndarray)  # True
isinstance(np.mean([1, 2, 3]), np.ndarray)  # False

I'll go ahead and merge this as-is, if it sounds good to you?
We can consider adding this to the huge list of things we'd like the documentation to be more consistent about.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if it sounds good to you?

Yep, everything sounds good to me!

@mdhaber
Copy link
Contributor

mdhaber commented Jun 3, 2021

This PR adds support for an alternative argument to the mstats versions of skewtest and kurtosistest; it also enables the stats versions of these tests to use alternative and nan_policy='omit'. The behavior is what we'd get if we were to manually remove nans from the arrays, e.g.:

import numpy as np
from scipy import stats
from scipy.stats import mstats

np.random.seed(0)

for test_name in {"kurtosistest", "skewtest"}:
    for alternative in {'less', 'greater', 'two-sided'}:
        for i in range(100):
            sample = stats.norm.rvs(size=np.random.randint(30, i+40))
            p = np.random.rand()
            mask = np.random.rand(*sample.shape) > (0.5 + p/2)
            
            test = getattr(stats, test_name)
            mtest = getattr(mstats, test_name)

            compressed_sample = sample[~mask]

            nan_sample = sample.copy()
            nan_sample[mask] = np.nan

            masked_sample = np.ma.masked_array(sample, mask=mask)

            res1 = test(compressed_sample)
            res2 = test(nan_sample, nan_policy='omit')
            res3 = mtest(masked_sample)

            np.testing.assert_allclose(res2, res1)
            np.testing.assert_allclose(res3, res1)

Tests are fine (considering the existing tests of skewtest and kurtosistest) and documentation of the alternatives is thorough.

Thanks @tirthasheshpatel!

@mdhaber mdhaber merged commit eaf5311 into scipy:master Jun 3, 2021
@tirthasheshpatel tirthasheshpatel deleted the gh12506-normal-masked branch June 4, 2021 01:34
@tylerjereddy tylerjereddy added this to the 1.8.0 milestone Jun 4, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement A new feature or improvement scipy.stats
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants