Skip to content

DOC: stats.bootstrap: improve documentation multidimensional input #20850

@aangelopoulos

Description

@aangelopoulos

Hello! I am writing to describe some errors in the bootstrap function. I am happy to help fix these errors if we agree that they are indeed bugs.

The issue arises when you give the bootstrap a statistic which depends on $X$ and $y$. Say $X \in \mathbb{R}^{n \times d}$ and $y \in \mathbb{R}^n$. (This follows standard scipy conventions, as in the logistic regression implementation). Let's say also that the statistic depends on both $X$ and $y$; for example, let's say the statistic is $\mathbf{1}^{\top}X^{\top}y$. Then we would expect running bootstrap([X,y], statistic, n_resamples=7, paired=True) to give us an array of size 7; it should just resample $X$ and $y$ 7 times, and calculate the statistic for each resampling.

However, this is not what happens. In this MWE below, the function fails to run. Errors in the resampling protocol are causing this bug.

  1. Firstly, to be consistent with the rest of scipy, we should have the first column index the length of the sample, and the rest of the columns indexx dimensions.
  2. Running _bootstrap_resample(X, n_resamples=7) returns an array with shape (n, n_resamples, d). Running _bootstrap_resample(y, n_resamples=7) gives an array of shape (n_resamples,n).
  3. Calculating `statistic(*resampled_data, axis=-1)' will fail, due to a dimension mismatch. In some other examples, it will fail silently, which can be worse, causing errors that propagate into downstream statistical analyses.

Thank you for your consideration!

 import numpy as np
 from scipy.stats import bootstrap
 
 def statistic(x,y):
     v = np.ones((2,))
     vXt = [email protected]
     return vXt@y
 
 X = np.column_stack((0*np.ones((10,)), np.ones((10,))))
 Y = 100*np.ones((10,))
 data = [X,Y]
 n_resamples = 100
 statistic_original = statistic(*data)
 print("Original statistic (working shape): ", statistic_original)
 print("The following line fails due to bad shape!")
 bsd = bootstrap([X,Y], statistic=statistic, n_resamples=15, paired=True).bootstrap_distribution
 print(bsd, bsd.shape)

Metadata

Metadata

Assignees

No one assigned

    Labels

    DocumentationIssues related to the SciPy documentation. Also check https://github.com/scipy/scipy.orgscipy.stats

    Type

    No type

    Projects

    No projects

    Milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions