Skip to content

Some minor fixes for stats.wishart addition. #4313

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 8 commits into from
Dec 31, 2014

Conversation

rgommers
Copy link
Member

  • fixes to docstrings (references, see also, added in 0.16.0)
  • use np.linalg.slogdet where appropriate
  • group tests for multivariate distributions

@rgommers rgommers added maintenance Items related to regular maintenance tasks scipy.stats labels Dec 26, 2014
@rgommers rgommers added this to the 0.16.0 milestone Dec 26, 2014
@ev-br
Copy link
Member

ev-br commented Dec 26, 2014

Might also want to mention it in the release notes already.

@rgommers
Copy link
Member Author

good point, done.

@coveralls
Copy link

Coverage Status

Coverage increased (+0.0%) when pulling ff92bb8 on rgommers:stats-cleanup into dc57092 on scipy:master.

rgommers pushed a commit to rgommers/scipy that referenced this pull request Dec 28, 2014
… logdet(x)

These are subtly different:

    In [46]: 2 * np.sum(np.log(linalg.cholesky(scale, lower=True).diagonal()))
    ...      - np.log(np.linalg.det(scale))
    Out[46]: 8.8817841970012523e-16

Use of ``np.linalg.slogdet`` is preferred.  Related change in scipygh-4313.
@rgommers
Copy link
Member Author

Added a couple more bug fixes.

@argriffing
Copy link
Contributor

The commit rgommers@e3150fe may be a performance regression. It adds extra slogdet calls which are expensive. One of the points of using Cholesky decomposition is that it is faster than the LU decomposition used by slogdet.

@rgommers
Copy link
Member Author

It's indeed slower:

In [16]: dim = 4

In [17]: scale = np.diag(np.arange(dim)+1)

In [18]: scale[np.tril_indices(dim, k=-1)] = np.arange(dim * (dim-1) // 2)

In [19]: scale = np.dot(scale.T, scale)

In [20]: scale
Out[20]: 
array([[11, 14, 18, 12],
       [14, 24, 26, 16],
       [18, 26, 34, 20],
       [12, 16, 20, 16]])

In [21]: 2 * np.sum(np.log(linalg.cholesky(scale, lower=True).diagonal()))
Out[21]: 6.3561076606958906

In [22]: %timeit 2 * np.sum(np.log(linalg.cholesky(scale, lower=True).diagonal()))
10000 loops, best of 3: 28.7 µs per loop

In [23]: %timeit np.linalg.slogdet(scale)
10000 loops, best of 3: 55.4 µs per loop

I don't think 30 us for something that is usually not called in inner loops is important though - I much prefer the maintainability/readability of slogdet. Note also that this particular computation is done on scale, which is never very large.

@argriffing
Copy link
Contributor

I agree that 30 microseconds wouldn't be important except in an inner loop. In addition to the nitpick that the slowdown would be more like 3x than 2x because the Cholesky decomposition is computed anyway, I'd vote for log likelihood functions as candidates in the "most likely to be used in an inner loop" competition. Regarding the largeness of scale I'm not sure what sizes would be reasonable, but for large sizes this would be a bottleneck with no loop required. I'll ping @ChadFulton for his views on the cost/benefit of adding an slogdet call vs. using the diagonal of an already-computed Cholesky matrix.

@ev-br
Copy link
Member

ev-br commented Dec 29, 2014

FWIW, I'm also in favor of reusing the Cholesky decomp here since it has been computed anyway. The readability cost is not too big IMO, and there is a cognitive burden of both "what is slogdet", and "why does it not use Cholesky, is there a reason for it"

@ChadFulton
Copy link
Contributor

In principle, both maximum likelihood models and Metropolis-Hastings Bayesian computation of models fitting a multivariate normal may require a potentially large number of logpdf calls, where the size of the scale matrix would vary with the dimension of the normal.

I haven't personally used such a model (the models I am using take advantage of Gibbs Sampling methods, which requires lots of calls instead to rvs), but PyMC has the Wishart distribution available, and they use the Metropolis-Hastings approach. I looked at their PDF calculation, and they use log(det(scale)), where det eventually resolves to np.linalg.det, so maybe the performance difference isn't too critical.

On the other hand, given that the Cholesky is already being computed, the determinant calculation is essentially a constant (or grows very slowly) with respect to the size of scale, whereas slogdet will grow in some non-linear way.

Thanks to both of you for your work getting this merged.

@rgommers
Copy link
Member Author

OK it seems you all agree. Then let's fix it by removing slogdet everywhere - the issue was that the test was doing assert_equal for something computed differently for the frozen dist as for the regular one. I'll update.

Ralf Gommers added 7 commits December 30, 2014 11:50
Fix docstring formatting and use np.linalg.slogdet where indicated by TODOs.
… funcs.

This raises DeprecationWarning's with recent numpy; should clearly be ints.
… logdet(x)

These are subtly different:

    In [46]: 2 * np.sum(np.log(linalg.cholesky(scale, lower=True).diagonal()))
    ...      - np.log(np.linalg.det(scale))
    Out[46]: 8.8817841970012523e-16

Use of ``np.linalg.slogdet`` is preferred for readability/maintainability.
…gdet.

This is faster, see discussion on scipygh-4313.  Using it everywhere fixes
the discrepancy between the normal and frozen distribution.
@rgommers
Copy link
Member Author

OK updated.

I think it would be useful at some point to either speed up np.linalg.slogdet or add an slogdet to scipy.linalg, because the last commit clearly doesn't help readability.

@coveralls
Copy link

Coverage Status

Coverage increased (+0.0%) when pulling a2a1aa3 on rgommers:stats-cleanup into 08238a5 on scipy:master.

@argriffing
Copy link
Contributor

Looks good to me.

I think it would be useful at some point to either speed up np.linalg.slogdet

This speedup applies only to positive semidefinite matrices (and given the lack of singular Cholesky in numpy/scipy, only to positive definite matrices in practice). Stepping back, the larger issue is that although numpy and scipy are amazingly good at n-dimensional array hacking, interfacing with BLAS/LAPACK, and automatically dealing with dtypes and dispatching to the correct underlying functions (e.g. [sdcz]gemm), they have absolutely no dispatch scheme for matrix structure except for ad-hoc python function name differences or possibly one-off keyword arguments. In other words, if LAPACK functions are named like XYYZZZ where X is s/d/c/z indicating dtype, YY indicates matrix structure and ZZZ indicates the computation, then numpy/scipy automatically handles X via dtype but handles YYZZZ through ad-hoc means. Using this notation, the numpy logdet can deal with YY == 'GE', but the Cholesky logdet would require a structure more like YY == 'PO'.

@rgommers
Copy link
Member Author

@argriffing thanks for the details, and interesting perspective on matrix structures.

@argriffing
Copy link
Contributor

I guess this PR would conflict with #4318, I'm not sure the best way to deal with that.

@rgommers
Copy link
Member Author

I'd merge this one first; it's good to go and the other one is easy to rebase.

>>> plt.contourf(x, y, rv.pdf(pos))
>>> fig2 = plt.figure()
>>> ax2 = fig2.add_subplot(111)
>>> ax2.contourf(x, y, rv.pdf(pos))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any reason not to add plt.show(), so that the plot actually renders in the docs?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I get a plot in the HTML docs without including plt.show().

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interesting. What is the complete (correct) way of building the HTML docs you're using, what version of sphinx etc?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I use make html in the doc directory. The numpydoc extension in the numpy source code (i.e. numpy/doc/sphinxext) is in my python path. Sphinx version is 1.2.3.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do the same, make html is the easiest. I don't have numpy/doc/sphinxext in my path but have simply installed numpydoc.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

plt.show indeed isn't necessary on my local build, but for the ones on docs.scipy.org it does seem to be. May depend on the version of the plot-directive extension used.

plt.show() is used in most docstring examples, I'll add it.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do make html-scipyorg and I do seem to need plt.plot(). Not sure what is different between the two.

This is surely a rather minor point, but all else being equal I'd rather have it implicit in docstrings (ditto for import matplotlib.pyplot as plt).

In any case, it's OT for this PR; my original comment is moot.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, I just checked and for all multivariate distributions adding plt.show() breaks generating the plots. So I suggest to leave it as is in this PR, and open a new issue for reconciling the different versions of building the docs (and/or a check on minimum Sphinx and numpydoc versions).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

@ev-br
Copy link
Member

ev-br commented Dec 31, 2014

Merging, thanks Ralf, all.

ev-br added a commit that referenced this pull request Dec 31, 2014
Some minor fixes for stats.wishart addition.
@ev-br ev-br merged commit 75337bc into scipy:master Dec 31, 2014
@rgommers rgommers deleted the stats-cleanup branch December 31, 2014 14:41
@ev-br
Copy link
Member

ev-br commented Dec 31, 2014

The issue for tracking the docs build: #4346

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
maintenance Items related to regular maintenance tasks scipy.stats
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants