Skip to content

ENH: stats.gaussian_kde: add method that returns marginal distribution #9932

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 14 commits into from
Aug 18, 2022

Conversation

h3jia
Copy link
Contributor

@h3jia h3jia commented Mar 10, 2019

Following the discussion at #9906, added pdf_marginal and logpdf_marginal to scipy.stats.kde, which are useful for visualizing high dimensional distributions.

@rgommers rgommers added scipy.stats enhancement A new feature or improvement labels Mar 11, 2019
h3jia added 3 commits March 10, 2019 18:40
Now it's not necessary to test the cases of more points / more data separately.
@h3jia
Copy link
Contributor Author

h3jia commented Mar 11, 2019

I just rewrote evaluate, logpdf, pdf_marginal and logpdf_marginal. Now all of them use np.einsum instead of looping over the data or points. I think it's not necessary for evaluate to do whitening. For evaluate and logpdf, they are still using self.inv_cov because it's available, while pdf_marginal and logpdf_marginal are not explicitly inverting the covariance.

The einsum approach needs more memory so it fails in the 32-bit tests. I'll fix this later, probably return to the loop approach.

@h3jia
Copy link
Contributor Author

h3jia commented Mar 11, 2019

@rgommers @ilayn

Now the problem is, without stuffs like tree-based computation, kde can be expensive for large data set.

Previously it was looping over the data points or input points. This needs less memory, but I think it should be slow.

I just tried to use einsum or tensordot to vectorize it, but then it could not pass the test because of memory limit. Since for each input point, we need to evaluate its distance to each data point, so we will need to manipulate a (# of dim, # of data, # of input) array.

Could you tell me your opinions on this?

I think an intermediate but less clean way can be, when # of data or # of input is too large, divide it into smaller blocks and do einsum on each block. This can be regarded as some kind of trading between memory and run time.

@lzkelley
Copy link
Contributor

I was interested in this same behavior, but also numerous other features (e.g. different kernels, reflecting boundary conditions, some custom behavior in resampling... etc). I decided it really warranted a separate package, instead of trying to add all of this functionality to the existing scipy submodule. For anyone interested, that new package (highly beta) can be found here: kalepy.

That being said, I don't know what the optimal tradeoff is for putting large/complex submodules into the scipy package. Should the gaussian_kde functionality be kept as simple, general KDE functionality, and leave complex behavior for another package? Or should this scipy method be expanded (arbitrarily) to include all desired behavior? If anyone has suggestions/comments (e.g. @rgommers), I'm very curious what the consensus thoughts are. I'm also very happy to help incorporate all or some of the additional functionality into the scipy module (of if the consensus is 'all') then perhaps fold this new package into scipy.

@HerculesJack one technical comment: I ran into the same memory vs speed issue you describe above in the resample functionality, and found that implementing a 'chunking' procedure works extremely well (~10--100x speedup, with very small memory footprint). Resampling in chunks of ~1e5 optimized speed and memory requirements for dimensions (numbers of parameters) between 1 and 5. I did not have any issues in the evaluation functionality.

@mdhaber
Copy link
Contributor

mdhaber commented Jul 18, 2022

@HerculesJack thought I'd check in on this PR. Would you like to finish it up? What do you need from us?

@h3jia
Copy link
Contributor Author

h3jia commented Jul 18, 2022

@mdhaber This is a quite old thread. It did not get merged because my naive implementation could not pass the tests due to memory and/or run time limits. I no longer need this feature for my own application, but am happy to help if you would like to take it over. I think you'd need to first review the code and resolve the conflicts, and then figure out a way to split the computation in batches so that it won't hit the memory cap.

@mdhaber
Copy link
Contributor

mdhaber commented Jul 18, 2022

Thanks for the summary @HerculesJack. In that case, I'll probably record the feature idea somewhere and close the PR.

@lzkelley I've thought about the idea of leaving KDE enhancements to another package, too. As a stats maintainer, I'm conflicted. On one hand, it's very interesting; e.g. adding back support for degenerate data and eliminating use of inv has gripped me. On the other, my attention is pulled in a lot of directions, so more complicated PRs tend to get proportionally less attention. In any case, I think KDE functionality would be able to grow a lot faster in a separate package; the downside is just that that users would not already have that at their fingertips (without looking for it).

I may create a meta-issue summarizing KDE's several enhancement requests and inactive enhancement PRs. If so I'll let you know.

Looks like kalepy's last release was a while ago, but you created some issues a few months ago. How's the project coming? Are there any features in SciPy's KDE that it does not support?

@lzkelley
Copy link
Contributor

lzkelley commented Jul 18, 2022

@mdhaber

Looks like kalepy's last release was a while ago, but you created some issues a few months ago. How's the project coming? Are there any features in SciPy's KDE that it does not support?

kalepy is quite functional, I would say. Unfortunately I haven't had much time to keep developing / improving it, but I do frequently use it in a number of different projects. Currently, it's functionality is a superset of scipy's, but there's definitely lots more features that could be implemented. The most pressing improvement would be additional optimization. There's a pending PR for multithreading, for example, that needs to be looked at (I'm more familiar with MPI parallelization, so I would need to learn more about multithreading per se).

@mdhaber
Copy link
Contributor

mdhaber commented Aug 10, 2022

Thanks again @HerculesJack. One last question: this PR adds methods for computing the pdf/logpdf of a marginal distribution of a KDE distribution. I'm thinking that rather than having a separate methods for pdf/logpdf only, it would make sense to have a single method, e.g. marginalized, that simply returns a KDE object representing the marginalized distribution. Because the marginal PDF of a multivariate normal distribution is the PDF of a multivariate normal distribution using a subset of indices of the original mean and covariance matrix, the implementation is quite simple at its core: just extract the relevant indices of the dataset and covariance matrix and instantiate a new KDE instance with that information. The PDF/logpdf code would not need to be copied/rewritten. Does this make sense?

@h3jia
Copy link
Contributor Author

h3jia commented Aug 10, 2022

Thanks again @HerculesJack. One last question: this PR adds methods for computing the pdf/logpdf of a marginal distribution of a KDE distribution. I'm thinking that rather than having a separate methods for pdf/logpdf only, it would make sense to have a single method, e.g. marginalized, that simply returns a KDE object representing the marginalized distribution. Because the marginal PDF of a multivariate normal distribution is the PDF of a multivariate normal distribution using a subset of indices of the original mean and covariance matrix, the implementation is quite simple at its core: just extract the relevant indices of the dataset and covariance matrix and instantiate a new KDE instance with that information. The PDF/logpdf code would not need to be copied/rewritten. Does this make sense?

Yes, that makes sense to me!

@mdhaber
Copy link
Contributor

mdhaber commented Aug 10, 2022

OK, I resolved merge conflicts and added a commit (0157c13) that shows the concept of the new approach. The tests (including one showing that old and new approaches produce the same PDF values) passed, so I'll remove the old approach.

I'm going to send an email about this to the mailing with three questions.

  1. What should the method that returns the marginal distribution be called? marginalize is a verb (good), but may not be acceptable due to other meanings of the word. marginal, get_marginal, or get_marginal_distribution would also work. get_marginal_distribution seems clear, and the method names of gaussian_kde are not super concise already (e.g. integrate_box_1d, integrate_gaussian), so I think that's my suggestion.
  2. In the original PR, the axis argument specified the dimensions that get "marginalized out" (i.e. those one would integrate over to obtain the marginal distribution). I changed this already to the opposite - the ones corresponding with the "marginal variables", which are kept. This leads to two questions:
    a. Should the user specify the variables that get "marginalized out" or the "marginal variables" that are to be kept? Update: I think the user should specify the indices of the ones to be kept, the marginal variables.
    b. Either way, the name axis is not really appropriate because it is used in a different way throughout NumPy and SciPy to refer to the axis/axes of an array, whereas here was are using the argument as indices along the zeroth axis of the dataset array (improving efficiency, e.g. by avoiding re-calculation of the covariance matrix and/or its inverse would be a separate PR). What should the name of this parameter be instead? I'd suggest marginal_variables.

This excerpt from Wikipedia might help those not already familiar with terminolgy (like me):
image

Comment on lines 757 to 763
def marginalize(self, axis):
"""Generate a marginal KDE distribution

Parameters
----------
axis : int or 1-d array_like
Axis (axes) along which the margins are to be computed.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These are subject to discussion. See questions about function name, parameter name, and parameter meaning.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would maybe go with d, dimensions or simply marginals. I also agree with you that this parameter should represent the dimensions we want to keep.

@mdhaber mdhaber changed the title Add pdf_marginal and logpdf_marginal to kde ENH: stats.gaussian_kde: add method that returns marginal distribution Aug 10, 2022
Copy link
Member

@tupui tupui left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice addition, I am +1 even if this is fairly immediate to do in user land.

Comment on lines 757 to 763
def marginalize(self, axis):
"""Generate a marginal KDE distribution

Parameters
----------
axis : int or 1-d array_like
Axis (axes) along which the margins are to be computed.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would maybe go with d, dimensions or simply marginals. I also agree with you that this parameter should represent the dimensions we want to keep.

@mdhaber
Copy link
Contributor

mdhaber commented Aug 18, 2022

Email to mailing list sent 8/11. No response so far, but we have a contributor and two stats maintainers who think this is a good idea. Yes, the implementation is simple, but I think it is very useful to users who don't happen to know that the marginal distribution of a multivariate normal takes such a simple form.

@tupui I'm happy with this if tests pass and the latest changes look good to you.

Copy link
Member

@tupui tupui left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM with the last updates. Thank you @HerculesJack for the PR and @mdhaber for pushing this to the finish line.

@tupui tupui merged commit bf9e262 into scipy:main Aug 18, 2022
@tupui tupui added this to the 1.10.0 milestone Aug 18, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement A new feature or improvement scipy.stats
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants