ENH: stats.gaussian_kde: add method that returns marginal distribution #9932

h3jia · 2019-03-10T20:25:29Z

Following the discussion at #9906, added pdf_marginal and logpdf_marginal to scipy.stats.kde, which are useful for visualizing high dimensional distributions.

scipy/stats/kde.py

Now it's not necessary to test the cases of more points / more data separately.

h3jia · 2019-03-11T08:30:59Z

I just rewrote evaluate, logpdf, pdf_marginal and logpdf_marginal. Now all of them use np.einsum instead of looping over the data or points. I think it's not necessary for evaluate to do whitening. For evaluate and logpdf, they are still using self.inv_cov because it's available, while pdf_marginal and logpdf_marginal are not explicitly inverting the covariance.

The einsum approach needs more memory so it fails in the 32-bit tests. I'll fix this later, probably return to the loop approach.

h3jia · 2019-03-11T20:12:28Z

@rgommers @ilayn

Now the problem is, without stuffs like tree-based computation, kde can be expensive for large data set.

Previously it was looping over the data points or input points. This needs less memory, but I think it should be slow.

I just tried to use einsum or tensordot to vectorize it, but then it could not pass the test because of memory limit. Since for each input point, we need to evaluate its distance to each data point, so we will need to manipulate a (# of dim, # of data, # of input) array.

Could you tell me your opinions on this?

I think an intermediate but less clean way can be, when # of data or # of input is too large, divide it into smaller blocks and do einsum on each block. This can be regarded as some kind of trading between memory and run time.

lzkelley · 2019-06-24T15:44:24Z

I was interested in this same behavior, but also numerous other features (e.g. different kernels, reflecting boundary conditions, some custom behavior in resampling... etc). I decided it really warranted a separate package, instead of trying to add all of this functionality to the existing scipy submodule. For anyone interested, that new package (highly beta) can be found here: kalepy.

That being said, I don't know what the optimal tradeoff is for putting large/complex submodules into the scipy package. Should the gaussian_kde functionality be kept as simple, general KDE functionality, and leave complex behavior for another package? Or should this scipy method be expanded (arbitrarily) to include all desired behavior? If anyone has suggestions/comments (e.g. @rgommers), I'm very curious what the consensus thoughts are. I'm also very happy to help incorporate all or some of the additional functionality into the scipy module (of if the consensus is 'all') then perhaps fold this new package into scipy.

@HerculesJack one technical comment: I ran into the same memory vs speed issue you describe above in the resample functionality, and found that implementing a 'chunking' procedure works extremely well (~10--100x speedup, with very small memory footprint). Resampling in chunks of ~1e5 optimized speed and memory requirements for dimensions (numbers of parameters) between 1 and 5. I did not have any issues in the evaluation functionality.

mdhaber · 2022-07-18T05:39:02Z

@HerculesJack thought I'd check in on this PR. Would you like to finish it up? What do you need from us?

h3jia · 2022-07-18T06:10:16Z

@mdhaber This is a quite old thread. It did not get merged because my naive implementation could not pass the tests due to memory and/or run time limits. I no longer need this feature for my own application, but am happy to help if you would like to take it over. I think you'd need to first review the code and resolve the conflicts, and then figure out a way to split the computation in batches so that it won't hit the memory cap.

mdhaber · 2022-07-18T13:45:43Z

Thanks for the summary @HerculesJack. In that case, I'll probably record the feature idea somewhere and close the PR.

@lzkelley I've thought about the idea of leaving KDE enhancements to another package, too. As a stats maintainer, I'm conflicted. On one hand, it's very interesting; e.g. adding back support for degenerate data and eliminating use of inv has gripped me. On the other, my attention is pulled in a lot of directions, so more complicated PRs tend to get proportionally less attention. In any case, I think KDE functionality would be able to grow a lot faster in a separate package; the downside is just that that users would not already have that at their fingertips (without looking for it).

I may create a meta-issue summarizing KDE's several enhancement requests and inactive enhancement PRs. If so I'll let you know.

Looks like kalepy's last release was a while ago, but you created some issues a few months ago. How's the project coming? Are there any features in SciPy's KDE that it does not support?

lzkelley · 2022-07-18T13:59:20Z

@mdhaber

Looks like kalepy's last release was a while ago, but you created some issues a few months ago. How's the project coming? Are there any features in SciPy's KDE that it does not support?

kalepy is quite functional, I would say. Unfortunately I haven't had much time to keep developing / improving it, but I do frequently use it in a number of different projects. Currently, it's functionality is a superset of scipy's, but there's definitely lots more features that could be implemented. The most pressing improvement would be additional optimization. There's a pending PR for multithreading, for example, that needs to be looked at (I'm more familiar with MPI parallelization, so I would need to learn more about multithreading per se).

mdhaber · 2022-08-10T15:29:43Z

Thanks again @HerculesJack. One last question: this PR adds methods for computing the pdf/logpdf of a marginal distribution of a KDE distribution. I'm thinking that rather than having a separate methods for pdf/logpdf only, it would make sense to have a single method, e.g. marginalized, that simply returns a KDE object representing the marginalized distribution. Because the marginal PDF of a multivariate normal distribution is the PDF of a multivariate normal distribution using a subset of indices of the original mean and covariance matrix, the implementation is quite simple at its core: just extract the relevant indices of the dataset and covariance matrix and instantiate a new KDE instance with that information. The PDF/logpdf code would not need to be copied/rewritten. Does this make sense?

h3jia · 2022-08-10T15:46:20Z

Thanks again @HerculesJack. One last question: this PR adds methods for computing the pdf/logpdf of a marginal distribution of a KDE distribution. I'm thinking that rather than having a separate methods for pdf/logpdf only, it would make sense to have a single method, e.g. marginalized, that simply returns a KDE object representing the marginalized distribution. Because the marginal PDF of a multivariate normal distribution is the PDF of a multivariate normal distribution using a subset of indices of the original mean and covariance matrix, the implementation is quite simple at its core: just extract the relevant indices of the dataset and covariance matrix and instantiate a new KDE instance with that information. The PDF/logpdf code would not need to be copied/rewritten. Does this make sense?

Yes, that makes sense to me!

mdhaber · 2022-08-10T22:59:48Z

OK, I resolved merge conflicts and added a commit (0157c13) that shows the concept of the new approach. The tests (including one showing that old and new approaches produce the same PDF values) passed, so I'll remove the old approach.

I'm going to send an email about this to the mailing with three questions.

What should the method that returns the marginal distribution be called? marginalize is a verb (good), but may not be acceptable due to other meanings of the word. marginal, get_marginal, or get_marginal_distribution would also work. get_marginal_distribution seems clear, and the method names of gaussian_kde are not super concise already (e.g. integrate_box_1d, integrate_gaussian), so I think that's my suggestion.
In the original PR, the axis argument specified the dimensions that get "marginalized out" (i.e. those one would integrate over to obtain the marginal distribution). I changed this already to the opposite - the ones corresponding with the "marginal variables", which are kept. This leads to two questions:
a. Should the user specify the variables that get "marginalized out" or the "marginal variables" that are to be kept? Update: I think the user should specify the indices of the ones to be kept, the marginal variables.
b. Either way, the name axis is not really appropriate because it is used in a different way throughout NumPy and SciPy to refer to the axis/axes of an array, whereas here was are using the argument as indices along the zeroth axis of the dataset array (improving efficiency, e.g. by avoiding re-calculation of the covariance matrix and/or its inverse would be a separate PR). What should the name of this parameter be instead? I'd suggest marginal_variables.

This excerpt from Wikipedia might help those not already familiar with terminolgy (like me):

scipy/stats/_kde.py

mdhaber · 2022-08-10T23:01:56Z

scipy/stats/_kde.py

+    def marginalize(self, axis):
+        """Generate a marginal KDE distribution
+
+        Parameters
+        ----------
+        axis : int or 1-d array_like
+            Axis (axes) along which the margins are to be computed.


These are subject to discussion. See questions about function name, parameter name, and parameter meaning.

I would maybe go with d, dimensions or simply marginals. I also agree with you that this parameter should represent the dimensions we want to keep.

scipy/stats/_kde.py

scipy/stats/tests/test_kdeoth.py

tupui

Nice addition, I am +1 even if this is fairly immediate to do in user land.

tupui · 2022-08-11T08:28:16Z

scipy/stats/_kde.py

+    def marginalize(self, axis):
+        """Generate a marginal KDE distribution
+
+        Parameters
+        ----------
+        axis : int or 1-d array_like
+            Axis (axes) along which the margins are to be computed.


I would maybe go with d, dimensions or simply marginals. I also agree with you that this parameter should represent the dimensions we want to keep.

scipy/stats/_kde.py

mdhaber · 2022-08-18T01:02:29Z

Email to mailing list sent 8/11. No response so far, but we have a contributor and two stats maintainers who think this is a good idea. Yes, the implementation is simple, but I think it is very useful to users who don't happen to know that the marginal distribution of a multivariate normal takes such a simple form.

@tupui I'm happy with this if tests pass and the latest changes look good to you.

tupui

LGTM with the last updates. Thank you @HerculesJack for the PR and @mdhaber for pushing this to the finish line.

h3jia added 3 commits March 10, 2019 11:51

add pdf_marginal and logpdf_marginal

8afb839

add tests for pdf_marginal and logpdf_marginal

6e61b28

update document

ad4ebac

ilayn reviewed Mar 10, 2019

View reviewed changes

scipy/stats/kde.py Outdated Show resolved Hide resolved

scipy/stats/kde.py Outdated Show resolved Hide resolved

scipy/stats/kde.py Outdated Show resolved Hide resolved

scipy/stats/kde.py Outdated Show resolved Hide resolved

fix transpose

bb2a8bc

rgommers added scipy.stats enhancement A new feature or improvement labels Mar 11, 2019

h3jia added 3 commits March 10, 2019 18:40

fix code style

c983bdc

rewrite pdf etc and improve doc

5880c74

Remove redundant tests

5e5a941

Now it's not necessary to test the cases of more points / more data separately.

use tensordot and sum instead of einsum

1bdd2c3

rgommers mentioned this pull request Apr 11, 2019

[Feature Request] Marginalize over parameters in gaussian_kde #10049

Closed

mdhaber added 3 commits August 10, 2022 09:08

Merge remote-tracking branch 'upstream/main' into kde-marginal

4fe0201

MAINT: stats.gaussian_kde: complete merge

d954c78

MAINT: stats: change approach; show equivalence

0157c13

mdhaber reviewed Aug 10, 2022

View reviewed changes

MAINT: stats.gaussian_kde: remote pdf_marginal/logpdf_marginal

9674d08

mdhaber changed the title ~~Add pdf_marginal and logpdf_marginal to kde~~ ENH: stats.gaussian_kde: add method that returns marginal distribution Aug 10, 2022

tupui reviewed Aug 11, 2022

View reviewed changes

mdhaber requested a review from tirthasheshpatel August 17, 2022 08:05

ENH: stats.kde: improve marginal input validation/tests

22ac3d2

Merge remote-tracking branch 'upstream/main' into kde-marginal

4d81de6

tupui approved these changes Aug 18, 2022

View reviewed changes

tupui merged commit bf9e262 into scipy:main Aug 18, 2022

tupui added this to the 1.10.0 milestone Aug 18, 2022

rgommers mentioned this pull request Jan 25, 2023

MAINT, CI: GHA MacOS setup.py update #17831

Merged

mdhaber mentioned this pull request Oct 28, 2024

ENH: stats: conditional probability estimator #10653

Open

Uh oh!

ENH: stats.gaussian_kde: add method that returns marginal distribution #9932

ENH: stats.gaussian_kde: add method that returns marginal distribution #9932

Uh oh!

Conversation

h3jia commented Mar 10, 2019

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

h3jia commented Mar 11, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

h3jia commented Mar 11, 2019

Uh oh!

lzkelley commented Jun 24, 2019

Uh oh!

mdhaber commented Jul 18, 2022

Uh oh!

h3jia commented Jul 18, 2022

Uh oh!

mdhaber commented Jul 18, 2022

Uh oh!

lzkelley commented Jul 18, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mdhaber commented Aug 10, 2022

Uh oh!

h3jia commented Aug 10, 2022

Uh oh!

mdhaber commented Aug 10, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

mdhaber Aug 10, 2022

Choose a reason for hiding this comment

Uh oh!

tupui Aug 11, 2022

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

tupui left a comment

Choose a reason for hiding this comment

Uh oh!

tupui Aug 11, 2022

Choose a reason for hiding this comment

Uh oh!

Uh oh!

mdhaber commented Aug 18, 2022

Uh oh!

tupui left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

h3jia commented Mar 11, 2019 •

edited

Loading

lzkelley commented Jul 18, 2022 •

edited

Loading

mdhaber commented Aug 10, 2022 •

edited

Loading