-
-
Notifications
You must be signed in to change notification settings - Fork 5.4k
ENH: stats.gaussian_kde: add method that returns marginal distribution #9932
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Now it's not necessary to test the cases of more points / more data separately.
The |
Now the problem is, without stuffs like tree-based computation, kde can be expensive for large data set. Previously it was looping over the data points or input points. This needs less memory, but I think it should be slow. I just tried to use Could you tell me your opinions on this? I think an intermediate but less clean way can be, when # of data or # of input is too large, divide it into smaller blocks and do |
I was interested in this same behavior, but also numerous other features (e.g. different kernels, reflecting boundary conditions, some custom behavior in resampling... etc). I decided it really warranted a separate package, instead of trying to add all of this functionality to the existing That being said, I don't know what the optimal tradeoff is for putting large/complex submodules into the @HerculesJack one technical comment: I ran into the same memory vs speed issue you describe above in the |
@HerculesJack thought I'd check in on this PR. Would you like to finish it up? What do you need from us? |
@mdhaber This is a quite old thread. It did not get merged because my naive implementation could not pass the tests due to memory and/or run time limits. I no longer need this feature for my own application, but am happy to help if you would like to take it over. I think you'd need to first review the code and resolve the conflicts, and then figure out a way to split the computation in batches so that it won't hit the memory cap. |
Thanks for the summary @HerculesJack. In that case, I'll probably record the feature idea somewhere and close the PR. @lzkelley I've thought about the idea of leaving KDE enhancements to another package, too. As a stats maintainer, I'm conflicted. On one hand, it's very interesting; e.g. adding back support for degenerate data and eliminating use of I may create a meta-issue summarizing KDE's several enhancement requests and inactive enhancement PRs. If so I'll let you know. Looks like kalepy's last release was a while ago, but you created some issues a few months ago. How's the project coming? Are there any features in SciPy's KDE that it does not support? |
|
Thanks again @HerculesJack. One last question: this PR adds methods for computing the pdf/logpdf of a marginal distribution of a KDE distribution. I'm thinking that rather than having a separate methods for pdf/logpdf only, it would make sense to have a single method, e.g. |
Yes, that makes sense to me! |
OK, I resolved merge conflicts and added a commit (0157c13) that shows the concept of the new approach. The tests (including one showing that old and new approaches produce the same PDF values) passed, so I'll remove the old approach. I'm going to send an email about this to the mailing with three questions.
This excerpt from Wikipedia might help those not already familiar with terminolgy (like me): |
scipy/stats/_kde.py
Outdated
def marginalize(self, axis): | ||
"""Generate a marginal KDE distribution | ||
|
||
Parameters | ||
---------- | ||
axis : int or 1-d array_like | ||
Axis (axes) along which the margins are to be computed. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These are subject to discussion. See questions about function name, parameter name, and parameter meaning.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would maybe go with d
, dimensions
or simply marginals
. I also agree with you that this parameter should represent the dimensions we want to keep.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice addition, I am +1 even if this is fairly immediate to do in user land.
scipy/stats/_kde.py
Outdated
def marginalize(self, axis): | ||
"""Generate a marginal KDE distribution | ||
|
||
Parameters | ||
---------- | ||
axis : int or 1-d array_like | ||
Axis (axes) along which the margins are to be computed. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would maybe go with d
, dimensions
or simply marginals
. I also agree with you that this parameter should represent the dimensions we want to keep.
Email to mailing list sent 8/11. No response so far, but we have a contributor and two stats maintainers who think this is a good idea. Yes, the implementation is simple, but I think it is very useful to users who don't happen to know that the marginal distribution of a multivariate normal takes such a simple form. @tupui I'm happy with this if tests pass and the latest changes look good to you. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM with the last updates. Thank you @HerculesJack for the PR and @mdhaber for pushing this to the finish line.
Following the discussion at #9906, added
pdf_marginal
andlogpdf_marginal
toscipy.stats.kde
, which are useful for visualizing high dimensional distributions.