-
-
Notifications
You must be signed in to change notification settings - Fork 25.8k
NOCATS: Categorical splits for tree-based learners (ctnd.) #12866
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
…causing all kinds of problems. Now safe_realloc requires the item size to be explicitly provided. Also, it can allocate arrays of pointers to any type by casting to void*.
…to categorical variables. Replaced the threshold attribute of SplitRecord and Node with SplitValue.
…hat defaults to -1 for each feature (indicating non-categorical).
…ediction with trees. Also introduced category caches for quick evaluation of categorical splits.
…he best categorical split.
Wow. Good on you for taking this on! |
I̶ ̶a̶s̶s̶u̶m̶e̶ ̶t̶h̶e̶ ̶a̶p̶p̶v̶e̶y̶o̶r̶ ̶f̶a̶i̶l̶u̶r̶e̶ ̶i̶s̶ ̶u̶n̶r̶e̶l̶a̶t̶e̶d̶ ̶t̶o̶ ̶t̶h̶i̶s̶ ̶P̶R̶ ̶I̶ ̶s̶u̶p̶p̶o̶s̶e̶.̶ |
Hallo, I would like to inquire about the status of this branch. My team would really benefit from this and be free from recurring to |
@AndreaTrucchia have you checked |
@adrinjalali I am checking it, too bad most of my works concern Random Forest. However, I think that I can give it a try for |
Out of curiosity, do the preprocessing techniques we have to handle categorical variables not satisfy your needs in a |
Dear @adrinjalali , while in a scikit-learn environment, I tend to one-hot-encode the categorical variable, with very high performances (see e.g. https://www.mdpi.com/2571-6255/5/1/30) .However, in the R (randomForest package) -style of treating canonical variables, I can use the partialPlot function that can rank the variables from let's say 1 "this category enhance the calssification of being label A" to -1 "this category strongly disagrees with classification of label A" . |
Isn't You could pass a pipeline with the |
That would probably be a different thing. Our PDP support is only defined for regressors, not classifiers. The "partial dependence" as we support it is defined as the expectation of a continuous target. |
How can we use this one? If I may ask this dumb question? Is it a function in scikit-learn? Thanks a lot! |
@AliciaPython it's not included. You can checkout this branch, compile the package, and install it locally, but I wouldn't recommend it since it's quite outdated compared to the main branch at this point. This requires some substantial work to get in if it's happening, and I'm not sure if it will. You're probably better off using HistGradientBoosting models. |
If a simple tree is needed, would it be a good idea to use HistGradientBoosting with max_iter=1 ? Does this default to a simple tree that would be relatively equivalent to a simple tree model ? |
@lcrmorin if you don't lose too much information by quantizing your features (which is what |
@lcrmorin cc @adrinjalali I think there still can be some significant differences: the first tree is fit to the pseudo-residual (gradient of the loss function, sometimes with hessian information too) from an initial prediction (see also the |
Any chance anyone has the link to the original issue, or a summarizing comment for why this is stalled or difficult? My impression is that since R and other packages have a similar feature, maybe there's some friction here due to just the internals of the sklearn tree API or something? I realize this may just not get in, but I want to see what some of the ideas were to see if I can implement a robust soln into scikit-tree: as just a separate splitter. Thanks! |
At some point in the past this PR was in a pretty good shape, but I was asked to provide more benchmarks and more evidence that this is good enough. There was also the issue of trying to simplify the existing codebase so that this could be introduced easier / also for sparse case. At some point I didn't have the time to spend on this anymore (after quite some time of working on this almost everyday). So it was left behind. At this point for me to get back to this it would cost me more that I have to spare. At the same time, So, I'm not sure. |
I see. Thanks for the update @adrinjalali! If you don't mind, I will probably incorporate this into our sklearn fork and probably refactor this to account for the most up-to-date Cython changes in sklearn:main and also make the API more similar to the categorical API in It sounds like this kind of feature is okay w/ maintainers for inclusion. The main bottleneck is some significant benchmarking(?) (and I suppose the missing-value work being carried out by Thomas). If so, I'll post back here with a link to the commit from our fork that implements this, so a PR can more easily be carried out that's in line w/ sklearn:main. |
I really don't like the idea of the fork as I've mentioned before, but sure, you can have it there. |
Agreed... I'm trying to pipe most of the features possible downstream to Therefore it has to be done at the Python BaseDecisionTree and Cython splitter level unfortunately. I'm not a huge fan of codebases that hard-forked the sklearn Cython code. At the moment, I'll just have to eat the cost of rebasing a submodule or hard-forking and then figuring out how to re-align code. Thanks for the feedback and updates! |
self.n_nodes = 0 | ||
self.bits = NULL | ||
|
||
def _dealloc__(self): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
def _dealloc__(self): | |
def __dealloc__(self): |
<!-- Thanks for contributing a pull request! Please ensure you have taken a look at the contribution guidelines: https://github.com/scikit-learn/scikit-learn/blob/main/CONTRIBUTING.md --> #### Reference Issues/PRs Helps bring in fork wrt changes in scikit-learn#12866 #### What does this implement/fix? Explain your changes. #### Any other comments? <!-- Please be aware that we are a loose team of volunteers so patience is necessary; assistance handling other issues is very welcome. We value all user contributions, no matter how minor they are. If we are slow to review, either the pull request needs some benchmarking, tinkering, convincing, etc. or more likely the reviewers are simply busy. In either case, we ask for your understanding during the review process. For more information, see our FAQ on this topic: http://scikit-learn.org/dev/faq.html#why-is-my-pull-request-not-getting-any-attention. Thanks for contributing! --> --------- Signed-off-by: Adam Li <[email protected]>
I would be interested in seeing if I can help push this towards completion. I think it's an important feature for one of the most used classes in scikit-learn, and would improve the performance of RF/ETs on tabular data where categorical data is common. However, I do not want to work on this feature if there is no chance for inclusion. Is this still of interest to the @sklearn-devs?First off to summarize what technical features changed wrt the trees on
Next to summarize the work that needs to be done based on reading this thread:
For anyone that managed to read thru this and see anything I missed, please let me know and I can edit this post :). |
@adam2392 I would love it if you could take over this one ❤️ and I think it would be a nice addition. |
In Even though the categories are encoded as I wonder if doing the same as |
This PR continues the work of #4899. For now I've merged the master into the PR, made it compile and make the tests run. There are several issues which need to be fixed. The list will be updated as I encounter them. Also, not all of these items are necessarily open, I have only collected them from the comments on the original PR, and need to make sure they're either already addressed or address them.
almost_equal
tree/tests
doneensemble/tests
doneCloses #4899
Future Work: These are the possible future work we already know of (i.e. outside the scope of this PR):
[0, max(feature)]
P.S. I moved away from "task list" due to the extremely buggy interface when used in combination with editing the post, which I'm extensively doing to keep it easy for us to keep up with the status.