Skip to content

NOCATS: Categorical splits for tree-based learners (ctnd.) #12866

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 59 commits into
base: main
Choose a base branch
from

Conversation

adrinjalali
Copy link
Member

@adrinjalali adrinjalali commented Dec 26, 2018

This PR continues the work of #4899. For now I've merged the master into the PR, made it compile and make the tests run. There are several issues which need to be fixed. The list will be updated as I encounter them. Also, not all of these items are necessarily open, I have only collected them from the comments on the original PR, and need to make sure they're either already addressed or address them.

  • merge master into the PR (done)
  • sparse tests pass (done)
    • The code is supposed to be the same as the status quo implementation if categories are not passed. But right now the tests related to sparse data fail.
    • EDIT: The tests pass if we compare floats with almost_equal
  • LabelEncoder -> CategoricalEncoder (done)
    • Preprocessing is not a part of NOCATS anymore.
  • Is maximum random generations 20 or 40 (done)
    • It's actually 60
  • Don't quantize features automatically (done)
  • check the category count limits for given data. (done)
  • add a benchmark
  • add tests (right now only invalid input are tested)
    • tree/tests done
    • ensemble/tests done
  • benchmark against master
  • add an example with plots
  • check numpy upgrade related issues (we've upgraded our numpy requirement in the meantime)
  • run some benchmarks with a simple integer coding of the features (with arbitrary ordering)
  • add cat_split to NODE_DTYPE once joblib.hash can handle it (padded struct)

Closes #4899

Future Work: These are the possible future work we already know of (i.e. outside the scope of this PR):

  • Heuristic methods to allow fast Breiman-like training for multi-class classification
  • export to graphviz
  • One-hot emulation using the NOCATS machinery
  • support sparse input
  • handle categories as their unique valies instead of [0, max(feature)]

P.S. I moved away from "task list" due to the extremely buggy interface when used in combination with editing the post, which I'm extensively doing to keep it easy for us to keep up with the status.

jblackburne and others added 20 commits February 11, 2017 12:43
…causing all kinds of problems. Now safe_realloc requires the item size to be explicitly provided. Also, it can allocate arrays of pointers to any type by casting to void*.
…to categorical variables. Replaced the threshold attribute of SplitRecord and Node with SplitValue.
…hat defaults to -1 for each feature (indicating non-categorical).
…ediction with trees. Also introduced category caches for quick evaluation of categorical splits.
@jnothman
Copy link
Member

Wow. Good on you for taking this on!

@adrinjalali
Copy link
Member Author

adrinjalali commented Dec 26, 2018

I̶ ̶a̶s̶s̶u̶m̶e̶ ̶t̶h̶e̶ ̶a̶p̶p̶v̶e̶y̶o̶r̶ ̶f̶a̶i̶l̶u̶r̶e̶ ̶i̶s̶ ̶u̶n̶r̶e̶l̶a̶t̶e̶d̶ ̶t̶o̶ ̶t̶h̶i̶s̶ ̶P̶R̶ ̶I̶ ̶s̶u̶p̶p̶o̶s̶e̶.̶

@AndreaTrucchia
Copy link

Hallo, I would like to inquire about the status of this branch. My team would really benefit from this and be free from recurring to
R every now and then.

@adrinjalali
Copy link
Member Author

@AndreaTrucchia have you checked HistGradientBoosting* instead?

@AndreaTrucchia
Copy link

@adrinjalali I am checking it, too bad most of my works concern Random Forest. However, I think that I can give it a try for
studies that revolves just on the effect of different categories on the predicted label. Thanks a lot

@adrinjalali
Copy link
Member Author

Out of curiosity, do the preprocessing techniques we have to handle categorical variables not satisfy your needs in a Pipeline?

@AndreaTrucchia
Copy link

Dear @adrinjalali , while in a scikit-learn environment, I tend to one-hot-encode the categorical variable, with very high performances (see e.g. https://www.mdpi.com/2571-6255/5/1/30) .However, in the R (randomForest package) -style of treating canonical variables, I can use the partialPlot function that can rank the variables from let's say 1 "this category enhance the calssification of being label A" to -1 "this category strongly disagrees with classification of label A" .
I hope I was clear enough :)

@adrinjalali
Copy link
Member Author

Isn't partialPlot the partial dependence plots that we have? (https://scikit-learn.org/stable/modules/partial_dependence.html#partial-dependence)

You could pass a pipeline with the OneHotEncoder included in it and get the partial dependence (I think).

@NicolasHug
Copy link
Member

NicolasHug commented Feb 24, 2022

That would probably be a different thing. Our PDP support is only defined for regressors, not classifiers. The "partial dependence" as we support it is defined as the expectation of a continuous target.
EDIT nvm I'm wrong, it can rely on the decision_function

@QianqianHan96
Copy link

How can we use this one? If I may ask this dumb question? Is it a function in scikit-learn? Thanks a lot!

@adrinjalali
Copy link
Member Author

@AliciaPython it's not included. You can checkout this branch, compile the package, and install it locally, but I wouldn't recommend it since it's quite outdated compared to the main branch at this point. This requires some substantial work to get in if it's happening, and I'm not sure if it will. You're probably better off using HistGradientBoosting models.

@lcrmorin
Copy link

lcrmorin commented Apr 5, 2023

If a simple tree is needed, would it be a good idea to use HistGradientBoosting with max_iter=1 ? Does this default to a simple tree that would be relatively equivalent to a simple tree model ?

@adrinjalali
Copy link
Member Author

@lcrmorin if you don't lose too much information by quantizing your features (which is what HistGradientBoosting does), then they might be similar I think.

@bmreiniger
Copy link
Contributor

@lcrmorin cc @adrinjalali
Set learning rate to 1, and early stopping to False to prevent a validation set being split out.

I think there still can be some significant differences: the first tree is fit to the pseudo-residual (gradient of the loss function, sometimes with hessian information too) from an initial prediction (see also the init parameter of the vanilla GBMs here). The splits chosen might be the same as an ordinary tree, but that might depend on the loss function chosen(??); and the leaf values will certainly be different.

@adam2392
Copy link
Member

Any chance anyone has the link to the original issue, or a summarizing comment for why this is stalled or difficult?

My impression is that since R and other packages have a similar feature, maybe there's some friction here due to just the internals of the sklearn tree API or something?

I realize this may just not get in, but I want to see what some of the ideas were to see if I can implement a robust soln into scikit-tree: as just a separate splitter.

Thanks!

@adrinjalali
Copy link
Member Author

At some point in the past this PR was in a pretty good shape, but I was asked to provide more benchmarks and more evidence that this is good enough. There was also the issue of trying to simplify the existing codebase so that this could be introduced easier / also for sparse case. At some point I didn't have the time to spend on this anymore (after quite some time of working on this almost everyday). So it was left behind. At this point for me to get back to this it would cost me more that I have to spare.

At the same time, HistGradientBoosting* now natively supports categorical features, which also deprioritized this work.

So, I'm not sure.

@adam2392
Copy link
Member

adam2392 commented Jun 23, 2023

I see. Thanks for the update @adrinjalali!

If you don't mind, I will probably incorporate this into our sklearn fork and probably refactor this to account for the most up-to-date Cython changes in sklearn:main and also make the API more similar to the categorical API in HistGradientBoosting*. Is that fine w/ you since you're the author of this work?

It sounds like this kind of feature is okay w/ maintainers for inclusion. The main bottleneck is some significant benchmarking(?) (and I suppose the missing-value work being carried out by Thomas). If so, I'll post back here with a link to the commit from our fork that implements this, so a PR can more easily be carried out that's in line w/ sklearn:main.

@adrinjalali
Copy link
Member Author

I really don't like the idea of the fork as I've mentioned before, but sure, you can have it there.

@adam2392
Copy link
Member

adam2392 commented Jun 23, 2023

I really don't like the idea of the fork as I've mentioned before, but sure, you can have it there.

Agreed... I'm trying to pipe most of the features possible downstream to scikit-tree, but for this specific one we want to enable categorical splits for all possible tree models w/o diverging from the sklearn's Cython code.

Therefore it has to be done at the Python BaseDecisionTree and Cython splitter level unfortunately. I'm not a huge fan of codebases that hard-forked the sklearn Cython code. At the moment, I'll just have to eat the cost of rebasing a submodule or hard-forking and then figuring out how to re-align code.

Thanks for the feedback and updates!

self.n_nodes = 0
self.bits = NULL

def _dealloc__(self):

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
def _dealloc__(self):
def __dealloc__(self):

adam2392 added a commit to neurodata/scikit-learn that referenced this pull request Jul 20, 2023
<!--
Thanks for contributing a pull request! Please ensure you have taken a
look at
the contribution guidelines:
https://github.com/scikit-learn/scikit-learn/blob/main/CONTRIBUTING.md
-->

#### Reference Issues/PRs
Helps bring in fork wrt changes in
scikit-learn#12866

#### What does this implement/fix? Explain your changes.


#### Any other comments?


<!--
Please be aware that we are a loose team of volunteers so patience is
necessary; assistance handling other issues is very welcome. We value
all user contributions, no matter how minor they are. If we are slow to
review, either the pull request needs some benchmarking, tinkering,
convincing, etc. or more likely the reviewers are simply busy. In either
case, we ask for your understanding during the review process.
For more information, see our FAQ on this topic:

http://scikit-learn.org/dev/faq.html#why-is-my-pull-request-not-getting-any-attention.

Thanks for contributing!
-->

---------

Signed-off-by: Adam Li <[email protected]>
@adam2392
Copy link
Member

adam2392 commented Jul 2, 2024

I would be interested in seeing if I can help push this towards completion. I think it's an important feature for one of the most used classes in scikit-learn, and would improve the performance of RF/ETs on tabular data where categorical data is common. However, I do not want to work on this feature if there is no chance for inclusion. Is this still of interest to the @sklearn-devs?

First off to summarize what technical features changed wrt the trees on main:

  1. SplitRecord.threshold and any downstream inheritance of this value will now hold a fusedtype, which is either the threshold on a numerical feature, or the category on which we split.
  2. bit encoding of categories: this allows us to encode in an unsigned integer type up to 32 bits, or 64 bits which category is selected. So this in practice allows 2^32, or 2^64 total categories IIUC. This part I still need to understand better.
  3. A cache manager for applying categorical splits: I still have to understand why this is necessary/useful.
  4. breiman shortcut for binary classification: Here, we can have an efficient way to split by categories. For details, I looked at https://orbi.uliege.be/bitstream/2268/170309/1/thesis.pdf.

Next to summarize the work that needs to be done based on reading this thread:

  1. integrating the above changes with the now tree implementation. The upgrade to Cython3.0+, addition of missing-values and some PRs improving the maintainability and readability of the decision tree codebase may help in this regards.
  2. benchmarking the results on Amazon dataset as done here by Adrin
  3. benchmarking on openml datasets with categorical features
  4. seeing if we can encode the categorical stuff via pandas/polars
  5. unit tests

For anyone that managed to read thru this and see anything I missed, please let me know and I can edit this post :).

@adrinjalali
Copy link
Member Author

@adam2392 I would love it if you could take over this one ❤️ and I think it would be a nice addition.

@caiohamamura
Copy link

caiohamamura commented Sep 28, 2024

In R randomForest the R code itself has minimal functionalities, instead it uses a wrapper around a modified version of the original Fortran code from Leo Breiman. In that code the decision tree handles the categorical data as so by taking a vector as argument that tells how many classes each predictor have, with 1 meaning that the predictor is numerical.

Even though the categories are encoded as integers, during calculation of the Gini impurity the splits are tested by all combinations of categories into two different groups (if number of categories < 10), else it will try a predefined number of random splits approach, or else the traditional splits would have no meaning (<= 3 vs > 3), because categories 1, 2, 3 probably have no correlation at all.

I wonder if doing the same as R did is not better after all, I mean using the python code only as a wrapper around the original function Fortran function, since it will be faster than any implementation we come up using Python.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Development

Successfully merging this pull request may close these issues.