NOCATS: Categorical splits for tree-based learners (ctnd.) #12866

adrinjalali · 2018-12-26T12:46:14Z

This PR continues the work of #4899. For now I've merged the master into the PR, made it compile and make the tests run. There are several issues which need to be fixed. The list will be updated as I encounter them. Also, not all of these items are necessarily open, I have only collected them from the comments on the original PR, and need to make sure they're either already addressed or address them.

merge master into the PR (done)
sparse tests pass (done)
- The code is supposed to be the same as the status quo implementation if categories are not passed. But right now the tests related to sparse data fail.
- EDIT: The tests pass if we compare floats with almost_equal
LabelEncoder -> CategoricalEncoder (done)
- Preprocessing is not a part of NOCATS anymore.
Is maximum random generations 20 or 40 (done)
- It's actually 60
Don't quantize features automatically (done)
- Doesn't happen anymore: NOCATS: Categorical splits for tree-based learners #4899 (comment)
check the category count limits for given data. (done)
add a benchmark
- done. Results: NOCATS: Categorical splits for tree-based learners (ctnd.) #12866 (comment)
add tests (right now only invalid input are tested)
- tree/tests done
- ensemble/tests done
benchmark against master
add an example with plots
check numpy upgrade related issues (we've upgraded our numpy requirement in the meantime)
run some benchmarks with a simple integer coding of the features (with arbitrary ordering)
add cat_split to NODE_DTYPE once joblib.hash can handle it (padded struct)
- joblib issue: Joblib.hash fails on numpy structured array with overlapping fields joblib/joblib#826

Closes #4899

Future Work: These are the possible future work we already know of (i.e. outside the scope of this PR):

Heuristic methods to allow fast Breiman-like training for multi-class classification
export to graphviz
One-hot emulation using the NOCATS machinery
support sparse input
handle categories as their unique valies instead of [0, max(feature)]
- This is to be consistent with our encoders' behavior
- moved this to future work per NOCATS: Categorical splits for tree-based learners (ctnd.) #12866 (comment)

P.S. I moved away from "task list" due to the extremely buggy interface when used in combination with editing the post, which I'm extensively doing to keep it easy for us to keep up with the status.

…causing all kinds of problems. Now safe_realloc requires the item size to be explicitly provided. Also, it can allocate arrays of pointers to any type by casting to void*.

…to categorical variables. Replaced the threshold attribute of SplitRecord and Node with SplitValue.

…hat defaults to -1 for each feature (indicating non-categorical).

…ediction with trees. Also introduced category caches for quick evaluation of categorical splits.

…he best categorical split.

…isionTree.

jnothman · 2018-12-26T12:59:26Z

Wow. Good on you for taking this on!

adrinjalali · 2018-12-26T14:25:19Z

I̶ ̶a̶s̶s̶u̶m̶e̶ ̶t̶h̶e̶ ̶a̶p̶p̶v̶e̶y̶o̶r̶ ̶f̶a̶i̶l̶u̶r̶e̶ ̶i̶s̶ ̶u̶n̶r̶e̶l̶a̶t̶e̶d̶ ̶t̶o̶ ̶t̶h̶i̶s̶ ̶P̶R̶ ̶I̶ ̶s̶u̶p̶p̶o̶s̶e̶.̶

AndreaTrucchia · 2022-02-24T10:54:10Z

@adrinjalali I am checking it, too bad most of my works concern Random Forest. However, I think that I can give it a try for
studies that revolves just on the effect of different categories on the predicted label. Thanks a lot

adrinjalali · 2022-02-24T12:25:48Z

Out of curiosity, do the preprocessing techniques we have to handle categorical variables not satisfy your needs in a Pipeline?

AndreaTrucchia · 2022-02-24T12:30:36Z

Dear @adrinjalali , while in a scikit-learn environment, I tend to one-hot-encode the categorical variable, with very high performances (see e.g. https://www.mdpi.com/2571-6255/5/1/30) .However, in the R (randomForest package) -style of treating canonical variables, I can use the partialPlot function that can rank the variables from let's say 1 "this category enhance the calssification of being label A" to -1 "this category strongly disagrees with classification of label A" .
I hope I was clear enough :)

adrinjalali · 2022-02-24T12:48:39Z

Isn't partialPlot the partial dependence plots that we have? (https://scikit-learn.org/stable/modules/partial_dependence.html#partial-dependence)

You could pass a pipeline with the OneHotEncoder included in it and get the partial dependence (I think).

NicolasHug · 2022-02-24T12:56:18Z

That would probably be a different thing. Our PDP support is only defined for regressors, not classifiers. The "partial dependence" as we support it is defined as the expectation of a continuous target.
EDIT nvm I'm wrong, it can rely on the decision_function

QianqianHan96 · 2022-12-22T17:09:04Z

How can we use this one? If I may ask this dumb question? Is it a function in scikit-learn? Thanks a lot!

adrinjalali · 2023-01-02T11:42:38Z

@AliciaPython it's not included. You can checkout this branch, compile the package, and install it locally, but I wouldn't recommend it since it's quite outdated compared to the main branch at this point. This requires some substantial work to get in if it's happening, and I'm not sure if it will. You're probably better off using HistGradientBoosting models.

lcrmorin · 2023-04-05T08:45:54Z

If a simple tree is needed, would it be a good idea to use HistGradientBoosting with max_iter=1 ? Does this default to a simple tree that would be relatively equivalent to a simple tree model ?

adrinjalali · 2023-04-05T09:00:45Z

@lcrmorin if you don't lose too much information by quantizing your features (which is what HistGradientBoosting does), then they might be similar I think.

bmreiniger · 2023-04-06T12:55:19Z

@lcrmorin cc @adrinjalali
Set learning rate to 1, and early stopping to False to prevent a validation set being split out.

I think there still can be some significant differences: the first tree is fit to the pseudo-residual (gradient of the loss function, sometimes with hessian information too) from an initial prediction (see also the init parameter of the vanilla GBMs here). The splits chosen might be the same as an ordinary tree, but that might depend on the loss function chosen(??); and the leaf values will certainly be different.

adam2392 · 2023-06-23T02:50:04Z

Any chance anyone has the link to the original issue, or a summarizing comment for why this is stalled or difficult?

My impression is that since R and other packages have a similar feature, maybe there's some friction here due to just the internals of the sklearn tree API or something?

I realize this may just not get in, but I want to see what some of the ideas were to see if I can implement a robust soln into scikit-tree: as just a separate splitter.

Thanks!

adrinjalali · 2023-06-23T12:22:51Z

At some point in the past this PR was in a pretty good shape, but I was asked to provide more benchmarks and more evidence that this is good enough. There was also the issue of trying to simplify the existing codebase so that this could be introduced easier / also for sparse case. At some point I didn't have the time to spend on this anymore (after quite some time of working on this almost everyday). So it was left behind. At this point for me to get back to this it would cost me more that I have to spare.

At the same time, HistGradientBoosting* now natively supports categorical features, which also deprioritized this work.

So, I'm not sure.

adam2392 · 2023-06-23T15:17:34Z

I see. Thanks for the update @adrinjalali!

If you don't mind, I will probably incorporate this into our sklearn fork and probably refactor this to account for the most up-to-date Cython changes in sklearn:main and also make the API more similar to the categorical API in HistGradientBoosting*. Is that fine w/ you since you're the author of this work?

It sounds like this kind of feature is okay w/ maintainers for inclusion. The main bottleneck is some significant benchmarking(?) (and I suppose the missing-value work being carried out by Thomas). If so, I'll post back here with a link to the commit from our fork that implements this, so a PR can more easily be carried out that's in line w/ sklearn:main.

adrinjalali · 2023-06-23T15:19:25Z

I really don't like the idea of the fork as I've mentioned before, but sure, you can have it there.

adam2392 · 2023-06-23T15:34:09Z

I really don't like the idea of the fork as I've mentioned before, but sure, you can have it there.

Agreed... I'm trying to pipe most of the features possible downstream to scikit-tree, but for this specific one we want to enable categorical splits for all possible tree models w/o diverging from the sklearn's Cython code.

Therefore it has to be done at the Python BaseDecisionTree and Cython splitter level unfortunately. I'm not a huge fan of codebases that hard-forked the sklearn Cython code. At the moment, I'll just have to eat the cost of rebasing a submodule or hard-forking and then figuring out how to re-align code.

Thanks for the feedback and updates!

iamDecode · 2023-07-16T09:53:28Z

sklearn/tree/_tree.pyx

+        self.n_nodes = 0
+        self.bits = NULL
+
+    def _dealloc__(self):


Suggested change

def _dealloc__(self):

def __dealloc__(self):

#### Reference Issues/PRs Helps bring in fork wrt changes in scikit-learn#12866 #### What does this implement/fix? Explain your changes. #### Any other comments?  --------- Signed-off-by: Adam Li <[email protected]>

adam2392 · 2024-07-02T21:53:33Z

I would be interested in seeing if I can help push this towards completion. I think it's an important feature for one of the most used classes in scikit-learn, and would improve the performance of RF/ETs on tabular data where categorical data is common. However, I do not want to work on this feature if there is no chance for inclusion. Is this still of interest to the @sklearn-devs?

First off to summarize what technical features changed wrt the trees on main:

SplitRecord.threshold and any downstream inheritance of this value will now hold a fusedtype, which is either the threshold on a numerical feature, or the category on which we split.
bit encoding of categories: this allows us to encode in an unsigned integer type up to 32 bits, or 64 bits which category is selected. So this in practice allows 2^32, or 2^64 total categories IIUC. This part I still need to understand better.
A cache manager for applying categorical splits: I still have to understand why this is necessary/useful.
breiman shortcut for binary classification: Here, we can have an efficient way to split by categories. For details, I looked at https://orbi.uliege.be/bitstream/2268/170309/1/thesis.pdf.

Next to summarize the work that needs to be done based on reading this thread:

integrating the above changes with the now tree implementation. The upgrade to Cython3.0+, addition of missing-values and some PRs improving the maintainability and readability of the decision tree codebase may help in this regards.
benchmarking the results on Amazon dataset as done here by Adrin
benchmarking on openml datasets with categorical features
seeing if we can encode the categorical stuff via pandas/polars
unit tests

For anyone that managed to read thru this and see anything I missed, please let me know and I can edit this post :).

adrinjalali · 2024-07-03T11:53:42Z

@adam2392 I would love it if you could take over this one ❤️ and I think it would be a nice addition.

caiohamamura · 2024-09-28T23:27:33Z

In R randomForest the R code itself has minimal functionalities, instead it uses a wrapper around a modified version of the original Fortran code from Leo Breiman. In that code the decision tree handles the categorical data as so by taking a vector as argument that tells how many classes each predictor have, with 1 meaning that the predictor is numerical.

Even though the categories are encoded as integers, during calculation of the Gini impurity the splits are tested by all combinations of categories into two different groups (if number of categories < 10), else it will try a predefined number of random splits approach, or else the traditional splits would have no meaning (<= 3 vs > 3), because categories 1, 2, 3 probably have no correlation at all.

I wonder if doing the same as R did is not better after all, I mean using the python code only as a wrapper around the original function Fortran function, since it will be faster than any implementation we come up using Python.

jeremiedbb · 2025-07-25T19:50:09Z

Closing in favor of #29437

jblackburne and others added 20 commits February 11, 2017 12:43

Remove inverted dependence of _utils.pyx on _tree.pyx because it was …

61e28ec

…causing all kinds of problems. Now safe_realloc requires the item size to be explicitly provided. Also, it can allocate arrays of pointers to any type by casting to void*.

Tree constructor now checks for mismatched struct sizes.

926d48f

Created SplitValue datatype to generalize the concept of a threshold …

82e932e

…to categorical variables. Replaced the threshold attribute of SplitRecord and Node with SplitValue.

Added attribute n_categories to Splitter and Tree, an array of ints t…

5cfa6c2

…hat defaults to -1 for each feature (indicating non-categorical).

Added a goes_left function to replace threshold comparisons during pr…

a65b848

…ediction with trees. Also introduced category caches for quick evaluation of categorical splits.

BestSplitter now calculates the best categorical split.

2bd5633

Added categorical split code to RandomSplitter.node_split

e3f0a99

Added an implementation of the Breiman sorting shortcut for finding t…

35dad26

…he best categorical split.

Added categorical constructor parameter and error checking to BaseDec…

2312ce0

…isionTree.

Added the categorical keyword to forest constructors.

e0068f8

Added some unit tests.

1b77326

Refactored _partial_dependence_tree a little.

149b7bf

Added categorical support to gradient boosting.

b65da1c

compile with recent cython

04c5fa9

Merge remote-tracking branch 'upstream/master' into tree/nocats

1c6f5b9

Merge remote-tracking branch 'upstream/master' into tree/nocats

0becd7b

compile goes a bit further

95a0bd2

compiles

28f42b1

fix yield based tests

724e720

add cat_split to NODE_DTYPE

614e51c

adrinjalali added 3 commits December 26, 2018 14:42

compare float arrays with almost_equal

51e51cf

remove extra import

7840004

remove commented extra lines

3d12660

adrinjalali added 5 commits December 26, 2018 15:37

fix some docstrings

bb02337

remove overlapping DTYPE

56cdb72

fix forest doctest

d49ff0e

remove input validation from __init__, it's don in fit

081dd83

fix extra tree param docstring

51b908c

glemaitre mentioned this pull request May 10, 2022

Support for ordinal multi-classification #23324

Open

betatim mentioned this pull request Nov 17, 2022

Transformer for nominal categories, with the goal of improving category support in decision trees #24967

Open

adam2392 mentioned this pull request Jun 23, 2023

Implement split nodes that can consider categorical features neurodata/treeple#90

Open

adam2392 mentioned this pull request Jun 30, 2023

Attempt to bring in categorical support neurodata/scikit-learn#46

Merged

iamDecode reviewed Jul 16, 2023

View reviewed changes

lorentzenchr added the Stalled label Mar 15, 2024

adam2392 mentioned this pull request Jul 8, 2024

FEA Categorical split support for DecisionTree*, ExtraTree*, RandomForest* and `ExtraTrees* #29437

Draft

adam2392 mentioned this pull request May 22, 2025

scikit-learn: Improved Categorical Feature Support for Decision Tree Models numfocus/small-development-grant-proposals#4

Closed

jeremiedbb closed this Jul 25, 2025

github-project-automation bot moved this from In Progress to Done in Support for categorical variable Jul 25, 2025

Uh oh!

NOCATS: Categorical splits for tree-based learners (ctnd.) #12866

NOCATS: Categorical splits for tree-based learners (ctnd.) #12866

Uh oh!

Conversation

adrinjalali commented Dec 26, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jnothman commented Dec 26, 2018

Uh oh!

adrinjalali commented Dec 26, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

AndreaTrucchia commented Feb 24, 2022

Uh oh!

adrinjalali commented Feb 24, 2022

Uh oh!

AndreaTrucchia commented Feb 24, 2022

Uh oh!

adrinjalali commented Feb 24, 2022

Uh oh!

NicolasHug commented Feb 24, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

QianqianHan96 commented Dec 22, 2022

Uh oh!

adrinjalali commented Jan 2, 2023

Uh oh!

lcrmorin commented Apr 5, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

adrinjalali commented Apr 5, 2023

Uh oh!

bmreiniger commented Apr 6, 2023

Uh oh!

adam2392 commented Jun 23, 2023

Uh oh!

adrinjalali commented Jun 23, 2023

Uh oh!

adam2392 commented Jun 23, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

adrinjalali commented Jun 23, 2023

Uh oh!

adam2392 commented Jun 23, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

iamDecode Jul 16, 2023

Choose a reason for hiding this comment

Uh oh!

adam2392 commented Jul 2, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

adrinjalali commented Jul 3, 2024

Uh oh!

caiohamamura commented Sep 28, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jeremiedbb commented Jul 25, 2025

Uh oh!

Uh oh!

adrinjalali commented Dec 26, 2018 •

edited

Loading

adrinjalali commented Dec 26, 2018 •

edited

Loading

NicolasHug commented Feb 24, 2022 •

edited

Loading

lcrmorin commented Apr 5, 2023 •

edited

Loading

adam2392 commented Jun 23, 2023 •

edited

Loading

adam2392 commented Jun 23, 2023 •

edited

Loading

adam2392 commented Jul 2, 2024 •

edited

Loading

caiohamamura commented Sep 28, 2024 •

edited

Loading