-
-
Notifications
You must be signed in to change notification settings - Fork 25.8k
Support categorical features in HistGradientBoosting estimators #15550
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Hi all, It's a fair amount of work, but it should be fun. The goal is to implement something similar to what LightGBM does (i.e. sorting of categories for fast split finding). Pinging also @lorentzenchr and @johannfaouzi in case you'd be interested? I'm happy to assist in any way I can (getting started with the codebase, etc.) Cheers |
@NicolasHug Thanks for asking - sound like fun and a very useful feature, too. But at the moment, I'd like to focus on some other features (e.g. different loss functions). I'm interested to see how you design the API for this. Tagging certain features with additional info is very useful for other estimators and transformers, too (a certain slep is ringing a bell:smirk:) |
No worries! Thanks for the reply In terms of API we'll keep things simple and just add a Ideally, categorical features would be automatically inferred in |
For information, the Paris team is very slowed down currently because those of us who are not sick are working on the Paris hospitals databases for
real-time statistics and reporting on the Covid cases. Let's hope that this cools down soon.
|
I am interested in picking this up. Look out for a pull request soonish. (within a week) |
@NicolasHug Thanks for asking! Sorry for the delay, I saw the notification this morning but I forgot to reply because I work on too many side projects for the moment. I'm glad to see that @thomasjpfan is interested in picking this up and I would be happy to review the PR. Also I didn't know how tree-based algorithms deal with categorical features, so TIL. |
Sleeping on the issue, here is my two cents. Feel free to ignore some / of all my remarks if irrelevant:
|
Not sure about this one: the gradients / hessians are updated at each iteration so the histograms are never the same, even at the root. For the rest I concur, this is a good list of the things we need to take care of |
Yeah I guess this refers to the so-called pseudo-residuals in the Wikipedia page. The LightGBM page also states that
I will remove that! |
Native categorical support for HGBT was implemented in #18394. |
Similar to #12866 , HGBT based estimators can also natively support categorical features.
This issue is a placeholder to keep track of the discussions around the issue.
The text was updated successfully, but these errors were encountered: