Skip to content

IsolationForest(max_features=0.8).predict(X) fails input validation #5732

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
betatim opened this issue Nov 5, 2015 · 6 comments
Closed

IsolationForest(max_features=0.8).predict(X) fails input validation #5732

betatim opened this issue Nov 5, 2015 · 6 comments
Labels
Milestone

Comments

@betatim
Copy link
Member

betatim commented Nov 5, 2015

When subsampling features IsolationForest fails the input validation when calling predict().

from sklearn.ensemble import IsolationForest
from sklearn import datasets

iris = datasets.load_iris()
X = iris.data
y = iris.target

clf = IsolationForest(max_features=0.8)
clf.fit(X, y)
clf.predict(X)

gives the following:

scikit-learn/sklearn/tree/tree.pyc in _validate_X_predict(self, X, check_input)
    392                              " match the input. Model n_features is %s and "
    393                              " input n_features is %s "
--> 394                              % (self.n_features_, n_features))
    395
    396         return X

ValueError: Number of features of the model must  match the input. Model n_features is 3 and  input n_features is 4

In predict one of the individual fitted estimators is used for input validation: self.estimators_[0]._validate_X_predict(X, check_input=True) but it is passed the full X which has all the features. After looking into it a bit, bagging.py sub-samples the features itself, where as forest.py delegates it to the underlying DecisionTree.

@agramfort
Copy link
Member

agramfort commented Nov 5, 2015 via email

@ngoix
Copy link
Contributor

ngoix commented Nov 6, 2015

Yes we can't apply X_test to the base estimators self.estimators_ because X_test has a full number of features, whereas base estimators are built on a smaller number of features.

The original Isolation Forest algorithm does not consider sub-sampling features. We let this argument because it was supposed to be an easy extension. Seems that it is not the case. I vote for removing max_features argument.

@glouppe
Copy link
Contributor

glouppe commented Nov 6, 2015

To me the issue comes from the fact that the validity of X is checked through one its sub-estimators
(See X = self.estimators_[0]._validate_X_predict(X, check_input=True) at line 194). This is not the correct way to do it.

@ngoix
Copy link
Contributor

ngoix commented Nov 6, 2015

Yes but even if you replace this line by just X = check_array(X), you get error in predict when doing (L202)

for i, tree in enumerate(self.estimators_):
         leaves_index = tree.apply(X)

@glouppe
Copy link
Contributor

glouppe commented Nov 6, 2015

Yes, because you should not do that either. Since you inherit form BaseBagging, you need to iterate over zip(self.estimators_, self.estimators_features_) and then compute leaves_index = tree.apply(X[:, features]). Sorry for not catching that in the initial review...

@ngoix
Copy link
Contributor

ngoix commented Nov 6, 2015

Ok great! I didn't know about self.estimators_features_. So it is indeed an easy extension, and I withdraw my vote for removing max_features ;)

@amueller amueller modified the milestone: 0.19 Sep 29, 2016
sergeyf pushed a commit to sergeyf/scikit-learn that referenced this issue Feb 28, 2017
Sundrique pushed a commit to Sundrique/scikit-learn that referenced this issue Jun 14, 2017
NelleV pushed a commit to NelleV/scikit-learn that referenced this issue Aug 11, 2017
paulha pushed a commit to paulha/scikit-learn that referenced this issue Aug 19, 2017
maskani-moh pushed a commit to maskani-moh/scikit-learn that referenced this issue Nov 15, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants