Skip to content

Commit 366a32c

Browse files
committed
Pushing the docs to dev/ for branch: main, commit 2c0cdd479a2d229e88afbb97cce91eef3d1b07c9
1 parent de1c906 commit 366a32c

File tree

1,541 files changed

+6146
-6122
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

1,541 files changed

+6146
-6122
lines changed
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
0 Bytes
Binary file not shown.
Binary file not shown.
0 Bytes
Binary file not shown.
0 Bytes
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
0 Bytes
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.

Diff for: dev/_downloads/3c3c738275484acc54821615bf72894a/plot_permutation_importance.py

+15-14
Original file line numberDiff line numberDiff line change
@@ -95,11 +95,15 @@
9595
# %%
9696
# Accuracy of the Model
9797
# ---------------------
98-
# Prior to inspecting the feature importances, it is important to check that
99-
# the model predictive performance is high enough. Indeed there would be little
100-
# interest of inspecting the important features of a non-predictive model.
101-
#
102-
# Here one can observe that the train accuracy is very high (the forest model
98+
# Before inspecting the feature importances, it is important to check that
99+
# the model predictive performance is high enough. Indeed, there would be little
100+
# interest in inspecting the important features of a non-predictive model.
101+
102+
print(f"RF train accuracy: {rf.score(X_train, y_train):.3f}")
103+
print(f"RF test accuracy: {rf.score(X_test, y_test):.3f}")
104+
105+
# %%
106+
# Here, one can observe that the train accuracy is very high (the forest model
103107
# has enough capacity to completely memorize the training set) but it can still
104108
# generalize well enough to the test set thanks to the built-in bagging of
105109
# random forests.
@@ -110,12 +114,9 @@
110114
# ``min_samples_leaf=10``) so as to limit overfitting while not introducing too
111115
# much underfitting.
112116
#
113-
# However let's keep our high capacity random forest model for now so as to
114-
# illustrate some pitfalls with feature importance on variables with many
117+
# However, let us keep our high capacity random forest model for now so that we can
118+
# illustrate some pitfalls about feature importance on variables with many
115119
# unique values.
116-
print(f"RF train accuracy: {rf.score(X_train, y_train):.3f}")
117-
print(f"RF test accuracy: {rf.score(X_test, y_test):.3f}")
118-
119120

120121
# %%
121122
# Tree's Feature Importance from Mean Decrease in Impurity (MDI)
@@ -135,7 +136,7 @@
135136
#
136137
# The bias towards high cardinality features explains why the `random_num` has
137138
# a really large importance in comparison with `random_cat` while we would
138-
# expect both random features to have a null importance.
139+
# expect that both random features have a null importance.
139140
#
140141
# The fact that we use training set statistics explains why both the
141142
# `random_num` and `random_cat` features have a non-null importance.
@@ -155,11 +156,11 @@
155156
# %%
156157
# As an alternative, the permutation importances of ``rf`` are computed on a
157158
# held out test set. This shows that the low cardinality categorical feature,
158-
# `sex` and `pclass` are the most important feature. Indeed, permuting the
159-
# values of these features will lead to most decrease in accuracy score of the
159+
# `sex` and `pclass` are the most important features. Indeed, permuting the
160+
# values of these features will lead to the most decrease in accuracy score of the
160161
# model on the test set.
161162
#
162-
# Also note that both random features have very low importances (close to 0) as
163+
# Also, note that both random features have very low importances (close to 0) as
163164
# expected.
164165
from sklearn.inspection import permutation_importance
165166

Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
0 Bytes
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
0 Bytes
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
0 Bytes
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
0 Bytes
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
0 Bytes
Binary file not shown.
Binary file not shown.
Binary file not shown.
0 Bytes
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
0 Bytes
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
0 Bytes
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
0 Bytes
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.

Diff for: dev/_downloads/f99bb35c32eb8028063a1428c3999b84/plot_permutation_importance.ipynb

+10-3
Original file line numberDiff line numberDiff line change
@@ -58,7 +58,7 @@
5858
"cell_type": "markdown",
5959
"metadata": {},
6060
"source": [
61-
"## Accuracy of the Model\nPrior to inspecting the feature importances, it is important to check that\nthe model predictive performance is high enough. Indeed there would be little\ninterest of inspecting the important features of a non-predictive model.\n\nHere one can observe that the train accuracy is very high (the forest model\nhas enough capacity to completely memorize the training set) but it can still\ngeneralize well enough to the test set thanks to the built-in bagging of\nrandom forests.\n\nIt might be possible to trade some accuracy on the training set for a\nslightly better accuracy on the test set by limiting the capacity of the\ntrees (for instance by setting ``min_samples_leaf=5`` or\n``min_samples_leaf=10``) so as to limit overfitting while not introducing too\nmuch underfitting.\n\nHowever let's keep our high capacity random forest model for now so as to\nillustrate some pitfalls with feature importance on variables with many\nunique values.\n\n"
61+
"## Accuracy of the Model\nBefore inspecting the feature importances, it is important to check that\nthe model predictive performance is high enough. Indeed, there would be little\ninterest in inspecting the important features of a non-predictive model.\n\n"
6262
]
6363
},
6464
{
@@ -76,7 +76,14 @@
7676
"cell_type": "markdown",
7777
"metadata": {},
7878
"source": [
79-
"## Tree's Feature Importance from Mean Decrease in Impurity (MDI)\nThe impurity-based feature importance ranks the numerical features to be the\nmost important features. As a result, the non-predictive ``random_num``\nvariable is ranked as one of the most important features!\n\nThis problem stems from two limitations of impurity-based feature\nimportances:\n\n- impurity-based importances are biased towards high cardinality features;\n- impurity-based importances are computed on training set statistics and\n therefore do not reflect the ability of feature to be useful to make\n predictions that generalize to the test set (when the model has enough\n capacity).\n\nThe bias towards high cardinality features explains why the `random_num` has\na really large importance in comparison with `random_cat` while we would\nexpect both random features to have a null importance.\n\nThe fact that we use training set statistics explains why both the\n`random_num` and `random_cat` features have a non-null importance.\n\n"
79+
"Here, one can observe that the train accuracy is very high (the forest model\nhas enough capacity to completely memorize the training set) but it can still\ngeneralize well enough to the test set thanks to the built-in bagging of\nrandom forests.\n\nIt might be possible to trade some accuracy on the training set for a\nslightly better accuracy on the test set by limiting the capacity of the\ntrees (for instance by setting ``min_samples_leaf=5`` or\n``min_samples_leaf=10``) so as to limit overfitting while not introducing too\nmuch underfitting.\n\nHowever, let us keep our high capacity random forest model for now so that we can\nillustrate some pitfalls about feature importance on variables with many\nunique values.\n\n"
80+
]
81+
},
82+
{
83+
"cell_type": "markdown",
84+
"metadata": {},
85+
"source": [
86+
"## Tree's Feature Importance from Mean Decrease in Impurity (MDI)\nThe impurity-based feature importance ranks the numerical features to be the\nmost important features. As a result, the non-predictive ``random_num``\nvariable is ranked as one of the most important features!\n\nThis problem stems from two limitations of impurity-based feature\nimportances:\n\n- impurity-based importances are biased towards high cardinality features;\n- impurity-based importances are computed on training set statistics and\n therefore do not reflect the ability of feature to be useful to make\n predictions that generalize to the test set (when the model has enough\n capacity).\n\nThe bias towards high cardinality features explains why the `random_num` has\na really large importance in comparison with `random_cat` while we would\nexpect that both random features have a null importance.\n\nThe fact that we use training set statistics explains why both the\n`random_num` and `random_cat` features have a non-null importance.\n\n"
8087
]
8188
},
8289
{
@@ -105,7 +112,7 @@
105112
"cell_type": "markdown",
106113
"metadata": {},
107114
"source": [
108-
"As an alternative, the permutation importances of ``rf`` are computed on a\nheld out test set. This shows that the low cardinality categorical feature,\n`sex` and `pclass` are the most important feature. Indeed, permuting the\nvalues of these features will lead to most decrease in accuracy score of the\nmodel on the test set.\n\nAlso note that both random features have very low importances (close to 0) as\nexpected.\n\n"
115+
"As an alternative, the permutation importances of ``rf`` are computed on a\nheld out test set. This shows that the low cardinality categorical feature,\n`sex` and `pclass` are the most important features. Indeed, permuting the\nvalues of these features will lead to the most decrease in accuracy score of the\nmodel on the test set.\n\nAlso, note that both random features have very low importances (close to 0) as\nexpected.\n\n"
109116
]
110117
},
111118
{

0 commit comments

Comments
 (0)