Hands on Practical Examples On Sequential Feature Selection in Python
Hands on Practical Examples On Sequential Feature Selection in Python
Selection in Python
Example 1 - A simple Sequential Forward Selection example
Initializing a simple classifier from scikit-learn:
Code:
iris = load_iris()
X = iris.data
y = iris.target
knn = KNeighborsClassifier(n_neighbors=4)
We start by selection the "best" 3 features from the Iris dataset via Sequential Forward
Selection (SFS). Here, we set forward=True and floating=False. By choosing cv=0, we don't
perform any cross-validation, therefore, the performance (here: 'accuracy') is computed
entirely on the training set.
Code:
sfs1 = SFS(knn,
k_features=3,
forward=True,
floating=False,
verbose=2,
scoring='accuracy',
cv=0)
sfs1 = sfs1.fit(X, y)
Output:
Via the subsets_ attribute, we can take a look at the selected feature indices at each step:
Code:
sfs1.subsets_
Output:
Note that the 'feature_names' entry is simply a string representation of the 'feature_idx' in this
case. Optionally, we can provide custom feature names via the fit method's
custom_feature_names parameter:
Code:
Output:
Furthermore, we can access the indices of the 3 best features directly via the k_feature_idx_
attribute:
Code:
sfs1.k_feature_idx_
Output:
(1, 2, 3)
And similarly, to obtain the names of these features, given that we provided an argument to
the custom_feature_names parameter, we can refer to the
sfs1.k_feature_names_ attribute:
Code:
sfs1.k_feature_names_
Output:
Finally, the prediction score for these 3 features can be accesses via k_score_:
Code:
sfs1.k_score_
Output:
0.9733333333333334
Example 2 - Toggling between SFS, SBS, SFFS, and SBFS
Using the forward and floating parameters, we can toggle between SFS, SBS, SFFS, and
SBFS as shown below. Note that we are performing (stratified) 4-fold cross-validation for
more robust estimates in contrast to Example 1. Via n_jobs=-1, we choose to run the cross-
validation on all our available CPU cores.
Code:
###################################################
###################################################
Output:
/Users/sebastianraschka/miniforge3/lib/python3.9/site-
packages/scipy/__init__.py:146: UserWarning: A NumPy version >=1.16.5 and
<1.23.0 is required for this version of SciPy (detected version 1.23.1
warnings.warn(f"A NumPy version >={np_minversion} and <{np_maxversion}"
/Users/sebastianraschka/miniforge3/lib/python3.9/site-
packages/scipy/__init__.py:146: UserWarning: A NumPy version >=1.16.5 and
<1.23.0 is required for this version of SciPy (detected version 1.23.1
warnings.warn(f"A NumPy version >={np_minversion} and <{np_maxversion}"
/Users/sebastianraschka/miniforge3/lib/python3.9/site-
packages/scipy/__init__.py:146: UserWarning: A NumPy version >=1.16.5 and
<1.23.0 is required for this version of SciPy (detected version 1.23.1
warnings.warn(f"A NumPy version >={np_minversion} and <{np_maxversion}"
/Users/sebastianraschka/miniforge3/lib/python3.9/site-
packages/scipy/__init__.py:146: UserWarning: A NumPy version >=1.16.5 and
<1.23.0 is required for this version of SciPy (detected version 1.23.1
warnings.warn(f"A NumPy version >={np_minversion} and <{np_maxversion}"
/Users/sebastianraschka/miniforge3/lib/python3.9/site-
packages/scipy/__init__.py:146: UserWarning: A NumPy version >=1.16.5 and
<1.23.0 is required for this version of SciPy (detected version 1.23.1
warnings.warn(f"A NumPy version >={np_minversion} and <{np_maxversion}"
/Users/sebastianraschka/miniforge3/lib/python3.9/site-
packages/scipy/__init__.py:146: UserWarning: A NumPy version >=1.16.5 and
<1.23.0 is required for this version of SciPy (detected version 1.23.1
warnings.warn(f"A NumPy version >={np_minversion} and <{np_maxversion}"
/Users/sebastianraschka/miniforge3/lib/python3.9/site-
packages/scipy/__init__.py:146: UserWarning: A NumPy version >=1.16.5 and
<1.23.0 is required for this version of SciPy (detected version 1.23.1
warnings.warn(f"A NumPy version >={np_minversion} and <{np_maxversion}"
/Users/sebastianraschka/miniforge3/lib/python3.9/site-
packages/scipy/__init__.py:146: UserWarning: A NumPy version >=1.16.5 and
<1.23.0 is required for this version of SciPy (detected version 1.23.1
warnings.warn(f"A NumPy version >={np_minversion} and <{np_maxversion}"
/Users/sebastianraschka/miniforge3/lib/python3.9/site-
packages/scipy/__init__.py:146: UserWarning: A NumPy version >=1.16.5 and
<1.23.0 is required for this version of SciPy (detected version 1.23.1
warnings.warn(f"A NumPy version >={np_minversion} and <{np_maxversion}"
In this simple scenario, selecting the best 3 features out of the 4 available features in the Iris
set, we end up with similar results regardless of which sequential selection algorithms we
used.
Below, we see the DataFrame of the Sequential Forward Selector from Example 2:
Code:
import pandas as pd
pd.DataFrame.from_dict(sfs.get_metric_dict()).T
Output:
Code:
pd.DataFrame.from_dict(sbs.get_metric_dict()).T
Output:
We can see that both SFS and SBFS found the same "best" 3 features, however, the
intermediate steps where obviously different.
The ci_bound column in the DataFrames above represents the confidence interval around the
computed cross-validation scores. By default, a confidence interval of 95% is used, but we
can use different confidence bounds via the confidence_interval parameter. E.g., the
confidence bounds for a 90% confidence interval can be obtained as follows:
Code:
pd.DataFrame.from_dict(sbs.get_metric_dict(confidence_interval=0.90)).T
Output:
Code:
sfs = SFS(knn,
k_features=4,
forward=True,
floating=False,
scoring='accuracy',
verbose=2,
cv=5)
sfs = sfs.fit(X, y)
plt.ylim([0.8, 1])
plt.title('Sequential Forward Selection (w. StdDev)')
plt.grid()
plt.show()
Output:
Output:
Code:
data = fetch_california_housing()
X, y = data.data, data.target
lr = LinearRegression()
sfs = SFS(lr,
k_features=8,
forward=True,
floating=False,
scoring='neg_mean_squared_error',
cv=10)
sfs = sfs.fit(X, y)
fig = plot_sfs(sfs.get_metric_dict(), kind='std_err')
Code:
iris = load_iris()
X = iris.data
y = iris.target
rng = np.random.RandomState(123)
my_validation_indices = rng.permutation(np.arange(150))[:30]
print(my_validation_indices)
Output:
knn = KNeighborsClassifier(n_neighbors=4)
piter = PredefinedHoldoutSplit(my_validation_indices)
sfs1 = SFS(knn,
k_features=3,
forward=True,
floating=False,
verbose=2,
scoring='accuracy',
cv=piter)
sfs1 = sfs1.fit(X, y)
Output:
knn = KNeighborsClassifier(n_neighbors=4)
Code:
sfs1 = SFS(knn,
k_features=3,
forward=True,
floating=False,
scoring='accuracy',
cv=5)
sfs1 = sfs1.fit(X_train, y_train)
print('Selected features:', sfs1.k_feature_idx_)
Output:
Code:
X_train_sfs = sfs1.transform(X_train)
X_test_sfs = sfs1.transform(X_test)
Output:
Code:
iris = load_iris()
X, y = iris.data, iris.target
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=123)
Code:
knn1 = KNeighborsClassifier()
knn2 = KNeighborsClassifier()
sfs1 = SFS(estimator=knn1,
k_features=3,
forward=True,
floating=False,
scoring='accuracy',
cv=5)
param_grid = {
'sfs__k_features': [1, 2, 3],
'sfs__estimator__n_neighbors': [3, 4, 7], # inner knn
'knn2__n_neighbors': [3, 4, 7] # outer knn
}
gs = GridSearchCV(estimator=pipe,
param_grid=param_grid,
scoring='accuracy',
n_jobs=1,
cv=5,
refit=False)
# run gridearch
gs = gs.fit(X_train, y_train)
Code:
Output:
Code:
pipe.set_params(**gs.best_params_).fit(X_train, y_train)
Pipeline(steps=[('sfs',
SequentialFeatureSelector(estimator=KNeighborsClassifier(n_neighbors=3),
k_features=3,
scoring='accuracy')),
('knn2', KNeighborsClassifier(n_neighbors=7))])
If k_features is set to to a tuple (min_k, max_k) (new in 0.4.2), the SFS will now
select the best feature combination that it discovered by iterating from k=1 to max_k
(forward), or max_k to min_k (backward). The size of the returned feature subset is then
within max_k to min_k, depending on which combination scored best during cross
validation.
Code:
X.shape
Output:
(150, 4)
Code:
from mlxtend.feature_selection import SequentialFeatureSelector as SFS
from sklearn.neighbors import KNeighborsClassifier
from mlxtend.data import wine_data
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import make_pipeline
X, y = wine_data()
X_train, X_test, y_train, y_test= train_test_split(X, y,
stratify=y,
test_size=0.3,
random_state=1)
knn = KNeighborsClassifier(n_neighbors=2)
sfs1 = SFS(estimator=knn,
k_features=(3, 10),
forward=True,
floating=False,
scoring='accuracy',
cv=5)
pipe.fit(X_train, y_train)
Output:
all subsets:
{1: {'feature_idx': (6,), 'cv_scores': array([0.84 , 0.64 , 0.84 , 0.8 ,
0.875]), 'avg_score': 0.799, 'feature_names': ('6',)}, 2: {'feature_idx':
(6, 9), 'cv_scores': array([0.92 , 0.88 , 1. , 0.96 ,
0.91666667]), 'avg_score': 0.9353333333333333, 'feature_names': ('6',
'9')}, 3: {'feature_idx': (6, 9, 12), 'cv_scores': array([0.92 , 0.92
, 0.96 , 1. , 0.95833333]), 'avg_score': 0.9516666666666665,
'feature_names': ('6', '9', '12')}, 4: {'feature_idx': (3, 6, 9, 12),
'cv_scores': array([0.96 , 0.96 , 0.96 , 1. ,
0.95833333]), 'avg_score': 0.9676666666666666, 'feature_names': ('3', '6',
'9', '12')}, 5: {'feature_idx': (3, 6, 9, 10, 12), 'cv_scores':
array([0.92, 0.96, 1. , 1. , 1. ]), 'avg_score': 0.976, 'feature_names':
('3', '6', '9', '10', '12')}, 6: {'feature_idx': (2, 3, 6, 9, 10, 12),
'cv_scores': array([0.92, 0.96, 1. , 0.96, 1. ]), 'avg_score': 0.968,
'feature_names': ('2', '3', '6', '9', '10', '12')}, 7: {'feature_idx': (0,
2, 3, 6, 9, 10, 12), 'cv_scores': array([0.92, 0.92, 1. , 1. , 1. ]),
'avg_score': 0.968, 'feature_names': ('0', '2', '3', '6', '9', '10',
'12')}, 8: {'feature_idx': (0, 2, 3, 6, 8, 9, 10, 12), 'cv_scores':
array([1. , 0.92, 1. , 1. , 1. ]), 'avg_score': 0.984, 'feature_names':
('0', '2', '3', '6', '8', '9', '10', '12')}, 9: {'feature_idx': (0, 2, 3,
6, 8, 9, 10, 11, 12), 'cv_scores': array([1. , 0.92, 1. , 1. , 1. ]),
'avg_score': 0.984, 'feature_names': ('0', '2', '3', '6', '8', '9', '10',
'11', '12')}, 10: {'feature_idx': (0, 1, 2, 3, 6, 8, 9, 10, 11, 12),
'cv_scores': array([1. , 0.96, 1. , 1. , 1. ]), 'avg_score': 0.992,
'feature_names': ('0', '1', '2', '3', '6', '8', '9', '10', '11', '12')}}
Code:
X, y = iris_data()
groups = np.arange(len(y)) // 10
print('groups: {}'.format(groups))
Output:
groups: [ 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 2 2
2 2
2 2 2 2 2 2 3 3 3 3 3 3 3 3 3 3 4 4 4 4 4 4 4 4
4 4 5 5 5 5 5 5 5 5 5 5 6 6 6 6 6 6 6 6 6 6 7 7
7 7 7 7 7 7 7 7 8 8 8 8 8 8 8 8 8 8 9 9 9 9 9 9
9 9 9 9 10 10 10 10 10 10 10 10 10 10 11 11 11 11 11 11 11 11 11 11
12 12 12 12 12 12 12 12 12 12 13 13 13 13 13 13 13 13 13 13 14 14 14 14
14 14 14 14 14 14]
Calling the split() method of a scikit-learn cross-validator object will return a generator
that yields train, test splits
Code:
Code:
cv = list(cv_gen)
knn = KNeighborsClassifier(n_neighbors=2)
sfs = SFS(estimator=knn,
k_features=2,
scoring='accuracy',
cv=cv)
sfs.fit(X, y)
Output:
Toy dataset
Code:
X, y = make_classification(
n_samples=20000,
n_features=500,
n_informative=10,
n_redundant=40,
n_repeated=25,
n_clusters_per_class=5,
flip_y=0.05,
class_sep=0.5,
random_state=123,
)
Output:
/Users/sebastianraschka/miniforge3/lib/python3.9/site-
packages/scipy/__init__.py:146: UserWarning: A NumPy version >=1.16.5 and
<1.23.0 is required for this version of SciPy (detected version 1.23.3
warnings.warn(f"A NumPy version >={np_minversion} and <{np_maxversion}"
Code:
model = LogisticRegression()
sfs1 = SFS(model,
k_features=10,
forward=True,
floating=False,
verbose=2,
scoring='accuracy',
cv=5)
Output:
Note that the feature selection run hasn't finished, so certain attributes may not be available.
In order to use the SFS instance, it is recommended to call finalize_fit, which will make
SFS estimator appear as "fitted" process the temporary results:
Code:
sfs1.finalize_fit()
print(sfs1.k_feature_idx_)
print(sfs1.k_score_)
Output:
(128, 160)
0.6256875000000001
Optionally, we can also use pandas DataFrames and pandas Series as input to the fit
function. In this case, the column names of the pandas DataFrame will be used as feature
names. However, note that if custom_feature_names are provided in the fit function,
these custom_feature_names take precedence over the DataFrame column-based
feature names.
Code:
import pandas as pd
from sklearn.neighbors import KNeighborsClassifier
from sklearn.datasets import load_iris
from mlxtend.feature_selection import SequentialFeatureSelector as SFS
iris = load_iris()
X = iris.data
y = iris.target
knn = KNeighborsClassifier(n_neighbors=4)
sfs1 = SFS(knn,
k_features=3,
forward=True,
floating=False,
scoring='accuracy',
cv=0)
Code:
Output:
Code:
y_series = pd.Series(y)
y_series.head()
Output:
0 0
1 0
2 0
3 0
4 0
dtype: int64
Code:
Note that the only difference of passing a pandas DataFrame as input is that the
sfs1.subsets_ array will now contain a new column,
Code:
sfs1.subsets_
Output:
Note that this feature works for all options regarding forward and backward selection, and
using floating selection or not.
The example below illustrates how we can set the features 0 and 2 in the dataset as fixed:
Code:
iris = load_iris()
X = iris.data
y = iris.target
knn = KNeighborsClassifier(n_neighbors=3)
sfs1 = SFS(knn,
k_features=4,
forward=True,
floating=False,
verbose=2,
scoring='accuracy',
fixed_features=(0, 2),
cv=3)
sfs1 = sfs1.fit(X, y)
Output:
sfs1.subsets_
Output:
If the input dataset is a pandas DataFrame, we can also use the column names directly:
Code:
import pandas as pd
X_df = pd.DataFrame(X, columns=['sepal len', 'petal len',
'sepal width', 'petal width'])
X_df.head()
Output:
Code:
sfs2 = SFS(knn,
k_features=4,
forward=True,
floating=False,
verbose=2,
scoring='accuracy',
fixed_features=('sepal len', 'petal len'),
cv=3)
Output:
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent
workers.
[Parallel(n_jobs=1)]: Done 1 out of 1 | elapsed: 0.0s remaining:
0.0s
[Parallel(n_jobs=1)]: Done 2 out of 2 | elapsed: 0.0s finished
Code:
sfs2.subsets_
Output:
Code:
iris = load_iris()
X = iris.data
y = iris.target
Output:
Code:
knn = KNeighborsClassifier(n_neighbors=3)
sfs1 = SFS(knn,
k_features=2,
scoring='accuracy',
feature_groups=(['sepal len', 'sepal wid'], ['petal len'],
['petal wid']),
cv=3)
Code:
sfs1 = sfs1.fit(X_df, y)
sfs1 = SFS(knn,
k_features=2,
scoring='accuracy',
feature_groups=[[0, 2], [1], [3]],
cv=3)
sfs1 = sfs1.fit(X, y)
API
SequentialFeatureSelector(estimator, k_features=1, forward=True, floating=False,
verbose=0, scoring=None, cv=5, n_jobs=1, pre_dispatch='2n_jobs', clone_estimator=True,
fixed_features=None, feature_groups=None)*
Methods
get_metric_dict(confidence_interval=0.95)
get_params(deep=True)
set_params(params)
Set the parameters of this estimator. Valid parameter keys can be listed with get_params().
Returns
self
transform(X)
Properties
named_estimators
Returns
ython