.. DO NOT EDIT.
.. THIS FILE WAS AUTOMATICALLY GENERATED BY SPHINX-GALLERY.
.. TO MAKE CHANGES, EDIT THE SOURCE PYTHON FILE:
.. "auto_examples/feature_selection/plot_feature_selection.py"
.. LINE NUMBERS ARE GIVEN BELOW.

.. only:: html

    .. note::
        :class: sphx-glr-download-link-note

        :ref:`Go to the end <sphx_glr_download_auto_examples_feature_selection_plot_feature_selection.py>`
        to download the full example code. or to run this example in your browser via JupyterLite or Binder

.. rst-class:: sphx-glr-example-title

.. _sphx_glr_auto_examples_feature_selection_plot_feature_selection.py:


============================
Univariate Feature Selection
============================

This notebook is an example of using univariate feature selection
to improve classification accuracy on a noisy dataset.

In this example, some noisy (non informative) features are added to
the iris dataset. Support vector machine (SVM) is used to classify the
dataset both before and after applying univariate feature selection.
For each feature, we plot the p-values for the univariate feature selection
and the corresponding weights of SVMs. With this, we will compare model
accuracy and examine the impact of univariate feature selection on model
weights.

.. GENERATED FROM PYTHON SOURCE LINES 18-22

.. code-block:: Python


    # Authors: The scikit-learn developers
    # SPDX-License-Identifier: BSD-3-Clause








.. GENERATED FROM PYTHON SOURCE LINES 23-26

Generate sample data
--------------------


.. GENERATED FROM PYTHON SOURCE LINES 26-43

.. code-block:: Python

    import numpy as np

    from sklearn.datasets import load_iris
    from sklearn.model_selection import train_test_split

    # The iris dataset
    X, y = load_iris(return_X_y=True)

    # Some noisy data not correlated
    E = np.random.RandomState(42).uniform(0, 0.1, size=(X.shape[0], 20))

    # Add the noisy data to the informative features
    X = np.hstack((X, E))

    # Split dataset to select feature and evaluate the classifier
    X_train, X_test, y_train, y_test = train_test_split(X, y, stratify=y, random_state=0)








.. GENERATED FROM PYTHON SOURCE LINES 44-50

Univariate feature selection
----------------------------

Univariate feature selection with F-test for feature scoring.
We use the default selection function to select
the four most significant features.

.. GENERATED FROM PYTHON SOURCE LINES 50-57

.. code-block:: Python

    from sklearn.feature_selection import SelectKBest, f_classif

    selector = SelectKBest(f_classif, k=4)
    selector.fit(X_train, y_train)
    scores = -np.log10(selector.pvalues_)
    scores /= scores.max()








.. GENERATED FROM PYTHON SOURCE LINES 58-69

.. code-block:: Python

    import matplotlib.pyplot as plt

    X_indices = np.arange(X.shape[-1])
    plt.figure(1)
    plt.clf()
    plt.bar(X_indices - 0.05, scores, width=0.2)
    plt.title("Feature univariate score")
    plt.xlabel("Feature number")
    plt.ylabel(r"Univariate score ($-Log(p_{value})$)")
    plt.show()




.. image-sg:: /auto_examples/feature_selection/images/sphx_glr_plot_feature_selection_001.png
   :alt: Feature univariate score
   :srcset: /auto_examples/feature_selection/images/sphx_glr_plot_feature_selection_001.png
   :class: sphx-glr-single-img





.. GENERATED FROM PYTHON SOURCE LINES 70-73

In the total set of features, only the 4 of the original features are significant.
We can see that they have the highest score with univariate feature
selection.

.. GENERATED FROM PYTHON SOURCE LINES 75-79

Compare with SVMs
-----------------

Without univariate feature selection

.. GENERATED FROM PYTHON SOURCE LINES 79-94

.. code-block:: Python

    from sklearn.pipeline import make_pipeline
    from sklearn.preprocessing import MinMaxScaler
    from sklearn.svm import LinearSVC

    clf = make_pipeline(MinMaxScaler(), LinearSVC())
    clf.fit(X_train, y_train)
    print(
        "Classification accuracy without selecting features: {:.3f}".format(
            clf.score(X_test, y_test)
        )
    )

    svm_weights = np.abs(clf[-1].coef_).sum(axis=0)
    svm_weights /= svm_weights.sum()





.. rst-class:: sphx-glr-script-out

 .. code-block:: none

    Classification accuracy without selecting features: 0.789




.. GENERATED FROM PYTHON SOURCE LINES 95-96

After univariate feature selection

.. GENERATED FROM PYTHON SOURCE LINES 96-107

.. code-block:: Python

    clf_selected = make_pipeline(SelectKBest(f_classif, k=4), MinMaxScaler(), LinearSVC())
    clf_selected.fit(X_train, y_train)
    print(
        "Classification accuracy after univariate feature selection: {:.3f}".format(
            clf_selected.score(X_test, y_test)
        )
    )

    svm_weights_selected = np.abs(clf_selected[-1].coef_).sum(axis=0)
    svm_weights_selected /= svm_weights_selected.sum()





.. rst-class:: sphx-glr-script-out

 .. code-block:: none

    Classification accuracy after univariate feature selection: 0.868




.. GENERATED FROM PYTHON SOURCE LINES 108-128

.. code-block:: Python

    plt.bar(
        X_indices - 0.45, scores, width=0.2, label=r"Univariate score ($-Log(p_{value})$)"
    )

    plt.bar(X_indices - 0.25, svm_weights, width=0.2, label="SVM weight")

    plt.bar(
        X_indices[selector.get_support()] - 0.05,
        svm_weights_selected,
        width=0.2,
        label="SVM weights after selection",
    )

    plt.title("Comparing feature selection")
    plt.xlabel("Feature number")
    plt.yticks(())
    plt.axis("tight")
    plt.legend(loc="upper right")
    plt.show()




.. image-sg:: /auto_examples/feature_selection/images/sphx_glr_plot_feature_selection_002.png
   :alt: Comparing feature selection
   :srcset: /auto_examples/feature_selection/images/sphx_glr_plot_feature_selection_002.png
   :class: sphx-glr-single-img





.. GENERATED FROM PYTHON SOURCE LINES 129-134

Without univariate feature selection, the SVM assigns a large weight
to the first 4 original significant features, but also selects many of the
non-informative features. Applying univariate feature selection before
the SVM increases the SVM weight attributed to the significant features,
and will thus improve classification.


.. rst-class:: sphx-glr-timing

   **Total running time of the script:** (0 minutes 0.217 seconds)


.. _sphx_glr_download_auto_examples_feature_selection_plot_feature_selection.py:

.. only:: html

  .. container:: sphx-glr-footer sphx-glr-footer-example

    .. container:: binder-badge

      .. image:: images/binder_badge_logo.svg
        :target: https://mybinder.org/v2/gh/scikit-learn/scikit-learn/1.6.X?urlpath=lab/tree/notebooks/auto_examples/feature_selection/plot_feature_selection.ipynb
        :alt: Launch binder
        :width: 150 px

    .. container:: lite-badge

      .. image:: images/jupyterlite_badge_logo.svg
        :target: ../../lite/lab/index.html?path=auto_examples/feature_selection/plot_feature_selection.ipynb
        :alt: Launch JupyterLite
        :width: 150 px

    .. container:: sphx-glr-download sphx-glr-download-jupyter

      :download:`Download Jupyter notebook: plot_feature_selection.ipynb <plot_feature_selection.ipynb>`

    .. container:: sphx-glr-download sphx-glr-download-python

      :download:`Download Python source code: plot_feature_selection.py <plot_feature_selection.py>`

    .. container:: sphx-glr-download sphx-glr-download-zip

      :download:`Download zipped: plot_feature_selection.zip <plot_feature_selection.zip>`


.. include:: plot_feature_selection.recommendations


.. only:: html

 .. rst-class:: sphx-glr-signature

    `Gallery generated by Sphinx-Gallery <https://sphinx-gallery.github.io>`_