scikit-learn
diff --git a/‎1.5/.buildinfo
Lines changed: 1 addition & 1 deletion b/‎1.5/.buildinfo
Lines changed: 1 addition & 1 deletion
diff --git a/‎1.5/_downloads/0785ea6d45bde062e5beedda88131215/plot_release_highlights_1_3_0.ipynb
Lines changed: 2 additions & 2 deletions b/‎1.5/_downloads/0785ea6d45bde062e5beedda88131215/plot_release_highlights_1_3_0.ipynb
Lines changed: 2 additions & 2 deletions
diff --git a/‎1.5/_downloads/07fcc19ba03226cd3d83d4e40ec44385/auto_examples_python.zip
8.08 KB b/‎1.5/_downloads/07fcc19ba03226cd3d83d4e40ec44385/auto_examples_python.zip
8.08 KB
diff --git a/‎1.5/_downloads/133f2198d3ab792c75b39a63b0a99872/plot_cost_sensitive_learning.ipynb
Lines changed: 10 additions & 10 deletions b/‎1.5/_downloads/133f2198d3ab792c75b39a63b0a99872/plot_cost_sensitive_learning.ipynb
Lines changed: 10 additions & 10 deletions
diff --git a/‎1.5/_downloads/1b8827af01c9a70017a4739bcf2e21a8/plot_gpr_co2.py
Lines changed: 27 additions & 25 deletions b/‎1.5/_downloads/1b8827af01c9a70017a4739bcf2e21a8/plot_gpr_co2.py
Lines changed: 27 additions & 25 deletions
diff --git a/‎1.5/_downloads/23614d75e8327ef369659da7d2ed62db/plot_nested_cross_validation_iris.py
Lines changed: 6 additions & 6 deletions b/‎1.5/_downloads/23614d75e8327ef369659da7d2ed62db/plot_nested_cross_validation_iris.py
Lines changed: 6 additions & 6 deletions
@@ -1,4 +1,4 @@
 # Sphinx build info version 1
 # This file hashes the configuration used when building these files. When it is not found, a full rebuild will be done.
-config: 040ab5c2be3ae4d05035263b4622b4bb
+config: 2907958f5ac8f0a65d5062369d2dcd04
 tags: 645f666f9bcd5a90fca523b33c5a78b7
@@ -72,7 +72,7 @@
       "cell_type": "markdown",
       "metadata": {},
       "source": [
-        "## New display `model_selection.ValidationCurveDisplay`\n:class:`model_selection.ValidationCurveDisplay` is now available to plot results\nfrom :func:`model_selection.validation_curve`.\n\n"
+        "## New display :class:`~model_selection.ValidationCurveDisplay`\n:class:`model_selection.ValidationCurveDisplay` is now available to plot results\nfrom :func:`model_selection.validation_curve`.\n\n"
       ]
     },
     {
@@ -108,7 +108,7 @@
       "cell_type": "markdown",
       "metadata": {},
       "source": [
-        "## Grouping infrequent categories in :class:`preprocessing.OrdinalEncoder`\nSimilarly to :class:`preprocessing.OneHotEncoder`, the class\n:class:`preprocessing.OrdinalEncoder` now supports aggregating infrequent categories\ninto a single output for each feature. The parameters to enable the gathering of\ninfrequent categories are `min_frequency` and `max_categories`.\nSee the `User Guide <encoder_infrequent_categories>` for more details.\n\n"
+        "## Grouping infrequent categories in :class:`~preprocessing.OrdinalEncoder`\nSimilarly to :class:`preprocessing.OneHotEncoder`, the class\n:class:`preprocessing.OrdinalEncoder` now supports aggregating infrequent categories\ninto a single output for each feature. The parameters to enable the gathering of\ninfrequent categories are `min_frequency` and `max_categories`.\nSee the `User Guide <encoder_infrequent_categories>` for more details.\n\n"
       ]
     },
     {
 
@@ -4,7 +4,7 @@
       "cell_type": "markdown",
       "metadata": {},
       "source": [
-        "\n# Post-tuning the decision threshold for cost-sensitive learning\n\nOnce a classifier is trained, the output of the :term:`predict` method outputs class\nlabel predictions corresponding to a thresholding of either the :term:`decision\nfunction` or the :term:`predict_proba` output. For a binary classifier, the default\nthreshold is defined as a posterior probability estimate of 0.5 or a decision score of\n0.0.\n\nHowever, this default strategy is most likely not optimal for the task at hand.\nHere, we use the \"Statlog\" German credit dataset [1]_ to illustrate a use case.\nIn this dataset, the task is to predict whether a person has a \"good\" or \"bad\" credit.\nIn addition, a cost-matrix is provided that specifies the cost of\nmisclassification. Specifically, misclassifying a \"bad\" credit as \"good\" is five\ntimes more costly on average than misclassifying a \"good\" credit as \"bad\".\n\nWe use the :class:`~sklearn.model_selection.TunedThresholdClassifierCV` to select the\ncut-off point of the decision function that minimizes the provided business\ncost.\n\nIn the second part of the example, we further extend this approach by\nconsidering the problem of fraud detection in credit card transactions: in this\ncase, the business metric depends on the amount of each individual transaction.\n.. topic:: References\n\n    .. [1] \"Statlog (German Credit Data) Data Set\", UCI Machine Learning Repository,\n       [Link](https://archive.ics.uci.edu/ml/datasets/Statlog+%28German+Credit+Data%29).\n\n    .. [2] [Charles Elkan, \"The Foundations of Cost-Sensitive Learning\",\n       International joint conference on artificial intelligence.\n       Vol. 17. No. 1. Lawrence Erlbaum Associates Ltd, 2001.](https://cseweb.ucsd.edu/~elkan/rescale.pdf)\n"
+        "\n# Post-tuning the decision threshold for cost-sensitive learning\n\nOnce a classifier is trained, the output of the :term:`predict` method outputs class\nlabel predictions corresponding to a thresholding of either the\n:term:`decision_function` or the :term:`predict_proba` output. For a binary classifier,\nthe default threshold is defined as a posterior probability estimate of 0.5 or a\ndecision score of 0.0.\n\nHowever, this default strategy is most likely not optimal for the task at hand.\nHere, we use the \"Statlog\" German credit dataset [1]_ to illustrate a use case.\nIn this dataset, the task is to predict whether a person has a \"good\" or \"bad\" credit.\nIn addition, a cost-matrix is provided that specifies the cost of\nmisclassification. Specifically, misclassifying a \"bad\" credit as \"good\" is five\ntimes more costly on average than misclassifying a \"good\" credit as \"bad\".\n\nWe use the :class:`~sklearn.model_selection.TunedThresholdClassifierCV` to select the\ncut-off point of the decision function that minimizes the provided business\ncost.\n\nIn the second part of the example, we further extend this approach by\nconsidering the problem of fraud detection in credit card transactions: in this\ncase, the business metric depends on the amount of each individual transaction.\n\n.. rubric :: References\n\n.. [1] \"Statlog (German Credit Data) Data Set\", UCI Machine Learning Repository,\n    [Link](https://archive.ics.uci.edu/ml/datasets/Statlog+%28German+Credit+Data%29).\n\n.. [2] [Charles Elkan, \"The Foundations of Cost-Sensitive Learning\",\n    International joint conference on artificial intelligence.\n    Vol. 17. No. 1. Lawrence Erlbaum Associates Ltd, 2001.](https://cseweb.ucsd.edu/~elkan/rescale.pdf)\n"
       ]
     },
     {
@@ -422,14 +422,14 @@
       },
       "outputs": [],
       "source": [
-        "fraud = target == 1\namount_fraud = data[\"Amount\"][fraud]\n_, ax = plt.subplots()\nax.hist(amount_fraud, bins=100)\nax.set_title(\"Amount of fraud transaction\")\n_ = ax.set_xlabel(\"Amount ($)\")"
+        "fraud = target == 1\namount_fraud = data[\"Amount\"][fraud]\n_, ax = plt.subplots()\nax.hist(amount_fraud, bins=100)\nax.set_title(\"Amount of fraud transaction\")\n_ = ax.set_xlabel(\"Amount (\u20ac)\")"
       ]
     },
     {
       "cell_type": "markdown",
       "metadata": {},
       "source": [
-        "### Addressing the problem with a business metric\n\nNow, we create the business metric that depends on the amount of each transaction. We\ndefine the cost matrix similarly to [2]_. Accepting a legitimate transaction provides\na gain of 2% of the amount of the transaction. However, accepting a fraudulent\ntransaction result in a loss of the amount of the transaction. As stated in [2]_, the\ngain and loss related to refusals (of fraudulent and legitimate transactions) are not\ntrivial to define. Here, we define that a refusal of a legitimate transaction is\nestimated to a loss of $5 while the refusal of a fraudulent transaction is estimated\nto a gain of $50 dollars and the amount of the transaction. Therefore, we define the\nfollowing function to compute the total benefit of a given decision:\n\n"
+        "### Addressing the problem with a business metric\n\nNow, we create the business metric that depends on the amount of each transaction. We\ndefine the cost matrix similarly to [2]_. Accepting a legitimate transaction provides\na gain of 2% of the amount of the transaction. However, accepting a fraudulent\ntransaction result in a loss of the amount of the transaction. As stated in [2]_, the\ngain and loss related to refusals (of fraudulent and legitimate transactions) are not\ntrivial to define. Here, we define that a refusal of a legitimate transaction is\nestimated to a loss of 5\u20ac while the refusal of a fraudulent transaction is estimated\nto a gain of 50\u20ac and the amount of the transaction. Therefore, we define the\nfollowing function to compute the total benefit of a given decision:\n\n"
       ]
     },
     {
@@ -505,14 +505,14 @@
       },
       "outputs": [],
       "source": [
-        "from sklearn.dummy import DummyClassifier\n\neasy_going_classifier = DummyClassifier(strategy=\"constant\", constant=0)\neasy_going_classifier.fit(data_train, target_train)\nbenefit_cost = business_scorer(\n    easy_going_classifier, data_test, target_test, amount=amount_test\n)\nprint(f\"Benefit/cost of our easy-going classifier: ${benefit_cost:,.2f}\")"
+        "from sklearn.dummy import DummyClassifier\n\neasy_going_classifier = DummyClassifier(strategy=\"constant\", constant=0)\neasy_going_classifier.fit(data_train, target_train)\nbenefit_cost = business_scorer(\n    easy_going_classifier, data_test, target_test, amount=amount_test\n)\nprint(f\"Benefit/cost of our easy-going classifier: {benefit_cost:,.2f}\u20ac\")"
       ]
     },
     {
       "cell_type": "markdown",
       "metadata": {},
       "source": [
-        "A classifier that predict all transactions as legitimate would create a profit of\naround $220,000. We make the same evaluation for a classifier that predicts all\ntransactions as fraudulent.\n\n"
+        "A classifier that predict all transactions as legitimate would create a profit of\naround 220,000.\u20ac We make the same evaluation for a classifier that predicts all\ntransactions as fraudulent.\n\n"
       ]
     },
     {
@@ -523,14 +523,14 @@
       },
       "outputs": [],
       "source": [
-        "intolerant_classifier = DummyClassifier(strategy=\"constant\", constant=1)\nintolerant_classifier.fit(data_train, target_train)\nbenefit_cost = business_scorer(\n    intolerant_classifier, data_test, target_test, amount=amount_test\n)\nprint(f\"Benefit/cost of our intolerant classifier: ${benefit_cost:,.2f}\")"
+        "intolerant_classifier = DummyClassifier(strategy=\"constant\", constant=1)\nintolerant_classifier.fit(data_train, target_train)\nbenefit_cost = business_scorer(\n    intolerant_classifier, data_test, target_test, amount=amount_test\n)\nprint(f\"Benefit/cost of our intolerant classifier: {benefit_cost:,.2f}\u20ac\")"
       ]
     },
     {
       "cell_type": "markdown",
       "metadata": {},
       "source": [
-        "Such a classifier create a loss of around $670,000. A predictive model should allow\nus to make a profit larger than $220,000. It is interesting to compare this business\nmetric with another \"standard\" statistical metric such as the balanced accuracy.\n\n"
+        "Such a classifier create a loss of around 670,000.\u20ac A predictive model should allow\nus to make a profit larger than 220,000.\u20ac It is interesting to compare this business\nmetric with another \"standard\" statistical metric such as the balanced accuracy.\n\n"
       ]
     },
     {
@@ -559,7 +559,7 @@
       },
       "outputs": [],
       "source": [
-        "from sklearn.linear_model import LogisticRegression\nfrom sklearn.model_selection import GridSearchCV\nfrom sklearn.pipeline import make_pipeline\nfrom sklearn.preprocessing import StandardScaler\n\nlogistic_regression = make_pipeline(StandardScaler(), LogisticRegression())\nparam_grid = {\"logisticregression__C\": np.logspace(-6, 6, 13)}\nmodel = GridSearchCV(logistic_regression, param_grid, scoring=\"neg_log_loss\").fit(\n    data_train, target_train\n)\n\nprint(\n    \"Benefit/cost of our logistic regression: \"\n    f\"${business_scorer(model, data_test, target_test, amount=amount_test):,.2f}\"\n)\nprint(\n    \"Balanced accuracy of our logistic regression: \"\n    f\"{balanced_accuracy_scorer(model, data_test, target_test):.3f}\"\n)"
+        "from sklearn.linear_model import LogisticRegression\nfrom sklearn.model_selection import GridSearchCV\nfrom sklearn.pipeline import make_pipeline\nfrom sklearn.preprocessing import StandardScaler\n\nlogistic_regression = make_pipeline(StandardScaler(), LogisticRegression())\nparam_grid = {\"logisticregression__C\": np.logspace(-6, 6, 13)}\nmodel = GridSearchCV(logistic_regression, param_grid, scoring=\"neg_log_loss\").fit(\n    data_train, target_train\n)\n\nprint(\n    \"Benefit/cost of our logistic regression: \"\n    f\"{business_scorer(model, data_test, target_test, amount=amount_test):,.2f}\u20ac\"\n)\nprint(\n    \"Balanced accuracy of our logistic regression: \"\n    f\"{balanced_accuracy_scorer(model, data_test, target_test):.3f}\"\n)"
       ]
     },
     {
@@ -606,7 +606,7 @@
       },
       "outputs": [],
       "source": [
-        "print(\n    \"Benefit/cost of our logistic regression: \"\n    f\"${business_scorer(tuned_model, data_test, target_test, amount=amount_test):,.2f}\"\n)\nprint(\n    \"Balanced accuracy of our logistic regression: \"\n    f\"{balanced_accuracy_scorer(tuned_model, data_test, target_test):.3f}\"\n)"
+        "print(\n    \"Benefit/cost of our logistic regression: \"\n    f\"{business_scorer(tuned_model, data_test, target_test, amount=amount_test):,.2f}\u20ac\"\n)\nprint(\n    \"Balanced accuracy of our logistic regression: \"\n    f\"{balanced_accuracy_scorer(tuned_model, data_test, target_test):.3f}\"\n)"
       ]
     },
     {
@@ -635,7 +635,7 @@
       },
       "outputs": [],
       "source": [
-        "business_score = business_scorer(\n    model_fixed_threshold, data_test, target_test, amount=amount_test\n)\nprint(f\"Benefit/cost of our logistic regression: ${business_score:,.2f}\")\nprint(\n    \"Balanced accuracy of our logistic regression: \"\n    f\"{balanced_accuracy_scorer(model_fixed_threshold, data_test, target_test):.3f}\"\n)"
+        "business_score = business_scorer(\n    model_fixed_threshold, data_test, target_test, amount=amount_test\n)\nprint(f\"Benefit/cost of our logistic regression: {business_score:,.2f}\u20ac\")\nprint(\n    \"Balanced accuracy of our logistic regression: \"\n    f\"{balanced_accuracy_scorer(model_fixed_threshold, data_test, target_test):.3f}\"\n)"
       ]
     },
     {
 
@@ -4,20 +4,19 @@
 ====================================================================================
 
 This example is based on Section 5.4.3 of "Gaussian Processes for Machine
-Learning" [RW2006]_. It illustrates an example of complex kernel engineering
+Learning" [1]_. It illustrates an example of complex kernel engineering
 and hyperparameter optimization using gradient ascent on the
 log-marginal-likelihood. The data consists of the monthly average atmospheric
 CO2 concentrations (in parts per million by volume (ppm)) collected at the
 Mauna Loa Observatory in Hawaii, between 1958 and 2001. The objective is to
 model the CO2 concentration as a function of the time :math:`t` and extrapolate
 for years after 2001.
 
-.. topic: References
+.. rubric:: References
 
-    .. [RW2006] `Rasmussen, Carl Edward.
-       "Gaussian processes in machine learning."
-       Summer school on machine learning. Springer, Berlin, Heidelberg, 2003
-       <http://www.gaussianprocess.org/gpml/chapters/RW.pdf>`_.
+.. [1] `Rasmussen, Carl Edward. "Gaussian processes in machine learning."
+    Summer school on machine learning. Springer, Berlin, Heidelberg, 2003
+    <http://www.gaussianprocess.org/gpml/chapters/RW.pdf>`_.
 """
 
 print(__doc__)
@@ -33,32 +32,34 @@
 # We will derive a dataset from the Mauna Loa Observatory that collected air
 # samples. We are interested in estimating the concentration of CO2 and
 # extrapolate it for further year. First, we load the original dataset available
-# in OpenML.
+# in OpenML as a pandas dataframe. This will be replaced with Polars
+# once `fetch_openml` adds a native support for it.
 from sklearn.datasets import fetch_openml
 
 co2 = fetch_openml(data_id=41187, as_frame=True)
 co2.frame.head()
 
 # %%
-# First, we process the original dataframe to create a date index and select
-# only the CO2 column.
-import pandas as pd
+# First, we process the original dataframe to create a date column and select
+# it along with the CO2 column.
+import polars as pl
 
-co2_data = co2.frame
-co2_data["date"] = pd.to_datetime(co2_data[["year", "month", "day"]])
-co2_data = co2_data[["date", "co2"]].set_index("date")
+co2_data = pl.DataFrame(co2.frame[["year", "month", "day", "co2"]]).select(
+    pl.date("year", "month", "day"), "co2"
+)
 co2_data.head()
 
 # %%
-co2_data.index.min(), co2_data.index.max()
+co2_data["date"].min(), co2_data["date"].max()
 
 # %%
 # We see that we get CO2 concentration for some days from March, 1958 to
 # December, 2001. We can plot these raw information to have a better
 # understanding.
 import matplotlib.pyplot as plt
 
-co2_data.plot()
+plt.plot(co2_data["date"], co2_data["co2"])
+plt.xlabel("date")
 plt.ylabel("CO$_2$ concentration (ppm)")
 _ = plt.title("Raw air samples measurements from the Mauna Loa Observatory")
 
@@ -67,15 +68,14 @@
 # for which no measurements were collected. Such a processing will have an
 # smoothing effect on the data.
 
-try:
-    co2_data_resampled_monthly = co2_data.resample("ME")
-except ValueError:
-    # pandas < 2.2 uses M instead of ME
-    co2_data_resampled_monthly = co2_data.resample("M")
-
-
-co2_data = co2_data_resampled_monthly.mean().dropna(axis="index", how="any")
-co2_data.plot()
+co2_data = (
+    co2_data.sort(by="date")
+    .group_by_dynamic("date", every="1mo")
+    .agg(pl.col("co2").mean())
+    .drop_nulls()
+)
+plt.plot(co2_data["date"], co2_data["co2"])
+plt.xlabel("date")
 plt.ylabel("Monthly average of CO$_2$ concentration (ppm)")
 _ = plt.title(
     "Monthly average of air samples measurements\nfrom the Mauna Loa Observatory"
@@ -88,7 +88,9 @@
 #
 # As a first step, we will divide the data and the target to estimate. The data
 # being a date, we will convert it into a numeric.
-X = (co2_data.index.year + co2_data.index.month / 12).to_numpy().reshape(-1, 1)
+X = co2_data.select(
+    pl.col("date").dt.year() + pl.col("date").dt.month() / 12
+).to_numpy()
 y = co2_data["co2"].to_numpy()
 
 # %%
 
@@ -30,17 +30,17 @@
 performance of non-nested and nested CV strategies by taking the difference
 between their scores.
 
-.. topic:: See Also:
+.. seealso::
 
     - :ref:`cross_validation`
     - :ref:`grid_search`
 
-.. topic:: References:
+.. rubric:: References
 
-    .. [1] `Cawley, G.C.; Talbot, N.L.C. On over-fitting in model selection and
-     subsequent selection bias in performance evaluation.
-     J. Mach. Learn. Res 2010,11, 2079-2107.
-     <http://jmlr.csail.mit.edu/papers/volume11/cawley10a/cawley10a.pdf>`_
+.. [1] `Cawley, G.C.; Talbot, N.L.C. On over-fitting in model selection and
+    subsequent selection bias in performance evaluation.
+    J. Mach. Learn. Res 2010,11, 2079-2107.
+    <http://jmlr.csail.mit.edu/papers/volume11/cawley10a/cawley10a.pdf>`_
 
 """
Original file line number	Diff line number	Diff line change
`@@ -72,7 +72,7 @@`
`72`	`72`	`"cell_type": "markdown",`
`73`	`73`	`"metadata": {},`
`74`	`74`	`"source": [`
`75`		- "## New display `model_selection.ValidationCurveDisplay`\n:class:`model_selection.ValidationCurveDisplay` is now available to plot results\nfrom :func:`model_selection.validation_curve`.\n\n"
	`75`	+ "## New display :class:`~model_selection.ValidationCurveDisplay`\n:class:`model_selection.ValidationCurveDisplay` is now available to plot results\nfrom :func:`model_selection.validation_curve`.\n\n"
`76`	`76`	`]`
`77`	`77`	`},`
`78`	`78`	`{`
`@@ -108,7 +108,7 @@`
`108`	`108`	`"cell_type": "markdown",`
`109`	`109`	`"metadata": {},`
`110`	`110`	`"source": [`
`111`		- "## Grouping infrequent categories in :class:`preprocessing.OrdinalEncoder`\nSimilarly to :class:`preprocessing.OneHotEncoder`, the class\n:class:`preprocessing.OrdinalEncoder` now supports aggregating infrequent categories\ninto a single output for each feature. The parameters to enable the gathering of\ninfrequent categories are `min_frequency` and `max_categories`.\nSee the `User Guide <encoder_infrequent_categories>` for more details.\n\n"
	`111`	+ "## Grouping infrequent categories in :class:`~preprocessing.OrdinalEncoder`\nSimilarly to :class:`preprocessing.OneHotEncoder`, the class\n:class:`preprocessing.OrdinalEncoder` now supports aggregating infrequent categories\ninto a single output for each feature. The parameters to enable the gathering of\ninfrequent categories are `min_frequency` and `max_categories`.\nSee the `User Guide <encoder_infrequent_categories>` for more details.\n\n"
`112`	`112`	`]`
`113`	`113`	`},`
`114`	`114`	`{`
Original file line number	Diff line number	Diff line change
`@@ -4,7 +4,7 @@`
`4`	`4`	`"cell_type": "markdown",`
`5`	`5`	`"metadata": {},`
`6`	`6`	`"source": [`
`7`		- "\n# Post-tuning the decision threshold for cost-sensitive learning\n\nOnce a classifier is trained, the output of the :term:`predict` method outputs class\nlabel predictions corresponding to a thresholding of either the :term:`decision\nfunction` or the :term:`predict_proba` output. For a binary classifier, the default\nthreshold is defined as a posterior probability estimate of 0.5 or a decision score of\n0.0.\n\nHowever, this default strategy is most likely not optimal for the task at hand.\nHere, we use the \"Statlog\" German credit dataset [1]_ to illustrate a use case.\nIn this dataset, the task is to predict whether a person has a \"good\" or \"bad\" credit.\nIn addition, a cost-matrix is provided that specifies the cost of\nmisclassification. Specifically, misclassifying a \"bad\" credit as \"good\" is five\ntimes more costly on average than misclassifying a \"good\" credit as \"bad\".\n\nWe use the :class:`~sklearn.model_selection.TunedThresholdClassifierCV` to select the\ncut-off point of the decision function that minimizes the provided business\ncost.\n\nIn the second part of the example, we further extend this approach by\nconsidering the problem of fraud detection in credit card transactions: in this\ncase, the business metric depends on the amount of each individual transaction.\n.. topic:: References\n\n .. [1] \"Statlog (German Credit Data) Data Set\", UCI Machine Learning Repository,\n [Link](https://archive.ics.uci.edu/ml/datasets/Statlog+%28German+Credit+Data%29).\n\n .. [2] [Charles Elkan, \"The Foundations of Cost-Sensitive Learning\",\n International joint conference on artificial intelligence.\n Vol. 17. No. 1. Lawrence Erlbaum Associates Ltd, 2001.](https://cseweb.ucsd.edu/~elkan/rescale.pdf)\n"
	`7`	+ "\n# Post-tuning the decision threshold for cost-sensitive learning\n\nOnce a classifier is trained, the output of the :term:`predict` method outputs class\nlabel predictions corresponding to a thresholding of either the\n:term:`decision_function` or the :term:`predict_proba` output. For a binary classifier,\nthe default threshold is defined as a posterior probability estimate of 0.5 or a\ndecision score of 0.0.\n\nHowever, this default strategy is most likely not optimal for the task at hand.\nHere, we use the \"Statlog\" German credit dataset [1]_ to illustrate a use case.\nIn this dataset, the task is to predict whether a person has a \"good\" or \"bad\" credit.\nIn addition, a cost-matrix is provided that specifies the cost of\nmisclassification. Specifically, misclassifying a \"bad\" credit as \"good\" is five\ntimes more costly on average than misclassifying a \"good\" credit as \"bad\".\n\nWe use the :class:`~sklearn.model_selection.TunedThresholdClassifierCV` to select the\ncut-off point of the decision function that minimizes the provided business\ncost.\n\nIn the second part of the example, we further extend this approach by\nconsidering the problem of fraud detection in credit card transactions: in this\ncase, the business metric depends on the amount of each individual transaction.\n\n.. rubric :: References\n\n.. [1] \"Statlog (German Credit Data) Data Set\", UCI Machine Learning Repository,\n [Link](https://archive.ics.uci.edu/ml/datasets/Statlog+%28German+Credit+Data%29).\n\n.. [2] [Charles Elkan, \"The Foundations of Cost-Sensitive Learning\",\n International joint conference on artificial intelligence.\n Vol. 17. No. 1. Lawrence Erlbaum Associates Ltd, 2001.](https://cseweb.ucsd.edu/~elkan/rescale.pdf)\n"
`8`	`8`	`]`
`9`	`9`	`},`
`10`	`10`	`{`
`@@ -422,14 +422,14 @@`
`422`	`422`	`},`
`423`	`423`	`"outputs": [],`
`424`	`424`	`"source": [`
`425`		`- "fraud = target == 1\namount_fraud = data[\"Amount\"][fraud]\n_, ax = plt.subplots()\nax.hist(amount_fraud, bins=100)\nax.set_title(\"Amount of fraud transaction\")\n_ = ax.set_xlabel(\"Amount ($)\")"`
	`425`	`+ "fraud = target == 1\namount_fraud = data[\"Amount\"][fraud]\n_, ax = plt.subplots()\nax.hist(amount_fraud, bins=100)\nax.set_title(\"Amount of fraud transaction\")\n_ = ax.set_xlabel(\"Amount (\u20ac)\")"`
`426`	`426`	`]`
`427`	`427`	`},`
`428`	`428`	`{`
`429`	`429`	`"cell_type": "markdown",`
`430`	`430`	`"metadata": {},`
`431`	`431`	`"source": [`
`432`		- "### Addressing the problem with a business metric\n\nNow, we create the business metric that depends on the amount of each transaction. We\ndefine the cost matrix similarly to [2]_. Accepting a legitimate transaction provides\na gain of 2% of the amount of the transaction. However, accepting a fraudulent\ntransaction result in a loss of the amount of the transaction. As stated in [2]_, the\ngain and loss related to refusals (of fraudulent and legitimate transactions) are not\ntrivial to define. Here, we define that a refusal of a legitimate transaction is\nestimated to a loss of $5 while the refusal of a fraudulent transaction is estimated\nto a gain of $50 dollars and the amount of the transaction. Therefore, we define the\nfollowing function to compute the total benefit of a given decision:\n\n"
	`432`	+ "### Addressing the problem with a business metric\n\nNow, we create the business metric that depends on the amount of each transaction. We\ndefine the cost matrix similarly to [2]_. Accepting a legitimate transaction provides\na gain of 2% of the amount of the transaction. However, accepting a fraudulent\ntransaction result in a loss of the amount of the transaction. As stated in [2]_, the\ngain and loss related to refusals (of fraudulent and legitimate transactions) are not\ntrivial to define. Here, we define that a refusal of a legitimate transaction is\nestimated to a loss of 5\u20ac while the refusal of a fraudulent transaction is estimated\nto a gain of 50\u20ac and the amount of the transaction. Therefore, we define the\nfollowing function to compute the total benefit of a given decision:\n\n"
`433`	`433`	`]`
`434`	`434`	`},`
`435`	`435`	`{`
`@@ -505,14 +505,14 @@`
`505`	`505`	`},`
`506`	`506`	`"outputs": [],`
`507`	`507`	`"source": [`
`508`		`- "from sklearn.dummy import DummyClassifier\n\neasy_going_classifier = DummyClassifier(strategy=\"constant\", constant=0)\neasy_going_classifier.fit(data_train, target_train)\nbenefit_cost = business_scorer(\n easy_going_classifier, data_test, target_test, amount=amount_test\n)\nprint(f\"Benefit/cost of our easy-going classifier: ${benefit_cost:,.2f}\")"`
	`508`	`+ "from sklearn.dummy import DummyClassifier\n\neasy_going_classifier = DummyClassifier(strategy=\"constant\", constant=0)\neasy_going_classifier.fit(data_train, target_train)\nbenefit_cost = business_scorer(\n easy_going_classifier, data_test, target_test, amount=amount_test\n)\nprint(f\"Benefit/cost of our easy-going classifier: {benefit_cost:,.2f}\u20ac\")"`
`509`	`509`	`]`
`510`	`510`	`},`
`511`	`511`	`{`
`512`	`512`	`"cell_type": "markdown",`
`513`	`513`	`"metadata": {},`
`514`	`514`	`"source": [`
`515`		`- "A classifier that predict all transactions as legitimate would create a profit of\naround $220,000. We make the same evaluation for a classifier that predicts all\ntransactions as fraudulent.\n\n"`
	`515`	`+ "A classifier that predict all transactions as legitimate would create a profit of\naround 220,000.\u20ac We make the same evaluation for a classifier that predicts all\ntransactions as fraudulent.\n\n"`
`516`	`516`	`]`
`517`	`517`	`},`
`518`	`518`	`{`
`@@ -523,14 +523,14 @@`
`523`	`523`	`},`
`524`	`524`	`"outputs": [],`
`525`	`525`	`"source": [`
`526`		`- "intolerant_classifier = DummyClassifier(strategy=\"constant\", constant=1)\nintolerant_classifier.fit(data_train, target_train)\nbenefit_cost = business_scorer(\n intolerant_classifier, data_test, target_test, amount=amount_test\n)\nprint(f\"Benefit/cost of our intolerant classifier: ${benefit_cost:,.2f}\")"`
	`526`	`+ "intolerant_classifier = DummyClassifier(strategy=\"constant\", constant=1)\nintolerant_classifier.fit(data_train, target_train)\nbenefit_cost = business_scorer(\n intolerant_classifier, data_test, target_test, amount=amount_test\n)\nprint(f\"Benefit/cost of our intolerant classifier: {benefit_cost:,.2f}\u20ac\")"`
`527`	`527`	`]`
`528`	`528`	`},`
`529`	`529`	`{`
`530`	`530`	`"cell_type": "markdown",`
`531`	`531`	`"metadata": {},`
`532`	`532`	`"source": [`
`533`		`- "Such a classifier create a loss of around $670,000. A predictive model should allow\nus to make a profit larger than $220,000. It is interesting to compare this business\nmetric with another \"standard\" statistical metric such as the balanced accuracy.\n\n"`
	`533`	`+ "Such a classifier create a loss of around 670,000.\u20ac A predictive model should allow\nus to make a profit larger than 220,000.\u20ac It is interesting to compare this business\nmetric with another \"standard\" statistical metric such as the balanced accuracy.\n\n"`
`534`	`534`	`]`
`535`	`535`	`},`
`536`	`536`	`{`
`@@ -559,7 +559,7 @@`
`559`	`559`	`},`
`560`	`560`	`"outputs": [],`
`561`	`561`	`"source": [`
`562`		- "from sklearn.linear_model import LogisticRegression\nfrom sklearn.model_selection import GridSearchCV\nfrom sklearn.pipeline import make_pipeline\nfrom sklearn.preprocessing import StandardScaler\n\nlogistic_regression = make_pipeline(StandardScaler(), LogisticRegression())\nparam_grid = {\"logisticregression__C\": np.logspace(-6, 6, 13)}\nmodel = GridSearchCV(logistic_regression, param_grid, scoring=\"neg_log_loss\").fit(\n data_train, target_train\n)\n\nprint(\n \"Benefit/cost of our logistic regression: \"\n f\"${business_scorer(model, data_test, target_test, amount=amount_test):,.2f}\"\n)\nprint(\n \"Balanced accuracy of our logistic regression: \"\n f\"{balanced_accuracy_scorer(model, data_test, target_test):.3f}\"\n)"
	`562`	+ "from sklearn.linear_model import LogisticRegression\nfrom sklearn.model_selection import GridSearchCV\nfrom sklearn.pipeline import make_pipeline\nfrom sklearn.preprocessing import StandardScaler\n\nlogistic_regression = make_pipeline(StandardScaler(), LogisticRegression())\nparam_grid = {\"logisticregression__C\": np.logspace(-6, 6, 13)}\nmodel = GridSearchCV(logistic_regression, param_grid, scoring=\"neg_log_loss\").fit(\n data_train, target_train\n)\n\nprint(\n \"Benefit/cost of our logistic regression: \"\n f\"{business_scorer(model, data_test, target_test, amount=amount_test):,.2f}\u20ac\"\n)\nprint(\n \"Balanced accuracy of our logistic regression: \"\n f\"{balanced_accuracy_scorer(model, data_test, target_test):.3f}\"\n)"
`563`	`563`	`]`
`564`	`564`	`},`
`565`	`565`	`{`
`@@ -606,7 +606,7 @@`
`606`	`606`	`},`
`607`	`607`	`"outputs": [],`
`608`	`608`	`"source": [`
`609`		`- "print(\n \"Benefit/cost of our logistic regression: \"\n f\"${business_scorer(tuned_model, data_test, target_test, amount=amount_test):,.2f}\"\n)\nprint(\n \"Balanced accuracy of our logistic regression: \"\n f\"{balanced_accuracy_scorer(tuned_model, data_test, target_test):.3f}\"\n)"`
	`609`	`+ "print(\n \"Benefit/cost of our logistic regression: \"\n f\"{business_scorer(tuned_model, data_test, target_test, amount=amount_test):,.2f}\u20ac\"\n)\nprint(\n \"Balanced accuracy of our logistic regression: \"\n f\"{balanced_accuracy_scorer(tuned_model, data_test, target_test):.3f}\"\n)"`
`610`	`610`	`]`
`611`	`611`	`},`
`612`	`612`	`{`
`@@ -635,7 +635,7 @@`
`635`	`635`	`},`
`636`	`636`	`"outputs": [],`
`637`	`637`	`"source": [`
`638`		`- "business_score = business_scorer(\n model_fixed_threshold, data_test, target_test, amount=amount_test\n)\nprint(f\"Benefit/cost of our logistic regression: ${business_score:,.2f}\")\nprint(\n \"Balanced accuracy of our logistic regression: \"\n f\"{balanced_accuracy_scorer(model_fixed_threshold, data_test, target_test):.3f}\"\n)"`
	`638`	`+ "business_score = business_scorer(\n model_fixed_threshold, data_test, target_test, amount=amount_test\n)\nprint(f\"Benefit/cost of our logistic regression: {business_score:,.2f}\u20ac\")\nprint(\n \"Balanced accuracy of our logistic regression: \"\n f\"{balanced_accuracy_scorer(model_fixed_threshold, data_test, target_test):.3f}\"\n)"`
`639`	`639`	`]`
`640`	`640`	`},`
`641`	`641`	`{`