Skip to content

Commit d077c52

Browse files
committed
Pushing the docs to 1.5/ for branch: 1.5.X, commit 5491dc695dbe2c9bec3452be5f3c409706ff7ee7
1 parent 26eafd6 commit d077c52

File tree

2,478 files changed

+1309246
-320971
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

2,478 files changed

+1309246
-320971
lines changed

Diff for: 1.5/.buildinfo

+1-1
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,4 @@
11
# Sphinx build info version 1
22
# This file hashes the configuration used when building these files. When it is not found, a full rebuild will be done.
3-
config: 040ab5c2be3ae4d05035263b4622b4bb
3+
config: 2907958f5ac8f0a65d5062369d2dcd04
44
tags: 645f666f9bcd5a90fca523b33c5a78b7

Diff for: 1.5/_downloads/0785ea6d45bde062e5beedda88131215/plot_release_highlights_1_3_0.ipynb

+2-2
Original file line numberDiff line numberDiff line change
@@ -72,7 +72,7 @@
7272
"cell_type": "markdown",
7373
"metadata": {},
7474
"source": [
75-
"## New display `model_selection.ValidationCurveDisplay`\n:class:`model_selection.ValidationCurveDisplay` is now available to plot results\nfrom :func:`model_selection.validation_curve`.\n\n"
75+
"## New display :class:`~model_selection.ValidationCurveDisplay`\n:class:`model_selection.ValidationCurveDisplay` is now available to plot results\nfrom :func:`model_selection.validation_curve`.\n\n"
7676
]
7777
},
7878
{
@@ -108,7 +108,7 @@
108108
"cell_type": "markdown",
109109
"metadata": {},
110110
"source": [
111-
"## Grouping infrequent categories in :class:`preprocessing.OrdinalEncoder`\nSimilarly to :class:`preprocessing.OneHotEncoder`, the class\n:class:`preprocessing.OrdinalEncoder` now supports aggregating infrequent categories\ninto a single output for each feature. The parameters to enable the gathering of\ninfrequent categories are `min_frequency` and `max_categories`.\nSee the `User Guide <encoder_infrequent_categories>` for more details.\n\n"
111+
"## Grouping infrequent categories in :class:`~preprocessing.OrdinalEncoder`\nSimilarly to :class:`preprocessing.OneHotEncoder`, the class\n:class:`preprocessing.OrdinalEncoder` now supports aggregating infrequent categories\ninto a single output for each feature. The parameters to enable the gathering of\ninfrequent categories are `min_frequency` and `max_categories`.\nSee the `User Guide <encoder_infrequent_categories>` for more details.\n\n"
112112
]
113113
},
114114
{
Binary file not shown.

Diff for: 1.5/_downloads/133f2198d3ab792c75b39a63b0a99872/plot_cost_sensitive_learning.ipynb

+10-10
Original file line numberDiff line numberDiff line change
@@ -4,7 +4,7 @@
44
"cell_type": "markdown",
55
"metadata": {},
66
"source": [
7-
"\n# Post-tuning the decision threshold for cost-sensitive learning\n\nOnce a classifier is trained, the output of the :term:`predict` method outputs class\nlabel predictions corresponding to a thresholding of either the :term:`decision\nfunction` or the :term:`predict_proba` output. For a binary classifier, the default\nthreshold is defined as a posterior probability estimate of 0.5 or a decision score of\n0.0.\n\nHowever, this default strategy is most likely not optimal for the task at hand.\nHere, we use the \"Statlog\" German credit dataset [1]_ to illustrate a use case.\nIn this dataset, the task is to predict whether a person has a \"good\" or \"bad\" credit.\nIn addition, a cost-matrix is provided that specifies the cost of\nmisclassification. Specifically, misclassifying a \"bad\" credit as \"good\" is five\ntimes more costly on average than misclassifying a \"good\" credit as \"bad\".\n\nWe use the :class:`~sklearn.model_selection.TunedThresholdClassifierCV` to select the\ncut-off point of the decision function that minimizes the provided business\ncost.\n\nIn the second part of the example, we further extend this approach by\nconsidering the problem of fraud detection in credit card transactions: in this\ncase, the business metric depends on the amount of each individual transaction.\n.. topic:: References\n\n .. [1] \"Statlog (German Credit Data) Data Set\", UCI Machine Learning Repository,\n [Link](https://archive.ics.uci.edu/ml/datasets/Statlog+%28German+Credit+Data%29).\n\n .. [2] [Charles Elkan, \"The Foundations of Cost-Sensitive Learning\",\n International joint conference on artificial intelligence.\n Vol. 17. No. 1. Lawrence Erlbaum Associates Ltd, 2001.](https://cseweb.ucsd.edu/~elkan/rescale.pdf)\n"
7+
"\n# Post-tuning the decision threshold for cost-sensitive learning\n\nOnce a classifier is trained, the output of the :term:`predict` method outputs class\nlabel predictions corresponding to a thresholding of either the\n:term:`decision_function` or the :term:`predict_proba` output. For a binary classifier,\nthe default threshold is defined as a posterior probability estimate of 0.5 or a\ndecision score of 0.0.\n\nHowever, this default strategy is most likely not optimal for the task at hand.\nHere, we use the \"Statlog\" German credit dataset [1]_ to illustrate a use case.\nIn this dataset, the task is to predict whether a person has a \"good\" or \"bad\" credit.\nIn addition, a cost-matrix is provided that specifies the cost of\nmisclassification. Specifically, misclassifying a \"bad\" credit as \"good\" is five\ntimes more costly on average than misclassifying a \"good\" credit as \"bad\".\n\nWe use the :class:`~sklearn.model_selection.TunedThresholdClassifierCV` to select the\ncut-off point of the decision function that minimizes the provided business\ncost.\n\nIn the second part of the example, we further extend this approach by\nconsidering the problem of fraud detection in credit card transactions: in this\ncase, the business metric depends on the amount of each individual transaction.\n\n.. rubric :: References\n\n.. [1] \"Statlog (German Credit Data) Data Set\", UCI Machine Learning Repository,\n [Link](https://archive.ics.uci.edu/ml/datasets/Statlog+%28German+Credit+Data%29).\n\n.. [2] [Charles Elkan, \"The Foundations of Cost-Sensitive Learning\",\n International joint conference on artificial intelligence.\n Vol. 17. No. 1. Lawrence Erlbaum Associates Ltd, 2001.](https://cseweb.ucsd.edu/~elkan/rescale.pdf)\n"
88
]
99
},
1010
{
@@ -422,14 +422,14 @@
422422
},
423423
"outputs": [],
424424
"source": [
425-
"fraud = target == 1\namount_fraud = data[\"Amount\"][fraud]\n_, ax = plt.subplots()\nax.hist(amount_fraud, bins=100)\nax.set_title(\"Amount of fraud transaction\")\n_ = ax.set_xlabel(\"Amount ($)\")"
425+
"fraud = target == 1\namount_fraud = data[\"Amount\"][fraud]\n_, ax = plt.subplots()\nax.hist(amount_fraud, bins=100)\nax.set_title(\"Amount of fraud transaction\")\n_ = ax.set_xlabel(\"Amount (\u20ac)\")"
426426
]
427427
},
428428
{
429429
"cell_type": "markdown",
430430
"metadata": {},
431431
"source": [
432-
"### Addressing the problem with a business metric\n\nNow, we create the business metric that depends on the amount of each transaction. We\ndefine the cost matrix similarly to [2]_. Accepting a legitimate transaction provides\na gain of 2% of the amount of the transaction. However, accepting a fraudulent\ntransaction result in a loss of the amount of the transaction. As stated in [2]_, the\ngain and loss related to refusals (of fraudulent and legitimate transactions) are not\ntrivial to define. Here, we define that a refusal of a legitimate transaction is\nestimated to a loss of $5 while the refusal of a fraudulent transaction is estimated\nto a gain of $50 dollars and the amount of the transaction. Therefore, we define the\nfollowing function to compute the total benefit of a given decision:\n\n"
432+
"### Addressing the problem with a business metric\n\nNow, we create the business metric that depends on the amount of each transaction. We\ndefine the cost matrix similarly to [2]_. Accepting a legitimate transaction provides\na gain of 2% of the amount of the transaction. However, accepting a fraudulent\ntransaction result in a loss of the amount of the transaction. As stated in [2]_, the\ngain and loss related to refusals (of fraudulent and legitimate transactions) are not\ntrivial to define. Here, we define that a refusal of a legitimate transaction is\nestimated to a loss of 5\u20ac while the refusal of a fraudulent transaction is estimated\nto a gain of 50\u20ac and the amount of the transaction. Therefore, we define the\nfollowing function to compute the total benefit of a given decision:\n\n"
433433
]
434434
},
435435
{
@@ -505,14 +505,14 @@
505505
},
506506
"outputs": [],
507507
"source": [
508-
"from sklearn.dummy import DummyClassifier\n\neasy_going_classifier = DummyClassifier(strategy=\"constant\", constant=0)\neasy_going_classifier.fit(data_train, target_train)\nbenefit_cost = business_scorer(\n easy_going_classifier, data_test, target_test, amount=amount_test\n)\nprint(f\"Benefit/cost of our easy-going classifier: ${benefit_cost:,.2f}\")"
508+
"from sklearn.dummy import DummyClassifier\n\neasy_going_classifier = DummyClassifier(strategy=\"constant\", constant=0)\neasy_going_classifier.fit(data_train, target_train)\nbenefit_cost = business_scorer(\n easy_going_classifier, data_test, target_test, amount=amount_test\n)\nprint(f\"Benefit/cost of our easy-going classifier: {benefit_cost:,.2f}\u20ac\")"
509509
]
510510
},
511511
{
512512
"cell_type": "markdown",
513513
"metadata": {},
514514
"source": [
515-
"A classifier that predict all transactions as legitimate would create a profit of\naround $220,000. We make the same evaluation for a classifier that predicts all\ntransactions as fraudulent.\n\n"
515+
"A classifier that predict all transactions as legitimate would create a profit of\naround 220,000.\u20ac We make the same evaluation for a classifier that predicts all\ntransactions as fraudulent.\n\n"
516516
]
517517
},
518518
{
@@ -523,14 +523,14 @@
523523
},
524524
"outputs": [],
525525
"source": [
526-
"intolerant_classifier = DummyClassifier(strategy=\"constant\", constant=1)\nintolerant_classifier.fit(data_train, target_train)\nbenefit_cost = business_scorer(\n intolerant_classifier, data_test, target_test, amount=amount_test\n)\nprint(f\"Benefit/cost of our intolerant classifier: ${benefit_cost:,.2f}\")"
526+
"intolerant_classifier = DummyClassifier(strategy=\"constant\", constant=1)\nintolerant_classifier.fit(data_train, target_train)\nbenefit_cost = business_scorer(\n intolerant_classifier, data_test, target_test, amount=amount_test\n)\nprint(f\"Benefit/cost of our intolerant classifier: {benefit_cost:,.2f}\u20ac\")"
527527
]
528528
},
529529
{
530530
"cell_type": "markdown",
531531
"metadata": {},
532532
"source": [
533-
"Such a classifier create a loss of around $670,000. A predictive model should allow\nus to make a profit larger than $220,000. It is interesting to compare this business\nmetric with another \"standard\" statistical metric such as the balanced accuracy.\n\n"
533+
"Such a classifier create a loss of around 670,000.\u20ac A predictive model should allow\nus to make a profit larger than 220,000.\u20ac It is interesting to compare this business\nmetric with another \"standard\" statistical metric such as the balanced accuracy.\n\n"
534534
]
535535
},
536536
{
@@ -559,7 +559,7 @@
559559
},
560560
"outputs": [],
561561
"source": [
562-
"from sklearn.linear_model import LogisticRegression\nfrom sklearn.model_selection import GridSearchCV\nfrom sklearn.pipeline import make_pipeline\nfrom sklearn.preprocessing import StandardScaler\n\nlogistic_regression = make_pipeline(StandardScaler(), LogisticRegression())\nparam_grid = {\"logisticregression__C\": np.logspace(-6, 6, 13)}\nmodel = GridSearchCV(logistic_regression, param_grid, scoring=\"neg_log_loss\").fit(\n data_train, target_train\n)\n\nprint(\n \"Benefit/cost of our logistic regression: \"\n f\"${business_scorer(model, data_test, target_test, amount=amount_test):,.2f}\"\n)\nprint(\n \"Balanced accuracy of our logistic regression: \"\n f\"{balanced_accuracy_scorer(model, data_test, target_test):.3f}\"\n)"
562+
"from sklearn.linear_model import LogisticRegression\nfrom sklearn.model_selection import GridSearchCV\nfrom sklearn.pipeline import make_pipeline\nfrom sklearn.preprocessing import StandardScaler\n\nlogistic_regression = make_pipeline(StandardScaler(), LogisticRegression())\nparam_grid = {\"logisticregression__C\": np.logspace(-6, 6, 13)}\nmodel = GridSearchCV(logistic_regression, param_grid, scoring=\"neg_log_loss\").fit(\n data_train, target_train\n)\n\nprint(\n \"Benefit/cost of our logistic regression: \"\n f\"{business_scorer(model, data_test, target_test, amount=amount_test):,.2f}\u20ac\"\n)\nprint(\n \"Balanced accuracy of our logistic regression: \"\n f\"{balanced_accuracy_scorer(model, data_test, target_test):.3f}\"\n)"
563563
]
564564
},
565565
{
@@ -606,7 +606,7 @@
606606
},
607607
"outputs": [],
608608
"source": [
609-
"print(\n \"Benefit/cost of our logistic regression: \"\n f\"${business_scorer(tuned_model, data_test, target_test, amount=amount_test):,.2f}\"\n)\nprint(\n \"Balanced accuracy of our logistic regression: \"\n f\"{balanced_accuracy_scorer(tuned_model, data_test, target_test):.3f}\"\n)"
609+
"print(\n \"Benefit/cost of our logistic regression: \"\n f\"{business_scorer(tuned_model, data_test, target_test, amount=amount_test):,.2f}\u20ac\"\n)\nprint(\n \"Balanced accuracy of our logistic regression: \"\n f\"{balanced_accuracy_scorer(tuned_model, data_test, target_test):.3f}\"\n)"
610610
]
611611
},
612612
{
@@ -635,7 +635,7 @@
635635
},
636636
"outputs": [],
637637
"source": [
638-
"business_score = business_scorer(\n model_fixed_threshold, data_test, target_test, amount=amount_test\n)\nprint(f\"Benefit/cost of our logistic regression: ${business_score:,.2f}\")\nprint(\n \"Balanced accuracy of our logistic regression: \"\n f\"{balanced_accuracy_scorer(model_fixed_threshold, data_test, target_test):.3f}\"\n)"
638+
"business_score = business_scorer(\n model_fixed_threshold, data_test, target_test, amount=amount_test\n)\nprint(f\"Benefit/cost of our logistic regression: {business_score:,.2f}\u20ac\")\nprint(\n \"Balanced accuracy of our logistic regression: \"\n f\"{balanced_accuracy_scorer(model_fixed_threshold, data_test, target_test):.3f}\"\n)"
639639
]
640640
},
641641
{

Diff for: 1.5/_downloads/1b8827af01c9a70017a4739bcf2e21a8/plot_gpr_co2.py

+27-25
Original file line numberDiff line numberDiff line change
@@ -4,20 +4,19 @@
44
====================================================================================
55
66
This example is based on Section 5.4.3 of "Gaussian Processes for Machine
7-
Learning" [RW2006]_. It illustrates an example of complex kernel engineering
7+
Learning" [1]_. It illustrates an example of complex kernel engineering
88
and hyperparameter optimization using gradient ascent on the
99
log-marginal-likelihood. The data consists of the monthly average atmospheric
1010
CO2 concentrations (in parts per million by volume (ppm)) collected at the
1111
Mauna Loa Observatory in Hawaii, between 1958 and 2001. The objective is to
1212
model the CO2 concentration as a function of the time :math:`t` and extrapolate
1313
for years after 2001.
1414
15-
.. topic: References
15+
.. rubric:: References
1616
17-
.. [RW2006] `Rasmussen, Carl Edward.
18-
"Gaussian processes in machine learning."
19-
Summer school on machine learning. Springer, Berlin, Heidelberg, 2003
20-
<http://www.gaussianprocess.org/gpml/chapters/RW.pdf>`_.
17+
.. [1] `Rasmussen, Carl Edward. "Gaussian processes in machine learning."
18+
Summer school on machine learning. Springer, Berlin, Heidelberg, 2003
19+
<http://www.gaussianprocess.org/gpml/chapters/RW.pdf>`_.
2120
"""
2221

2322
print(__doc__)
@@ -33,32 +32,34 @@
3332
# We will derive a dataset from the Mauna Loa Observatory that collected air
3433
# samples. We are interested in estimating the concentration of CO2 and
3534
# extrapolate it for further year. First, we load the original dataset available
36-
# in OpenML.
35+
# in OpenML as a pandas dataframe. This will be replaced with Polars
36+
# once `fetch_openml` adds a native support for it.
3737
from sklearn.datasets import fetch_openml
3838

3939
co2 = fetch_openml(data_id=41187, as_frame=True)
4040
co2.frame.head()
4141

4242
# %%
43-
# First, we process the original dataframe to create a date index and select
44-
# only the CO2 column.
45-
import pandas as pd
43+
# First, we process the original dataframe to create a date column and select
44+
# it along with the CO2 column.
45+
import polars as pl
4646

47-
co2_data = co2.frame
48-
co2_data["date"] = pd.to_datetime(co2_data[["year", "month", "day"]])
49-
co2_data = co2_data[["date", "co2"]].set_index("date")
47+
co2_data = pl.DataFrame(co2.frame[["year", "month", "day", "co2"]]).select(
48+
pl.date("year", "month", "day"), "co2"
49+
)
5050
co2_data.head()
5151

5252
# %%
53-
co2_data.index.min(), co2_data.index.max()
53+
co2_data["date"].min(), co2_data["date"].max()
5454

5555
# %%
5656
# We see that we get CO2 concentration for some days from March, 1958 to
5757
# December, 2001. We can plot these raw information to have a better
5858
# understanding.
5959
import matplotlib.pyplot as plt
6060

61-
co2_data.plot()
61+
plt.plot(co2_data["date"], co2_data["co2"])
62+
plt.xlabel("date")
6263
plt.ylabel("CO$_2$ concentration (ppm)")
6364
_ = plt.title("Raw air samples measurements from the Mauna Loa Observatory")
6465

@@ -67,15 +68,14 @@
6768
# for which no measurements were collected. Such a processing will have an
6869
# smoothing effect on the data.
6970

70-
try:
71-
co2_data_resampled_monthly = co2_data.resample("ME")
72-
except ValueError:
73-
# pandas < 2.2 uses M instead of ME
74-
co2_data_resampled_monthly = co2_data.resample("M")
75-
76-
77-
co2_data = co2_data_resampled_monthly.mean().dropna(axis="index", how="any")
78-
co2_data.plot()
71+
co2_data = (
72+
co2_data.sort(by="date")
73+
.group_by_dynamic("date", every="1mo")
74+
.agg(pl.col("co2").mean())
75+
.drop_nulls()
76+
)
77+
plt.plot(co2_data["date"], co2_data["co2"])
78+
plt.xlabel("date")
7979
plt.ylabel("Monthly average of CO$_2$ concentration (ppm)")
8080
_ = plt.title(
8181
"Monthly average of air samples measurements\nfrom the Mauna Loa Observatory"
@@ -88,7 +88,9 @@
8888
#
8989
# As a first step, we will divide the data and the target to estimate. The data
9090
# being a date, we will convert it into a numeric.
91-
X = (co2_data.index.year + co2_data.index.month / 12).to_numpy().reshape(-1, 1)
91+
X = co2_data.select(
92+
pl.col("date").dt.year() + pl.col("date").dt.month() / 12
93+
).to_numpy()
9294
y = co2_data["co2"].to_numpy()
9395

9496
# %%

Diff for: 1.5/_downloads/23614d75e8327ef369659da7d2ed62db/plot_nested_cross_validation_iris.py

+6-6
Original file line numberDiff line numberDiff line change
@@ -30,17 +30,17 @@
3030
performance of non-nested and nested CV strategies by taking the difference
3131
between their scores.
3232
33-
.. topic:: See Also:
33+
.. seealso::
3434
3535
- :ref:`cross_validation`
3636
- :ref:`grid_search`
3737
38-
.. topic:: References:
38+
.. rubric:: References
3939
40-
.. [1] `Cawley, G.C.; Talbot, N.L.C. On over-fitting in model selection and
41-
subsequent selection bias in performance evaluation.
42-
J. Mach. Learn. Res 2010,11, 2079-2107.
43-
<http://jmlr.csail.mit.edu/papers/volume11/cawley10a/cawley10a.pdf>`_
40+
.. [1] `Cawley, G.C.; Talbot, N.L.C. On over-fitting in model selection and
41+
subsequent selection bias in performance evaluation.
42+
J. Mach. Learn. Res 2010,11, 2079-2107.
43+
<http://jmlr.csail.mit.edu/papers/volume11/cawley10a/cawley10a.pdf>`_
4444
4545
"""
4646

0 commit comments

Comments
 (0)