Skip to content

Commit c83fad2

Browse files
committed
Pushing the docs to 1.2/ for branch: 1.2.X, commit 98cf537f5c538fdbc9d27b851cf03ce7611b8a48
1 parent f4a9f01 commit c83fad2

File tree

1,284 files changed

+6275
-5054
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

1,284 files changed

+6275
-5054
lines changed
Binary file not shown.

Diff for: 1.2/_downloads/1ed4d16a866c9fe4d86a05477e6d0664/plot_svm_scale_c.ipynb

+125-2
Original file line numberDiff line numberDiff line change
@@ -15,7 +15,7 @@
1515
"cell_type": "markdown",
1616
"metadata": {},
1717
"source": [
18-
"\n# Scaling the regularization parameter for SVCs\n\nThe following example illustrates the effect of scaling the\nregularization parameter when using `svm` for\n`classification <svm_classification>`.\nFor SVC classification, we are interested in a risk minimization for the\nequation:\n\n\n\\begin{align}C \\sum_{i=1, n} \\mathcal{L} (f(x_i), y_i) + \\Omega (w)\\end{align}\n\nwhere\n\n - $C$ is used to set the amount of regularization\n - $\\mathcal{L}$ is a `loss` function of our samples\n and our model parameters.\n - $\\Omega$ is a `penalty` function of our model parameters\n\nIf we consider the loss function to be the individual error per\nsample, then the data-fit term, or the sum of the error for each sample, will\nincrease as we add more samples. The penalization term, however, will not\nincrease.\n\nWhen using, for example, `cross validation <cross_validation>`, to\nset the amount of regularization with `C`, there will be a\ndifferent amount of samples between the main problem and the smaller problems\nwithin the folds of the cross validation.\n\nSince our loss function is dependent on the amount of samples, the latter\nwill influence the selected value of `C`.\nThe question that arises is `How do we optimally adjust C to\naccount for the different amount of training samples?`\n\nThe figures below are used to illustrate the effect of scaling our\n`C` to compensate for the change in the number of samples, in the\ncase of using an `l1` penalty, as well as the `l2` penalty.\n\n## l1-penalty case\nIn the `l1` case, theory says that prediction consistency\n(i.e. that under given hypothesis, the estimator\nlearned predicts as well as a model knowing the true distribution)\nis not possible because of the bias of the `l1`. It does say, however,\nthat model consistency, in terms of finding the right set of non-zero\nparameters as well as their signs, can be achieved by scaling\n`C1`.\n\n## l2-penalty case\nThe theory says that in order to achieve prediction consistency, the\npenalty parameter should be kept constant\nas the number of samples grow.\n\n## Simulations\n\nThe two figures below plot the values of `C` on the `x-axis` and the\ncorresponding cross-validation scores on the `y-axis`, for several different\nfractions of a generated data-set.\n\nIn the `l1` penalty case, the cross-validation-error correlates best with\nthe test-error, when scaling our `C` with the number of samples, `n`,\nwhich can be seen in the first figure.\n\nFor the `l2` penalty case, the best result comes from the case where `C`\nis not scaled.\n\n.. topic:: Note:\n\n Two separate datasets are used for the two different plots. The reason\n behind this is the `l1` case works better on sparse data, while `l2`\n is better suited to the non-sparse case.\n"
18+
"\n# Scaling the regularization parameter for SVCs\n\nThe following example illustrates the effect of scaling the\nregularization parameter when using `svm` for\n`classification <svm_classification>`.\nFor SVC classification, we are interested in a risk minimization for the\nequation:\n\n\n\\begin{align}C \\sum_{i=1, n} \\mathcal{L} (f(x_i), y_i) + \\Omega (w)\\end{align}\n\nwhere\n\n - $C$ is used to set the amount of regularization\n - $\\mathcal{L}$ is a `loss` function of our samples\n and our model parameters.\n - $\\Omega$ is a `penalty` function of our model parameters\n\nIf we consider the loss function to be the individual error per\nsample, then the data-fit term, or the sum of the error for each sample, will\nincrease as we add more samples. The penalization term, however, will not\nincrease.\n\nWhen using, for example, `cross validation <cross_validation>`, to\nset the amount of regularization with `C`, there will be a\ndifferent amount of samples between the main problem and the smaller problems\nwithin the folds of the cross validation.\n\nSince our loss function is dependent on the amount of samples, the latter\nwill influence the selected value of `C`.\nThe question that arises is \"How do we optimally adjust C to\naccount for the different amount of training samples?\"\n\nIn the remainder of this example, we will investigate the effect of scaling\nthe value of the regularization parameter `C` in regards to the number of\nsamples for both L1 and L2 penalty. We will generate some synthetic datasets\nthat are appropriate for each type of regularization.\n"
1919
]
2020
},
2121
{
@@ -26,7 +26,130 @@
2626
},
2727
"outputs": [],
2828
"source": [
29-
"# Author: Andreas Mueller <[email protected]>\n# Jaques Grobler <[email protected]>\n# License: BSD 3 clause\n\nimport numpy as np\nimport matplotlib.pyplot as plt\n\nfrom sklearn.svm import LinearSVC\nfrom sklearn.model_selection import ShuffleSplit\nfrom sklearn.model_selection import GridSearchCV\nfrom sklearn.utils import check_random_state\nfrom sklearn import datasets\n\nrnd = check_random_state(1)\n\n# set up dataset\nn_samples = 100\nn_features = 300\n\n# l1 data (only 5 informative features)\nX_1, y_1 = datasets.make_classification(\n n_samples=n_samples, n_features=n_features, n_informative=5, random_state=1\n)\n\n# l2 data: non sparse, but less features\ny_2 = np.sign(0.5 - rnd.rand(n_samples))\nX_2 = rnd.randn(n_samples, n_features // 5) + y_2[:, np.newaxis]\nX_2 += 5 * rnd.randn(n_samples, n_features // 5)\n\nclf_sets = [\n (\n LinearSVC(penalty=\"l1\", loss=\"squared_hinge\", dual=False, tol=1e-3),\n np.logspace(-2.3, -1.3, 10),\n X_1,\n y_1,\n ),\n (\n LinearSVC(penalty=\"l2\", loss=\"squared_hinge\", dual=True),\n np.logspace(-4.5, -2, 10),\n X_2,\n y_2,\n ),\n]\n\ncolors = [\"navy\", \"cyan\", \"darkorange\"]\nlw = 2\n\nfor clf, cs, X, y in clf_sets:\n # set up the plot for each regressor\n fig, axes = plt.subplots(nrows=2, sharey=True, figsize=(9, 10))\n\n for k, train_size in enumerate(np.linspace(0.3, 0.7, 3)[::-1]):\n param_grid = dict(C=cs)\n # To get nice curve, we need a large number of iterations to\n # reduce the variance\n grid = GridSearchCV(\n clf,\n refit=False,\n param_grid=param_grid,\n cv=ShuffleSplit(\n train_size=train_size, test_size=0.3, n_splits=50, random_state=1\n ),\n )\n grid.fit(X, y)\n scores = grid.cv_results_[\"mean_test_score\"]\n\n scales = [\n (1, \"No scaling\"),\n ((n_samples * train_size), \"1/n_samples\"),\n ]\n\n for ax, (scaler, name) in zip(axes, scales):\n ax.set_xlabel(\"C\")\n ax.set_ylabel(\"CV Score\")\n grid_cs = cs * float(scaler) # scale the C's\n ax.semilogx(\n grid_cs,\n scores,\n label=\"fraction %.2f\" % train_size,\n color=colors[k],\n lw=lw,\n )\n ax.set_title(\n \"scaling=%s, penalty=%s, loss=%s\" % (name, clf.penalty, clf.loss)\n )\n\n plt.legend(loc=\"best\")\nplt.show()"
29+
"# Author: Andreas Mueller <[email protected]>\n# Jaques Grobler <[email protected]>\n# License: BSD 3 clause"
30+
]
31+
},
32+
{
33+
"cell_type": "markdown",
34+
"metadata": {},
35+
"source": [
36+
"## L1-penalty case\nIn the L1 case, theory says that prediction consistency (i.e. that under\ngiven hypothesis, the estimator learned predicts as well as a model knowing\nthe true distribution) is not possible because of the bias of the L1. It\ndoes say, however, that model consistency, in terms of finding the right set\nof non-zero parameters as well as their signs, can be achieved by scaling\n`C`.\n\nWe will demonstrate this effect by using a synthetic dataset. This\ndataset will be sparse, meaning that only a few features will be informative\nand useful for the model.\n\n"
37+
]
38+
},
39+
{
40+
"cell_type": "code",
41+
"execution_count": null,
42+
"metadata": {
43+
"collapsed": false
44+
},
45+
"outputs": [],
46+
"source": [
47+
"from sklearn.datasets import make_classification\n\nn_samples, n_features = 100, 300\nX, y = make_classification(\n n_samples=n_samples, n_features=n_features, n_informative=5, random_state=1\n)"
48+
]
49+
},
50+
{
51+
"cell_type": "markdown",
52+
"metadata": {},
53+
"source": [
54+
"Now, we can define a linear SVC with the `l1` penalty.\n\n"
55+
]
56+
},
57+
{
58+
"cell_type": "code",
59+
"execution_count": null,
60+
"metadata": {
61+
"collapsed": false
62+
},
63+
"outputs": [],
64+
"source": [
65+
"from sklearn.svm import LinearSVC\n\nmodel_l1 = LinearSVC(penalty=\"l1\", loss=\"squared_hinge\", dual=False, tol=1e-3)"
66+
]
67+
},
68+
{
69+
"cell_type": "markdown",
70+
"metadata": {},
71+
"source": [
72+
"We will compute the mean test score for different values of `C`.\n\n"
73+
]
74+
},
75+
{
76+
"cell_type": "code",
77+
"execution_count": null,
78+
"metadata": {
79+
"collapsed": false
80+
},
81+
"outputs": [],
82+
"source": [
83+
"import numpy as np\nimport pandas as pd\nfrom sklearn.model_selection import validation_curve, ShuffleSplit\n\nCs = np.logspace(-2.3, -1.3, 10)\ntrain_sizes = np.linspace(0.3, 0.7, 3)\nlabels = [f\"fraction: {train_size}\" for train_size in train_sizes]\n\nresults = {\"C\": Cs}\nfor label, train_size in zip(labels, train_sizes):\n cv = ShuffleSplit(train_size=train_size, test_size=0.3, n_splits=50, random_state=1)\n train_scores, test_scores = validation_curve(\n model_l1, X, y, param_name=\"C\", param_range=Cs, cv=cv\n )\n results[label] = test_scores.mean(axis=1)\nresults = pd.DataFrame(results)"
84+
]
85+
},
86+
{
87+
"cell_type": "code",
88+
"execution_count": null,
89+
"metadata": {
90+
"collapsed": false
91+
},
92+
"outputs": [],
93+
"source": [
94+
"import matplotlib.pyplot as plt\n\nfig, axes = plt.subplots(nrows=1, ncols=2, sharey=True, figsize=(12, 6))\n\n# plot results without scaling C\nresults.plot(x=\"C\", ax=axes[0], logx=True)\naxes[0].set_ylabel(\"CV score\")\naxes[0].set_title(\"No scaling\")\n\n# plot results by scaling C\nfor train_size_idx, label in enumerate(labels):\n results_scaled = results[[label]].assign(\n C_scaled=Cs * float(n_samples * train_sizes[train_size_idx])\n )\n results_scaled.plot(x=\"C_scaled\", ax=axes[1], logx=True, label=label)\naxes[1].set_title(\"Scaling C by 1 / n_samples\")\n\n_ = fig.suptitle(\"Effect of scaling C with L1 penalty\")"
95+
]
96+
},
97+
{
98+
"cell_type": "markdown",
99+
"metadata": {},
100+
"source": [
101+
"Here, we observe that the cross-validation-error correlates best with the\ntest-error, when scaling our `C` with the number of samples, `n`.\n\n## L2-penalty case\nWe can repeat a similar experiment with the `l2` penalty. In this case, we\ndon't need to use a sparse dataset.\n\nIn this case, the theory says that in order to achieve prediction\nconsistency, the penalty parameter should be kept constant as the number of\nsamples grow.\n\nSo we will repeat the same experiment by creating a linear SVC classifier\nwith the `l2` penalty and check the test score via cross-validation and\nplot the results with and without scaling the parameter `C`.\n\n"
102+
]
103+
},
104+
{
105+
"cell_type": "code",
106+
"execution_count": null,
107+
"metadata": {
108+
"collapsed": false
109+
},
110+
"outputs": [],
111+
"source": [
112+
"rng = np.random.RandomState(1)\ny = np.sign(0.5 - rng.rand(n_samples))\nX = rng.randn(n_samples, n_features // 5) + y[:, np.newaxis]\nX += 5 * rng.randn(n_samples, n_features // 5)"
113+
]
114+
},
115+
{
116+
"cell_type": "code",
117+
"execution_count": null,
118+
"metadata": {
119+
"collapsed": false
120+
},
121+
"outputs": [],
122+
"source": [
123+
"model_l2 = LinearSVC(penalty=\"l2\", loss=\"squared_hinge\", dual=True)\nCs = np.logspace(-4.5, -2, 10)\n\nlabels = [f\"fraction: {train_size}\" for train_size in train_sizes]\nresults = {\"C\": Cs}\nfor label, train_size in zip(labels, train_sizes):\n cv = ShuffleSplit(train_size=train_size, test_size=0.3, n_splits=50, random_state=1)\n train_scores, test_scores = validation_curve(\n model_l2, X, y, param_name=\"C\", param_range=Cs, cv=cv\n )\n results[label] = test_scores.mean(axis=1)\nresults = pd.DataFrame(results)"
124+
]
125+
},
126+
{
127+
"cell_type": "code",
128+
"execution_count": null,
129+
"metadata": {
130+
"collapsed": false
131+
},
132+
"outputs": [],
133+
"source": [
134+
"import matplotlib.pyplot as plt\n\nfig, axes = plt.subplots(nrows=1, ncols=2, sharey=True, figsize=(12, 6))\n\n# plot results without scaling C\nresults.plot(x=\"C\", ax=axes[0], logx=True)\naxes[0].set_ylabel(\"CV score\")\naxes[0].set_title(\"No scaling\")\n\n# plot results by scaling C\nfor train_size_idx, label in enumerate(labels):\n results_scaled = results[[label]].assign(\n C_scaled=Cs * float(n_samples * train_sizes[train_size_idx])\n )\n results_scaled.plot(x=\"C_scaled\", ax=axes[1], logx=True, label=label)\naxes[1].set_title(\"Scaling C by 1 / n_samples\")\n\n_ = fig.suptitle(\"Effect of scaling C with L2 penalty\")"
135+
]
136+
},
137+
{
138+
"cell_type": "markdown",
139+
"metadata": {},
140+
"source": [
141+
"So or the L2 penalty case, the best result comes from the case where `C` is\nnot scaled.\n\n"
142+
]
143+
},
144+
{
145+
"cell_type": "code",
146+
"execution_count": null,
147+
"metadata": {
148+
"collapsed": false
149+
},
150+
"outputs": [],
151+
"source": [
152+
"plt.show()"
30153
]
31154
}
32155
],

Diff for: 1.2/_downloads/348dd747b709a747e14c8bcdddf0a9b6/plot_gpr_on_structured_data.py

+15-15
Original file line numberDiff line numberDiff line change
@@ -38,8 +38,8 @@
3838
3939
"""
4040

41+
# %%
4142
import numpy as np
42-
import matplotlib.pyplot as plt
4343
from sklearn.gaussian_process.kernels import Kernel, Hyperparameter
4444
from sklearn.gaussian_process.kernels import GenericKernelMixin
4545
from sklearn.gaussian_process import GaussianProcessRegressor
@@ -102,10 +102,11 @@ def clone_with_theta(self, theta):
102102

103103
kernel = SequenceKernel()
104104

105-
"""
106-
Sequence similarity matrix under the kernel
107-
===========================================
108-
"""
105+
# %%
106+
# Sequence similarity matrix under the kernel
107+
# ===========================================
108+
109+
import matplotlib.pyplot as plt
109110

110111
X = np.array(["AGCT", "AGC", "AACT", "TAA", "AAA", "GAACA"])
111112

@@ -117,11 +118,11 @@ def clone_with_theta(self, theta):
117118
plt.xticks(np.arange(len(X)), X)
118119
plt.yticks(np.arange(len(X)), X)
119120
plt.title("Sequence similarity under the kernel")
121+
plt.show()
120122

121-
"""
122-
Regression
123-
==========
124-
"""
123+
# %%
124+
# Regression
125+
# ==========
125126

126127
X = np.array(["AGCT", "AGC", "AACT", "TAA", "AAA", "GAACA"])
127128
Y = np.array([1.0, 1.0, 2.0, 2.0, 3.0, 3.0])
@@ -136,11 +137,11 @@ def clone_with_theta(self, theta):
136137
plt.xticks(np.arange(len(X)), X)
137138
plt.title("Regression on sequences")
138139
plt.legend()
140+
plt.show()
139141

140-
"""
141-
Classification
142-
==============
143-
"""
142+
# %%
143+
# Classification
144+
# ==============
144145

145146
X_train = np.array(["AGCT", "CGA", "TAAC", "TCG", "CTTT", "TGCT"])
146147
# whether there are 'A's in the sequence
@@ -176,13 +177,12 @@ def clone_with_theta(self, theta):
176177
[1.0 if c else -1.0 for c in gp.predict(X_test)],
177178
s=100,
178179
marker="x",
179-
edgecolor=(0, 1.0, 0.3),
180+
facecolor="b",
180181
linewidth=2,
181182
label="prediction",
182183
)
183184
plt.xticks(np.arange(len(X_train) + len(X_test)), np.concatenate((X_train, X_test)))
184185
plt.yticks([-1, 1], [False, True])
185186
plt.title("Classification on sequences")
186187
plt.legend()
187-
188188
plt.show()

0 commit comments

Comments
 (0)