Skip to content

Commit 1f626fa

Browse files
ENH Refactor Robust Classification, Regression and Clustering (#70)
1 parent 9877318 commit 1f626fa

11 files changed

+1118
-186
lines changed

Diff for: doc/modules/robust.rst

+10-17
Original file line numberDiff line numberDiff line change
@@ -1,8 +1,8 @@
11
.. _robust:
22

3-
===================================================
4-
Robust algorithms for Regression and Classification
5-
===================================================
3+
===============================================================
4+
Robust algorithms for Regression, Classification and Clustering
5+
===============================================================
66

77
.. currentmodule:: sklearn_extra.robust
88

@@ -59,7 +59,7 @@ that minimizes an estimation of the risk.
5959
6060
\widehat{f} = \text{argmin}_{f\in F}\frac{1}{n}\sum_{i=1}^n\ell(f(X_i),y_i),
6161
62-
where the :math:`ell` is a loss function (e.g. the squared distance in
62+
where the :math:`\ell` is a loss function (e.g. the squared distance in
6363
regression problems). Said in another way, we are trying to minimize an
6464
estimation of the expected risk and this estimation corresponds to an empirical
6565
mean. However, it is well known that the empirical mean is not robust to
@@ -90,10 +90,7 @@ The algorithm
9090
-------------
9191

9292
The approach is implemented as a meta algorithm that takes as input a base
93-
estimator (e.g., SGDClassifier or SGDRegressor). To be compatible, the
94-
base estimator must support partial_fit and sample_weight
95-
partial_fit and sample_weight. Refer to the KMeans example for a template
96-
to adapt the method to other estimators.
93+
estimator (e.g., SGDClassifier, SGDRegressor or MiniBatchKMeans).
9794

9895
At each step, the algorithm estimates sample weights that are meant to be small
9996
for outliers and large for inliers and then we do one optimization step using
@@ -155,25 +152,21 @@ Hence, we will not talk about classification algorithms in this comparison.
155152

156153
As such we only compare ourselves to TheilSenRegressor and RANSACRegressor as
157154
they both deal with outliers in X and in Y and are closer to
158-
RobustWeightedEstimator.
155+
RobustWeightedRegressor.
159156

160157
**Warning:** Huber weights used in our algorithm should not be confused with
161158
HuberRegressor or other regression with “robust losses”. Those types of
162159
regressions are robust only to outliers in the label Y but not in X.
163160

164161
Pro: RANSACRegressor and TheilSenRegressor both use a hard rejection of
165162
outlier. This can be interpreted as though there was an outlier detection
166-
step and then a regression step whereas RobustWeightedEstimator is directly
163+
step and then a regression step whereas RobustWeightedRegressor is directly
167164
robust to outliers. This often increase the performance on moderatly corrupted
168165
datasets.
169166

170167
Con: In general, this algorithm is slower than both TheilSenRegressor and
171168
RANSACRegressor.
172169

173-
One other advantage of RobustWeightedEstimator is that it can be used for a
174-
broad range of algorithms. For example, one can do robust unsupervised
175-
learning with RobustWeightedEstimator, see the example using KMeans algorithm.
176-
177170
Speed and limits of the algorithm
178171
---------------------------------
179172

@@ -188,9 +181,9 @@ Complexity and limitation:
188181

189182
* weighting=”huber”: the complexity is larger than that of base_estimator but
190183
it is still of the same order of magnitude.
191-
* weighting=”mom”: the larger k is the faster the algorithm will perform if
192-
sample_size is large. This weighting scheme is advised only with
193-
sufficiently large dataset (thumb rule sample_size > 500 the specifics
184+
* weighting=”mom”: the larger k is the faster the algorithm will perform if
185+
sample_size is large. This weighting scheme is advised only with
186+
sufficiently large dataset (thumb rule sample_size > 500 the specifics
194187
depend on the dataset).
195188

196189
**Warning:** On a real dataset, one should be aware that there can be outliers

Diff for: examples/plot_clustering.py

+9-24
Original file line numberDiff line numberDiff line change
@@ -3,15 +3,15 @@
33
===================================================================
44
A demo of several clustering algorithms on a corrupted dataset
55
===================================================================
6-
In this example we exhibit the results of various
6+
In this example we exhibit the results of various
77
scikit-learn and scikit-learn-extra clustering algorithms on
88
a dataset with outliers.
9-
KMedoids is the most stable and efficient
9+
KMedoids is the most stable and efficient
1010
algorithm for this application (change the seed to
11-
see different behavior for SpectralClustering and
11+
see different behavior for SpectralClustering and
1212
the robust kmeans).
13-
The mean-shift algorithm, once correctly
14-
parameterized, detects the outliers as a class of
13+
The mean-shift algorithm, once correctly
14+
parameterized, detects the outliers as a class of
1515
their own.
1616
"""
1717
print(__doc__)
@@ -22,11 +22,11 @@
2222
import matplotlib.pyplot as plt
2323

2424
from sklearn import cluster, mixture
25-
from sklearn.cluster import MiniBatchKMeans, KMeans
25+
from sklearn.cluster import KMeans
2626
from sklearn.datasets import make_blobs
2727
from sklearn.utils import shuffle
2828

29-
from sklearn_extra.robust import RobustWeightedEstimator
29+
from sklearn_extra.robust import RobustWeightedKMeans
3030
from sklearn_extra.cluster import KMedoids
3131

3232
rng = np.random.RandomState(42)
@@ -37,16 +37,6 @@
3737
kmeans = KMeans(n_clusters=n_clusters, random_state=rng)
3838
kmedoid = KMedoids(n_clusters=n_clusters, random_state=rng)
3939

40-
41-
def kmeans_loss(X, pred):
42-
return np.array(
43-
[
44-
np.linalg.norm(X[pred[i]] - np.mean(X[pred == pred[i]])) ** 2
45-
for i in range(len(X))
46-
]
47-
)
48-
49-
5040
two_means = cluster.MiniBatchKMeans(n_clusters=n_clusters, random_state=rng)
5141
spectral = cluster.SpectralClustering(
5242
n_clusters=n_clusters,
@@ -78,15 +68,10 @@ def kmeans_loss(X, pred):
7868
X = shuffle(X, random_state=rng)
7969

8070
# Define two other clustering algorithms
81-
kmeans_rob = RobustWeightedEstimator(
82-
MiniBatchKMeans(
83-
n_clusters, batch_size=len(X), init="random", random_state=rng
84-
),
85-
# in theory, init=kmeans++ is very non-robust
86-
burn_in=0,
71+
kmeans_rob = RobustWeightedKMeans(
72+
n_clusters,
8773
eta0=0.01,
8874
weighting="mom",
89-
loss=kmeans_loss,
9075
max_iter=100,
9176
k=int(n_samples / 50),
9277
random_state=rng,

Diff for: examples/plot_robust_classification_diabete.py

+18-7
Original file line numberDiff line numberDiff line change
@@ -3,22 +3,21 @@
33
======================================================================
44
A demo of Robust Classification on real dataset "diabetes" from OpenML
55
======================================================================
6-
In this example we compare the RobustWeightedEstimator using SGDClassifier
6+
In this example we compare the RobustWeightedCLassifier
77
for classification on the real dataset "diabetes".
88
WARNING: running this example can take some time (<1hour).
99
We only compare the estimator with SGDClassifier as there is no robust
1010
classification estimator in scikit-learn.
1111
"""
1212
import matplotlib.pyplot as plt
1313
import numpy as np
14-
from sklearn_extra.robust import RobustWeightedEstimator
14+
from sklearn_extra.robust import RobustWeightedClassifier
1515
from sklearn.linear_model import SGDClassifier
1616
from sklearn.datasets import fetch_openml
1717
from sklearn.metrics import roc_auc_score, make_scorer
1818
from sklearn.model_selection import cross_val_score
1919
from sklearn.preprocessing import RobustScaler
2020

21-
2221
X, y = fetch_openml(name="diabetes", return_X_y=True)
2322

2423
# replace the label names with 0 or 1
@@ -36,8 +35,7 @@
3635
# Using GridSearchCV, we tuned the parameters c and eta0, with the
3736
# choice of "huber" weighting because the sample_size is not very large.
3837

39-
clf_rob = RobustWeightedEstimator(
40-
SGDClassifier(average=10, learning_rate="optimal", loss="hinge"),
38+
clf_rob = RobustWeightedClassifier(
4139
weighting="huber",
4240
loss="hinge",
4341
c=1.35,
@@ -50,8 +48,19 @@
5048
M = 10
5149
res = []
5250
for f in range(M):
51+
rng = np.random.RandomState(f)
5352
print("\r Progress: %s / %s" % (f + 1, M), end="")
54-
clf = SGDClassifier(average=10, learning_rate="optimal", loss="hinge")
53+
clf = SGDClassifier(
54+
average=10, learning_rate="optimal", loss="hinge", random_state=rng
55+
)
56+
clf_rob = RobustWeightedClassifier(
57+
weighting="huber",
58+
loss="hinge",
59+
c=1.35,
60+
eta0=1e-3,
61+
max_iter=300,
62+
random_state=rng,
63+
)
5564

5665
cv_not_rob = cross_val_score(
5766
clf_not_rob, X, y, cv=10, scoring=make_scorer(roc_auc_score)
@@ -64,7 +73,9 @@
6473
res += [[np.mean(cv_rob), np.mean(cv_not_rob)]]
6574

6675

67-
plt.boxplot(np.array(res), labels=["RobustWeightedEstimator", "SGDClassifier"])
76+
plt.boxplot(
77+
np.array(res), labels=["RobustWeightedClassifier", "SGDClassifier"]
78+
)
6879
plt.ylabel("AUC")
6980

7081
plt.show()

Diff for: examples/plot_robust_classification_toy.py

+4-6
Original file line numberDiff line numberDiff line change
@@ -3,12 +3,12 @@
33
=============================================================
44
A demo of Robust Classification on Simulated corrupted dataset
55
=============================================================
6-
In this example we compare the RobustWeightedEstimator using SGDClassifier
6+
In this example we compare the RobustWeightedClassifier using SGDClassifier
77
for classification with the vanilla SGDClassifier with various losses.
88
"""
99
import matplotlib.pyplot as plt
1010
import numpy as np
11-
from sklearn_extra.robust import RobustWeightedEstimator
11+
from sklearn_extra.robust import RobustWeightedClassifier
1212
from sklearn.linear_model import SGDClassifier
1313
from sklearn.datasets import make_blobs
1414
from sklearn.utils import shuffle
@@ -40,10 +40,8 @@
4040
SGDClassifier(loss="modified_huber", random_state=rng),
4141
),
4242
(
43-
"RobustWeightedEstimator",
44-
RobustWeightedEstimator(
45-
base_estimator=SGDClassifier(),
46-
loss="log",
43+
"RobustWeightedClassifier",
44+
RobustWeightedClassifier(
4745
max_iter=100,
4846
weighting="mom",
4947
k=6,

Diff for: examples/plot_robust_regression_california_houses.py

+21-20
Original file line numberDiff line numberDiff line change
@@ -3,26 +3,23 @@
33
================================================================
44
A demo of Robust Regression on real dataset "california housing"
55
================================================================
6-
In this example we compare the RobustWeightedEstimator using SGDRegressor
7-
for regression on the real dataset california housing.
8-
WARNING: running this example can take some time (<1hour).
9-
10-
We also compare with robust estimators from scikit-learn: TheilSenRegressor
11-
and RANSACRegressor
6+
In this example we compare the RobustWeightedRegressor to other scikit-learn
7+
regressors on the real dataset california housing.
8+
WARNING: running this example can take some time (<1 hour on recent computer).
129
1310
One of the main point of this example is the importance of taking into account
1411
outliers in the test dataset when dealing with real datasets.
1512
16-
For this example, we took a parameter so that RobustWeightedEstimator is better
13+
For this example, we took a parameter so that RobustWeightedRegressor is better
1714
than RANSAC and TheilSen when talking about the mean squared error and it
1815
is better than the SGDRegressor when talking about the median squared error.
1916
Depending on what criterion one want to optimize, the parameter measuring
20-
robustness in RobustWeightedEstimator can change and this is not so
17+
robustness in RobustWeightedRegressor can change and this is not so
2118
straightforward when using RANSAC and TheilSenRegressor.
2219
"""
2320
import matplotlib.pyplot as plt
2421
import numpy as np
25-
from sklearn_extra.robust import RobustWeightedEstimator
22+
from sklearn_extra.robust import RobustWeightedRegressor
2623
from sklearn.linear_model import (
2724
SGDRegressor,
2825
TheilSenRegressor,
@@ -57,19 +54,18 @@ def quadratic_loss(est, X, y, X_test, y_test):
5754
),
5855
),
5956
(
60-
"RWE, Huber weights",
61-
RobustWeightedEstimator(
62-
SGDRegressor(
63-
learning_rate="adaptive",
64-
eta0=1e-6,
65-
max_iter=1000,
66-
n_iter_no_change=100,
67-
),
68-
loss="squared_loss",
57+
"RobustWeightedRegressor",
58+
RobustWeightedRegressor(
6959
weighting="huber",
7060
c=0.5,
7161
eta0=1e-6,
7262
max_iter=500,
63+
sgd_args={
64+
"max_iter": 1000,
65+
"n_iter_no_change": 100,
66+
"learning_rate": "adaptive",
67+
"eta0": 1e-6,
68+
},
7369
),
7470
),
7571
("RANSAC", RANSACRegressor()),
@@ -82,14 +78,19 @@ def quadratic_loss(est, X, y, X_test, y_test):
8278
for f in range(M):
8379
print("\r Progress: %s / %s" % (f + 1, M), end="")
8480

81+
rng = np.random.RandomState(f)
82+
8583
# Split in a training set and a test set
86-
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
84+
X_train, X_test, y_train, y_test = train_test_split(
85+
X, y, test_size=0.2, random_state=rng
86+
)
8787

8888
for i, (name, est) in enumerate(estimators):
8989
cv = quadratic_loss(est, X_train, y_train, X_test, y_test)
9090

9191
# It is preferable to use the median of the validation losses
92-
# because it is possible that some outliers are present in the test set.
92+
# because it is possible that some outliers are present in the
93+
# test set.
9394
# We compute both for comparison.
9495
res[i, f, 0] = np.mean(cv)
9596
res[i, f, 1] = np.median(cv)

Diff for: examples/plot_robust_regression_toy.py

+7-9
Original file line numberDiff line numberDiff line change
@@ -3,13 +3,13 @@
33
=============================================================
44
Robust regression on simulated corrupted dataset
55
=============================================================
6-
In this example we compare the RobustWeightedEstimator using SGDRegressor
7-
for regression with various robust regression algorithms from scikit-learn.
6+
In this example we compare the RobustWeightedRegressor
7+
with various robust regression algorithms from scikit-learn.
88
"""
99
import matplotlib.pyplot as plt
1010
import numpy as np
1111

12-
from sklearn_extra.robust import RobustWeightedEstimator
12+
from sklearn_extra.robust import RobustWeightedRegressor
1313
from sklearn.utils import shuffle
1414
from sklearn.linear_model import (
1515
SGDRegressor,
@@ -41,10 +41,8 @@
4141
SGDRegressor(loss="epsilon_insensitive", random_state=rng),
4242
),
4343
(
44-
"RobustWeightedEstimator",
45-
RobustWeightedEstimator(
46-
loss="squared_loss", weighting="mom", k=7, random_state=rng
47-
),
44+
"RobustWeightedRegressor",
45+
RobustWeightedRegressor(weighting="mom", k=7, random_state=rng),
4846
# The parameter k is set larger to the number of outliers
4947
# because here we know it.
5048
),
@@ -56,7 +54,7 @@
5654
"Theil-Sen": "gold",
5755
"RANSAC": "lightgreen",
5856
"HuberRegressor": "black",
59-
"RobustWeightedEstimator": "magenta",
57+
"RobustWeightedRegressor": "magenta",
6058
"SGD epsilon loss": "purple",
6159
}
6260
linestyle = {
@@ -65,7 +63,7 @@
6563
"Theil-Sen": "-.",
6664
"RANSAC": "--",
6765
"HuberRegressor": "--",
68-
"RobustWeightedEstimator": "--",
66+
"RobustWeightedRegressor": "--",
6967
}
7068
lw = 3
7169

Diff for: sklearn_extra/robust/__init__.py

+8-2
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,11 @@
11
from sklearn_extra.robust.robust_weighted_estimator import (
2-
RobustWeightedEstimator,
2+
RobustWeightedClassifier,
3+
RobustWeightedKMeans,
4+
RobustWeightedRegressor,
35
)
46

5-
__all__ = ["RobustWeightedEstimator"]
7+
__all__ = [
8+
"RobustWeightedClassifier",
9+
"RobustWeightedKMeans",
10+
"RobustWeightedRegressor",
11+
]

0 commit comments

Comments
 (0)