Skip to content

[MRG+1] QuantileTransformer #8363

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 107 commits into from
Jun 9, 2017

Conversation

glemaitre
Copy link
Member

Reference Issue

Cont'd of #2176

What does this implement/fix? Explain your changes.

Implementation of quantile normalizer

Any other comments?

@glemaitre
Copy link
Member Author

@tguillemot @dengemann @raghavrv @ogrisel here we go

X_trans = normalizer.fit_transform(X)
# FIXME: one of those will drive to precision error
# in the interpolation
# assert_array_almost_equal(np.min(X_trans, axis=0), 0.)
Copy link
Contributor

@tguillemot tguillemot Feb 16, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm working on it.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I checked yesterday for while and there is nothing wrong with our code.
f(min(X)) of the interpolated function do not want to return 0.
The issue should come from numpy.interp

This is working on the toy :D
I will try to sort out the issue with the CI error coming from different numpy version I think.

Copy link
Contributor

@tguillemot tguillemot Feb 16, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's a problem of precision with numpy.interp indeed.

self : object
Returns self
"""
X = self._validate_X(X)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a niptick, is it necessary to create a specific function ?
When there are few lines I prefer not create function ;).

normalizer = QuantileNormalizer()
normalizer.fit(X)
X_trans = normalizer.fit_transform(X)
assert_array_almost_equal(np.min(X_trans, axis=0), 0.)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can use use assert_almost_equal when you compare scalar values.

X_trans = normalizer.fit_transform(X)
assert_array_almost_equal(np.min(X_trans, axis=0), 0.)
assert_array_almost_equal(np.max(X_trans, axis=0), 1.)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you please add a check that extreme values are mapped to 0 or 1, e.g.

X_test = np.array([
    [ -1,  1,  0],
    [101, 11, 10],
])
expected = np.array([
    [0, 0, 0],
    [1, 1, 1],
])
assert_array_almost_equal(normalizer.transform(X_test), expected)


for feat_idx, f in enumerate(func_transform):
Xt.data[Xt.indptr[feat_idx]:Xt.indptr[feat_idx + 1]] = f(
Xt.data[Xt.indptr[feat_idx]:Xt.indptr[feat_idx + 1]])
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice :)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually you could factorize the slicing to make the code more readable:

column_slice = slice(Xt.indptr[feat_idx], Xt.indptr[feat_idx + 1])
Xt.data[column_slice] = f(Xt.data[column_sclice])

----------
X : sparse matrix, shape (n_samples, n_features)
The data used to scale along the features axis. The sparse matrix
needs to be semi-positive.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You should make it explicit that it only works for CSC sparse matrices (I know this is not public API but it makes it easier to understand how the code works).

# we only accept positive sparse matrix
if sparse.issparse(X) and X.min() < 0:
raise ValueError('QuantileNormalizer only accepts semi-positive'
' sparse matrices')
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not "semi-positive sparse matrix" but "sparse matrices with all non-negative entries".

def test_quantile_normalizer_error_neg_sparse():
X = np.array([[0, 25, 50, 75, 100],
[-2, 4, 6, 8, 10],
[2.6, 4.1, 2.3, 9.5, 0.1]]).T
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You should insert more zero values in this matrix to make it sparser.


X = np.array([[0, 25, 50, 75, 100],
[2, 4, 6, 8, 10],
[2.6, 4.1, 2.3, 9.5, 0.1]]).T
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You should insert more zero values in this matrix to make it sparser.

qn_ser = pickle.dumps(qn, pickle.HIGHEST_PROTOCOL)
qn2 = pickle.loads(qn_ser)
assert_array_almost_equal(qn.transform(iris.data),
qn2.transform(iris.data))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You should also check that it can pickle correctly before fitting (evenn though it should trivially work).


The normalization is applied on each feature independently.
The cumulative density function of a feature is used to project the
original values.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add something like:

Features of new/unseen data that fall below or above the fitted range will be mapped to 0 and one, respectively.
Note that this transform and non-linear. It may remove correlations between variables measured at the same scale but renders variables measured at different scales more directly comparable.

See also
--------
:class:`sklearn.preprocessing.StandardScaler` to perform standardization
that is faster, but less robust to outliers.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add maybe

:class:`sklearn.preprocessing.Ro bustScaler` to perform robust standardization that removes the influence of outliers but does not put outliers and inliers on the same scale.
     

bounds_error=False,
fill_value=(min(quantiles_feat),
max(quantiles_feat)))
for quantiles_feat in self.quantiles_.T]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there any reason for these guys being lists, hence mutable?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point.

self.references_ = np.linspace(0, 1, self.n_quantiles,
endpoint=True)
# FIXME: it does not take into account the zero in the computation
self.quantiles_ = np.array([np.percentile(
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ogrisel @tguillemot Here I am not really sure what should be the right way.
Assuming that the sparse matrix as a lot of zeros fo a given feature, it will have a bad influence on the normalisation, didn't it?
It could also be the case in the dense in fact. That was the reason of including a quantile_range.

Copy link
Contributor

@tguillemot tguillemot Feb 16, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we modify the reference value to take into account of the number of 0 ?
Not sure it's what we want.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it is in np.percentiles that we can do that. We know the size of X_col and we can now the number of non-zeros. Therefore, we can add the zeros in the data to compute the percentiles.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes we need to find a way to shift the percentile distribution efficiently. It probably better to do the quantile computation ourselves: sort the subsampled column non-zero data, then consider the fraction of zeros that should be considered to be added at the beginning of that array (without actually materializing it) also taking the subsampling rate into account and do the quantile lookups manually.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But then we also need to handle the linear interpolation...

Copy link
Member

@ogrisel ogrisel Feb 16, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, no need to do that, let's do:

column_nnz_data = X.data[X.indptr[feat]:X.indptr[feat + 1]]
column_subsample = subsample * len(column_nnz_data) // X.shape[0]
column_data = np.zeros(shape=subsample, dtype=X.dtype)
column_data[:column_subsample] = rng.choice(column_nnz_data, column_subsample,
                                            replace=False)

and then proceed to extract the quantiles from column_data as usual.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because subsample is going to be smallish and independent of X.shape[0] this is good enough and easier to maintain.

# FIXME: it does not take into account the zero in the computation
self.quantiles_ = np.array([np.percentile(
X.data[X.indptr[feat]:X.indptr[feat + 1]], self.references_ * 100)
for feat in range(n_feat)]).T
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cosmetics: please use n_features and feature_idx.

# assert_array_almost_equal(np.min(X_trans, axis=0), 0.)
# assert_array_almost_equal(np.max(X_trans, axis=0), 1.)
X_trans_inv = normalizer.inverse_transform(X_trans)
assert_array_almost_equal(X, X_trans_inv)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not directly related with the line but with the transform.inverse_transform. It will not be equal if X have out of bounds value which will be clipped during transform and mapped to minimum of maximum of the references_ during inverse transform

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are no problem for that case (and it's a way to be sure the normalizer works in a correct way).
But what you say is true indeed for general cases.

if direction:
print(1)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tguillemot That look like debugging flags

func_transform = self.f_transform_
else:
print(2)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tguillemot That look like debugging flags

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oups indeed. Sorry

@glemaitre
Copy link
Member Author

@ogrisel I was checking the User guide for the preprocessing to see what to add.

I have a second thought on the naming of the class. From the description in the user guide, QuantileScaler would be more appropriate.

What is the reason to stick to normalizer?

@ogrisel
Copy link
Member

ogrisel commented Feb 16, 2017

The problem is that (feature-wise) scaling stands for deviding each feature by a scalar value. This is the case for StandardScaler and RobustScaler but not in our case. I prefer QuantileNormalizer or QuantileTransformer.

@ogrisel
Copy link
Member

ogrisel commented Feb 16, 2017

https://research.google.com/pubs/pub45530.html uses "quantile normalization" in the body of the article to describe what we do in this class. +1 for QuantileNormalizer.

The cumulative density function of a feature is used to project the
original values. Features values of new/unseen data that fall below
or above the fitted range will be mapped to 0 and 1, respectively.
Note that this transform is non-linear. It may remove correlations between
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"remove correlations" => "distort linear correlations"

This Normalizer scales the features between 0 and 1, equalizing the
distribution of each feature to a uniform distribution. Therefore,
for a given feature, this normalization tends to spread out the most
frequent values.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It also reduces the impact of (marginal) outliers: this is therefore a robust preprocessing scheme.

@glemaitre
Copy link
Member Author

https://research.google.com/pubs/pub45530.html uses "quantile normalization" in the body of the article to describe what we do in this class. +1 for QuantileNormalizer.

Fair enough. The narration of the User guide needs to be changed to be coherent.


f_inverse_transform_ : list of callable, shape (n_quantiles,)
The inverse of the cumulative density function used to project the
data.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should keep the f_transform_ and f_inverse_transform_ attribute private (with a leading underscore).

@GaelVaroquaux
Copy link
Member

I've removed the smoothing_noise

@jnothman : this should give us your 👍, no?

There is a failing test that I will address soon

@GaelVaroquaux GaelVaroquaux force-pushed the quantile_scaler branch 9 times, most recently from 05290f5 to 45a1548 Compare June 9, 2017 13:49
Simplifies also the code, examples, and documentation
@GaelVaroquaux GaelVaroquaux merged commit 26a1027 into scikit-learn:master Jun 9, 2017
@GaelVaroquaux
Copy link
Member

Merged. Whoot!

This is based on a 4-year old PR by Joseph Turian :)

@dengemann
Copy link
Contributor

dengemann commented Jun 9, 2017 via email

@agramfort
Copy link
Member

🍻

@tguillemot
Copy link
Contributor

👍

@raghavrv
Copy link
Member

Yohoo :D Thanks for the patience @glemaitre

@jnothman
Copy link
Member

Nicely resolved, @GaelVaroquaux, and well done all!

Sundrique pushed a commit to Sundrique/scikit-learn that referenced this pull request Jun 14, 2017
* resurrect quantile scaler

* move the code in the pre-processing module

* first draft

* Add tests.

* Fix bug in QuantileNormalizer.

* Add quantile_normalizer.

* Implement pickling

* create a specific function for dense transform

* Create a fit function for the dense case

* Create a toy examples

* First draft with sparse matrices

* remove useless functions and non-negative sparse compatibility

* fix slice call

* Fix tests of QuantileNormalizer.

* Fix estimator compatibility

* List of functions became tuple of functions
* Check X consistency at transform and inverse transform time

* fix doc

* Add negative ValueError tests for QuantileNormalizer.

* Fix cosmetics

* Fix compatibility numpy <= 1.8

* Add n_features tests and correct ValueError.

* PEP8

* fix fill_value for early scipy compatibility

* simplify sampling

* Fix tests.

* removing last pring

* Change choice for permutation

* cosmetics

* fix remove remaining choice

* DOC

* Fix inconsistencies

* pep8

* Add checker for init parameters.

* hack bounds and make a test

* FIX/TST bounds are provided by the fitting and not X at transform

* PEP8

* FIX/TST axis should be <= 1

* PEP8

* ENH Add parameter ignore_implicit_zeros

* ENH match output distribution

* ENH clip the data to avoid infinity due to output PDF

* FIX ENH restraint to uniform and norm

* [MRG] ENH Add example comparing the distribution of all scaling preprocessor (scikit-learn#2)

* ENH Add example comparing the distribution of all scaling preprocessor

* Remove Jupyter notebook convert

* FIX/ENH Select feat before not after; Plot interquantile data range for all

* Add heatmap legend

* Remove comment maybe?

* Move doc from robust_scaling to plot_all_scaling; Need to update doc

* Update the doc

* Better aesthetics; Better spacing and plot colormap only at end

* Shameless author re-ordering ;P

* Use env python for she-bang

* TST Validity of output_pdf

* EXA Use OrderedDict; Make it easier to add more transformations

* FIX PEP8 and replace scipy.stats by str in example

* FIX remove useless import

* COSMET change variable names

* FIX change output_pdf occurence to output_distribution

* FIX partial fixies from comments

* COMIT change class name and code structure

* COSMIT change direction to inverse

* FIX factorize transform in _transform_col

* PEP8

* FIX change the magic 10

* FIX add interp1d to fixes

* FIX/TST allow negative entries when ignore_implicit_zeros is True

* FIX use np.interp instead of sp.interpolate.interp1d

* FIX/TST fix tests

* DOC start checking doc

* TST add test to check the behaviour of interp numpy

* TST/EHN Add the possibility to add noise to compute quantile

* FIX factorize quantile computation

* FIX fixes issues

* PEP8

* FIX/DOC correct doc

* TST/DOC improve doc and add random state

* EXA add examples to illustrate the use of smoothing_noise

* FIX/DOC fix some grammar

* DOC fix example

* DOC/EXA make plot titles more succint

* EXA improve explanation

* EXA improve the docstring

* DOC add a bit more documentation

* FIX advance review

* TST add subsampling test

* DOC/TST better example for the docstring

* DOC add ellipsis to docstring

* FIX address olivier comments

* FIX remove random_state in sparse.rand

* FIX spelling doc

* FIX cite example in user guide and docstring

* FIX olivier comments

* EHN improve the example comparing all the pre-processing methods

* FIX/DOC remove title

* FIX change the scaling of the figure

* FIX plotting layout

* FIX ratio w/h

* Reorder and reword the plot_all_scaling example

* Fix aspect ratio and better explanations in the plot_all_scaling.py example

* Fix broken link and remove useless sentence

* FIX fix couples of spelling

* FIX comments joel

* FIX/DOC address documentation comments

* FIX address comments joel

* FIX inline sparse and dense transform

* PEP8

* TST/DOC temporary skipping test

* FIX raise an error if n_quantiles > subsample

* FIX wording in smoothing_noise example

* EXA Denis comments

* FIX rephrasing

* FIX make smoothing_noise to be a boolearn and change doc

* FIX address comments

* FIX verbose the doc slightly more

* PEP8/DOC

* ENH: 2-ways interpolation to avoid smoothing_noise

Simplifies also the code, examples, and documentation
dmohns pushed a commit to dmohns/scikit-learn that referenced this pull request Aug 7, 2017
* resurrect quantile scaler

* move the code in the pre-processing module

* first draft

* Add tests.

* Fix bug in QuantileNormalizer.

* Add quantile_normalizer.

* Implement pickling

* create a specific function for dense transform

* Create a fit function for the dense case

* Create a toy examples

* First draft with sparse matrices

* remove useless functions and non-negative sparse compatibility

* fix slice call

* Fix tests of QuantileNormalizer.

* Fix estimator compatibility

* List of functions became tuple of functions
* Check X consistency at transform and inverse transform time

* fix doc

* Add negative ValueError tests for QuantileNormalizer.

* Fix cosmetics

* Fix compatibility numpy <= 1.8

* Add n_features tests and correct ValueError.

* PEP8

* fix fill_value for early scipy compatibility

* simplify sampling

* Fix tests.

* removing last pring

* Change choice for permutation

* cosmetics

* fix remove remaining choice

* DOC

* Fix inconsistencies

* pep8

* Add checker for init parameters.

* hack bounds and make a test

* FIX/TST bounds are provided by the fitting and not X at transform

* PEP8

* FIX/TST axis should be <= 1

* PEP8

* ENH Add parameter ignore_implicit_zeros

* ENH match output distribution

* ENH clip the data to avoid infinity due to output PDF

* FIX ENH restraint to uniform and norm

* [MRG] ENH Add example comparing the distribution of all scaling preprocessor (scikit-learn#2)

* ENH Add example comparing the distribution of all scaling preprocessor

* Remove Jupyter notebook convert

* FIX/ENH Select feat before not after; Plot interquantile data range for all

* Add heatmap legend

* Remove comment maybe?

* Move doc from robust_scaling to plot_all_scaling; Need to update doc

* Update the doc

* Better aesthetics; Better spacing and plot colormap only at end

* Shameless author re-ordering ;P

* Use env python for she-bang

* TST Validity of output_pdf

* EXA Use OrderedDict; Make it easier to add more transformations

* FIX PEP8 and replace scipy.stats by str in example

* FIX remove useless import

* COSMET change variable names

* FIX change output_pdf occurence to output_distribution

* FIX partial fixies from comments

* COMIT change class name and code structure

* COSMIT change direction to inverse

* FIX factorize transform in _transform_col

* PEP8

* FIX change the magic 10

* FIX add interp1d to fixes

* FIX/TST allow negative entries when ignore_implicit_zeros is True

* FIX use np.interp instead of sp.interpolate.interp1d

* FIX/TST fix tests

* DOC start checking doc

* TST add test to check the behaviour of interp numpy

* TST/EHN Add the possibility to add noise to compute quantile

* FIX factorize quantile computation

* FIX fixes issues

* PEP8

* FIX/DOC correct doc

* TST/DOC improve doc and add random state

* EXA add examples to illustrate the use of smoothing_noise

* FIX/DOC fix some grammar

* DOC fix example

* DOC/EXA make plot titles more succint

* EXA improve explanation

* EXA improve the docstring

* DOC add a bit more documentation

* FIX advance review

* TST add subsampling test

* DOC/TST better example for the docstring

* DOC add ellipsis to docstring

* FIX address olivier comments

* FIX remove random_state in sparse.rand

* FIX spelling doc

* FIX cite example in user guide and docstring

* FIX olivier comments

* EHN improve the example comparing all the pre-processing methods

* FIX/DOC remove title

* FIX change the scaling of the figure

* FIX plotting layout

* FIX ratio w/h

* Reorder and reword the plot_all_scaling example

* Fix aspect ratio and better explanations in the plot_all_scaling.py example

* Fix broken link and remove useless sentence

* FIX fix couples of spelling

* FIX comments joel

* FIX/DOC address documentation comments

* FIX address comments joel

* FIX inline sparse and dense transform

* PEP8

* TST/DOC temporary skipping test

* FIX raise an error if n_quantiles > subsample

* FIX wording in smoothing_noise example

* EXA Denis comments

* FIX rephrasing

* FIX make smoothing_noise to be a boolearn and change doc

* FIX address comments

* FIX verbose the doc slightly more

* PEP8/DOC

* ENH: 2-ways interpolation to avoid smoothing_noise

Simplifies also the code, examples, and documentation
dmohns pushed a commit to dmohns/scikit-learn that referenced this pull request Aug 7, 2017
* resurrect quantile scaler

* move the code in the pre-processing module

* first draft

* Add tests.

* Fix bug in QuantileNormalizer.

* Add quantile_normalizer.

* Implement pickling

* create a specific function for dense transform

* Create a fit function for the dense case

* Create a toy examples

* First draft with sparse matrices

* remove useless functions and non-negative sparse compatibility

* fix slice call

* Fix tests of QuantileNormalizer.

* Fix estimator compatibility

* List of functions became tuple of functions
* Check X consistency at transform and inverse transform time

* fix doc

* Add negative ValueError tests for QuantileNormalizer.

* Fix cosmetics

* Fix compatibility numpy <= 1.8

* Add n_features tests and correct ValueError.

* PEP8

* fix fill_value for early scipy compatibility

* simplify sampling

* Fix tests.

* removing last pring

* Change choice for permutation

* cosmetics

* fix remove remaining choice

* DOC

* Fix inconsistencies

* pep8

* Add checker for init parameters.

* hack bounds and make a test

* FIX/TST bounds are provided by the fitting and not X at transform

* PEP8

* FIX/TST axis should be <= 1

* PEP8

* ENH Add parameter ignore_implicit_zeros

* ENH match output distribution

* ENH clip the data to avoid infinity due to output PDF

* FIX ENH restraint to uniform and norm

* [MRG] ENH Add example comparing the distribution of all scaling preprocessor (scikit-learn#2)

* ENH Add example comparing the distribution of all scaling preprocessor

* Remove Jupyter notebook convert

* FIX/ENH Select feat before not after; Plot interquantile data range for all

* Add heatmap legend

* Remove comment maybe?

* Move doc from robust_scaling to plot_all_scaling; Need to update doc

* Update the doc

* Better aesthetics; Better spacing and plot colormap only at end

* Shameless author re-ordering ;P

* Use env python for she-bang

* TST Validity of output_pdf

* EXA Use OrderedDict; Make it easier to add more transformations

* FIX PEP8 and replace scipy.stats by str in example

* FIX remove useless import

* COSMET change variable names

* FIX change output_pdf occurence to output_distribution

* FIX partial fixies from comments

* COMIT change class name and code structure

* COSMIT change direction to inverse

* FIX factorize transform in _transform_col

* PEP8

* FIX change the magic 10

* FIX add interp1d to fixes

* FIX/TST allow negative entries when ignore_implicit_zeros is True

* FIX use np.interp instead of sp.interpolate.interp1d

* FIX/TST fix tests

* DOC start checking doc

* TST add test to check the behaviour of interp numpy

* TST/EHN Add the possibility to add noise to compute quantile

* FIX factorize quantile computation

* FIX fixes issues

* PEP8

* FIX/DOC correct doc

* TST/DOC improve doc and add random state

* EXA add examples to illustrate the use of smoothing_noise

* FIX/DOC fix some grammar

* DOC fix example

* DOC/EXA make plot titles more succint

* EXA improve explanation

* EXA improve the docstring

* DOC add a bit more documentation

* FIX advance review

* TST add subsampling test

* DOC/TST better example for the docstring

* DOC add ellipsis to docstring

* FIX address olivier comments

* FIX remove random_state in sparse.rand

* FIX spelling doc

* FIX cite example in user guide and docstring

* FIX olivier comments

* EHN improve the example comparing all the pre-processing methods

* FIX/DOC remove title

* FIX change the scaling of the figure

* FIX plotting layout

* FIX ratio w/h

* Reorder and reword the plot_all_scaling example

* Fix aspect ratio and better explanations in the plot_all_scaling.py example

* Fix broken link and remove useless sentence

* FIX fix couples of spelling

* FIX comments joel

* FIX/DOC address documentation comments

* FIX address comments joel

* FIX inline sparse and dense transform

* PEP8

* TST/DOC temporary skipping test

* FIX raise an error if n_quantiles > subsample

* FIX wording in smoothing_noise example

* EXA Denis comments

* FIX rephrasing

* FIX make smoothing_noise to be a boolearn and change doc

* FIX address comments

* FIX verbose the doc slightly more

* PEP8/DOC

* ENH: 2-ways interpolation to avoid smoothing_noise

Simplifies also the code, examples, and documentation
NelleV pushed a commit to NelleV/scikit-learn that referenced this pull request Aug 11, 2017
* resurrect quantile scaler

* move the code in the pre-processing module

* first draft

* Add tests.

* Fix bug in QuantileNormalizer.

* Add quantile_normalizer.

* Implement pickling

* create a specific function for dense transform

* Create a fit function for the dense case

* Create a toy examples

* First draft with sparse matrices

* remove useless functions and non-negative sparse compatibility

* fix slice call

* Fix tests of QuantileNormalizer.

* Fix estimator compatibility

* List of functions became tuple of functions
* Check X consistency at transform and inverse transform time

* fix doc

* Add negative ValueError tests for QuantileNormalizer.

* Fix cosmetics

* Fix compatibility numpy <= 1.8

* Add n_features tests and correct ValueError.

* PEP8

* fix fill_value for early scipy compatibility

* simplify sampling

* Fix tests.

* removing last pring

* Change choice for permutation

* cosmetics

* fix remove remaining choice

* DOC

* Fix inconsistencies

* pep8

* Add checker for init parameters.

* hack bounds and make a test

* FIX/TST bounds are provided by the fitting and not X at transform

* PEP8

* FIX/TST axis should be <= 1

* PEP8

* ENH Add parameter ignore_implicit_zeros

* ENH match output distribution

* ENH clip the data to avoid infinity due to output PDF

* FIX ENH restraint to uniform and norm

* [MRG] ENH Add example comparing the distribution of all scaling preprocessor (#2)

* ENH Add example comparing the distribution of all scaling preprocessor

* Remove Jupyter notebook convert

* FIX/ENH Select feat before not after; Plot interquantile data range for all

* Add heatmap legend

* Remove comment maybe?

* Move doc from robust_scaling to plot_all_scaling; Need to update doc

* Update the doc

* Better aesthetics; Better spacing and plot colormap only at end

* Shameless author re-ordering ;P

* Use env python for she-bang

* TST Validity of output_pdf

* EXA Use OrderedDict; Make it easier to add more transformations

* FIX PEP8 and replace scipy.stats by str in example

* FIX remove useless import

* COSMET change variable names

* FIX change output_pdf occurence to output_distribution

* FIX partial fixies from comments

* COMIT change class name and code structure

* COSMIT change direction to inverse

* FIX factorize transform in _transform_col

* PEP8

* FIX change the magic 10

* FIX add interp1d to fixes

* FIX/TST allow negative entries when ignore_implicit_zeros is True

* FIX use np.interp instead of sp.interpolate.interp1d

* FIX/TST fix tests

* DOC start checking doc

* TST add test to check the behaviour of interp numpy

* TST/EHN Add the possibility to add noise to compute quantile

* FIX factorize quantile computation

* FIX fixes issues

* PEP8

* FIX/DOC correct doc

* TST/DOC improve doc and add random state

* EXA add examples to illustrate the use of smoothing_noise

* FIX/DOC fix some grammar

* DOC fix example

* DOC/EXA make plot titles more succint

* EXA improve explanation

* EXA improve the docstring

* DOC add a bit more documentation

* FIX advance review

* TST add subsampling test

* DOC/TST better example for the docstring

* DOC add ellipsis to docstring

* FIX address olivier comments

* FIX remove random_state in sparse.rand

* FIX spelling doc

* FIX cite example in user guide and docstring

* FIX olivier comments

* EHN improve the example comparing all the pre-processing methods

* FIX/DOC remove title

* FIX change the scaling of the figure

* FIX plotting layout

* FIX ratio w/h

* Reorder and reword the plot_all_scaling example

* Fix aspect ratio and better explanations in the plot_all_scaling.py example

* Fix broken link and remove useless sentence

* FIX fix couples of spelling

* FIX comments joel

* FIX/DOC address documentation comments

* FIX address comments joel

* FIX inline sparse and dense transform

* PEP8

* TST/DOC temporary skipping test

* FIX raise an error if n_quantiles > subsample

* FIX wording in smoothing_noise example

* EXA Denis comments

* FIX rephrasing

* FIX make smoothing_noise to be a boolearn and change doc

* FIX address comments

* FIX verbose the doc slightly more

* PEP8/DOC

* ENH: 2-ways interpolation to avoid smoothing_noise

Simplifies also the code, examples, and documentation
paulha pushed a commit to paulha/scikit-learn that referenced this pull request Aug 19, 2017
* resurrect quantile scaler

* move the code in the pre-processing module

* first draft

* Add tests.

* Fix bug in QuantileNormalizer.

* Add quantile_normalizer.

* Implement pickling

* create a specific function for dense transform

* Create a fit function for the dense case

* Create a toy examples

* First draft with sparse matrices

* remove useless functions and non-negative sparse compatibility

* fix slice call

* Fix tests of QuantileNormalizer.

* Fix estimator compatibility

* List of functions became tuple of functions
* Check X consistency at transform and inverse transform time

* fix doc

* Add negative ValueError tests for QuantileNormalizer.

* Fix cosmetics

* Fix compatibility numpy <= 1.8

* Add n_features tests and correct ValueError.

* PEP8

* fix fill_value for early scipy compatibility

* simplify sampling

* Fix tests.

* removing last pring

* Change choice for permutation

* cosmetics

* fix remove remaining choice

* DOC

* Fix inconsistencies

* pep8

* Add checker for init parameters.

* hack bounds and make a test

* FIX/TST bounds are provided by the fitting and not X at transform

* PEP8

* FIX/TST axis should be <= 1

* PEP8

* ENH Add parameter ignore_implicit_zeros

* ENH match output distribution

* ENH clip the data to avoid infinity due to output PDF

* FIX ENH restraint to uniform and norm

* [MRG] ENH Add example comparing the distribution of all scaling preprocessor (scikit-learn#2)

* ENH Add example comparing the distribution of all scaling preprocessor

* Remove Jupyter notebook convert

* FIX/ENH Select feat before not after; Plot interquantile data range for all

* Add heatmap legend

* Remove comment maybe?

* Move doc from robust_scaling to plot_all_scaling; Need to update doc

* Update the doc

* Better aesthetics; Better spacing and plot colormap only at end

* Shameless author re-ordering ;P

* Use env python for she-bang

* TST Validity of output_pdf

* EXA Use OrderedDict; Make it easier to add more transformations

* FIX PEP8 and replace scipy.stats by str in example

* FIX remove useless import

* COSMET change variable names

* FIX change output_pdf occurence to output_distribution

* FIX partial fixies from comments

* COMIT change class name and code structure

* COSMIT change direction to inverse

* FIX factorize transform in _transform_col

* PEP8

* FIX change the magic 10

* FIX add interp1d to fixes

* FIX/TST allow negative entries when ignore_implicit_zeros is True

* FIX use np.interp instead of sp.interpolate.interp1d

* FIX/TST fix tests

* DOC start checking doc

* TST add test to check the behaviour of interp numpy

* TST/EHN Add the possibility to add noise to compute quantile

* FIX factorize quantile computation

* FIX fixes issues

* PEP8

* FIX/DOC correct doc

* TST/DOC improve doc and add random state

* EXA add examples to illustrate the use of smoothing_noise

* FIX/DOC fix some grammar

* DOC fix example

* DOC/EXA make plot titles more succint

* EXA improve explanation

* EXA improve the docstring

* DOC add a bit more documentation

* FIX advance review

* TST add subsampling test

* DOC/TST better example for the docstring

* DOC add ellipsis to docstring

* FIX address olivier comments

* FIX remove random_state in sparse.rand

* FIX spelling doc

* FIX cite example in user guide and docstring

* FIX olivier comments

* EHN improve the example comparing all the pre-processing methods

* FIX/DOC remove title

* FIX change the scaling of the figure

* FIX plotting layout

* FIX ratio w/h

* Reorder and reword the plot_all_scaling example

* Fix aspect ratio and better explanations in the plot_all_scaling.py example

* Fix broken link and remove useless sentence

* FIX fix couples of spelling

* FIX comments joel

* FIX/DOC address documentation comments

* FIX address comments joel

* FIX inline sparse and dense transform

* PEP8

* TST/DOC temporary skipping test

* FIX raise an error if n_quantiles > subsample

* FIX wording in smoothing_noise example

* EXA Denis comments

* FIX rephrasing

* FIX make smoothing_noise to be a boolearn and change doc

* FIX address comments

* FIX verbose the doc slightly more

* PEP8/DOC

* ENH: 2-ways interpolation to avoid smoothing_noise

Simplifies also the code, examples, and documentation
AishwaryaRK pushed a commit to AishwaryaRK/scikit-learn that referenced this pull request Aug 29, 2017
* resurrect quantile scaler

* move the code in the pre-processing module

* first draft

* Add tests.

* Fix bug in QuantileNormalizer.

* Add quantile_normalizer.

* Implement pickling

* create a specific function for dense transform

* Create a fit function for the dense case

* Create a toy examples

* First draft with sparse matrices

* remove useless functions and non-negative sparse compatibility

* fix slice call

* Fix tests of QuantileNormalizer.

* Fix estimator compatibility

* List of functions became tuple of functions
* Check X consistency at transform and inverse transform time

* fix doc

* Add negative ValueError tests for QuantileNormalizer.

* Fix cosmetics

* Fix compatibility numpy <= 1.8

* Add n_features tests and correct ValueError.

* PEP8

* fix fill_value for early scipy compatibility

* simplify sampling

* Fix tests.

* removing last pring

* Change choice for permutation

* cosmetics

* fix remove remaining choice

* DOC

* Fix inconsistencies

* pep8

* Add checker for init parameters.

* hack bounds and make a test

* FIX/TST bounds are provided by the fitting and not X at transform

* PEP8

* FIX/TST axis should be <= 1

* PEP8

* ENH Add parameter ignore_implicit_zeros

* ENH match output distribution

* ENH clip the data to avoid infinity due to output PDF

* FIX ENH restraint to uniform and norm

* [MRG] ENH Add example comparing the distribution of all scaling preprocessor (scikit-learn#2)

* ENH Add example comparing the distribution of all scaling preprocessor

* Remove Jupyter notebook convert

* FIX/ENH Select feat before not after; Plot interquantile data range for all

* Add heatmap legend

* Remove comment maybe?

* Move doc from robust_scaling to plot_all_scaling; Need to update doc

* Update the doc

* Better aesthetics; Better spacing and plot colormap only at end

* Shameless author re-ordering ;P

* Use env python for she-bang

* TST Validity of output_pdf

* EXA Use OrderedDict; Make it easier to add more transformations

* FIX PEP8 and replace scipy.stats by str in example

* FIX remove useless import

* COSMET change variable names

* FIX change output_pdf occurence to output_distribution

* FIX partial fixies from comments

* COMIT change class name and code structure

* COSMIT change direction to inverse

* FIX factorize transform in _transform_col

* PEP8

* FIX change the magic 10

* FIX add interp1d to fixes

* FIX/TST allow negative entries when ignore_implicit_zeros is True

* FIX use np.interp instead of sp.interpolate.interp1d

* FIX/TST fix tests

* DOC start checking doc

* TST add test to check the behaviour of interp numpy

* TST/EHN Add the possibility to add noise to compute quantile

* FIX factorize quantile computation

* FIX fixes issues

* PEP8

* FIX/DOC correct doc

* TST/DOC improve doc and add random state

* EXA add examples to illustrate the use of smoothing_noise

* FIX/DOC fix some grammar

* DOC fix example

* DOC/EXA make plot titles more succint

* EXA improve explanation

* EXA improve the docstring

* DOC add a bit more documentation

* FIX advance review

* TST add subsampling test

* DOC/TST better example for the docstring

* DOC add ellipsis to docstring

* FIX address olivier comments

* FIX remove random_state in sparse.rand

* FIX spelling doc

* FIX cite example in user guide and docstring

* FIX olivier comments

* EHN improve the example comparing all the pre-processing methods

* FIX/DOC remove title

* FIX change the scaling of the figure

* FIX plotting layout

* FIX ratio w/h

* Reorder and reword the plot_all_scaling example

* Fix aspect ratio and better explanations in the plot_all_scaling.py example

* Fix broken link and remove useless sentence

* FIX fix couples of spelling

* FIX comments joel

* FIX/DOC address documentation comments

* FIX address comments joel

* FIX inline sparse and dense transform

* PEP8

* TST/DOC temporary skipping test

* FIX raise an error if n_quantiles > subsample

* FIX wording in smoothing_noise example

* EXA Denis comments

* FIX rephrasing

* FIX make smoothing_noise to be a boolearn and change doc

* FIX address comments

* FIX verbose the doc slightly more

* PEP8/DOC

* ENH: 2-ways interpolation to avoid smoothing_noise

Simplifies also the code, examples, and documentation
maskani-moh pushed a commit to maskani-moh/scikit-learn that referenced this pull request Nov 15, 2017
* resurrect quantile scaler

* move the code in the pre-processing module

* first draft

* Add tests.

* Fix bug in QuantileNormalizer.

* Add quantile_normalizer.

* Implement pickling

* create a specific function for dense transform

* Create a fit function for the dense case

* Create a toy examples

* First draft with sparse matrices

* remove useless functions and non-negative sparse compatibility

* fix slice call

* Fix tests of QuantileNormalizer.

* Fix estimator compatibility

* List of functions became tuple of functions
* Check X consistency at transform and inverse transform time

* fix doc

* Add negative ValueError tests for QuantileNormalizer.

* Fix cosmetics

* Fix compatibility numpy <= 1.8

* Add n_features tests and correct ValueError.

* PEP8

* fix fill_value for early scipy compatibility

* simplify sampling

* Fix tests.

* removing last pring

* Change choice for permutation

* cosmetics

* fix remove remaining choice

* DOC

* Fix inconsistencies

* pep8

* Add checker for init parameters.

* hack bounds and make a test

* FIX/TST bounds are provided by the fitting and not X at transform

* PEP8

* FIX/TST axis should be <= 1

* PEP8

* ENH Add parameter ignore_implicit_zeros

* ENH match output distribution

* ENH clip the data to avoid infinity due to output PDF

* FIX ENH restraint to uniform and norm

* [MRG] ENH Add example comparing the distribution of all scaling preprocessor (scikit-learn#2)

* ENH Add example comparing the distribution of all scaling preprocessor

* Remove Jupyter notebook convert

* FIX/ENH Select feat before not after; Plot interquantile data range for all

* Add heatmap legend

* Remove comment maybe?

* Move doc from robust_scaling to plot_all_scaling; Need to update doc

* Update the doc

* Better aesthetics; Better spacing and plot colormap only at end

* Shameless author re-ordering ;P

* Use env python for she-bang

* TST Validity of output_pdf

* EXA Use OrderedDict; Make it easier to add more transformations

* FIX PEP8 and replace scipy.stats by str in example

* FIX remove useless import

* COSMET change variable names

* FIX change output_pdf occurence to output_distribution

* FIX partial fixies from comments

* COMIT change class name and code structure

* COSMIT change direction to inverse

* FIX factorize transform in _transform_col

* PEP8

* FIX change the magic 10

* FIX add interp1d to fixes

* FIX/TST allow negative entries when ignore_implicit_zeros is True

* FIX use np.interp instead of sp.interpolate.interp1d

* FIX/TST fix tests

* DOC start checking doc

* TST add test to check the behaviour of interp numpy

* TST/EHN Add the possibility to add noise to compute quantile

* FIX factorize quantile computation

* FIX fixes issues

* PEP8

* FIX/DOC correct doc

* TST/DOC improve doc and add random state

* EXA add examples to illustrate the use of smoothing_noise

* FIX/DOC fix some grammar

* DOC fix example

* DOC/EXA make plot titles more succint

* EXA improve explanation

* EXA improve the docstring

* DOC add a bit more documentation

* FIX advance review

* TST add subsampling test

* DOC/TST better example for the docstring

* DOC add ellipsis to docstring

* FIX address olivier comments

* FIX remove random_state in sparse.rand

* FIX spelling doc

* FIX cite example in user guide and docstring

* FIX olivier comments

* EHN improve the example comparing all the pre-processing methods

* FIX/DOC remove title

* FIX change the scaling of the figure

* FIX plotting layout

* FIX ratio w/h

* Reorder and reword the plot_all_scaling example

* Fix aspect ratio and better explanations in the plot_all_scaling.py example

* Fix broken link and remove useless sentence

* FIX fix couples of spelling

* FIX comments joel

* FIX/DOC address documentation comments

* FIX address comments joel

* FIX inline sparse and dense transform

* PEP8

* TST/DOC temporary skipping test

* FIX raise an error if n_quantiles > subsample

* FIX wording in smoothing_noise example

* EXA Denis comments

* FIX rephrasing

* FIX make smoothing_noise to be a boolearn and change doc

* FIX address comments

* FIX verbose the doc slightly more

* PEP8/DOC

* ENH: 2-ways interpolation to avoid smoothing_noise

Simplifies also the code, examples, and documentation
jwjohnson314 pushed a commit to jwjohnson314/scikit-learn that referenced this pull request Dec 18, 2017
* resurrect quantile scaler

* move the code in the pre-processing module

* first draft

* Add tests.

* Fix bug in QuantileNormalizer.

* Add quantile_normalizer.

* Implement pickling

* create a specific function for dense transform

* Create a fit function for the dense case

* Create a toy examples

* First draft with sparse matrices

* remove useless functions and non-negative sparse compatibility

* fix slice call

* Fix tests of QuantileNormalizer.

* Fix estimator compatibility

* List of functions became tuple of functions
* Check X consistency at transform and inverse transform time

* fix doc

* Add negative ValueError tests for QuantileNormalizer.

* Fix cosmetics

* Fix compatibility numpy <= 1.8

* Add n_features tests and correct ValueError.

* PEP8

* fix fill_value for early scipy compatibility

* simplify sampling

* Fix tests.

* removing last pring

* Change choice for permutation

* cosmetics

* fix remove remaining choice

* DOC

* Fix inconsistencies

* pep8

* Add checker for init parameters.

* hack bounds and make a test

* FIX/TST bounds are provided by the fitting and not X at transform

* PEP8

* FIX/TST axis should be <= 1

* PEP8

* ENH Add parameter ignore_implicit_zeros

* ENH match output distribution

* ENH clip the data to avoid infinity due to output PDF

* FIX ENH restraint to uniform and norm

* [MRG] ENH Add example comparing the distribution of all scaling preprocessor (scikit-learn#2)

* ENH Add example comparing the distribution of all scaling preprocessor

* Remove Jupyter notebook convert

* FIX/ENH Select feat before not after; Plot interquantile data range for all

* Add heatmap legend

* Remove comment maybe?

* Move doc from robust_scaling to plot_all_scaling; Need to update doc

* Update the doc

* Better aesthetics; Better spacing and plot colormap only at end

* Shameless author re-ordering ;P

* Use env python for she-bang

* TST Validity of output_pdf

* EXA Use OrderedDict; Make it easier to add more transformations

* FIX PEP8 and replace scipy.stats by str in example

* FIX remove useless import

* COSMET change variable names

* FIX change output_pdf occurence to output_distribution

* FIX partial fixies from comments

* COMIT change class name and code structure

* COSMIT change direction to inverse

* FIX factorize transform in _transform_col

* PEP8

* FIX change the magic 10

* FIX add interp1d to fixes

* FIX/TST allow negative entries when ignore_implicit_zeros is True

* FIX use np.interp instead of sp.interpolate.interp1d

* FIX/TST fix tests

* DOC start checking doc

* TST add test to check the behaviour of interp numpy

* TST/EHN Add the possibility to add noise to compute quantile

* FIX factorize quantile computation

* FIX fixes issues

* PEP8

* FIX/DOC correct doc

* TST/DOC improve doc and add random state

* EXA add examples to illustrate the use of smoothing_noise

* FIX/DOC fix some grammar

* DOC fix example

* DOC/EXA make plot titles more succint

* EXA improve explanation

* EXA improve the docstring

* DOC add a bit more documentation

* FIX advance review

* TST add subsampling test

* DOC/TST better example for the docstring

* DOC add ellipsis to docstring

* FIX address olivier comments

* FIX remove random_state in sparse.rand

* FIX spelling doc

* FIX cite example in user guide and docstring

* FIX olivier comments

* EHN improve the example comparing all the pre-processing methods

* FIX/DOC remove title

* FIX change the scaling of the figure

* FIX plotting layout

* FIX ratio w/h

* Reorder and reword the plot_all_scaling example

* Fix aspect ratio and better explanations in the plot_all_scaling.py example

* Fix broken link and remove useless sentence

* FIX fix couples of spelling

* FIX comments joel

* FIX/DOC address documentation comments

* FIX address comments joel

* FIX inline sparse and dense transform

* PEP8

* TST/DOC temporary skipping test

* FIX raise an error if n_quantiles > subsample

* FIX wording in smoothing_noise example

* EXA Denis comments

* FIX rephrasing

* FIX make smoothing_noise to be a boolearn and change doc

* FIX address comments

* FIX verbose the doc slightly more

* PEP8/DOC

* ENH: 2-ways interpolation to avoid smoothing_noise

Simplifies also the code, examples, and documentation
output_distribution = self.output_distribution
output_distribution = getattr(stats, output_distribution)

# older version of scipy do not handle tuple as fill_value
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@glemaitre I bumped into this, while trying to get rid of code related to old numpy/scipy versions that we don't support any more. Do you remember what this is about? I could not figure it out by just looking at the code and searching the PR comments ...

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that it should have been removed. At first, I implemented the interpolation using scipy.interpolate.interp1d which get a fill_value parameters. In older version fill_values do not accept a tuple [min, max] which is what we need.

But right now we are using numpy.interp. We could move to the higher scipy interp function but we need to wait the fill_values is accepting a typle or array-like. Then I am also not sure if this is useful to spend time on it :)

Copy link
Member

@lesteve lesteve Feb 28, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK thanks

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.