Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add float32 compatibility to KMedoids #120

Merged

Conversation

TimotheeMathieu
Copy link
Contributor

I implemented the changes suggested by @rth in PR#83 for dtypes in kmedoids.
In Clara, the dtype changes can make a consequent speedup but for kmedoids, I don't know why but there is no speedup:

Here is a small benchmark :

  • KMedoids
wall_time cpu_time peak_memory
('build', 'float32') 0.178639 0.49873 1.14062
('build', 'float64') 0.166929 0.396058 0.0859375
('heuristic', 'float32') 0.0497368 0.255722 0.0117188
('heuristic', 'float64') 0.0783268 0.250505 60.8164
('k-medoids++', 'float32') 0.0602323 0.261578 0.00390625
('k-medoids++', 'float64') 0.0871284 0.284815 61.0117
  • CLARA
wall_time cpu_time peak_memory
('build', 'float32') 0.944996 3.84636 0.160156
('build', 'float64') 0.794827 5.21157 1.21875
('heuristic', 'float32') 0.956299 4.54045 0
('heuristic', 'float64') 0.803466 5.22568 0
('k-medoids++', 'float32') 0.940615 6.23457 0
('k-medoids++', 'float64') 1.14663 6.11493 0.00390625

For Clara we can design settings in which the difference between 64 and 32 bit is quite large. Here I used n_sample = 200 000 in dimension 100 for CLARA and sampling_size = 200 . I used n_sample = 2000 for KMedoids.

Code
import numpy as np
import neurtu
from sklearn_extra.cluster import KMedoids, CLARA

X = np.random.normal(size=(100_000, 100))
X_32 = X.astype(np.float32)

def make_experiment(dtype, init):
    if dtype == 'float32':
        X2 = X_32
    else:
        X2 = X
    #km = CLARA(init=init, sampling_size= 50, n_clusters=9)
    km = KMedoids(n_clusters = 9, init = init)
    km.fit(X2)


def cases():
    for init in ['build', "heuristic", "k-medoids++"]:
        for dtype in ['float32', 'float64']:
            tags = {'init' : init, "dtype": dtype}
            yield neurtu.delayed(make_experiment, tags=tags)(dtype, init)
        
        
bench = neurtu.Benchmark(wall_time=True, cpu_time=True, peak_memory=True)
df = bench(cases()) # 2

Copy link
Contributor

@rth rth left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks could you parametrize an existing test to run both on float64 and float32 input? and maybe check that the output of transform is of the same dtype. Otherwise LGTM.

Something like,

@pytest.mark.parametrize('dtype', [np.float64, np.float32])
def test_...(dtype):
    X_input = X_input.astype(dtype)
    ..

@TimotheeMathieu TimotheeMathieu merged commit 445aaf8 into scikit-learn-contrib:main Jun 24, 2021
@TimotheeMathieu TimotheeMathieu deleted the clara_32bit branch June 24, 2021 19:26
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants