GitHub - rcantini/BLEST-ML: BLEST-ML (BLock size ESTimation through Machine Learning) is a methodology for block size estimation that relies on supervised machine learning techniques.

BLEST-ML for dislib

This repository provides an implementation of BLEST-ML (BLock size ESTimation through Machine Learning) tailored to optimizing the execution of dislib's distributed ML algorithms.

Two modules are provided:

GridSearch: generate the training dataset for the Machine Learning model, starting from a log of executions of several algorithms on different datasets.
StackedClassifier: implements the Machine Learning model that can predict the most suitable value of the block-size parameter, given an algorithm to be executed and the dataset information.

An additional module, namely Main, is also provided to show how to use the aforementioned modules.

The config.json file contains all the configuration parameters, listed in the following:

algorithms: a JSON containing the information about all the algorithms to be executed. Each algorithm is instantiated via the reflection mechanism and requires the module and class name, along with the type of the task (supervised or not).
var_config: contains the general configuration.
number_of_row_partitions: a list containing all the partitions along rows to be tested during the grid search.
number_of_column_partitions: a list containing all the partitions along columns to be tested during the grid search.
model_path_name: the path where the trained Machine Learning model is stored.
training_dataset_path_name: the path where the training set, used for training the Machine Learning model, is stored.
execution_log_path_name: the path where the log of executions, used for creating the training dataset, is stored.
execution_info_path_name: the path where the file containing the information about the performed executions is stored.
classification_metrics_path_name: the path where the file containing the classification metrics is stored.
test_failed: constant to be used for the elapsed time in the case of a filed test.
exec_config: a JSON containing the configuration of the different executions to be performed. For each algorithm (the same key as in the "algorithm" section) a list must be provided, that contains the JSON configuration of the different executions of that algorithm. In particular, each configuration contains the path to the dataset to be used and the algorithm parameters, which are set dynamically.

How to cite

Riccardo Cantini, Fabrizio Marozzo, Alessio Orsino, Domenico Talia, Paolo Trunfio, Rosa M. Badia, Jorge Ejarque, Fernando V. Novoa (2024). Block size estimation for data partitioning in HPC applications using machine learning techniques. Journal of Big Data, vol. 11, no. 1, 2024.

Name		Name	Last commit message	Last commit date
Latest commit History 38 Commits
BLEST-ML for dislib		BLEST-ML for dislib
BLEST-ML-logo.png		BLEST-ML-logo.png
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

BLEST-ML for dislib

How to cite

About

Releases

Packages

Languages

License

rcantini/BLEST-ML

Folders and files

Latest commit

History

Repository files navigation

BLEST-ML for dislib

How to cite

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages