ML Parallelization
ML Parallelization
Abstract—We present CRAMP (Classifiers and Regres- and accelerator-based clusters. Although hardware ca-
sors Adaptability Mapping for Parallelism), a unified pabilities continue to grow, the computational charac-
taxonomy and benchmarking framework that categorizes teristics of ML classifiers remain a critical bottleneck
machine learning models based on their adaptability to four
parallelization strategies: MapReduce, multicore process- to scalable performance. Different classifiers exhibit
ing, multithreading, and gradient-based optimization. Un- unique optimization dynamics, memory access patterns,
like prior work, CRAMP introduces a novel parallelization- and iterative dependencies, resulting in highly variable
aware taxonomy linking algorithmic structure to execu- parallelization behavior across hardware and software
tion behavior - an underexplored area in the literature. execution models [1].
Experiments on the MNIST classification dataset and a
real-world Alberta climate regression dataset, conducted on In this paper, we present CRAMP (Classifiers and
Google Colab, show that MapReduce excels for ensemble Regressors Adaptability Mapping for Parallelism), a
classifiers, multicore and multithreading boost CPU-bound structured framework that categorizes ML classifiers and
models like SVMs and random forests, and gradient-based regressors according to their suitability for four dominant
parallelism suits iterative models such as neural networks. parallel computing paradigms:
Across both tasks, parallelization generally preserves or im-
proves predictive performance, offering practical guidance 1) MapReduce-based distributed batch processing
for scalable model deployment across diverse computing (e.g., Hadoop, Apache Spark [2]),
architectures. 2) Shared-memory multicore execution, and
Index Terms—Machine Learning, Classifiers, Regres- 3) Multithreading (e.g., OpenMP, Intel MKL,
sors, Parallel Computing, Distributed Systems, Multicore
joblib [3]), and
Processing, MapReduce, Multithreading, Gradient-Based
Optimization. 4) Gradient-based optimization on accelerators
(e.g., GPUs, TPUs via TensorFlow or PyTorch [4]).
I. I NTRODUCTION Our motivation stems from the observation that
traditional ML taxonomies emphasize algorithmic
Modern machine learning (ML) workloads increas- distinctions-such as linear versus nonlinear or genera-
ingly require high-throughput, low-latency execution tive versus discriminative models-but often neglect how
across diverse computing environments - from shared- efficiently these algorithms can be mapped onto parallel
memory multicore systems to distributed cloud platforms hardware. For instance, Random Forests enable embar-
rassingly parallel training, whereas kernel-based SVMs
This research is supported by NSERC Discovery Grant # RGPIN-
2020-05422 and the Seed Grant Program at Concordia University of suffer from an O(n2 ) kernel matrix bottleneck [5],
Edmonton. limiting their scalability in distributed memory systems
unless approximation methods are applied [6]. Reduce Phase:
CRAMP fills this gap by (i) analyzing classifier m
X m
X
families based on their core computational graphs and X ⊤X = Xj⊤ Xj , X ⊤y = Xj⊤ yj
synchronization requirements, (ii) assessing their com- j=1 j=1
patibility with key parallelization strategies, and (iii) sur-
Then:
veying implementation tools available in popular open-
−1
source libraries and high-performance frameworks [7]– X X
[9]. Unlike prior work, CRAMP introduces a novel β̂ = Xj⊤ Xj Xj⊤ yj
parallelization-aware taxonomy that links algorithmic j j
structure to execution behavior - an under-explored
area in the literature - providing a rare and unified Example 2: K-Means Clustering
perspective on the intersection of algorithm design and Map Phase: Assign each point to the nearest centroid:
system-level performance.
Our contributions are twofold: (1) a practical taxon- Assign(xi ) = arg min ∥xi − µk ∥2
k
omy grounded in parallel system behavior rather than
abstract algorithmic theory, and (2) actionable guide- Reduce Phase: Recompute centroids:
lines enabling ML practitioners to select, configure, and 1 X
optimize models based on hardware architecture and µk = xi
|Ck |
xi ∈Ck
workload scale.
B. Multicore-Based Parallelization
II. PARALLELIZATION OF M ACHINE L EARNING
A LGORITHMS Multicore systems enable multiple CPU cores to pro-
cess data chunks concurrently, ideal for data-parallel
Processing large datasets is computationally expen- operations. Results are merged after each core completes
sive for traditional machine learning (ML) algorithms. its computation.
Parallelization techniques are essential for scaling ML
models across processors, cores, and distributed nodes. Mathematical Example:
This paper explores four key parallelization approaches: Split X across m cores:
MapReduce, Multicore Processing, Multithreading,
m
and Parallel Gradient Descent, illustrated using stan- [
dard ML algorithms such as Linear Regression, Decision X= Xj , θj = f (Xj )
j=1
Trees, K-Nearest Neighbors (KNN), and K-Means Clus-
tering. Aggregate:
m
1 X
A. MapReduce-Based Parallelization θ= θj
m j=1
MapReduce is a distributed programming model in-
troduced by Google, well-suited for batch learning and Example 1: Decision Tree Split Selection
embarrassingly parallel problems on large datasets. It
Cores evaluate different features or split points in
consists of two main functions: Map (task distribution)
parallel:
and Reduce (aggregation of results).
X |Sv |
Working Principle Gain(f ) = Impurity(S) − Impurity(Sv )
v
|S|
• Map: Split input into key-value pairs and distribute
tasks across nodes. Each returns the best local split; a final selection identi-
• Reduce: Aggregate intermediate results from each fies the global best.
node into final output.
Example 2: K-Nearest Neighbors (KNN)
Example 1: Linear Regression (Normal Equation) For query x, compute distances:
To solve:
d(x, xi ) = ∥x − xi ∥2
β̂ = (X ⊤ X)−1 X ⊤ y
Each core processes chunk Xj :
Split data across m nodes.
Map Phase: Dj = {d(x, xi ) | xi ∈ Xj }
∇j =
X
∇l(xi , yi , θ) Update:
m
1 X
xi ∈Xj β (t+1) = β (t) − η · ∇j
m j=1
Aggregate:
m
X Mini-batch SGD
∇L(θ) = ∇j
j=1 Each worker:
(t+1) (t)
Example 1: Decision Tree (Impurity Computation) βj = βj − η∇j
Threads evaluate impurity for candidate splits: Model averaging:
X nc m
Impurity(S) = − pc log pc , pc = 1 X (t+1)
|S| β (t+1) = β
c m j=1 j
Example 2: K-Means
Note: All notations follow standard conventions in the
Each thread updates centroids from partial data: literature.
(j) (j)
X
µk = xi , |Ck | = count E. Comparison of Parallelization Techniques for Ma-
(j)
xi ∈Ck chine Learning Algorithms
Aggregate: MapReduce excels in distributed systems for large-
P (j) scale batch tasks like matrix operations and K-Means
j µk clustering, efficiently distributing computation across
µk = P (j)
j |Ck | multiple nodes. However, its overhead in job scheduling
and data shuffling reduces effectiveness for iterative
D. Parallel Gradient Descent algorithms requiring frequent synchronization, such as
In large-scale settings, gradient descent is parallelized gradient-based optimization.
by partitioning data or using model averaging. Each Multicore processing suits CPU-bound tasks that ben-
worker updates a local model and periodically synchro- efit from parallel execution across physical cores on
nizes. a single machine, offering faster memory access and
lower latency than distributed frameworks. It is ideal for
Mathematical Representation decision tree construction, feature-wise computations,
Standard: and distance calculations in k-Nearest Neighbors (KNN)
on moderate-sized datasets fully loaded into memory.
θt+1 = θt − η∇L(θt ) Multithreading provides lightweight parallelism
Parallel (across m workers): Each computes: within shared memory, excelling in real-time inference,
data preprocessing, I/O-bound workloads, and operations
1 X like entropy or information gain in tree models and
∇j = ∇l(xi , yi , θt )
|Xj | feature selection. Shared memory enables fast context
xi ∈Xj
switching but requires careful concurrency management
Then: to avoid contention.
m
1 X Parallel Gradient Descent - whether distributed or
θt+1 = θt − η · ∇j
m j=1 mini-batch - is highly effective for iterative ML tasks like
TABLE I
C OMPARISON OF PARALLELIZATION T ECHNIQUES FOR M ACHINE L EARNING A LGORITHMS
TABLE II
PARALLELIZATION S UITABILITY OF A LGORITHMS BY M ETA A LGORITHM C ATEGORY (BASED ON E XPERIMENTAL I NSIGHTS )
training linear models, neural networks, and regressors. It types, including MapReduce phases, core/thread-level
computes gradients independently across data partitions parallelism, and gradient-based optimization (Table III).
with synchronization for parameter updates, combining
data parallelism and iterative convergence, making it a III. E XPERIMENTAL R ESULTS AND D ISCUSSIONS
core technique in scalable deep learning and large-scale
optimization. We conducted experiments on two real-world datasets
to evaluate the effectiveness of different parallelization
The three tables respectively present: a comparative strategies for training machine learning models on the
overview of parallelization techniques and their char- Google Colab cloud platform. The first dataset is the
acteristics (Table I); the suitability of various machine MNIST handwritten digit database, containing 60,000
learning algorithms for each parallelization method by training and 10,000 test grayscale images [10], with
meta algorithm category (Table II); and the detailed sample test images shown in Fig. 1. The second is a
working principles of these methods across algorithm historical Alberta climate dataset, used to predict air
TABLE III
PARALLELIZATION W ORKING P RINCIPLES OF A LGORITHMS BY M ETA A LGORITHM C ATEGORY
MapReduce
Meta Algorithm Algorithms Multicore Parallelism Multithreading Gradient Descent / Loss Function
Map Phase Reduce Phase
Ensemble - Boosting AdaBoost, Gradient Train weak learners Combine learners’ Parallelize tree Limited to tree split Greedy optimization; loss:
Boosting, HistGradient- on data partitions outputs and update building per iteration finding L(t+1)P= L(t) − η∇L, e.g.,
BoostingRegressor weights or across boosting (y − F (x))2
stages
Ensemble - Bagging Bagging, Independently train Aggregate Independent learners Threaded training Not applicable; non-iterative model
BaggingRegressor, base learners on predictions (majority trained in parallel and prediction
Random Forest, bootstrapped samples vote or averaging)
ExtraTreeRegressor
Ensemble - Voting VotingRegressor Train individual base Combine predictions Parallel model Parallel inference Depends on base learners; typically not
regressors via averaging execution threads gradient-based
Tree-Based Decision Tree, Calculate local best Merge to find best Find best split per Parallelize candidate Greedy optimization; surrogate loss:
DecisionTreeRegressor, splits on subsets global split feature in parallel split search impurity functions such as Gini or entropy
ExtraTreeRegressor
Lazy Learners / k Nearest Neighbor Compute distances Select k nearest Parallelize distance Threaded neighbor No optimization; distance metric (e.g.,
Distance-Based between query and based on aggregated computations across ranking and voting Euclidean) used directly
training points results cores
Probabilistic Models Multinomial Naive Bayes Estimate probabilities Aggregate counts Parallel class-wise Thread-safe batch No gradient descent; uses Bayes rule
per class and feature into conditional probability probability estimation directly P (C|X) ∝ P (X|C)P (C)
probabilities computation
Neural Networks (Deep and Neural Network, Compute layer-wise Aggregate gradients Parallel batch matrix Threaded batch Gradient descent:
Shallow) MLPRegressor, CNN activations on across batches and operations gradient updates w(t+1) = w(t) − η∇L(w), e.g., MSE or
batches update weights cross-entropy loss
Regularized Linear Models Ridge, Ridge Classifier Compute partial Aggregate gradients Solve closed-form or Threaded linear Gradient descent:
gradients per data and update weights use SGD per feature algebra ops minw ||Xw − y||2 + λ||w||2
partition
Bayesian Models BayesianRidge Estimate posteriors Combine sufficient Parallel sampling or Threaded matrix Closed-form solution or EM;
for coefficients on statistics for final matrix decomposition inversion minw ||Xw − y||2 + λ||w||2 with
data splits posterior Bayesian prior
Support Vector Machines SVM, SVR Compute kernel Merge and solve QP Parallel kernel matrix Limited parallel Dual optimization (non-GD); approximate
matrix and support problem computation solver threads via SGD for hinge loss:
vectors per block max(0, 1 − yi wT xi )
Linear Models Linear Regression Compute XTX and Aggregate to solve Parallel linear Threaded matrix w = (X T X)−1 X T y, or SGD:
XTy on chunks normal equations algebra computation operations w (t+1) = w(t) − η∇L(w)
TABLE IV
C OMPARISON OF S PEEDUP AND ACCURACY S PEEDUP ACROSS PARALLELIZATION T ECHNIQUES AND C LASSIFIERS
Tree Regressor (1.52×), leveraging CPU scalability. and proposed mitigation strategies such as histogram-
Multithreading provides consistent gains for Bagging based learning, feature hashing, and approximate kernel
(1.51×), Decision Tree (1.49×), and Voting Regressors methods. These insights are distilled into compatibility
(1.06×), via shared-memory execution. Mean Squared tables and parallelization guidelines to aid in building
Error (MSE) speedups stay near 1, with moderate gains efficient, scalable ML pipelines.
for Extra Trees (1.4×, MapReduce) and Decision Trees Future directions include extending CRAMP to fed-
(1.27×, Multicore), confirming that predictive perfor- erated and decentralized learning, where communica-
mance is preserved or slightly enhanced. tion topologies and privacy constraints introduce new
Overall, the results show that appropriate paralleliza- challenges. Integration with auto-parallelizing compil-
tion can significantly accelerate classification and re- ers, graph-based optimization frameworks (e.g., XLA,
gression tasks. While speedup varies by algorithm and ONNX Runtime), and quantum-enhanced ML workflows
strategy, most approaches maintain - or slightly improve offers promising paths toward scalable AI systems.
- predictive accuracy, confirming their practical value in
R EFERENCES
large-scale ML workflows.
[1] G. Eason, B. Noble, and I. N. Sneddon, “On certain integrals
of lipschitz-hankel type involving products of bessel functions,”
IV. C ONCLUSION Phil. Trans. Roy. Soc. London, vol. A247, pp. 529–551, April
1955.
In this work, we introduced CRAMP, a parallelization [2] J. Dean and S. Ghemawat, “Mapreduce: Simplified data process-
- aware framework for categorizing machine learning ing on large clusters,” Communications of the ACM, vol. 51,
classifiers and regressors to support design decisions in no. 1, pp. 107–113, 2008.
[3] F. P. et al., “Scikit-learn: Machine learning in python,” Journal
high - performance ML systems. By analyzing models of Machine Learning Research, vol. 12, pp. 2825–2830, 2011.
through execution semantics and hardware affinity, we [4] A. P. et al., “Pytorch: An imperative style, high-performance deep
developed a taxonomy linking algorithmic structures to learning library,” 2019, https://pytorch.org/.
[5] C.-C. Chang and C.-J. Lin, “Libsvm: A library for support
their parallelization feasibility across four paradigms: vector machines,” ACM Transactions on Intelligent Systems and
distributed MapReduce, accelerator-based gradient de- Technology, vol. 2, no. 3, pp. 27:1–27:27, 2011.
scent, shared - memory multicore execution, and mul- [6] A. Rahimi and B. Recht, “Random features for large-scale kernel
machines,” Advances in Neural Information Processing Systems
tithreading. (NIPS), vol. 20, 2007.
Unlike prior studies, CRAMP introduces a novel par- [7] Y. Freund and R. E. Schapire, “A decision-theoretic generaliza-
allelization - aware taxonomy that connects algorith- tion of on-line learning and an application to boosting,” Journal
of Computer and System Sciences, vol. 55, no. 1, pp. 119–139,
mic structure with system-level execution behavior - 1997.
an underexplored dimension in the literature - offering [8] G. K. et al., “Lightgbm: A highly efficient gradient boosting de-
a unified view of model performance across hardware cision tree,” Advances in Neural Information Processing Systems
(NeurIPS), 2017.
backends. [9] T. Chen and C. Guestrin, “Xgboost: A scalable tree boosting
Our study shows that models differ not only in ac- system,” Proceedings of the 22nd ACM SIGKDD International
Conference on Knowledge Discovery and Data Mining, pp. 785–
curacy and interpretability but also in their architectural 794, 2016.
compatibility. For instance, ensemble methods like bag- [10] Y. LeCun and C. Cortes, “MNIST handwritten digit database,”
ging support high thread- and process-level parallelism, 2010. [Online]. Available: http://yann.lecun.com/exdb/mnist/
[11] Jun 2025. [Online]. Available:
while boosting is limited by sequential dependencies. https://climate.weather.gc.ca/climate data/daily data e.html
Deep neural networks benefit from data-parallel training
on GPUs but can incur communication overhead in dis-
tributed settings without efficient gradient aggregation.
We identified key bottlenecks - kernel matrix explo-
sion, greedy tree splitting, and sequential iteration -