0% found this document useful (0 votes)
10 views6 pages

ML Parallelization

The document introduces CRAMP (Classifiers and Regressors Adaptability Mapping for Parallelism), a taxonomy and benchmarking framework that categorizes machine learning models based on their adaptability to four parallelization strategies: MapReduce, multicore processing, multithreading, and gradient-based optimization. It highlights the varying performance of different classifiers and provides practical guidance for scalable model deployment across diverse computing architectures. Experimental results demonstrate that parallelization generally preserves or improves predictive performance across various tasks.

Uploaded by

BaidyaNathSaha
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
10 views6 pages

ML Parallelization

The document introduces CRAMP (Classifiers and Regressors Adaptability Mapping for Parallelism), a taxonomy and benchmarking framework that categorizes machine learning models based on their adaptability to four parallelization strategies: MapReduce, multicore processing, multithreading, and gradient-based optimization. It highlights the varying performance of different classifiers and provides practical guidance for scalable model deployment across diverse computing architectures. Experimental results demonstrate that parallelization generally preserves or improves predictive performance across various tasks.

Uploaded by

BaidyaNathSaha
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 6

CRAMP: Categorizing Classifiers and

Regressors for Scalable Parallelism on


Distributed and Multicore Systems
Parallelization-Aware Taxonomy Covering MapReduce, Multicore, Multithreading, and Gradient-Based Strategies

Baidya Nath Saha Pavan Sarvaiya


Mathematics & Information Technology Mathematics & Information Technology
Concordia University of Edmonton Concordia University of Edmonton
Edmonton, Alberta, Canada Edmonton, Alberta, Canada
[email protected] [email protected]

Wali Mohammad Abdullah Md. Morshedul Islam


Mathematics & Information Technology Mathematics & Information Technology
Concordia University of Edmonton Concordia University of Edmonton
Edmonton, Alberta, Canada Edmonton, Alberta, Canada
[email protected] [email protected]

Abstract—We present CRAMP (Classifiers and Regres- and accelerator-based clusters. Although hardware ca-
sors Adaptability Mapping for Parallelism), a unified pabilities continue to grow, the computational charac-
taxonomy and benchmarking framework that categorizes teristics of ML classifiers remain a critical bottleneck
machine learning models based on their adaptability to four
parallelization strategies: MapReduce, multicore process- to scalable performance. Different classifiers exhibit
ing, multithreading, and gradient-based optimization. Un- unique optimization dynamics, memory access patterns,
like prior work, CRAMP introduces a novel parallelization- and iterative dependencies, resulting in highly variable
aware taxonomy linking algorithmic structure to execu- parallelization behavior across hardware and software
tion behavior - an underexplored area in the literature. execution models [1].
Experiments on the MNIST classification dataset and a
real-world Alberta climate regression dataset, conducted on In this paper, we present CRAMP (Classifiers and
Google Colab, show that MapReduce excels for ensemble Regressors Adaptability Mapping for Parallelism), a
classifiers, multicore and multithreading boost CPU-bound structured framework that categorizes ML classifiers and
models like SVMs and random forests, and gradient-based regressors according to their suitability for four dominant
parallelism suits iterative models such as neural networks. parallel computing paradigms:
Across both tasks, parallelization generally preserves or im-
proves predictive performance, offering practical guidance 1) MapReduce-based distributed batch processing
for scalable model deployment across diverse computing (e.g., Hadoop, Apache Spark [2]),
architectures. 2) Shared-memory multicore execution, and
Index Terms—Machine Learning, Classifiers, Regres- 3) Multithreading (e.g., OpenMP, Intel MKL,
sors, Parallel Computing, Distributed Systems, Multicore
joblib [3]), and
Processing, MapReduce, Multithreading, Gradient-Based
Optimization. 4) Gradient-based optimization on accelerators
(e.g., GPUs, TPUs via TensorFlow or PyTorch [4]).
I. I NTRODUCTION Our motivation stems from the observation that
traditional ML taxonomies emphasize algorithmic
Modern machine learning (ML) workloads increas- distinctions-such as linear versus nonlinear or genera-
ingly require high-throughput, low-latency execution tive versus discriminative models-but often neglect how
across diverse computing environments - from shared- efficiently these algorithms can be mapped onto parallel
memory multicore systems to distributed cloud platforms hardware. For instance, Random Forests enable embar-
rassingly parallel training, whereas kernel-based SVMs
This research is supported by NSERC Discovery Grant # RGPIN-
2020-05422 and the Seed Grant Program at Concordia University of suffer from an O(n2 ) kernel matrix bottleneck [5],
Edmonton. limiting their scalability in distributed memory systems
unless approximation methods are applied [6]. Reduce Phase:
CRAMP fills this gap by (i) analyzing classifier m
X m
X
families based on their core computational graphs and X ⊤X = Xj⊤ Xj , X ⊤y = Xj⊤ yj
synchronization requirements, (ii) assessing their com- j=1 j=1
patibility with key parallelization strategies, and (iii) sur-
Then:
veying implementation tools available in popular open-
 −1  
source libraries and high-performance frameworks [7]– X X
[9]. Unlike prior work, CRAMP introduces a novel β̂ =  Xj⊤ Xj   Xj⊤ yj 
parallelization-aware taxonomy that links algorithmic j j
structure to execution behavior - an under-explored
area in the literature - providing a rare and unified Example 2: K-Means Clustering
perspective on the intersection of algorithm design and Map Phase: Assign each point to the nearest centroid:
system-level performance.
Our contributions are twofold: (1) a practical taxon- Assign(xi ) = arg min ∥xi − µk ∥2
k
omy grounded in parallel system behavior rather than
abstract algorithmic theory, and (2) actionable guide- Reduce Phase: Recompute centroids:
lines enabling ML practitioners to select, configure, and 1 X
optimize models based on hardware architecture and µk = xi
|Ck |
xi ∈Ck
workload scale.
B. Multicore-Based Parallelization
II. PARALLELIZATION OF M ACHINE L EARNING
A LGORITHMS Multicore systems enable multiple CPU cores to pro-
cess data chunks concurrently, ideal for data-parallel
Processing large datasets is computationally expen- operations. Results are merged after each core completes
sive for traditional machine learning (ML) algorithms. its computation.
Parallelization techniques are essential for scaling ML
models across processors, cores, and distributed nodes. Mathematical Example:
This paper explores four key parallelization approaches: Split X across m cores:
MapReduce, Multicore Processing, Multithreading,
m
and Parallel Gradient Descent, illustrated using stan- [
dard ML algorithms such as Linear Regression, Decision X= Xj , θj = f (Xj )
j=1
Trees, K-Nearest Neighbors (KNN), and K-Means Clus-
tering. Aggregate:
m
1 X
A. MapReduce-Based Parallelization θ= θj
m j=1
MapReduce is a distributed programming model in-
troduced by Google, well-suited for batch learning and Example 1: Decision Tree Split Selection
embarrassingly parallel problems on large datasets. It
Cores evaluate different features or split points in
consists of two main functions: Map (task distribution)
parallel:
and Reduce (aggregation of results).
X |Sv |
Working Principle Gain(f ) = Impurity(S) − Impurity(Sv )
v
|S|
• Map: Split input into key-value pairs and distribute
tasks across nodes. Each returns the best local split; a final selection identi-
• Reduce: Aggregate intermediate results from each fies the global best.
node into final output.
Example 2: K-Nearest Neighbors (KNN)
Example 1: Linear Regression (Normal Equation) For query x, compute distances:
To solve:
d(x, xi ) = ∥x − xi ∥2
β̂ = (X ⊤ X)−1 X ⊤ y
Each core processes chunk Xj :
Split data across m nodes.
Map Phase: Dj = {d(x, xi ) | xi ∈ Xj }

Mapj → (Xj⊤ Xj , Xj⊤ yj ) Merge distances, select k nearest.


C. Multithreading-Based Parallelization Example: Linear Regression (Gradient Descent)
Multithreading enables concurrent lightweight threads Loss: n
sharing memory space, suited for feature-wise opera- 1 X
L(β) = (yi − x⊤
i β)
2
tions, model scoring, and shared-data tasks. It is less 2n i=1
resource-intensive than multicore processing.
Gradient:
Mathematical Example: n
1X
∇L(β) = − xi (yi − x⊤
i β)
Threads compute partial gradients for loss L(θ): n i=1
n
X Each partition:
∇L(θ) = ∇l(xi , yi , θ)
1 X
i=1
∇j = − xi (yi − x⊤
i β)
|Xj |
Each thread on Xj : xi ∈Xj

∇j =
X
∇l(xi , yi , θ) Update:
m
1 X
xi ∈Xj β (t+1) = β (t) − η · ∇j
m j=1
Aggregate:
m
X Mini-batch SGD
∇L(θ) = ∇j
j=1 Each worker:
(t+1) (t)
Example 1: Decision Tree (Impurity Computation) βj = βj − η∇j
Threads evaluate impurity for candidate splits: Model averaging:
X nc m
Impurity(S) = − pc log pc , pc = 1 X (t+1)
|S| β (t+1) = β
c m j=1 j
Example 2: K-Means
Note: All notations follow standard conventions in the
Each thread updates centroids from partial data: literature.
(j) (j)
X
µk = xi , |Ck | = count E. Comparison of Parallelization Techniques for Ma-
(j)
xi ∈Ck chine Learning Algorithms
Aggregate: MapReduce excels in distributed systems for large-
P (j) scale batch tasks like matrix operations and K-Means
j µk clustering, efficiently distributing computation across
µk = P (j)
j |Ck | multiple nodes. However, its overhead in job scheduling
and data shuffling reduces effectiveness for iterative
D. Parallel Gradient Descent algorithms requiring frequent synchronization, such as
In large-scale settings, gradient descent is parallelized gradient-based optimization.
by partitioning data or using model averaging. Each Multicore processing suits CPU-bound tasks that ben-
worker updates a local model and periodically synchro- efit from parallel execution across physical cores on
nizes. a single machine, offering faster memory access and
lower latency than distributed frameworks. It is ideal for
Mathematical Representation decision tree construction, feature-wise computations,
Standard: and distance calculations in k-Nearest Neighbors (KNN)
on moderate-sized datasets fully loaded into memory.
θt+1 = θt − η∇L(θt ) Multithreading provides lightweight parallelism
Parallel (across m workers): Each computes: within shared memory, excelling in real-time inference,
data preprocessing, I/O-bound workloads, and operations
1 X like entropy or information gain in tree models and
∇j = ∇l(xi , yi , θt )
|Xj | feature selection. Shared memory enables fast context
xi ∈Xj
switching but requires careful concurrency management
Then: to avoid contention.
m
1 X Parallel Gradient Descent - whether distributed or
θt+1 = θt − η · ∇j
m j=1 mini-batch - is highly effective for iterative ML tasks like
TABLE I
C OMPARISON OF PARALLELIZATION T ECHNIQUES FOR M ACHINE L EARNING A LGORITHMS

Method Applicable Models Granularity (task Strength Limitation


decomposition level)
MapReduce Linear Regression, Coarse Distributed Aggregation High Latency
K-Means
Multicore Decision Tree, KNN Medium Efficient CPU Utilization and low Memory Overhead
latency
Multithreading K-Means, Trees Fine Lightweight Tasks and very low Shared Memory
latency Issues
Gradient Descent Linear Models, NNs Iterative Scalable Optimization and Sync Overhead
and Deep Learning medium latency

TABLE II
PARALLELIZATION S UITABILITY OF A LGORITHMS BY M ETA A LGORITHM C ATEGORY (BASED ON E XPERIMENTAL I NSIGHTS )

Meta Algorithm Algorithms MapReduce Multicore Multithreading Gradient Descent


Ensemble - Boosting AdaBoost, Gradient Boosting, Highly suitable for Very effective; Moderately effective Not applicable; uses
HistGradientBoosting MapReduce due to parallel tree in later stages of greedy additive
independent learners construction boosts boosting strategies, not
performance gradient-based
Ensemble - Bagging Bagging, Random Forest, Extra Well-suited; learners Highly suitable; Suitable; thread-level Not applicable
Trees train independently shows consistent parallelism helps in
on subsets improvement across training and inference
CPU cores
Ensemble - Voting Voting Classifier/Regressor Suitable; base models Effective; ensemble Suitable; especially Depends on
can train and infer aggregation benefits for prediction phase constituent models
independently from core parallelism
Tree-Based Decision Tree, Extra Trees Limited suitability Effective; benefits Moderate benefit; Not applicable
for MapReduce due from node-level and helps during
to data dependencies feature-level inference and
parallelism shallow trees
Lazy Learners / k Nearest Neighbors Partially suitable; Good candidate; Effective; particularly Not applicable
Distance-Based distance cores can compute useful during
computations can be distances in parallel inference
mapped
Probabilistic Models Multinomial Naive Bayes Suitable; Effective; Suitable; thread-safe Not applicable
computations of independent feature batch inference is
probabilities are contributions allow common
separable parallel updates
Neural Networks MLP, CNN Not ideal; distributed Highly effective; Suitable; helps with Core technique;
gradient sync adds matrix operations batch-level ops training depends on
overhead accelerate training gradient descent
Regularized Linear Models Ridge, Ridge Classifier Suitable; can Highly effective; Limited; thread Applicable; uses
parallelize gradient vectorized operations support varies across gradient or analytical
or closed-form steps scale well libraries solutions
Bayesian Models Bayesian Ridge Not ideal for Moderate suitability Partially useful; Applicable; leverages
MapReduce due to for CPU parallelism benefits depend on gradient-based or
matrix inversions sampling method closed-form
estimation
Support Vector Machines SVM, SVR Less suitable; Effective; optimized Limited; high Partially applicable
quadratic libraries exploit memory and kernel via SGD or kernel
optimization is hard multicore routines costs reduce gains approximations
to scale
Linear Models Linear Regression Suitable; gradient or Highly effective; Limited; problem Applicable; solved
normal equation parallel matrix size often does not using gradient or
steps parallelize well computation boosts require threads closed-form updates
performance

training linear models, neural networks, and regressors. It types, including MapReduce phases, core/thread-level
computes gradients independently across data partitions parallelism, and gradient-based optimization (Table III).
with synchronization for parameter updates, combining
data parallelism and iterative convergence, making it a III. E XPERIMENTAL R ESULTS AND D ISCUSSIONS
core technique in scalable deep learning and large-scale
optimization. We conducted experiments on two real-world datasets
to evaluate the effectiveness of different parallelization
The three tables respectively present: a comparative strategies for training machine learning models on the
overview of parallelization techniques and their char- Google Colab cloud platform. The first dataset is the
acteristics (Table I); the suitability of various machine MNIST handwritten digit database, containing 60,000
learning algorithms for each parallelization method by training and 10,000 test grayscale images [10], with
meta algorithm category (Table II); and the detailed sample test images shown in Fig. 1. The second is a
working principles of these methods across algorithm historical Alberta climate dataset, used to predict air
TABLE III
PARALLELIZATION W ORKING P RINCIPLES OF A LGORITHMS BY M ETA A LGORITHM C ATEGORY
MapReduce
Meta Algorithm Algorithms Multicore Parallelism Multithreading Gradient Descent / Loss Function
Map Phase Reduce Phase
Ensemble - Boosting AdaBoost, Gradient Train weak learners Combine learners’ Parallelize tree Limited to tree split Greedy optimization; loss:
Boosting, HistGradient- on data partitions outputs and update building per iteration finding L(t+1)P= L(t) − η∇L, e.g.,
BoostingRegressor weights or across boosting (y − F (x))2
stages
Ensemble - Bagging Bagging, Independently train Aggregate Independent learners Threaded training Not applicable; non-iterative model
BaggingRegressor, base learners on predictions (majority trained in parallel and prediction
Random Forest, bootstrapped samples vote or averaging)
ExtraTreeRegressor
Ensemble - Voting VotingRegressor Train individual base Combine predictions Parallel model Parallel inference Depends on base learners; typically not
regressors via averaging execution threads gradient-based
Tree-Based Decision Tree, Calculate local best Merge to find best Find best split per Parallelize candidate Greedy optimization; surrogate loss:
DecisionTreeRegressor, splits on subsets global split feature in parallel split search impurity functions such as Gini or entropy
ExtraTreeRegressor
Lazy Learners / k Nearest Neighbor Compute distances Select k nearest Parallelize distance Threaded neighbor No optimization; distance metric (e.g.,
Distance-Based between query and based on aggregated computations across ranking and voting Euclidean) used directly
training points results cores
Probabilistic Models Multinomial Naive Bayes Estimate probabilities Aggregate counts Parallel class-wise Thread-safe batch No gradient descent; uses Bayes rule
per class and feature into conditional probability probability estimation directly P (C|X) ∝ P (X|C)P (C)
probabilities computation
Neural Networks (Deep and Neural Network, Compute layer-wise Aggregate gradients Parallel batch matrix Threaded batch Gradient descent:
Shallow) MLPRegressor, CNN activations on across batches and operations gradient updates w(t+1) = w(t) − η∇L(w), e.g., MSE or
batches update weights cross-entropy loss
Regularized Linear Models Ridge, Ridge Classifier Compute partial Aggregate gradients Solve closed-form or Threaded linear Gradient descent:
gradients per data and update weights use SGD per feature algebra ops minw ||Xw − y||2 + λ||w||2
partition
Bayesian Models BayesianRidge Estimate posteriors Combine sufficient Parallel sampling or Threaded matrix Closed-form solution or EM;
for coefficients on statistics for final matrix decomposition inversion minw ||Xw − y||2 + λ||w||2 with
data splits posterior Bayesian prior
Support Vector Machines SVM, SVR Compute kernel Merge and solve QP Parallel kernel matrix Limited parallel Dual optimization (non-GD); approximate
matrix and support problem computation solver threads via SGD for hinge loss:
vectors per block max(0, 1 − yi wT xi )
Linear Models Linear Regression Compute XTX and Aggregate to solve Parallel linear Threaded matrix w = (X T X)−1 X T y, or SGD:
XTy on chunks normal equations algebra computation operations w (t+1) = w(t) − η∇L(w)

TABLE IV
C OMPARISON OF S PEEDUP AND ACCURACY S PEEDUP ACROSS PARALLELIZATION T ECHNIQUES AND C LASSIFIERS

Speedup Accuracy Speedup


Classifier
MapReduce Multicore Multithreading Gradient Descent MapReduce Multicore Multithreading Gradient Descent
AdaBoost 8.59 0.99 1.02 1 0.8 1 0.89 0.9
Bagging 1.66 1.21 1.24 1 1 1 0.98 1
Convolutional Neural Network 1.5 2.52 1.02 1.03 1 1 1 1
Decision Tree 1 1.5 1.33 0.84 1.06 1 1 1.89
Gradient Boosting 4.8 7 1.8 1.2 1 1.01 0.97 1
k Nearest Neighbor 1.2 1.03 2.59 5.24 1.08 1 1 0.9
Multinomial Naive Bayes 0.8 0.7 0.9 0.85 0.9 1 1 0.9
Neural Network 0.8 0.9 1.8 0,8 1 1 1 1
Random Forest 1.2 1.5 2.7 1.1 1 1 1 1
Ridge Classifier 1.32 1.6 1.2 0.9 0.9 1 1 0.9
Support Vector Machine 2.3 3.53 1.2 1.1 1.1 1.01 1 1

baseline classification accuracy. Values greater than 1


indicate improvement.
From Table IV, MapReduce yields the highest
speedup for AdaBoost (8.59×) and Gradient Boosting
(4.8×), showcasing its suitability for ensemble models.
Multicore proves most effective for Gradient Boost-
ing (7×), SVM (3.53×), and CNNs (2.52×), benefit-
ing CPU-heavy operations. Multithreading gives strong
gains for Random Forest (2.7×) and kNN (2.59×), par-
ticularly for repeated data access or shallow computation.
Fig. 1. Sample images from MNIST test dataset Gradient Descent, though less impactful for classifiers,
offers notable gains for kNN (5.24×) and Decision Trees
(accuracy speedup of 1.89×), reflecting benefits from
temperature from parameters like humidity, precipitation, iterative updates.
and wind speed [11]. Table V shows results for regressors on the second
Table IV compares four parallelization techniques - dataset. Gradient Descent yields strong speedups for
MapReduce, Multicore, Multithreading, and Gradient MLP Regressor (2.9×), Bagging Regressor (1.6×), and
Descent - across multiple classifiers using two metrics: Bayesian Ridge (1.1×), which align well with itera-
speedup, the ratio of non-parallel to parallel training tive optimization. Multicore is effective for Voting Re-
time; and accuracy speedup, the ratio of parallel to gressor (1.82×), MLP Regressor (1.9×), and Decision
TABLE V
C OMPARISON OF S PEEDUP AND ACCURACY S PEEDUP ACROSS PARALLELIZATION T ECHNIQUES AND R EGRESSORS

Speedup Mean Squared Error Speedup


Regressor
MapReduce Multicore Multithreading Gradient Descent MapReduce Multicore Multithreading Gradient Descent
Linear Regression 7.69 0.39 1.05 1 1 1 0.99 1
DecisionTree Regressor 1.15 1.52 1.49 0.0 1 1.27 1.26 1
ExtraTreeRegressor 1.16 0.74 0.79 1.02 1.4 1.26 1.2 1
HistGradientBoostingRegressor 1.9 0.51 0.41 0.0 0.99 1 1.01 1
VotingRegressor 1.70 1.82 1.06 0.14 1.01 1 1.03 0.99
MLPRegressor 0.65 1.9 0.62 2.9 0.12 1.1 0.94 1.02
SVR 0.44 1.06 1.83 0.58 1.29 1.02 1.01 1.01
BaggingRegressor 0.89 1.48 1.51 1.6 0.99 1 1.10 1.3
Ridge 0.43 0.2 0.5 0.4 0.21 1 1 1
BayesianRidge 1.14 0.77 1.07 1.1 1 0.99 1 1.01

Tree Regressor (1.52×), leveraging CPU scalability. and proposed mitigation strategies such as histogram-
Multithreading provides consistent gains for Bagging based learning, feature hashing, and approximate kernel
(1.51×), Decision Tree (1.49×), and Voting Regressors methods. These insights are distilled into compatibility
(1.06×), via shared-memory execution. Mean Squared tables and parallelization guidelines to aid in building
Error (MSE) speedups stay near 1, with moderate gains efficient, scalable ML pipelines.
for Extra Trees (1.4×, MapReduce) and Decision Trees Future directions include extending CRAMP to fed-
(1.27×, Multicore), confirming that predictive perfor- erated and decentralized learning, where communica-
mance is preserved or slightly enhanced. tion topologies and privacy constraints introduce new
Overall, the results show that appropriate paralleliza- challenges. Integration with auto-parallelizing compil-
tion can significantly accelerate classification and re- ers, graph-based optimization frameworks (e.g., XLA,
gression tasks. While speedup varies by algorithm and ONNX Runtime), and quantum-enhanced ML workflows
strategy, most approaches maintain - or slightly improve offers promising paths toward scalable AI systems.
- predictive accuracy, confirming their practical value in
R EFERENCES
large-scale ML workflows.
[1] G. Eason, B. Noble, and I. N. Sneddon, “On certain integrals
of lipschitz-hankel type involving products of bessel functions,”
IV. C ONCLUSION Phil. Trans. Roy. Soc. London, vol. A247, pp. 529–551, April
1955.
In this work, we introduced CRAMP, a parallelization [2] J. Dean and S. Ghemawat, “Mapreduce: Simplified data process-
- aware framework for categorizing machine learning ing on large clusters,” Communications of the ACM, vol. 51,
classifiers and regressors to support design decisions in no. 1, pp. 107–113, 2008.
[3] F. P. et al., “Scikit-learn: Machine learning in python,” Journal
high - performance ML systems. By analyzing models of Machine Learning Research, vol. 12, pp. 2825–2830, 2011.
through execution semantics and hardware affinity, we [4] A. P. et al., “Pytorch: An imperative style, high-performance deep
developed a taxonomy linking algorithmic structures to learning library,” 2019, https://pytorch.org/.
[5] C.-C. Chang and C.-J. Lin, “Libsvm: A library for support
their parallelization feasibility across four paradigms: vector machines,” ACM Transactions on Intelligent Systems and
distributed MapReduce, accelerator-based gradient de- Technology, vol. 2, no. 3, pp. 27:1–27:27, 2011.
scent, shared - memory multicore execution, and mul- [6] A. Rahimi and B. Recht, “Random features for large-scale kernel
machines,” Advances in Neural Information Processing Systems
tithreading. (NIPS), vol. 20, 2007.
Unlike prior studies, CRAMP introduces a novel par- [7] Y. Freund and R. E. Schapire, “A decision-theoretic generaliza-
allelization - aware taxonomy that connects algorith- tion of on-line learning and an application to boosting,” Journal
of Computer and System Sciences, vol. 55, no. 1, pp. 119–139,
mic structure with system-level execution behavior - 1997.
an underexplored dimension in the literature - offering [8] G. K. et al., “Lightgbm: A highly efficient gradient boosting de-
a unified view of model performance across hardware cision tree,” Advances in Neural Information Processing Systems
(NeurIPS), 2017.
backends. [9] T. Chen and C. Guestrin, “Xgboost: A scalable tree boosting
Our study shows that models differ not only in ac- system,” Proceedings of the 22nd ACM SIGKDD International
Conference on Knowledge Discovery and Data Mining, pp. 785–
curacy and interpretability but also in their architectural 794, 2016.
compatibility. For instance, ensemble methods like bag- [10] Y. LeCun and C. Cortes, “MNIST handwritten digit database,”
ging support high thread- and process-level parallelism, 2010. [Online]. Available: http://yann.lecun.com/exdb/mnist/
[11] Jun 2025. [Online]. Available:
while boosting is limited by sequential dependencies. https://climate.weather.gc.ca/climate data/daily data e.html
Deep neural networks benefit from data-parallel training
on GPUs but can incur communication overhead in dis-
tributed settings without efficient gradient aggregation.
We identified key bottlenecks - kernel matrix explo-
sion, greedy tree splitting, and sequential iteration -

You might also like