Decision Trees and MPI Collective Algorithm Select
Decision Trees and MPI Collective Algorithm Select
net/publication/225724625
CITATIONS READS
21 97
5 authors, including:
Jack Dongarra
University of Tennessee
1,806 PUBLICATIONS 76,207 CITATIONS
SEE PROFILE
All content following this page was uploaded by George Bosilca on 27 May 2014.
Abstract
Selecting the close-to-optimal collective algorithm based on the parameters of the collective
call at run time is an important step in achieving good performance of MPI applications. In
this paper, we explore the applicability of C4.5 decision trees to the MPI collective algorithm
selection problem. We construct C4.5 decision trees from the measured algorithm performance
data and analyze the decision tree properties and expected run time performance penalty.
In cases we considered, results show that the C4.5 decision trees can be used to generate
a reasonably small and very accurate decision function. For example, the Broadcast decision
tree with only 21 leaves was able to achieve a mean performance penalty of 2.08%. Similarly,
combining experimental data for Reduce and Broadcast and generating a decision function from
the combined decision trees resulted in less than 2.5% relative performance penalty. The results
indicate that C4.5 decision trees are applicable to this problem and should be more widely used
in this domain.
1 Introduction
The performance of MPI collective operations is crucial for good performance of MPI applications
that use them [1]. For this reason, significant efforts have gone into design and implementation
of efficient collective algorithms both for homogeneous and heterogeneous cluster environments
[2, 3, 4, 5, 6, 7, 8]. Performance of these algorithms varies with the total number of nodes involved
in communication, system and network characteristics, size of data being transferred, current load
and, if applicable, the operation that is being performed, as well as the segment size which is used
for operation pipelining. Thus, selecting the best possible algorithm and segment size combination
(method ) for every instance of collective operation is important.
To ensure good performance of MPI applications, collective operations can be tuned for the
particular system. The tuning process often involves detailed profiling of the system, possibly
combined with communication modeling, analyzing the collected data, and generating a decision
function. During run-time, the decision function selects the close-to-optimal method for a particular
collective instance. This approach relies on the ability of the decision function to accurately predict
the algorithm and segment size to be used for the particular collective instance. Alternatively, one
could construct an in-memory decision system that could be queried/searched at run-time to provide
the optimal method information. In order for either of these approaches to be feasible, the memory
footprint and the time it takes to make decisions need to be minimal.
This paper studies the applicability of C4.5 decision trees [9] to MPI collective algorithm/method
selection problem. We assume that the system of interest has been benchmarked and that detailed
performance information exists for each of the available collective communication methods1 . With
this information, we focus our efforts on investigating whether the C4.5 algorithm is a feasible way
to generate static decision functions.
The paper proceeds as follows: Section 2 discusses existing approaches to the decision mak-
ing/algorithm selection problem; Section 3 provides background information on the C4.5 algorithm;
Section 4 discusses the mapping of performance measurement data to C4.5 input, Section 5 presents
experimental results; Section 6 concludes the paper with discussion of the results and future work.
2 Related work
The MPI collective algorithm selection problem has been addressed in many MPI implementations.
In FT-MPI [10], the decision function is generated manually using a visual inspection method aug-
mented with Matlab scripts used for analysis of the experimentally collected performance data.
This approach results in a precise, albeit complex, decision function. In the MPICH-2 MPI imple-
mentation, the algorithm selection is based on bandwidth and latency requirements of an algorithm,
and the switching points are predetermined by the implementers [5]. In the tuned collective module
of Open MPI [11], the algorithm selection can be done in either of the following three ways: via a
compiled decision function; via user-specified command line flags; or using a rule-based run-length
encoding scheme that can be tuned for a particular system.
Another possibility is to view this problem as a data mining task in which the algorithm selection
problem is replaced by an equivalent classification problem. The new problem is to classify collective
parameters, (collective operation, communicator size, message size), into a correct category, a
method in our case, to be used at run time. The major benefit of this approach is that the decision
making process is a well-studied topic in engineering and machine learning fields so the literature
is readily available. The decision trees are extensively used in pattern recognitions, CAD design,
signal processing, medicine, and biology [12].
Vuduc et al. construct statistical learning models to build different decision functions for matrix-
matrix multiplication algorithm selection [13]. In their work, they consider three methods for
decision function construction: parametric modeling; parametric geometry modeling; and non-
parametric geometry modeling. The non-parametric geometry modeling uses statistical learning
methods to construct implicit models of the boundaries/switching points between the algorithms
based on the actual experimental data. To achieve this, Vuduc et al. use support vector method
[14].
Conceptually, the work presented in this paper is close to the non-parametric geometry modeling
work done by Vuduc et al. However, our problem domain is different: MPI collective operations
instead of matrix-matrix multiplication, and we use the C4.5 algorithm instead of support vector
1
Detailed benchmarking of all possible methods takes significant amount of time. If this is not an option, perfor-
mance profiles can be generated using limited set of performance measurements coupled with performance modeling
[8].
2
methods. To the best of our knowledge, we are the only group that has approached the MPI
collective tuning process in this way.
3 C4.5 algorithm
C4.5 is a supervised learning classification algorithm used to construct decision trees from the data
[9]. C4.5 can be applied to the data that fulfills the following requirements:
• Attribute-value description: information about a single entry in the data must be described
in terms of attributes. The attribute values can be discrete or continuous, and in some cases,
attribute value may be missing or can be ignored;
• Predefined classes: the training data has to be divided in predefined classes or categories.
This is a standard requirement for supervised learning algorithms;
• Discrete classes: the classes must be clearly separated and a single training case either belongs
to a class or it does not. C4.5 cannot be used to predict continuous class values such as the
cost of a transaction;
• Sufficient data: the C4.5 algorithm utilizes an inductive generalization process by searching
for patterns in data. For this approach to work, the patterns must be distinguishable from
random occurrences. What constitutes the “sufficient” amount of data depends on a particular
data set and its attribute and class values, but in general, statistical methods used in C4.5
to generate tests require reasonably large amount of data;
In the C4.5 algorithm, the initial decision tree is constructed using a variation of Hunt’s method
for decision tree construction (Figure 1). The main difference between C4.5 and other similar
decision tree building algorithms is in the test selection and evaluation process (last case in Figure
1). The C4.5 utilizes information gain ratio criterion, which maximizes normalized information
gain by partitioning T in accordance with particular tests [9].
To define the gain ratio we have to look at the information conveyed by classified cases. Consider
a set T of k training cases. If we select a single case t ∈ T and decide that it belongs to class Cj , then
f req(Cj ,T ) f req(Cj ,T )
the probability of this message is |T | and it conveys −log2 ( |T | ) bits of information.
Then the average amount of information needed to identify the class of a case in set T can be
computed as a weighted sum of per-case information amounts [9]:
k
X f req(Cj , T ) f req(Cj , T )
inf o(T ) = − × log2 ( ) (1)
|T | |T |
j=1
If the set T was partitioned into n subsets based on outcomes of test X, we can compute a similar
information requirement [9]:
n
X |Ti |
inf oX (T ) = × inf o(Ti ) (2)
|T |
i=1
3
Hunt’s method for decision tree construction [9]
Given a set of training cases, T , and set of classes C = {C1 , C2 , ..., Ck },
the tree is constructed recursively by testing for the following cases:
T contains one or more cases which all belong to the same class Cj :
a leaf node is created for T and is denoted to belong to Cj class;
T contains no cases:
a leaf node is created for T but the class it belongs to must be
selected from outside source. C4.5 algorithm selects most frequent
class at the parent node;
T contains cases that belong to more than one class:
find a test which will split T set to a single-class collections of cases.
This test is based on a single attribute value, and is selected such that
it results in one or more mutually exclusive outcomes {O1 , O2 , ...On }.
The set T is then split into subsets {T1 , T2 , ...Tn } such that
the set Ti contains all cases in T with outcome Oi .
The algorithm is then called recursively on all subsets of T .
Then, the information gained by partitioning T in accordance with the test X can be computed as:
gain(X) = inf o(T ) − inf oX (T ) (3)
The predecessor to the C4.5 method, the ID3 algorithm, used gain criterion in Equation 3 to select
the test for partition. However, the gain criterion is biased towards the high frequency data. To
ameliorate this problem, C4.5 normalizes the information gain by the amount of the potential
information generated by dividing T into n subsets:
n
X |Ti | |Ti |
split inf o(X) = − × log2 ( ) (4)
|T | |T |
i=1
The condition on which C4.5 selects the test to partition the set of available cases is defined as:
gain(X)
gain ratio(X) = (5)
split inf o(X)
C4.5 selects the test that maximizes gain ratio value.
Once the initial decision tree is constructed, a pruning procedure is initiated to decrease the
overall tree size and decrease the estimated error rate of the tree[9].
Additional parameters that affect the resulting decision tree are:
• weight, which specifies the minimum number of cases of at least two outcomes of a test. By
default this value is two, meaning that for the test to be accepted, at least two outcomes
of a test must have two or more cases. This prevents near-trivial splits that would result in
almost flat and really wide trees;
• confidence level, which is used for prediction of tree error rates and affects the pruning process.
The lower the confidence level, the higher the amount of pruning that will take place;
• attribute grouping, which can be used to create attribute value groups for discrete attributes
and possibly infer patterns occurring in sets of cases with different values of an attribute, but
do not occur for other values of that attribute;
4
Decision Tree:
message size <= 512 :
— communicator size <= 4 :
— — message size <= 32 : ring (12.0/1.3)
— — message size > 32 : linear (8.0/2.4)
— communicator size > 4 :
— — communicator size > 8 : bruck (100.0/1.4)
— — communicator size <= 8 :
— — — message size <= 128 : bruck (8.0/1.3)
— — — message size > 128 : linear (2.0/1.0)
message size > 512 :
— message size > 1024 : linear (78.0/1.4)
— message size <= 1024 :
— — communicator size > 56 : linear (5.0/1.2)
— — communicator size <= 56 :
— — — communicator size <= 8 : linear (3.0/1.1)
— — — communicator size > 8 : bruck (5.0/1.2)
Table 1: C4.5 decision tree for Alltoall on Nano cluster. The numbers in parentheses following
the leaves represent number of the training cases covered by each leaf and the number of cases
misclassified by that leaf.
• windowing, which is used to enable construction of multiple trees based on a portion of the
test data, and then select the best performing tree [9].
5
index/name. The c4.5rules program is supplied with the software release, but we did not use it
for this purpose.
Table 2: Broadcast decision tree statistics corresponding to the data presented in Figure 2. Size
refers to the number of leaf nodes in the tree. Errors are in terms of misclassified training cases.
The data set had 1248 training cases.
2
Decision map is 2D representation of decision tree output for particular communicator and message sizes ranges.
3
For more details on these algorithms, refer to [8].
6
(a) (b) (c)
Figure 2: Broadcast decision maps from Grig cluster: (a) Measured (b) ’-m 2 -c 25’ (c) ’-m 4 -c
25’ (d) ’-m 6 -c 15’ (e) ’-m 8 -c 5’ (f) ’-m 40 -c 5’. Different colors correspond to different method
indices. The trees were generated using the specified command line parameters. The x-axis scale is
logarithmic. The bright red color represents a Linear algorithm with no segmentation, the shades
of green represent Binomial without segmentation and with 1KB segments, dark blue is a Binary
algorithm with 1KB segments, gray is a Splitted Binary algorithm with 1KB segments, and shades
of yellow are Pipeline algorithms with 1KB and 8KB segments.
In the upper left corner of Figure 2 we can find an exact decision map generated from experimen-
tal data. The subsequent maps were generated from C4.5 decision trees constructed by specifying
different values for weight (-m) and confidence level (-c) parameters. The statistics about these
trees can be found in Table 2.
The exact decision map in Figure 2 exhibits trends, but there is a considerable amount of
information for intermediate size messages (between 1KB and 10KB) and small communicator
sizes. The decision maps generated from different C4.5 trees capture general trends very well. The
amount of captured detail depends on weight, which determines how the initial tree will be built,
and confidence level, which affects the tree pruning process. “Heavier” trees require that branches
contain more cases, thus limiting the number of fine-grained splits. A lower confidence level allows
for more aggressive pruning, which also results in coarser decisions.
Looking at the decision tree statistics in Table 2, we can see that the default C4.5 tree (’-m 2
-c 25’) has 127 leaves and a predicted misclassification error of 14.6%. Using a slightly “heavier”
tree ’-m 4 -c 25’ gives us a 25.20% decrease in tree size (95 leaves) and maintains almost the same
predicted misclassification error. As we increase tree weight and decrease the confidence level, we
produce the tree with only 21 leaves (83.46% reduction in size) with a 50% increase in predicted
misclassifications (21.9%).
7
In this work, the goal is to construct reasonably small decision trees that will provide good
run-time performance of an MPI collective of interest. Given this goal, the number of misclassified
training examples is not the main figure of merit we need to consider. To determine the “quality”
of the resulting tree in terms of collective operation performance, we consider the performance
penalty of the tree.
The performance penalty is the relative difference between the performance obtained using
methods predicted by the decision tree instead of the experimentally optimal ones:
where tX (comm, msg) is operation duration using the method predicted by tree X for commu-
nicator and message sizes comm and msg, and t(comm, msg) is the operation duration using
experimentally optimal method. The mean performance penalty of a decision tree is the mean
value of performance penalties for all communicator and message sizes of interest.
Table 3: Performance penalty of Broadcast decision trees corresponding to the data presented in
Figure 2 and Table 2.
Table 3 provides performance penalty statistics for the Broadcast decision trees we are consid-
ering. We can see that the minimum, mean, and median performance penalty values are rather low
- less than 4%, even as low as 0.66%, indicating that even the simplest tree we considered should
provide good run-time performance. Moreover, the simplest tree, “-m 40 -c 5”, had a lower per-
formance penalty than the “-m 6 -c 15,” which indicates that the percent of misclassified training
cases does not translate directly into performance penalty of the tree.
It is also interesting to consider the case with maximum performance penalty. Most of the trees
would incur 316.97% at communicator size 25 and message size 480. For this data point, the exact
tree selects the Binary algorithm without segmentation (1.12 ms) while the C4.5 decision trees
would select the Binomial algorithm without segmentation (4.69 ms). Additionally, in the “-m 40
-c 5” tree, only six data points had a performance penalty above 50%.
8
Splitted Binary algorithm for which we do not have an equivalent in Reduce implementation, but
we expected that C4.5 will be able to handle these cases correctly.
The training data for this experiment contained three attributes (collective name, communicator
size, and message size) and the set of predetermined classes was the same as in the Broadcast-only
case.
Figure 3: Combined Broadcast and Reduce decision maps from the Grig cluster: (a) Reduce, Exact
(b) Reduce, ’-m 2 -c 25’ (c) Reduce, ’-m 20 -c 5’ (d) Broadcast, Exact (e) Broadcast, ’-m 2 -c25’
(f) Broadcast, ’-m 20 -c 5’. Color has the same meaning as in Figure 2.
Figure 3 shows the decision maps generated from the combined broadcast and reduce decision
tree. The left most maps in both rows are the exact decisions for each of the collectives based on
experimental data. The remaining maps are generated by querying the combined decision tree.
Figures 3 (b) and (e) were generated using a “-m 2 -c 25” decision tree with 221 leaves and a 12.6%
predicted misclassification error, while (c) and (f) were generated by a “-m 20 -c 5” decision tree with
55 leaves and a 20.6% predicted misclassification error. Table 4 provides the detailed information
about the combined Broadcast and Reduce trees we considered. The mean performance penalty of
the combined tree for each of the collectives is less than 2.5% (Figure 4).
The structure of combined Broadcast and Reduce decision trees reveals that the test for the
type collective occurs for the first time on the third level of the tree. Even then, the subtrees have
somewhat similar structure as we can see in Table 5. Considering the message sizes in the range
1, 408B - 6, 144B, we can see that both Broadcast and Reduce would use a variant of the Binomial
algorithm. However, switching points are shifted, and for larger communicator sizes and smaller
messages, Reduce would use a non-segmented version of the Binomial algorithm.
9
Command Before Pruning After Pruning
Line Size Errors Size Errors Predicted Error
-m 2 -c 25 239 137 (6.0%) 221 142 (6.2%) 12.6%
-m 6 -c 25 149 205 (9.0%) 115 220 (9.6%) 14.0%
-m 8 -c 25 127 225 (9.8%) 103 235 (10.3%) 14.4%
-m 20 -c 5 63 310 (13.6%) 55 316 (13.8%) 20.6%
-m 40 -c 25 33 392 (17.1%) 33 392 (17.1%) 19.6%
Table 4: Statistics for combined Broadcast and Reduce decision trees corresponding to the data
presented in Figure 3. Size refers to the number of leaf nodes in the tree. Errors are in terms of
misclassified training cases. The data set had 2286 training cases.
Figure 4: Mean performance penalty of the combined decision tree for each of the collectives. Tree
index denotes the decision tree: (1) ’-m 2 -c 25’ (2) ’-m 6 -c 25’ (3) ’-m 8 -c25’ (4) ’-m 20 -c 5’, (5)
’-m 40 -c 25’.
10
...
message size > 1408 :
— message size <= 6144 :
— — collective = Broadcast:
— — — message size <= 2560 : Binomial 1K (81.0/35.0)
— — — message size > 2560 : Binary 1K (135.0/36.0)
— — collective = Reduce:
— — — message size > 1920 : Binomial 1K (132.0/16.0)
— — — message size <= 1920 :
— — — — communicator size <= 15 : Binomial 1K (40.0/16.0)
— — — — communicator size > 15 : Binomial 0K (48.0/24.0)
...
Table 5: Segment of combined Broadcast and Reduce decision tree ’-m 40 -c 25’.
performance penalty. Similar results were obtained for Reduce and Alltoall.
Additionally, we combined the experimental data for Reduce and Broadcast to generate the
combined decision trees. These trees were also able to produce decision functions with less than
a 2.5% relative performance penalty for both collectives. This indicates that it is possible to use
information about one MPI collective operation to generate a reasonably well decision function for
another collective.
Our findings demonstrate that the C4.5 algorithm and decision trees are applicable to this
problem and should be more widely used in this domain. In the future, we plan to use C4.5
decision trees to reevaluate decision functions in FT-MPI and tuned collective module of Open
MPI, as well as integrate C4.5 decision trees with our MPI collective testing and performance
measurement framework, OCC.
Acknowledgments
This work was supported by Los Alamos Computer Science Institute (LACSI), funded by Rice
University Subcontract #R7B127 under Regents of the University Subcontract #12783-001-05 49.
References
[1] R. Rabenseifner, “Automatic MPI counter profiling of all users: First results on a CRAY T3E
900-512,” in Proceedings of the Message Passing Interface Developer’s and User’s Conference,
pp. 77–85, 1999.
[2] J. Worringen, “Pipelining and overlapping for MPI collective operations,” in 28th Annyal IEEE
Conference on Local Computer Network, (Bonn/Königswinter, Germany), pp. 548–557, IEEE
Computer Society, October 2003.
[3] R. Rabenseifner and J. L. Träff, “More efficient reduction algorithms for non-power-of-two
number of processors in message-passing parallel systems,” in Proceedings of EuroPVM/MPI,
Lecture Notes in Computer Science, Springer-Verlag, 2004.
11
[4] E. W. Chan, M. F. Heimlich, A. Purkayastha, and R. M. van de Geijn, “On optimizing
of collective communication,” in Proceedings of IEEE International Conference on Cluster
Computing, pp. 145–155, 2004.
[9] J. R. Quinlan, C4.5: Programs for Machine Learning. San Mateo, California: Morgan Kauf-
mann Publishers, 1993.
[12] S. K. Murthy, “Automatic construction of decision trees from data: A multi-disciplinary sur-
vey,” Data Mining and Knowledge Discovery, vol. 2, no. 4, pp. 345–389, 1998.
[13] R. Vuduc, J. W. Demmel, and J. A. Bilmes, “Statistical Models for Empirical Search-Based
Performance Tuning,” International Journal of High Performance Computing Applications,
vol. 18, no. 1, pp. 65–94, 2004.
[14] V. N. Vapnik, Statistical Learning Theory. New York, NY: Wiley, 1998.
12