Job Runtime Prediction of HPC Cluster Based On PC-Transformer
Job Runtime Prediction of HPC Cluster Based On PC-Transformer
https://doi.org/10.1007/s11227-023-05470-2
Fengxian Chen1
Abstract
Job scheduling of high performance cluster is a crucial task that affects the efficiency
and performance of the system. The accuracy of job runtime prediction is one of the
key factors that influences the quality of job scheduling. In this paper, we propose a
novel method for job runtime prediction based on Transformer with plain connec-
tion and attention mechanism. The proposed method utilizes the job category infor-
mation obtained by clustering the historical log datasets, and selects six-dimensional
features that are highly correlated with job runtime. We divide the datasets into mul-
tiple job sets according to the length of job runtime, train and predict each job set
separately. We evaluate the proposed method on the HPC2N dataset, and compare it
with several existing methods. The results show that the proposed method achieves
an average accuracy of 0.892, with 15.2% MAPE, and outperforms other methods
in terms of prediction performance and training time. The proposed method can be
applied to improve the efficiency and quality of job scheduling in high performance
cluster.
1 Introduction
* Fengxian Chen
[email protected]
1
Supercomputing Center, Lanzhou university, Tianshui Road, Lanzhou 730000, Gansu, China
1Vol:.(1234567890)
3
Job runtime prediction of HPC cluster based on PC‑Transformer 20209
There are a wide variety types of jobs running on supercomputing clusters. More-
over, different jobs require various resources and runtime, thus causing difficulty in
job scheduling [1]. Job scheduling is generally implemented by the job scheduling
system. The system monitors the characteristics of job resource consumption and
runtime, and then determines the execution sequence of jobs on the cluster using
default scheduling algorithms.
Commonly used scheduling policies comprise First-Come, First-Served (FCFS),
Round Robin, Short Job First (SJF) etc. The above policies are executed in order in
accordance with the predetermined rules. For instance, FCFS automatically executes
queued requests and processes in order of their arrival [2]. When there are many
time-consuming tasks on the cluster, the FCFS is difficult to make use of the frag-
mented resources of the cluster, which results in waste of resources. Backfilling is a
scheduling optimization allowing running jobs out of order to more effectively use
available resources. When large jobs consuming a lot of resources wait at the front
of the line, it needs to wait till there are sufficient resources to execute. Moreover,
other small jobs will queue up in the waiting queue to obtain computing resources,
even if there are some small resources available. Backfilling moves forward small
jobs in the queue to use free computing resources [3]. Because the FCFS algorithm
is stable and fair to job scheduling, most of the existing job scheduling systems use
the FCFS with backfilling, and the work of this paper is also based on this method.
Accurate estimation of job runtime is the key to effectively use cluster resources
without affecting original scheduling order. However, runtime estimated by users is
often inaccurate, so it is difficult to use in actual scheduling. Prediction methods
based on the history data of previously executed jobs have been proposed, and con-
firmed with better performance [4, 5]. As a result, user’s historical job information
should be used to make accurate predictions on the runtime of cluster jobs [6].
Most predictions of the job runtime have been based on historical log data. The
method assumes that jobs with similar computing modes and scales on the same
cluster have similar runtime. However, it is challenging to achieve accurate job runt-
ime prediction due to data missing and noisy in historical data [7, 8]. There is a
prediction strategy that predicts jobs runtime by using the same user’s historical jobs
information. For example, for a user, the average runtime of the last two jobs is used
as the predicted runtime of the next job [4, 9]. This algorithm has improved perfor-
mance of the scheduling system with EASY backfilling, whereas it is difficult to
deploy to large-scale high-performance clusters, and it cannot predict the new users’
jobs. Juan et al. [10] measured the similarity of different users and jobs based on the
characteristics of historical jobs. Then they screened out the jobs that are most simi-
lar to the job to be predicted in characteristics, and the average runtime of the above
jobs is taken as the predicted value. This method isn’t dependent on a single user ’s
log and is robust against noisy data using an optimized job similarity measure. How-
ever, due to the deviation of similarity calculation, the algorithm is difficult to apply
to complex HPC cluster systems.
Rauschmayer [11] estimated job runtime through linear regression and maximum
likelihood estimation. The results showed that, compared with the maximum likeli-
hood estimation method, the prediction results could improve 22% by designing lin-
ear regression models for various characteristic factors related to the runtime.
13
20210 F. Chen
13
Job runtime prediction of HPC cluster based on PC‑Transformer 20211
• Time-serialize the log data by their commit time. In First Come First Serve
scheduling system, the submit order of jobs is important information. Selecting
an appropriate time step to sample the ordered job log is essential to achieving
valid time encoding.
• Similar users usually submit similar jobs repeatedly on the HPC cluster. Mining
the operation characteristics of users facilitates the construction of feature engi-
neering. In this study, clustering model is adopted to divide users into different
categories in accordance with their jobs, and the above classifications serve as
one of the features for time coding.
• After the data are processed, the form of vector embedding is adopted to rep-
resent time series, inspired by word2vec [20], so as to automate the feature
engineering process in a better way. In this study, a linear neural network with
a periodic activation function is capable effectively capturing the characteristic
information of time series.
• We propose a simplified version of the Transformer model, PC-Transformer,
which uses several layers of multi-head attention structure in encoder, and a lin-
ear network as decoder.
• Extensive experiments are performed on log data of several real supercomput-
ing clusters to verify PC-Transformer model. Several sequential neural net-
work models such as Recurrent Neural Network (RNN), Long-Short Term
Memory(LSTM) and PC-Transformer are employed for experiments, respec-
tively. Experimental results reveal that the proposed method outperforms other
neural networks.
The job logs generated by the HPC cluster contain various features, (e.g., user ID,
number of CPU used and job wait time). It is difficult to exploit all job features when
we train the model. First, using excessive features will make the model too complex,
increase the training period and be prone to overfitting. Second, since some features
contain considerable missing data and noisy data, it is difficult for the model to con-
verge when these features are employed for training. Accordingly, the data features
should be screened and cleaned before training the model.
Public datasets commonly used in high-performance cluster logs are adopted to
verify the effectiveness of the proposed model. Five real job log sets are adopted
in this study, including ANL-2009, HPC2N, KIT FH2, LLNL Thunder and SDSC
13
20212 F. Chen
2.1 Feature screening
As listed in Table 1, all datasets originate from HPC clusters with multiple users.
Different datasets have different numbers of users and jobs among them, due to the
large number of jobs and the high data quality, HPC2N is one of the most widely
used HPC log jobs. Accordingly, HPC2N is employed as an example for partial fea-
ture visualization and model effect analysis in some experiments.
SWF data contain 18-dimensional (18D) features. The above features comprise
time-based (e.g., job submission time, waiting time and running time), resource-
based features (e.g., the number of CPUs occupied and memory size), and user-
based features (e.g., user ID and user group). Job features are represented by actual
values, and missing data are represented by -1. Although SWF data have been pre-
liminarily cleaned and sorted, a lot of missing data remain in some features, so it is
imperative to achieve further cleaning and processing.
Number of the missing data features are shown in Fig. 1, the number of features
with considerable missing data in some datasets is 9, the above features are excluded
from the dataset due to lack of information. Among the remaining parts, only one
dimension is selected in the features containing duplicate information. For instance,
the job number used to record the order of job submission overlaps with the job sub-
mission time, so we only keep the submission time. Although runtime is predicted
targets, it is an important feature of historical job logs, so we keep it when train
13
Job runtime prediction of HPC cluster based on PC‑Transformer 20213
models. Lastly, 6D features are screened for model training and prediction. The fea-
ture names and partial values are listed in Table 2.
The retained job features still comprise the time class, resource class and user
class, i.e., feature filtering does not cause information loss. The filtered data features
still have a few missing values. In this study, we use the mean value interpolation
method to fill in the missing values [22].
Moreover, it is imperative to removing invalid data and outlier data during data
preprocessing. Some data in HPC job logs are generated by failed job logs that pre-
maturely terminate due to submitted job parameters or program errors, and they will
be removed as noise data. In this study, runtime of the data is less than 600 s, and
data with actual runtime is less than 1% of the requested runtime are also removed
from the dataset as noise data.
13
20214 F. Chen
2.2 Data normalization
The base units of the data features in the logs are different, thus leading to a sig-
nificant difference in the order of magnitude. Data normalization is a valid method
to eliminate this difference. In general, learning algorithms benefit from normaliza-
tion of the dataset. Z-score, Min-Max and nonlinear normalization are frequently-
used methods. With HPC2N as an example, the time class features are compared
with other classes on order of magnitude distribution, specifically for comparing the
number of CPU and runtime.
As depicted in Fig. 2, horizontal axis represents the order of intercepted data
points, and vertical axis is the amplitude of features. In the original data, runtime
value is significant larger than other characteristics (e.g. CPU number). After the
normalization process, the values of two features are compressed into similar inter-
vals, especially the data processed by z-score method is smoother. The effect of the
dimension of features is eliminated to a certain extent. Thus, z-score is adopted to
process original data in user clustering and prediction models.
2.3 User clustering
After screening and standardization, the data can be adopted to mine the informa-
tion of users’ similar behaviors, which can optimize user categories. New research
suggests that similar users usually repeatedly submit similar jobs on the HPC cluster
[23, 24]. As depicted in Table 1, the number of users on the datasets is relatively
large. As a discrete feature variable, user ID used directly in the training may make
the model difficult to converge due to scattered values. In consequence, users are
clustered based on the computing mode and scale of user jobs in historical logs. The
result of the clustering serves as the user category feature to replace the user ID.
Since the calculation modes of homogeneous users are similar, substitution does not
result in loss of information in user’s features. We will compare the impact of user
ID and clustering results separately on the accuracy of the model in experimental
section.
K-Means is a commonly used clustering method. K-Means is characterized by
a simple principle and a fast calculation speed. It is efficient when clustering large
amounts of data. To test the performance of the algorithms, Silhouette Coefficient
(SC) and Principal Component Analysis (PCA) serves as evaluation index [25].
Before clustering, it is necessary to sort out the main characteristics of users, includ-
ing counting the number of jobs, average waiting time, average runtime, and aver-
age number of CPUs used by each user on the platform. The above features are
employed for user clustering. The k-means algorithm should specify the cluster
value K in advance. To compare the clustering performance of different K values,
the range of K values is preset to train K-Means models, and then SC is adopted to
evaluate the above models. The SC is calculated as follows:
b(i) − a(i)
S(i) = (1)
max(a(i), b(i))
13
Job runtime prediction of HPC cluster based on PC‑Transformer 20215
13
20216 F. Chen
N
1 ∑
SC = S(i) (2)
Ndata i=1
where a(i) denotes the average distance between sample i and other samples in its
cluster; b(i) represents the average distance from sample i to samples in other clus-
ters; S(i) is the silhouette coefficient of sample i; Ndata is the number of data on data-
set; SC express the overall silhouette coefficient in the entire dataset. The SC value
of the clustering is closer to 1, which suggests that the instances in the cluster are
compact and the distance between clusters is large. Otherwise, the overlap between
clusters is large and the clustering effect is poor. Thus, the K value with the maxi-
mum SC in the pre-selected interval will be selected as the final value. The preset K
values range from 2 to 17 in this study. The above K values are adopted to train the
clustering model on the datasets respectively, and the SC under different K values
are calculated for testing the performance.
As depicted in Fig. 3, as the number of clusters increases, the overall SC tend
to decrease, and the K values with the optimal clustering effect range from 2 and
8. To be specific, the number of users on the KIT dataset is smaller than that of
other datasets, and it has the highest SC when it is clustered into two groups.
SDSC has the largest number of users, and its categories with highest SC is also
the most in datasets. Lastly, the selected K values of ANL, LLNL, HPC2N, KIT
and SDSC are 8, 2, 5, 3 and 2, the above k values will be retained for actual
clustering.
Different clusters are marked with different colors in Fig. 4, As depicted in this
figure, there is a clear distance between users in different clusters after clustering,
which means the clustering preliminarily achieves the division of user categories.
The user category feature obtained by clustering already contains information at
13
Job runtime prediction of HPC cluster based on PC‑Transformer 20217
the user level. Accordingly, user class is adopted, instead of user ID. For new
user, after generating 5 job records, the trained clustering model use these record
to predict category.
From the feature description of the dataset, there are several features in the job log
to record the sequential information of job submission and execution, such as wait-
ing time and submission time. This part of the information plays an important role
in runtime forecast. In an actual job system, the status of new jobs is often related to
jobs running in the current cluster. For instance, the runtime of jobs on the current
system will directly affect the waiting time of new jobs. However, most studies do
not consider the above factors when analyzing job logs. Xiao et al. filtered out tim-
ing features and only used user and job resource features when training their model
[26]. Chen et al. used job submission time as job feature information for training
the model, but did not consider the association between jobs [17]. We sample and
encode the data to use the timing information between jobs on the dataset.
Jobs with large differences in runtime often have large differences in characteristics.
If the model is trained by using the data without distinction of runtime, it will be dif-
ficult for the model to converge to the optimal point. Several studies suggested that
separating long and short jobs can bring a large performance improvement in pre-
diction [16, 26]. Therefore, this study first determines the interval divided by time
length in accordance with the data distribution characteristics of each dataset, and
then samples by timestep and different runtime.
13
20218 F. Chen
As Fig. 5 shown, on the above datasets, the number of short job samples between
0 and 3600 s is the largest, and fewer long job samples with runtime greater than
45,000 s. In this study, according to the interval of runtime and the number of
13
Job runtime prediction of HPC cluster based on PC‑Transformer 20219
samples in each interval, each dataset is divided into three types of job sets: long
jobs, medium jobs and short jobs.
The division of job sets is listed in Table 3. To make the job sets as uniform
as possible in each dataset, 3600 s and 10,600 s serve as demarcation points to
divide job sets with different runtime, and the number of samples in each interval is
different.
3.2 Data sampling
Jobs in the logs are sorted by submission time, this ordering is useful time
series information, sampling in order could reserve it. We sample the dataset
in sequence in accordance with the length of the sampling window T, and the
window slide one step each time. After sampling, log data is packaged into data
groups of size L, at the same time, we separate datasets into corresponding job
sets at different runtime intervals. The specific sampling method is shown in
Fig. 6.
13
20220 F. Chen
4 Related methodologies
This section introduces neural networks related to sequences and discusses their
advantages and disadvantages. Besides, it also presents PC-Transformer in detail.
In the sequential data process, the linear neural network cannot retain the historical
data information during training, thus resulting in the loss of context information.
As a result, the performance of the linear neural network on sequential data is poor.
To solve the above problem, the recurrent neural network (RNN) is proposed. The
unique directed cyclic connection between the layers of RNN enables it to have the
memory function of time series data [27].
RNN is a type of feedback neural network, of which the hidden layer input con-
tains the input value of the current moment, and the output value of the last moment.
The structure of RNN is illustrated in Fig. 7, the left side of the arrow represents the
network structure of RNN, and the other side shows the schematic diagram of its
structure expanded along the time axis. The neural network unit A reads the input xt
at the current moment, and outputs a hidden state value ht , which will be sent to the
neural unit at the next moment together with the next input.
RNN can be considered multiple copies of one neural network. The respective
neural network module will pass the information acquired by itself to the next neu-
ral network unit. This chain-like feature reveals that the recurrent neural network
is a sequence-related network. In theory, RNN can use previous information in the
current task, so it is suitable for sequence data (e.g., speech, natural language, and
stock sequences). In practical applications, however, RNN is affected by the length
of sequence. In the long-term dependency sequence, there are some fatal problems
(e.g., vanishing gradient), so it faces difficulty in learning complete information
[28].
13
Job runtime prediction of HPC cluster based on PC‑Transformer 20221
4.2 LSTM
To overcome the major limitations of RNN, LSTM adopts gate structure to avoid
exploding and vanishing gradient problems, so it can learn long-term information
in sequence data [29]. LSTM uses multiple nonlinear gates to control the output and
state of neurons, compared with the single tanh layer in the RNN.
The internal structure of LSTM cell is presented in the Fig. 8, consisting of forget
gate ft , input gate it and output gate ot that are adopted to update the cell state ct . ft
selects the part to discard in input xt and the state ht−1 of the hidden layer at the last
moment. it determines the information stored in the cell state, and it comprises the
sigmoid layer and the tanh layer. The final output of cell is achieved by ot . It filters
out the cell state through the sigmoid layer and subsequently, maps the cell state
between − 1 and 1 by tanh, the final output is obtained by multiplying the above two
results. The specific expressions are presented follows [30].
ft = 𝜎(Wf ∗ [ht−1 , xt ] + bf ) (3)
13
20222 F. Chen
4.3 Transformer
13
Job runtime prediction of HPC cluster based on PC‑Transformer 20223
( pos )
PE(pos,2i) = sin (6)
100002i∕dmodel
( pos )
PE(pos,2i+1) = cos (7)
100002i∕dmodel
where PE represents the position encoding vector of the i dimension at time t in the
input vector; pos denotes the value of the current encoding position; dmodel repre-
sents the dimension of the input vector. The vector after positional encoding is used
for input of encoder and decoder. The core module of encoder and decoder is multi-
head attention, it is a novel attention calculation algorithm, calculating the input
attention multiple times through scaling dot product and dimensionality reduction
mapping and jointing these results. The calculation formula is expressed as follows:
MultiHead(Q, K, V) = Concat(head1 , head2 , … , headh )W o (8)
where Q, K, V are query, key and value vectors respectively, which are used in atten-
tion calculation; h is the number of heads, or the number of calculations, headi is
calculated by scaling the dot product:
� �
T
Q Q, K
headi = Attention(QWi , KWiK , VWiV ) = softmax √ (9)
dk
13
20224 F. Chen
sentence structure and word interdependency. Likewise, the proposed model requires
a representation of time when processing job logs. In this study, the input data are
encoded as a time vector with location information through embedding layer. To
capture different information in time series data simultaneously, we divide embed-
ding layer into two parts, a linear layer and a linear layer with a periodic activate
function. The final encoding result is formed by adding the outputs of two layers.
The structure of PC-Transformer is presented in Fig. 10, the first module is
the time embedding layer, it encodes the input data separately in the periodic and
non-periodic parts, and then combines them into one vector. The vector carrying
time series information serves as the input to the encoder, the output of encoder is
mapped by the linear layer to obtain the final predicted value.
13
Job runtime prediction of HPC cluster based on PC‑Transformer 20225
5 Experiment
5.1 Experimental settings
The framework of experiment is illustrated in Fig. 11. The model is trained on pre-
processing data, and runtime is predicted on test datasets. The model employs RNN,
LSTM and PC-Transformer neural network structures respectively. In RNN and LSTM,
the model comprises 3 layers network, and each layer contains 64 neurons. A linear
layer is adopted to adjust the dimension of the final predicted value. In the selection of
activation function, RNN uses the Relu to avoid the gradient problem; LSTM uses the
Sigmoid and the Tanh according to the characteristics of each gate structure. In PC-
Transformer, activation function employs the Relu, the periodic function in position
encoding uses sine, the number of layers N of the encoder is 3, and the number of head
h in multi-head attention is 8. The optimizer uses the Adam optimization algorithm,
initial learning rate is 0.001, the batch size is 128, and the length of data group L is 20.
During training, Dropout is adopted to prevent overfitting. The model is implemented
by Pytorch, and the calculations are performed with a single Nvidia Tesla V100 graph-
ics card.
5.2 Evaluation metrics
This study uses Huber function as loss function, compared with the Mean Absolute
Error (MAE) and Mean Square Error (MSE), Huber loss combines their strengths so
that it is more robust to outliers and could avoid gradient explosion [33].
n {
1 ∑ 0.5 ∗ (yi − f (xi ))2 , |yi − f (xi )| < 1
loss =
|yi − f (xi )| − 0.5, otherwise (10)
n i
where f (xi ) denotes results predicted by the model with input xi , yi is the label of xi ,
n is number of the batch size.
Mean Absolute Percent Error (MAPE) and Average Predictive Accuracy (APA)
are adopted to evaluate the efficiency and accuracy of the prediction model. There
are various datasets with different runtime lengths, so it is difficult to directly use
the MAE to measure the degree of deviation of predictions on job sets of different
lengths. MAPE is capable of better measuring the deviation of the predicted value
from the actual value. The smaller the value, the better the performance of the model
will be. The number of samples in the test set is denoted as Ntest , and the calculation
formula is as follow:
Ntest
1 ∑ yi − f (xi )
MAPE = | | × 100% (11)
Ntest i=1 yi
APA denotes the average of the prediction accuracy of all jobs on the test set, and
the prediction accuracy of a single job is calculated as follows:
13
20226 F. Chen
{ f (xi )
, f (xi ) ≤ yi
APAi = yi
yi (12)
f (xi )
, f (xi ) > yi
The value of APA is between 0 and 1. The closer the value to 1, the closer the pre-
diction will be to the actual value.
5.3 Experiment results
In this chapter, the results of the data on different models are presented, as listed
in Table 3. The respective dataset is divided into training, validation and test set
at 8:1:1 ratio. The preset training iterations is 100. The early stopping method is
adopted to prevent overfitting during training [34]. In other words, in 5 consecu-
tive training epochs, if the loss value of the validation set does not decrease and the
parameter updates no longer yield an improvement, the training will be stopped and
the last best parameters are adopted. Moreover, the optimal model is employed to
predict the runtime on the test datasets.
The overall results are presented in Figs. 12 and 13, where the performance of dif-
ferent models on datasets is reported in figures, and the values of proposed model
13
Job runtime prediction of HPC cluster based on PC‑Transformer 20227
are labeled. Figure 12 presents APA of the proposed PC-Transformer model and
other neural networks. Notably, PC-Transformer exhibits high APA on most datasets
compared with RNN and LSTM, especially in ANL long jobs. PC-Transformer has
3.4% and 10.6% improved APA than RNN and LSTM. In contrast, RNN performs
poorly on datasets due to its simple and single structure. This result also reveals
that RNN is difficult to capture long-term dependency information, whereas a long
job sequence should be obtained during modeling of job logs. Multi-head attention
and time embedding provide information regarding the relationship between dif-
ferent jobs in one group. Multiple independent learning and results splicing enable
PC-Transformer to learn more comprehensive data features, which makes PC-Trans-
former to perform better than other sequence models. MAPE is another metric to
assess performance and places a focus on the margin of error. Similar to the APA,
PC-Transformer achieves the lowest MAPE values, suggesting that PC-Transformer
can achieve low errors on the same data. For datasets, Table 1 shows that the data
volume of HPC2N is larger than others, and its performance is also better than oth-
ers. The above result also reveals that in a multi-user cluster, the more the historical
data, the better performance of the trained model will be. Moreover, short job sets
are more predictable than long job sets.
5.3.2 Comparative analysis
13
20228 F. Chen
As shown in Table 4, simple historical data combination method has the worst
accuracy rate, although methods of traditional machine learning can improve the
accuracy of prediction, there is still a gap compared to the PC-Transformer in each
job set. This experiment shows that performance of proposed model is improved
when compared with existing techniques.
5.3.3 Complexity analysis
13
Job runtime prediction of HPC cluster based on PC‑Transformer 20229
and biases in the model. FLOPs refers to the number of floating-point operations
required by the model during execution, which reflects the computational resource
and time required. We calculated these two metrics based on the network structure
and input–output size of each model, the results are shown in Table 5. RNN has the
lowest complexity, but also the worst performance; LSTM has the most parameters
and PC-Transformer has the highest FLOPs.
Table 5 RNN has the lowest complexity, but also the worst performance;
LSTM has the most parameters and PC-Transformer has the highest FLOPs. Fig-
ure 14 compares the training time required for a single epoch, and the time is
related to the model and dataset size. Notably, the long job set has more jobs than
the other job sets, so the training time for an epoch of long job set is the longest
on the respective model. For model size, RNN has fewer parameters than LSTM,
so the training time is also shorter than that of LSTM. For PC-Transformer,
although the number of parameters and FLOPs is large, its training time is the
shortest, thanks to its structure that enables parallelization.
13
20230 F. Chen
In Sect. 2.3, we use user clustering categories instead of user IDs, the results of
experiments using these two types of user information separately are shown in
Table 7, the results show that models using user clustering information data perform
better than using user ID data directly.
Error bars help to indicate estimated error or uncertainty to give a general sense
of how precise a measurement is, we use the error bar shown in Fig. 15, which is
adopted to describe the prediction errors of different models [36]. In the figure, 100
data samples are selected from the HPC2N short job set. The blue dots represent the
actual runtime, and the light yellow line segments represent the deviation between
predicted value and true value. Compared with RNN and LSTM, PC-Transformer
model shows obvious advantages in the number of error points and the margin of
error on a single point. Moreover, there are a small number of data points that have
large prediction errors on the respective model, which also means that it is diffi-
cult to predict the runtime of individual jobs running on the cluster using statistical
or neural network methods. If the above points are excluded, the runtime predicted
by PC-Transformer model can be introduced to user job information to support job
scheduling.
13
Job runtime prediction of HPC cluster based on PC‑Transformer 20231
This chapter presents the experimental results of different models on the datasets.
The results show that the proposed PC-Transformer model achieves the best perfor-
mance on most datasets in terms of accuracy and MAPE, especially on long job sets.
RNN performs poorly due to its simple structure. PC-Transformer outperforms RNN
and LSTM by 3.4% and 10.6% on accuracy. Compared with existing techniques like
MA, MLKF, SVR and DNN, PC-Transformer also shows the most accurate predic-
tion. In addition, its training speed also has more advantages than other temporal
neural networks, which is important in real-time scheduling. In summary, the pro-
posed PC-Transformer model achieves the best runtime prediction performance on
the collected job logs compared with baseline models and existing techniques. The
model has great potential for job scheduling in HPC clusters.
13
20232 F. Chen
In accordance with the model design and experiments results, we can draw some
conclusions:
In this study, K-means algorithm is adopted to clustering users on the HPC plat-
form, and the optimal number of clusters is determined by silhouette coefficient
score. Using user categories to represent user identity information can not only
retain user features, but also significantly reduce the amplitude range of the features.
In the process of data sampling, the data of datasets are separated in accordance
with the runtime. The partition interval is determined based on basically the same
number of each work set by analyzing the characteristics of datasets. The division
allows models to capture the characteristics of similar jobs and reduce the interfer-
ence of outlier data.
The two most popular sequential neural networks and the proposed model have
been evaluated on the respective dataset, experimental results demonstrate that
sequential neural networks have better predictive performance than other machine
learning methods, the proposed model achieves an accuracy of 0.892 on the HPC2N
dataset, and MAPE is 15.2%. Furthermore, compared with the original time coding,
the proposed time embedding method has obvious advantages in training time and
prediction performance, which also provides an embedding direction for time series
analysis.
Comparing the error bars of the respective model on the test set shows that the
error amplitude and the number of error points of the proposed model are smaller
than those of other models, thus suggesting that the runtime predicted by the PC-
Transformer model can be applied to the actual scheduling environment. Despite
the above benefits, the proposed model has certain limitation: on some outliers,
although the proposed model can reduce the margin of error, the error remains.
At present, this work has only been tested on public datasets. In future research,
we will focus on two aspects. On the one hand, the runtime predicted by the pro-
posed model is combined with the scheduling system to assist scheduling in a real
high-performance environment. On the other hand, the predictive runtime combined
with deep reinforcement learning is used to further explore efficient scheduling
strategies.
Acknowledgements This work is supported by Supercomputing Center of Lanzhou University.
Author Contributions Fengxian Chen has finished all the work of the paper.
Availability of data and materials The datasets used in this paper are all public data sets, which can be
obtained openly.
Declarations
Conflict of interest The authors declare that they have no known competing financial interests or personal
relationships that could have appeared to influence the work reported in this paper.
13
Job runtime prediction of HPC cluster based on PC‑Transformer 20233
References
1. Molka D, Hackenberg D, Schöne R, Minartz T, Nagel WE (2012) Flexible workload generation for
hpc cluster efficiency benchmarking. Comput Sci Res Dev 27(4):235–243
2. Grosof I, Yang K, Scully Z, Harchol-Balter M (2021) Nudge: stochastically improving upon fcfs.
SIGMETRICS Perform Eval Rev 49(1):11–12. https://doi.org/10.1145/3543516.3460102
3. Wong AKL, Goscinski AM (2007) Evaluating the easy-backfill job scheduling of static workloads
on clusters. In: 2007 IEEE International Conference on Cluster Computing, pp 64–73. https://doi.
org/10.1109/CLUSTR.2007.4629218
4. Tsafrir D, Etsion Y, Feitelson DG (2007) Backfilling using system-generated predictions rather than
user runtime estimates. IEEE Trans Parallel Distrib Syst 18(6):789–803. https://doi.org/10.1109/
TPDS.2007.70606
5. Fan Y, Rich P, Allcock WE, Papka ME, Lan Z (2017) Trade-off between prediction accuracy and
underestimation rate in job runtime estimates. In: 2017 IEEE International Conference on Cluster
Computing (CLUSTER), pp 530–540. https://doi.org/10.1109/CLUSTER.2017.11
6. Gaussier E, Glesser D, Reis V, Trystram D (2015) Improving backfilling by using machine learning
to predict running times. In: SC ’15: Proceedings of the International Conference for High Perfor-
mance Computing, Networking, Storage and Analysis, pp 1–10. https://doi.org/10.1145/2807591.
2807646
7. Škrjanc I, Iglesias JA, Sanchis A, Leite D, Lughofer E, Gomide F (2019) Evolving fuzzy and
neuro-fuzzy approaches in clustering, regression, identification, and classification: a survey. Inf Sci
490:344–368. https://doi.org/10.1016/j.ins.2019.03.060
8. Gama J, Aguilar-Ruiz J, Klinkenberg R (2008) Knowledge discovery from data streams. Intell Data
Anal 12(3):251–252
9. Tsafrir D, Etsion Y, Feitelson DG (2005) Modeling user runtime estimates. In: Workshop on Job
Scheduling Strategies for Parallel Processing. Springer, pp 1–35. https://doi.org/10.1007/11605
300_1
10. Ramírez-Alcaraz JM, Tchernykh A, Yahyapour R, Schwiegelshohn U, Quezada-Pina A, González-
García JL, Hirales-Carbajal A (2011) Job allocation strategies with user run time estimates for
online scheduling in hierarchical grids. J Grid Comput 9(1):95–116. https://doi.org/10.1007/
s10723-011-9179-y
11. Rauschmayr N (2015) A history-based estimation for lhcb job requirements. J Phys Conf Ser
664:062050. https://doi.org/10.1088/1742-6596/664/6/062050
12. Park J-W, Kim E (2017) Runtime prediction of parallel applications with workload-aware clustering.
J Supercomput 73(11):4635–4651. https://doi.org/10.1007/s11227-017-2038-2
13. Cunha RLF, Rodrigues ER, Tizzei LP, Netto MAS (2017) Job placement advisor based on turn-
around predictions for hpc hybrid clouds. Futur Gener Comput Syst 67:35–46. https://doi.org/10.
1016/j.future.2016.08.010
14. McKenna R, Herbein S, Moody A, Gamblin T, Taufer M (2016) Machine learning predictions of
runtime and io traffic on high-end clusters. In: 2016 IEEE International Conference on Cluster Com-
puting (CLUSTER), pp 255–258. https://doi.org/10.1109/CLUSTER.2016.58
15. Xiujuan S, Xinxiu L, Fasheng L et al (2018) Research on combination prediction model of traffic
flow based on entropy weight method. J Shandong Univ Sci Technol (Nat Sci) 37(4):111–117
16. Wang Q, Li J, Wang S, Wu G (2019) A novel two-step job runtime estimation method based on
input parameters in hpc system. In: 2019 IEEE 4th International Conference on Cloud Computing
and Big Data Analysis (ICCCBDA), pp 311–316. https://doi.org/10.1109/ICCCBDA.2019.8725643
17. Chen X, Zhang H, Bai H, YangC, Zhao X, Li B (2020) Runtime prediction of high-performance
computing jobs based on ensemble learning. HP3C 2020. Association for Computing Machinery, pp
56–62. https://doi.org/10.1145/3407947.3407968
18. Naghshnejad M, Singhal M (2020) A hybrid scheduling platform: a runtime prediction reliability
aware scheduling platform to improve hpc scheduling performance. J Supercomput 76(1):122–149.
https://doi.org/10.1007/s11227-019-03004-3
19. Cheon H, Ryu J, Ryou J, Park CY, Han Y-S (2021) Ared: automata-based runtime estimation for dis-
tributed systems using deep learning. Clust Comput. https://doi.org/10.1007/s10586-021-03272-w
20. Grohe M (2020) Word2vec, node2vec, graph2vec, x2vec: towards a theory of vector embeddings
of structured data. In: Proceedings of the 39th ACM SIGMOD-SIGACT-SIGAI Symposium on
13
20234 F. Chen
Principles of Database Systems. PODS’20. Association for Computing Machinery, pp 1–16. https://
doi.org/10.1145/3375395.3387641
21. Feitelson DG, Tsafrir D, Krakov D (2014) Experience with using the parallel workloads archive. J
Parallel Distrib Comput 74(10):2967–2982. https://doi.org/10.1016/j.jpdc.2014.06.013
22. Jiang L, Ma M, Wang G (2021) Application of interpolation method in data processing of danger-
ous cargo transportation in the Yangtze river. In: International Conference on Smart Transportation
and City Engineering 2021, vol 12050, pp 445–452. https://doi.org/10.1117/12.2613731. SPIE
23. Carvalho M, Brasileiro F (2012) A user-based model of grid computing workloads. In: 2012 ACM/
IEEE 13th International Conference on Grid Computing, pp 40–48. https://doi.org/10.1109/Grid.
2012.13
24. Iosup A, Epema D (2011) Grid computing workloads. IEEE Internet Comput 15(2):19–26. https://
doi.org/10.1109/MIC.2010.130
25. Roul RK (2018) An effective approach for semantic-based clustering and topic-based ranking of
web documents. Int J Data Sci Anal 5(4):269–284
26. Xiao YH et al (2019) Ga-sim: a job running time prediction algorithm based on categorization and
instance learning. Comput Eng Sci 41(6):6. https://doi.org/10.3969/j.issn.1007-130X.2019.06.005
27. Zhang X-M, Han Q-L, Ge X, Ding D (2018) An overview of recent developments in Lyapunov–
Krasovskii functionals and stability criteria for recurrent neural networks with time-varying delays.
Neurocomputing 313:392–401. https://doi.org/10.1016/j.neucom.2018.06.038
28. Bengio Y, Simard P, Frasconi P (1994) Learning long-term dependencies with gradient descent is
difficult. IEEE Trans Neural Netw 5(2):157–166. https://doi.org/10.1109/72.279181
29. Balaji E, Brindha D, Elumalai VK, Vikrama R (2021) Automatic and non-invasive Parkinson’s dis-
ease diagnosis and severity rating using lstm network. Appl Soft Comput 108:107463. https://doi.
org/10.1016/j.asoc.2021.107463
30. Hochreiter S, Schmidhuber J (1997) Long short-term memory. Neural Comput 9(8):1735–1780.
https://doi.org/10.1162/neco.1997.9.8.1735
31. Niu Z, Zhong G, Yu H (2021) A review on the attention mechanism of deep learning. Neurocom-
puting 452:48–62. https://doi.org/10.1016/j.neucom.2021.03.091
32. Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez AN, Kaiser Lu, Polosukhin I (2017)
Attention is all you need. In: Advances in Neural Information Processing Systems, vol 30. Curran
Associates, Inc. https://proceedings.neurips.cc/paper/2017/file/3f5ee243547dee91fbd053c1c4a845
aa-Paper.pdf
33. Esmaeili A, Marvasti F (2019) A novel approach to quantized matrix completion using huber loss
measure. IEEE Signal Process Lett 26(2):337–341. https://doi.org/10.1109/LSP.2019.2891134
34. Li M, Soltanolkotabi M, Oymak S (2020) Gradient descent with early stopping is provably robust
to label noise for overparameterized neural networks. In: Chiappa S, Calandra R (eds) Proceedings
of the Twenty Third International Conference on Artificial Intelligence and Statistics. Proceedings
of Machine Learning Research, vol 108, pp 4313–4324. PMLR. https://proceedings.mlr.press/v108/
li20j.html
35. Naghshnejad M, Singhal M (2018) Adaptive online runtime prediction to improve hpc applications
latency in cloud. In: 2018 IEEE 11th International Conference on Cloud Computing (CLOUD).
IEEE, pp 762–769
36. Zhang S, Lin G (2018) Robust data-driven discovery of governing physical laws with error bars.
Proc R Soc A Math Phys Eng Sci 474(2217):20180305. https://doi.org/10.1098/rspa.2018.0305
Publisher’s Note Springer Nature remains neutral with regard to jurisdictional claims in published maps
and institutional affiliations.
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under
a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted
manuscript version of this article is solely governed by the terms of such publishing agreement and
applicable law.
13