0% found this document useful (0 votes)

37 views27 pages

Job Runtime Prediction of HPC Cluster Based On PC-Transformer

Uploaded by

yesyesagainandagain

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

37 views27 pages

Job Runtime Prediction of HPC Cluster Based On PC-Transformer

Uploaded by

yesyesagainandagain

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 27

The Journal of Supercomputing (2023) 79:20208–20234

https://doi.org/10.1007/s11227-023-05470-2

Job runtime prediction of HPC cluster based

on PC‑Transformer

Fengxian Chen1

Accepted: 31 May 2023 / Published online: 12 June 2023

Abstract
Job scheduling of high performance cluster is a crucial task that affects the efficiency
and performance of the system. The accuracy of job runtime prediction is one of the
key factors that influences the quality of job scheduling. In this paper, we propose a
novel method for job runtime prediction based on Transformer with plain connec-
tion and attention mechanism. The proposed method utilizes the job category infor-
mation obtained by clustering the historical log datasets, and selects six-dimensional
features that are highly correlated with job runtime. We divide the datasets into mul-
tiple job sets according to the length of job runtime, train and predict each job set
separately. We evaluate the proposed method on the HPC2N dataset, and compare it
with several existing methods. The results show that the proposed method achieves
an average accuracy of 0.892, with 15.2% MAPE, and outperforms other methods
in terms of prediction performance and training time. The proposed method can be
applied to improve the efficiency and quality of job scheduling in high performance
cluster.

Keywords High performance computing · User clustering · Time encoding ·

Attention mechanism

1 Introduction

As computing in scientific computing and engineering applications has been increas-

ingly demanded, a growing number of supercomputing centers have been built. To
keep resource usage efficient while meeting the high performance computing (HPC)
cluster users’ performance needs, the above supercomputing centers usually employ
cluster management and job scheduling systems to solve complex HPC jobs.

* Fengxian Chen
[email protected]
1
Supercomputing Center, Lanzhou university, Tianshui Road, Lanzhou 730000, Gansu, China

1Vol:.(1234567890)
3
Job runtime prediction of HPC cluster based on PC‑Transformer 20209

There are a wide variety types of jobs running on supercomputing clusters. More-
over, different jobs require various resources and runtime, thus causing difficulty in
job scheduling [1]. Job scheduling is generally implemented by the job scheduling
system. The system monitors the characteristics of job resource consumption and
runtime, and then determines the execution sequence of jobs on the cluster using
default scheduling algorithms.
Commonly used scheduling policies comprise First-Come, First-Served (FCFS),
Round Robin, Short Job First (SJF) etc. The above policies are executed in order in
accordance with the predetermined rules. For instance, FCFS automatically executes
queued requests and processes in order of their arrival [2]. When there are many
time-consuming tasks on the cluster, the FCFS is difficult to make use of the frag-
mented resources of the cluster, which results in waste of resources. Backfilling is a
scheduling optimization allowing running jobs out of order to more effectively use
available resources. When large jobs consuming a lot of resources wait at the front
of the line, it needs to wait till there are sufficient resources to execute. Moreover,
other small jobs will queue up in the waiting queue to obtain computing resources,
even if there are some small resources available. Backfilling moves forward small
jobs in the queue to use free computing resources [3]. Because the FCFS algorithm
is stable and fair to job scheduling, most of the existing job scheduling systems use
the FCFS with backfilling, and the work of this paper is also based on this method.
Accurate estimation of job runtime is the key to effectively use cluster resources
without affecting original scheduling order. However, runtime estimated by users is
often inaccurate, so it is difficult to use in actual scheduling. Prediction methods
based on the history data of previously executed jobs have been proposed, and con-
firmed with better performance [4, 5]. As a result, user’s historical job information
should be used to make accurate predictions on the runtime of cluster jobs [6].
Most predictions of the job runtime have been based on historical log data. The
method assumes that jobs with similar computing modes and scales on the same
cluster have similar runtime. However, it is challenging to achieve accurate job runt-
ime prediction due to data missing and noisy in historical data [7, 8]. There is a
prediction strategy that predicts jobs runtime by using the same user’s historical jobs
information. For example, for a user, the average runtime of the last two jobs is used
as the predicted runtime of the next job [4, 9]. This algorithm has improved perfor-
mance of the scheduling system with EASY backfilling, whereas it is difficult to
deploy to large-scale high-performance clusters, and it cannot predict the new users’
jobs. Juan et al. [10] measured the similarity of different users and jobs based on the
characteristics of historical jobs. Then they screened out the jobs that are most simi-
lar to the job to be predicted in characteristics, and the average runtime of the above
jobs is taken as the predicted value. This method isn’t dependent on a single user ’s
log and is robust against noisy data using an optimized job similarity measure. How-
ever, due to the deviation of similarity calculation, the algorithm is difficult to apply
to complex HPC cluster systems.
Rauschmayer [11] estimated job runtime through linear regression and maximum
likelihood estimation. The results showed that, compared with the maximum likeli-
hood estimation method, the prediction results could improve 22% by designing lin-
ear regression models for various characteristic factors related to the runtime.

13
20210 F. Chen

Runtime prediction is a complex nonlinear problem due to the jobs on HPC

clusters involving a variety of features. Thus, simple statistical methods are diffi-
cult to make accurate predictions. In recent years, machine learning methods such as
clustering, classification and ensemble learning have performed well on non-linear
problems.
Park et al. [12] filtered vital runtime-related features from historical logs and
used them for clustering. The result was adopted to predict runtime of the new job.
Experimental results indicated that the proposed method increased the accuracy to
69.56%. In fact, the performance of direct prediction using clustering is limited.
Renato et al. [13] used K-Nearest Neighbor algorithm (KNN). These research-
ers first trained the KD tree through historical data, and then obtained the similar
distance between the new job and the historical jobs by KD tree. Thus, they can
complete the runtime estimation based on the category of the new job. Their experi-
ments showed an increase of nearly 20% in prediction accuracy. Ryan et al. [14]
used several machine learning algorithms to predict performance characteristics
(e.g., runtime and IO traffic) using user job scripts as inputs. They suggested that
decision trees outperform other algorithms and can accurately predict the runtime of
73% of jobs in an error tolerance of 10 min.
In fact, a single machine learning model cannot fit a system of multiple features.
Numerous scholars have studied ensemble learning algorithms to predict runtime.
The opinion of ensemble learning algorithm is to build a learning model, with better
performance from multiple base learners using a range of combined strategies, and
in most cases it outperforms single learner [15].
Wang et al. [16] proposed a hybrid prediction method Bayesian Radial Basis
Function (BRBF), based on Radial Basis Function (RBF) network and Naive Bayes-
ian classification. They employed the BRBF to catch the characteristics of Vienna
Ab-initio Simulation Package (VASP) jobs and then mixed the results of multiple
methods for the purpose of obtaining more precise results. In addition, they divided
the historical data into multiple intervals according to the length of runtime. Train-
ing and prediction were performed separately in the respective data interval. The
result indicated that the way achieved a good prediction effect. Chen et al. [17] com-
pared the performance of different models on the HPC log dataset and suggested
that the LightGBM algorithm achieved higher prediction accuracy and faster com-
putation speed in runtime prediction.
Mina et al. [18] designed a hybrid scheduling platform, using a plan-based sched-
uling strategy. Gradient boosting tree regression served as a meta-learning approach
for job runtime estimation. Subsequently, a specific class of scheduling algorithm
was selected in accordance with the prediction reliability.
Besides, deep neural networks are also used for runtime prediction. Hyunjoon
et al. proposed several models based on deep neural network (DNN) and statistics
by analyzing MPI-software. The result revealed that DNN model was superior in
comparison [19].
However, most relevant studies have ignored the sequence and time correla-
tion between jobs. In addition, the effectiveness of the model need to be further
improved. In this study, jobs on the same HPC cluster are considered a special
time series in the order of submission. Besides, a novel deep time encoding and

13
Job runtime prediction of HPC cluster based on PC‑Transformer 20211

developed Transformer with plain connection (PC-Transformer) is proposed based

on the attention mechanism to predict job runtime. Our model considers the jobs
on the same HPC cluster as a special time series in the order of submission, and
captures the temporal and sequential dependencies among them. We evaluate our
model on a real-world HPC log dataset, and demonstrate its superiority over several
baseline methods in terms of prediction accuracy and efficiency.
In general, the main contributions of this study can be concluded in the statement
below.

• Time-serialize the log data by their commit time. In First Come First Serve
scheduling system, the submit order of jobs is important information. Selecting
an appropriate time step to sample the ordered job log is essential to achieving
valid time encoding.
• Similar users usually submit similar jobs repeatedly on the HPC cluster. Mining
the operation characteristics of users facilitates the construction of feature engi-
neering. In this study, clustering model is adopted to divide users into different
categories in accordance with their jobs, and the above classifications serve as
one of the features for time coding.
• After the data are processed, the form of vector embedding is adopted to rep-
resent time series, inspired by word2vec [20], so as to automate the feature
engineering process in a better way. In this study, a linear neural network with
a periodic activation function is capable effectively capturing the characteristic
information of time series.
• We propose a simplified version of the Transformer model, PC-Transformer,
which uses several layers of multi-head attention structure in encoder, and a lin-
ear network as decoder.
• Extensive experiments are performed on log data of several real supercomput-
ing clusters to verify PC-Transformer model. Several sequential neural net-
work models such as Recurrent Neural Network (RNN), Long-Short Term
Memory(LSTM) and PC-Transformer are employed for experiments, respec-
tively. Experimental results reveal that the proposed method outperforms other
neural networks.

2 Data processing and users clustering

The job logs generated by the HPC cluster contain various features, (e.g., user ID,
number of CPU used and job wait time). It is difficult to exploit all job features when
we train the model. First, using excessive features will make the model too complex,
increase the training period and be prone to overfitting. Second, since some features
contain considerable missing data and noisy data, it is difficult for the model to con-
verge when these features are employed for training. Accordingly, the data features
should be screened and cleaned before training the model.
Public datasets commonly used in high-performance cluster logs are adopted to
verify the effectiveness of the proposed model. Five real job log sets are adopted
in this study, including ANL-2009, HPC2N, KIT FH2, LLNL Thunder and SDSC

13
20212 F. Chen

DataStar. The above datasets are generated by different supercomputing clusters

[21]. To be specific, ANL-2009 is the workload log recorded by at Argonne National
Laboratory, the log contains first 8 months of workload; HPC2N includes three and
a half years workload log records from the High-Performance Computing Center
North in Sweden; KIT FH2 includes one and a half years records from the ForHLR
II system located at the Karlsruhe Institute of Technology in Germany; LLNL Thun-
der includes several months log records from a large Linux cluster called Thunder
installed at Lawrence Livermore National Lab; SDSC includes a year log records
from the San Diego Supercomputer Center (SDSC). All the above datasets are
recorded in the Standard Workload Format (SWF), compared to the raw data, work
logs from different supercomputing clusters are coded and organized with the same
specifications. The details of the five datasets are listed in Table 1.

2.1 Feature screening

As listed in Table 1, all datasets originate from HPC clusters with multiple users.
Different datasets have different numbers of users and jobs among them, due to the
large number of jobs and the high data quality, HPC2N is one of the most widely
used HPC log jobs. Accordingly, HPC2N is employed as an example for partial fea-
ture visualization and model effect analysis in some experiments.
SWF data contain 18-dimensional (18D) features. The above features comprise
time-based (e.g., job submission time, waiting time and running time), resource-
based features (e.g., the number of CPUs occupied and memory size), and user-
based features (e.g., user ID and user group). Job features are represented by actual
values, and missing data are represented by -1. Although SWF data have been pre-
liminarily cleaned and sorted, a lot of missing data remain in some features, so it is
imperative to achieve further cleaning and processing.
Number of the missing data features are shown in Fig. 1, the number of features
with considerable missing data in some datasets is 9, the above features are excluded
from the dataset due to lack of information. Among the remaining parts, only one
dimension is selected in the features containing duplicate information. For instance,
the job number used to record the order of job submission overlaps with the job sub-
mission time, so we only keep the submission time. Although runtime is predicted
targets, it is an important feature of historical job logs, so we keep it when train

Table 1 The details of datasets

Name of dataset Duration Features Jobs quantity Users quantity
quantity

ANL 2009.01− 2009.09 18 68,936 236

HPC2N 2002.07− 2006.01 18 202,871 257
KIT 2016.01− 2018.01 18 114,355 166
LLNL 2007.02− 2007.04 18 128,662 282
SDSC 2004.03− 2005.03 18 96,089 460

13
Job runtime prediction of HPC cluster based on PC‑Transformer 20213

Fig. 1 Missing data features chart

Table 2 Selected features and values

Feature name Submit_time Wait_time Run_time Cpu_num Time_req User_id

Value 342072 32 43306 32 43200 3

2137605 117703 8418 32 604800 15
33791766 157 1825 32 7200 256

models. Lastly, 6D features are screened for model training and prediction. The fea-
ture names and partial values are listed in Table 2.
The retained job features still comprise the time class, resource class and user
class, i.e., feature filtering does not cause information loss. The filtered data features
still have a few missing values. In this study, we use the mean value interpolation
method to fill in the missing values [22].
Moreover, it is imperative to removing invalid data and outlier data during data
preprocessing. Some data in HPC job logs are generated by failed job logs that pre-
maturely terminate due to submitted job parameters or program errors, and they will
be removed as noise data. In this study, runtime of the data is less than 600 s, and
data with actual runtime is less than 1% of the requested runtime are also removed
from the dataset as noise data.

13
20214 F. Chen

2.2 Data normalization

The base units of the data features in the logs are different, thus leading to a sig-
nificant difference in the order of magnitude. Data normalization is a valid method
to eliminate this difference. In general, learning algorithms benefit from normaliza-
tion of the dataset. Z-score, Min-Max and nonlinear normalization are frequently-
used methods. With HPC2N as an example, the time class features are compared
with other classes on order of magnitude distribution, specifically for comparing the
number of CPU and runtime.
As depicted in Fig. 2, horizontal axis represents the order of intercepted data
points, and vertical axis is the amplitude of features. In the original data, runtime
value is significant larger than other characteristics (e.g. CPU number). After the
normalization process, the values of two features are compressed into similar inter-
vals, especially the data processed by z-score method is smoother. The effect of the
dimension of features is eliminated to a certain extent. Thus, z-score is adopted to
process original data in user clustering and prediction models.

2.3 User clustering

After screening and standardization, the data can be adopted to mine the informa-
tion of users’ similar behaviors, which can optimize user categories. New research
suggests that similar users usually repeatedly submit similar jobs on the HPC cluster
[23, 24]. As depicted in Table 1, the number of users on the datasets is relatively
large. As a discrete feature variable, user ID used directly in the training may make
the model difficult to converge due to scattered values. In consequence, users are
clustered based on the computing mode and scale of user jobs in historical logs. The
result of the clustering serves as the user category feature to replace the user ID.
Since the calculation modes of homogeneous users are similar, substitution does not
result in loss of information in user’s features. We will compare the impact of user
ID and clustering results separately on the accuracy of the model in experimental
section.
K-Means is a commonly used clustering method. K-Means is characterized by
a simple principle and a fast calculation speed. It is efficient when clustering large
amounts of data. To test the performance of the algorithms, Silhouette Coefficient
(SC) and Principal Component Analysis (PCA) serves as evaluation index [25].
Before clustering, it is necessary to sort out the main characteristics of users, includ-
ing counting the number of jobs, average waiting time, average runtime, and aver-
age number of CPUs used by each user on the platform. The above features are
employed for user clustering. The k-means algorithm should specify the cluster
value K in advance. To compare the clustering performance of different K values,
the range of K values is preset to train K-Means models, and then SC is adopted to
evaluate the above models. The SC is calculated as follows:
b(i) − a(i)
S(i) = (1)
max(a(i), b(i))

13
Job runtime prediction of HPC cluster based on PC‑Transformer 20215

Fig. 2 The characteristic ampli-

tudes of different regularization
methods

13
20216 F. Chen

N
1 ∑
SC = S(i) (2)
Ndata i=1

where a(i) denotes the average distance between sample i and other samples in its
cluster; b(i) represents the average distance from sample i to samples in other clus-
ters; S(i) is the silhouette coefficient of sample i; Ndata is the number of data on data-
set; SC express the overall silhouette coefficient in the entire dataset. The SC value
of the clustering is closer to 1, which suggests that the instances in the cluster are
compact and the distance between clusters is large. Otherwise, the overlap between
clusters is large and the clustering effect is poor. Thus, the K value with the maxi-
mum SC in the pre-selected interval will be selected as the final value. The preset K
values range from 2 to 17 in this study. The above K values are adopted to train the
clustering model on the datasets respectively, and the SC under different K values
are calculated for testing the performance.
As depicted in Fig. 3, as the number of clusters increases, the overall SC tend
to decrease, and the K values with the optimal clustering effect range from 2 and
8. To be specific, the number of users on the KIT dataset is smaller than that of
other datasets, and it has the highest SC when it is clustered into two groups.
SDSC has the largest number of users, and its categories with highest SC is also
the most in datasets. Lastly, the selected K values of ANL, LLNL, HPC2N, KIT
and SDSC are 8, 2, 5, 3 and 2, the above k values will be retained for actual
clustering.
Different clusters are marked with different colors in Fig. 4, As depicted in this
figure, there is a clear distance between users in different clusters after clustering,
which means the clustering preliminarily achieves the division of user categories.
The user category feature obtained by clustering already contains information at

Fig. 3 The curve of Silhouette Coefficient varying with different K value

13
Job runtime prediction of HPC cluster based on PC‑Transformer 20217

Fig. 4 3D Scatter Plot of PCA

on clustering result of HPC2N

the user level. Accordingly, user class is adopted, instead of user ID. For new
user, after generating 5 job records, the trained clustering model use these record
to predict category.

3 The sequential processing

From the feature description of the dataset, there are several features in the job log
to record the sequential information of job submission and execution, such as wait-
ing time and submission time. This part of the information plays an important role
in runtime forecast. In an actual job system, the status of new jobs is often related to
jobs running in the current cluster. For instance, the runtime of jobs on the current
system will directly affect the waiting time of new jobs. However, most studies do
not consider the above factors when analyzing job logs. Xiao et al. filtered out tim-
ing features and only used user and job resource features when training their model
[26]. Chen et al. used job submission time as job feature information for training
the model, but did not consider the association between jobs [17]. We sample and
encode the data to use the timing information between jobs on the dataset.

3.1 Dataset interval partitioning

Jobs with large differences in runtime often have large differences in characteristics.
If the model is trained by using the data without distinction of runtime, it will be dif-
ficult for the model to converge to the optimal point. Several studies suggested that
separating long and short jobs can bring a large performance improvement in pre-
diction [16, 26]. Therefore, this study first determines the interval divided by time
length in accordance with the data distribution characteristics of each dataset, and
then samples by timestep and different runtime.

13
20218 F. Chen

Fig. 5 Runtime interval distribution on five datasets

As Fig. 5 shown, on the above datasets, the number of short job samples between
0 and 3600 s is the largest, and fewer long job samples with runtime greater than
45,000 s. In this study, according to the interval of runtime and the number of

13
Job runtime prediction of HPC cluster based on PC‑Transformer 20219

Table 3 Job sets division Datasets/interval [0,3600) [3600,10800) [10800,54000]

ANL 20302 16981 10441

HPC2N 31758 31092 42323
KIT 12874 8487 19168
LLNL 10324 8709 6454
SDSC 12730 10609 10598

samples in each interval, each dataset is divided into three types of job sets: long
jobs, medium jobs and short jobs.
The division of job sets is listed in Table 3. To make the job sets as uniform
as possible in each dataset, 3600 s and 10,600 s serve as demarcation points to
divide job sets with different runtime, and the number of samples in each interval is
different.

3.2 Data sampling

Jobs in the logs are sorted by submission time, this ordering is useful time
series information, sampling in order could reserve it. We sample the dataset
in sequence in accordance with the length of the sampling window T, and the
window slide one step each time. After sampling, log data is packaged into data
groups of size L, at the same time, we separate datasets into corresponding job
sets at different runtime intervals. The specific sampling method is shown in
Fig. 6.

Fig. 6 Data sampling

13
20220 F. Chen

4 Related methodologies

This section introduces neural networks related to sequences and discusses their
advantages and disadvantages. Besides, it also presents PC-Transformer in detail.

4.1 Recurrent neural network

In the sequential data process, the linear neural network cannot retain the historical
data information during training, thus resulting in the loss of context information.
As a result, the performance of the linear neural network on sequential data is poor.
To solve the above problem, the recurrent neural network (RNN) is proposed. The
unique directed cyclic connection between the layers of RNN enables it to have the
memory function of time series data [27].
RNN is a type of feedback neural network, of which the hidden layer input con-
tains the input value of the current moment, and the output value of the last moment.
The structure of RNN is illustrated in Fig. 7, the left side of the arrow represents the
network structure of RNN, and the other side shows the schematic diagram of its
structure expanded along the time axis. The neural network unit A reads the input xt
at the current moment, and outputs a hidden state value ht , which will be sent to the
neural unit at the next moment together with the next input.
RNN can be considered multiple copies of one neural network. The respective
neural network module will pass the information acquired by itself to the next neu-
ral network unit. This chain-like feature reveals that the recurrent neural network
is a sequence-related network. In theory, RNN can use previous information in the
current task, so it is suitable for sequence data (e.g., speech, natural language, and
stock sequences). In practical applications, however, RNN is affected by the length
of sequence. In the long-term dependency sequence, there are some fatal problems
(e.g., vanishing gradient), so it faces difficulty in learning complete information
[28].

Fig. 7 Architecture of RNN

13
Job runtime prediction of HPC cluster based on PC‑Transformer 20221

4.2 LSTM

To overcome the major limitations of RNN, LSTM adopts gate structure to avoid
exploding and vanishing gradient problems, so it can learn long-term information
in sequence data [29]. LSTM uses multiple nonlinear gates to control the output and
state of neurons, compared with the single tanh layer in the RNN.
The internal structure of LSTM cell is presented in the Fig. 8, consisting of forget
gate ft , input gate it and output gate ot that are adopted to update the cell state ct . ft
selects the part to discard in input xt and the state ht−1 of the hidden layer at the last
moment. it determines the information stored in the cell state, and it comprises the
sigmoid layer and the tanh layer. The final output of cell is achieved by ot . It filters
out the cell state through the sigmoid layer and subsequently, maps the cell state
between − 1 and 1 by tanh, the final output is obtained by multiplying the above two
results. The specific expressions are presented follows [30].
ft = 𝜎(Wf ∗ [ht−1 , xt ] + bf ) (3)

it = 𝜎(Wi ∗ [ht−1 , xt ] + bi ) (4)

ot = 𝜎(Wo ∗ [ht−1 , xt ] + bo ) (5)

Where 𝜎 denotes the activate function, W represents the weight matrix, b is the bias,
subscripts f, i, o represent forget gate, input gate and output gate, respectively; xt is
the current input; ht−1 expresses the output of LSTM cell at t − 1. Due to the special
gate structure design of LSTM, the error flows between neurons in the form of a
constant, thus avoiding the disappearance of the gradient. RNN and LSTM severs as
a comparison model for runtime prediction in this study.

Fig. 8 Architecture of LSTM

13
20222 F. Chen

4.3 Transformer

Attention Mechanism (AM) has been generally used in the sequence-to-sequence

(seq2seq) structure to solve sequence data with different lengths of input and output
[31]. On the encoding side, for the input at each moment, AM calculates the output
state value of the basic neural network at that moment, and the state value serves as
the input of the decoding side. RNN, LSTM and 1D convolution network are common
base neural networks. The seq2seq with AM is capable of obtaining longer sequence
history information than LSTM, so it exhibits excellent performance in natural lan-
guage processing and partial time series.
Transformer is a Seq2Seq structure based on self-attention mechanism proposed by
Google in 2017. It employs a self-attention instead of LSTM as the base neural network
of the model [32]. Since the calculation of the self-attention is not dependent on the
previous moment output, parallelization can be adopted to accelerate the training of the
model. To acquire time-series information, Positional Encoding (PE) is employed in
Transformer to calculate the relative position of the data in the sequence. Both encoder
and decoder use multi-head self-attention as the main structure, besides, residual con-
nections and layer regularization are applied to inter-layer layers.
The structure of Transformer is illustrated in Fig. 9, PE uses sine and cosine func-
tions to calculate values for each input vector to obtain the sense of order:

Fig. 9 Architecture of Transformer

13
Job runtime prediction of HPC cluster based on PC‑Transformer 20223

( pos )
PE(pos,2i) = sin (6)
100002i∕dmodel
( pos )
PE(pos,2i+1) = cos (7)
100002i∕dmodel
where PE represents the position encoding vector of the i dimension at time t in the
input vector; pos denotes the value of the current encoding position; dmodel repre-
sents the dimension of the input vector. The vector after positional encoding is used
for input of encoder and decoder. The core module of encoder and decoder is multi-
head attention, it is a novel attention calculation algorithm, calculating the input
attention multiple times through scaling dot product and dimensionality reduction
mapping and jointing these results. The calculation formula is expressed as follows:
MultiHead(Q, K, V) = Concat(head1 , head2 , … , headh )W o (8)
where Q, K, V are query, key and value vectors respectively, which are used in atten-
tion calculation; h is the number of heads, or the number of calculations, headi is
calculated by scaling the dot product:
� �
T
Q Q, K
headi = Attention(QWi , KWiK , VWiV ) = softmax √ (9)
dk

where dk denotes the dimension of Q and K , dk is the dimension of V,

WiQ , WiK ∈ Rh×dk ×dk , WiV ∈ Rh×dk ×dv , WiO ∈ Rhdk ×hdv . In the first multi-head attention
of the encoder and decoder, Q, K , V is obtained after inputs passes through the input
embedding layer, in the second multi-head attention of the decoder, Q is obtained
from the embedding layer, K and V are the output of the encoder. The Feed Forward
layer is the last module, performing nonlinear transformation on the output of multi-
head attention. It consists of 2 fully connected networks and activation functions.

4.4 PC‑Transformer with time embedding

Transformer is proposed for natural language processing at first. Increasing scholars

use Transformer in various sequence data (e.g., stock price and power prices) owing
to its efficient parallel computation and excellent performance on sequence. How-
ever, the decoder is set to handle text data. Using the decoder in time series data will
slow down training and affect the performance of the model. In runtime prediction
model, we use linear layer instead of decoder to process the output from encoder.
Moreover, the residual connection between layers of Transformer is adopted to solve
the gradient problem of deep network, for networks with several layers, the resid-
ual connection have a negative effect on model convergence, so plain connection is
employed to replace residual connection in PC-Transformer.
In time encoding, the original model uses PE as a representation of word posi-
tion in a sentence, which allows the transformer to acquire knowledge regarding a

13
20224 F. Chen

Fig. 10 Architecture of PC-Transformer

Fig. 11 The experiment framework

sentence structure and word interdependency. Likewise, the proposed model requires
a representation of time when processing job logs. In this study, the input data are
encoded as a time vector with location information through embedding layer. To
capture different information in time series data simultaneously, we divide embed-
ding layer into two parts, a linear layer and a linear layer with a periodic activate
function. The final encoding result is formed by adding the outputs of two layers.
The structure of PC-Transformer is presented in Fig. 10, the first module is
the time embedding layer, it encodes the input data separately in the periodic and
non-periodic parts, and then combines them into one vector. The vector carrying
time series information serves as the input to the encoder, the output of encoder is
mapped by the linear layer to obtain the final predicted value.

13
Job runtime prediction of HPC cluster based on PC‑Transformer 20225

5 Experiment

5.1 Experimental settings

The framework of experiment is illustrated in Fig. 11. The model is trained on pre-
processing data, and runtime is predicted on test datasets. The model employs RNN,
LSTM and PC-Transformer neural network structures respectively. In RNN and LSTM,
the model comprises 3 layers network, and each layer contains 64 neurons. A linear
layer is adopted to adjust the dimension of the final predicted value. In the selection of
activation function, RNN uses the Relu to avoid the gradient problem; LSTM uses the
Sigmoid and the Tanh according to the characteristics of each gate structure. In PC-
Transformer, activation function employs the Relu, the periodic function in position
encoding uses sine, the number of layers N of the encoder is 3, and the number of head
h in multi-head attention is 8. The optimizer uses the Adam optimization algorithm,
initial learning rate is 0.001, the batch size is 128, and the length of data group L is 20.
During training, Dropout is adopted to prevent overfitting. The model is implemented
by Pytorch, and the calculations are performed with a single Nvidia Tesla V100 graph-
ics card.

5.2 Evaluation metrics

This study uses Huber function as loss function, compared with the Mean Absolute
Error (MAE) and Mean Square Error (MSE), Huber loss combines their strengths so
that it is more robust to outliers and could avoid gradient explosion [33].
n {
1 ∑ 0.5 ∗ (yi − f (xi ))2 , |yi − f (xi )| < 1
loss =
|yi − f (xi )| − 0.5, otherwise (10)
n i

where f (xi ) denotes results predicted by the model with input xi , yi is the label of xi ,
n is number of the batch size.
Mean Absolute Percent Error (MAPE) and Average Predictive Accuracy (APA)
are adopted to evaluate the efficiency and accuracy of the prediction model. There
are various datasets with different runtime lengths, so it is difficult to directly use
the MAE to measure the degree of deviation of predictions on job sets of different
lengths. MAPE is capable of better measuring the deviation of the predicted value
from the actual value. The smaller the value, the better the performance of the model
will be. The number of samples in the test set is denoted as Ntest , and the calculation
formula is as follow:
Ntest
1 ∑ yi − f (xi )
MAPE = | | × 100% (11)
Ntest i=1 yi

APA denotes the average of the prediction accuracy of all jobs on the test set, and
the prediction accuracy of a single job is calculated as follows:

13
20226 F. Chen

{ f (xi )
, f (xi ) ≤ yi
APAi = yi
yi (12)
f (xi )
, f (xi ) > yi

The APA over the entire test set is calculated as:

Ntest
1 ∑
APA = APAi (13)
Ntest i=1

The value of APA is between 0 and 1. The closer the value to 1, the closer the pre-
diction will be to the actual value.

5.3 Experiment results

In this chapter, the results of the data on different models are presented, as listed
in Table 3. The respective dataset is divided into training, validation and test set
at 8:1:1 ratio. The preset training iterations is 100. The early stopping method is
adopted to prevent overfitting during training [34]. In other words, in 5 consecu-
tive training epochs, if the loss value of the validation set does not decrease and the
parameter updates no longer yield an improvement, the training will be stopped and
the last best parameters are adopted. Moreover, the optimal model is employed to
predict the runtime on the test datasets.

5.3.1 Model APA and MAPE

The overall results are presented in Figs. 12 and 13, where the performance of dif-
ferent models on datasets is reported in figures, and the values of proposed model

Fig. 12 APA on test set

13
Job runtime prediction of HPC cluster based on PC‑Transformer 20227

Fig. 13 MAPE in test set

are labeled. Figure 12 presents APA of the proposed PC-Transformer model and
other neural networks. Notably, PC-Transformer exhibits high APA on most datasets
compared with RNN and LSTM, especially in ANL long jobs. PC-Transformer has
3.4% and 10.6% improved APA than RNN and LSTM. In contrast, RNN performs
poorly on datasets due to its simple and single structure. This result also reveals
that RNN is difficult to capture long-term dependency information, whereas a long
job sequence should be obtained during modeling of job logs. Multi-head attention
and time embedding provide information regarding the relationship between dif-
ferent jobs in one group. Multiple independent learning and results splicing enable
PC-Transformer to learn more comprehensive data features, which makes PC-Trans-
former to perform better than other sequence models. MAPE is another metric to
assess performance and places a focus on the margin of error. Similar to the APA,
PC-Transformer achieves the lowest MAPE values, suggesting that PC-Transformer
can achieve low errors on the same data. For datasets, Table 1 shows that the data
volume of HPC2N is larger than others, and its performance is also better than oth-
ers. The above result also reveals that in a multi-user cluster, the more the historical
data, the better performance of the trained model will be. Moreover, short job sets
are more predictable than long job sets.

5.3.2 Comparative analysis

We measure the performance of proposed PC-Transformer in terms of accuracy and

compared the performance with existing techniques. We evaluate the performance
for HPC2N dataset and compared with existing techniques such as Moving Average
(MA2), Multi-Layer Kalman Filter (MLKF) [35], Support Vector Regression (SVR)
[26], and Deep Neural Network (DNN) [19], Table 4 presents comparison.

13
20228 F. Chen

Table 4 Comparison of different Method Job set Accuracy

methods
MA2 Short 0.531
Medium 0.513
Long 0.535
MLKF Short 0.745
Medium 0.732
Long 0.718
SVR Short 0.711
Medium 0.699
Long 0.703
DNN Short 0.781
Medium 0.775
Long 0.732
PC-Transformer Short 0.892
Medium 0.865
Long 0.872

As shown in Table 4, simple historical data combination method has the worst
accuracy rate, although methods of traditional machine learning can improve the
accuracy of prediction, there is still a gap compared to the PC-Transformer in each
job set. This experiment shows that performance of proposed model is improved
when compared with existing techniques.

5.3.3 Complexity analysis

To further compare the complexity and training speed of different models, we

analyzed the training details of the three models on HPC2N . We used two met-
rics, the number of parameters and floating-point operations (FLOPs), to meas-
ure the complexity of the models. The number of parameters reflects the size and
storage requirement of the model, and it refers to the number of trainable weights

Table 5 Training detail Model Job set Data_size Parameter_size FLOPs

RNN Short 31758 54K 53 G

Medium 31792 54K 53 G
Long 42323 54K 72 G
LSTM Short 31758 253K 214 G
Medium 31792 253K 214 G
Long 42323 253K 290 G
PC-Transformer Short 31758 218K 278 G
Medium 31792 218K 279 G
Long 42323 218K 399 G

13
Job runtime prediction of HPC cluster based on PC‑Transformer 20229

Fig. 14 Time required for a single epoch

and biases in the model. FLOPs refers to the number of floating-point operations
required by the model during execution, which reflects the computational resource
and time required. We calculated these two metrics based on the network structure
and input–output size of each model, the results are shown in Table 5. RNN has the
lowest complexity, but also the worst performance; LSTM has the most parameters
and PC-Transformer has the highest FLOPs.
Table 5 RNN has the lowest complexity, but also the worst performance;
LSTM has the most parameters and PC-Transformer has the highest FLOPs. Fig-
ure 14 compares the training time required for a single epoch, and the time is
related to the model and dataset size. Notably, the long job set has more jobs than
the other job sets, so the training time for an epoch of long job set is the longest
on the respective model. For model size, RNN has fewer parameters than LSTM,
so the training time is also shorter than that of LSTM. For PC-Transformer,
although the number of parameters and FLOPs is large, its training time is the
shortest, thanks to its structure that enables parallelization.

Table 6 Comparison of different time coding layers on HPC2N

Model Job set Training time/ Train set loss MAPE Accuracy
epoch

PC-Transformer Short 8 0.141 15.2% 0.892

Medium 6 0.208 19.0% 0.865
Long 10 0.165 16.8% 0.872
Original time coding Short 12 0.146 18.9% 0.858
Medium 10 0.225 25.1% 0.801
Long 16 0.173 24.5% 0.818

13
20230 F. Chen

5.3.4 Comparison of time coding layers

To verify the performance of the proposed time embedding layer, it is compared

with Transformer’s native time encoding under the same network structure. The
training details and results are listed in Table 6. The model with original time
coding requires more time on the training time of a single epoch. Besides, the
final training loss of proposed PC-Transformer is lower than original model on
all job sets. For prediction performance, both accuracy and MAPE, the proposed
model has been significantly improved. Notably, there is a 0.6 difference in the
medium job set. As revealed by the above results, linear neural network layers
with periodic activation function plays an important role in time series feature
extraction.

5.3.5 Comparison of user information

In Sect. 2.3, we use user clustering categories instead of user IDs, the results of
experiments using these two types of user information separately are shown in
Table 7, the results show that models using user clustering information data perform
better than using user ID data directly.

5.3.6 Prediction error analysis

Error bars help to indicate estimated error or uncertainty to give a general sense
of how precise a measurement is, we use the error bar shown in Fig. 15, which is
adopted to describe the prediction errors of different models [36]. In the figure, 100
data samples are selected from the HPC2N short job set. The blue dots represent the
actual runtime, and the light yellow line segments represent the deviation between
predicted value and true value. Compared with RNN and LSTM, PC-Transformer
model shows obvious advantages in the number of error points and the margin of
error on a single point. Moreover, there are a small number of data points that have
large prediction errors on the respective model, which also means that it is diffi-
cult to predict the runtime of individual jobs running on the cluster using statistical
or neural network methods. If the above points are excluded, the runtime predicted
by PC-Transformer model can be introduced to user job information to support job
scheduling.

Table 7 Prediction accuracy of User information Job set Accuracy

different user information on
HPC2N Data with user clustering Short 0.892
Medium 0.865
Long 0.872
Data with user ID Short 0.827
Medium 0.793
Long 0.835

13
Job runtime prediction of HPC cluster based on PC‑Transformer 20231

Fig. 15 Prediction errors bar of different models

5.3.7 Summary of the experiments

This chapter presents the experimental results of different models on the datasets.
The results show that the proposed PC-Transformer model achieves the best perfor-
mance on most datasets in terms of accuracy and MAPE, especially on long job sets.
RNN performs poorly due to its simple structure. PC-Transformer outperforms RNN
and LSTM by 3.4% and 10.6% on accuracy. Compared with existing techniques like
MA, MLKF, SVR and DNN, PC-Transformer also shows the most accurate predic-
tion. In addition, its training speed also has more advantages than other temporal
neural networks, which is important in real-time scheduling. In summary, the pro-
posed PC-Transformer model achieves the best runtime prediction performance on
the collected job logs compared with baseline models and existing techniques. The
model has great potential for job scheduling in HPC clusters.

6 Conclusions and future work

Accurate runtime prediction can help scheduling software to schedule efficiently

and improve the utilization of supercomputer platform. In this study, a runtime pre-
diction system based on historical log data is proposed, consisting of data preproc-
essing, feature selection, user clustering, time sequence encoding, and prediction.

13
20232 F. Chen

In accordance with the model design and experiments results, we can draw some
conclusions:
In this study, K-means algorithm is adopted to clustering users on the HPC plat-
form, and the optimal number of clusters is determined by silhouette coefficient
score. Using user categories to represent user identity information can not only
retain user features, but also significantly reduce the amplitude range of the features.
In the process of data sampling, the data of datasets are separated in accordance
with the runtime. The partition interval is determined based on basically the same
number of each work set by analyzing the characteristics of datasets. The division
allows models to capture the characteristics of similar jobs and reduce the interfer-
ence of outlier data.
The two most popular sequential neural networks and the proposed model have
been evaluated on the respective dataset, experimental results demonstrate that
sequential neural networks have better predictive performance than other machine
learning methods, the proposed model achieves an accuracy of 0.892 on the HPC2N
dataset, and MAPE is 15.2%. Furthermore, compared with the original time coding,
the proposed time embedding method has obvious advantages in training time and
prediction performance, which also provides an embedding direction for time series
analysis.
Comparing the error bars of the respective model on the test set shows that the
error amplitude and the number of error points of the proposed model are smaller
than those of other models, thus suggesting that the runtime predicted by the PC-
Transformer model can be applied to the actual scheduling environment. Despite
the above benefits, the proposed model has certain limitation: on some outliers,
although the proposed model can reduce the margin of error, the error remains.
At present, this work has only been tested on public datasets. In future research,
we will focus on two aspects. On the one hand, the runtime predicted by the pro-
posed model is combined with the scheduling system to assist scheduling in a real
high-performance environment. On the other hand, the predictive runtime combined
with deep reinforcement learning is used to further explore efficient scheduling
strategies.
Acknowledgements This work is supported by Supercomputing Center of Lanzhou University.

Author Contributions Fengxian Chen has finished all the work of the paper.

Funding There is no fund information.

Availability of data and materials The datasets used in this paper are all public data sets, which can be
obtained openly.

Declarations
Conflict of interest The authors declare that they have no known competing financial interests or personal
relationships that could have appeared to influence the work reported in this paper.

Ethics approval There are no ethical problems with the paper.

13
Job runtime prediction of HPC cluster based on PC‑Transformer 20233

References
1. Molka D, Hackenberg D, Schöne R, Minartz T, Nagel WE (2012) Flexible workload generation for
hpc cluster efficiency benchmarking. Comput Sci Res Dev 27(4):235–243
2. Grosof I, Yang K, Scully Z, Harchol-Balter M (2021) Nudge: stochastically improving upon fcfs.
SIGMETRICS Perform Eval Rev 49(1):11–12. https://doi.org/10.1145/3543516.3460102
3. Wong AKL, Goscinski AM (2007) Evaluating the easy-backfill job scheduling of static workloads
on clusters. In: 2007 IEEE International Conference on Cluster Computing, pp 64–73. https://doi.
org/10.1109/CLUSTR.2007.4629218
4. Tsafrir D, Etsion Y, Feitelson DG (2007) Backfilling using system-generated predictions rather than
user runtime estimates. IEEE Trans Parallel Distrib Syst 18(6):789–803. https://doi.org/10.1109/
TPDS.2007.70606
5. Fan Y, Rich P, Allcock WE, Papka ME, Lan Z (2017) Trade-off between prediction accuracy and
underestimation rate in job runtime estimates. In: 2017 IEEE International Conference on Cluster
Computing (CLUSTER), pp 530–540. https://doi.org/10.1109/CLUSTER.2017.11
6. Gaussier E, Glesser D, Reis V, Trystram D (2015) Improving backfilling by using machine learning
to predict running times. In: SC ’15: Proceedings of the International Conference for High Perfor-
mance Computing, Networking, Storage and Analysis, pp 1–10. https://doi.org/10.1145/2807591.
2807646
7. Škrjanc I, Iglesias JA, Sanchis A, Leite D, Lughofer E, Gomide F (2019) Evolving fuzzy and
neuro-fuzzy approaches in clustering, regression, identification, and classification: a survey. Inf Sci
490:344–368. https://doi.org/10.1016/j.ins.2019.03.060
8. Gama J, Aguilar-Ruiz J, Klinkenberg R (2008) Knowledge discovery from data streams. Intell Data
Anal 12(3):251–252
9. Tsafrir D, Etsion Y, Feitelson DG (2005) Modeling user runtime estimates. In: Workshop on Job
Scheduling Strategies for Parallel Processing. Springer, pp 1–35. https://doi.org/10.1007/11605
300_1
10. Ramírez-Alcaraz JM, Tchernykh A, Yahyapour R, Schwiegelshohn U, Quezada-Pina A, González-
García JL, Hirales-Carbajal A (2011) Job allocation strategies with user run time estimates for
online scheduling in hierarchical grids. J Grid Comput 9(1):95–116. https://doi.org/10.1007/
s10723-011-9179-y
11. Rauschmayr N (2015) A history-based estimation for lhcb job requirements. J Phys Conf Ser
664:062050. https://doi.org/10.1088/1742-6596/664/6/062050
12. Park J-W, Kim E (2017) Runtime prediction of parallel applications with workload-aware clustering.
J Supercomput 73(11):4635–4651. https://doi.org/10.1007/s11227-017-2038-2
13. Cunha RLF, Rodrigues ER, Tizzei LP, Netto MAS (2017) Job placement advisor based on turn-
around predictions for hpc hybrid clouds. Futur Gener Comput Syst 67:35–46. https://doi.org/10.
1016/j.future.2016.08.010
14. McKenna R, Herbein S, Moody A, Gamblin T, Taufer M (2016) Machine learning predictions of
runtime and io traffic on high-end clusters. In: 2016 IEEE International Conference on Cluster Com-
puting (CLUSTER), pp 255–258. https://doi.org/10.1109/CLUSTER.2016.58
15. Xiujuan S, Xinxiu L, Fasheng L et al (2018) Research on combination prediction model of traffic
flow based on entropy weight method. J Shandong Univ Sci Technol (Nat Sci) 37(4):111–117
16. Wang Q, Li J, Wang S, Wu G (2019) A novel two-step job runtime estimation method based on
input parameters in hpc system. In: 2019 IEEE 4th International Conference on Cloud Computing
and Big Data Analysis (ICCCBDA), pp 311–316. https://doi.org/10.1109/ICCCBDA.2019.8725643
17. Chen X, Zhang H, Bai H, YangC, Zhao X, Li B (2020) Runtime prediction of high-performance
computing jobs based on ensemble learning. HP3C 2020. Association for Computing Machinery, pp
56–62. https://doi.org/10.1145/3407947.3407968
18. Naghshnejad M, Singhal M (2020) A hybrid scheduling platform: a runtime prediction reliability
aware scheduling platform to improve hpc scheduling performance. J Supercomput 76(1):122–149.
https://doi.org/10.1007/s11227-019-03004-3
19. Cheon H, Ryu J, Ryou J, Park CY, Han Y-S (2021) Ared: automata-based runtime estimation for dis-
tributed systems using deep learning. Clust Comput. https://doi.org/10.1007/s10586-021-03272-w
20. Grohe M (2020) Word2vec, node2vec, graph2vec, x2vec: towards a theory of vector embeddings
of structured data. In: Proceedings of the 39th ACM SIGMOD-SIGACT-SIGAI Symposium on

13
20234 F. Chen

Principles of Database Systems. PODS’20. Association for Computing Machinery, pp 1–16. https://
doi.org/10.1145/3375395.3387641
21. Feitelson DG, Tsafrir D, Krakov D (2014) Experience with using the parallel workloads archive. J
Parallel Distrib Comput 74(10):2967–2982. https://doi.org/10.1016/j.jpdc.2014.06.013
22. Jiang L, Ma M, Wang G (2021) Application of interpolation method in data processing of danger-
ous cargo transportation in the Yangtze river. In: International Conference on Smart Transportation
and City Engineering 2021, vol 12050, pp 445–452. https://doi.org/10.1117/12.2613731. SPIE
23. Carvalho M, Brasileiro F (2012) A user-based model of grid computing workloads. In: 2012 ACM/
IEEE 13th International Conference on Grid Computing, pp 40–48. https://doi.org/10.1109/Grid.
2012.13
24. Iosup A, Epema D (2011) Grid computing workloads. IEEE Internet Comput 15(2):19–26. https://
doi.org/10.1109/MIC.2010.130
25. Roul RK (2018) An effective approach for semantic-based clustering and topic-based ranking of
web documents. Int J Data Sci Anal 5(4):269–284
26. Xiao YH et al (2019) Ga-sim: a job running time prediction algorithm based on categorization and
instance learning. Comput Eng Sci 41(6):6. https://doi.org/10.3969/j.issn.1007-130X.2019.06.005
27. Zhang X-M, Han Q-L, Ge X, Ding D (2018) An overview of recent developments in Lyapunov–
Krasovskii functionals and stability criteria for recurrent neural networks with time-varying delays.
Neurocomputing 313:392–401. https://doi.org/10.1016/j.neucom.2018.06.038
28. Bengio Y, Simard P, Frasconi P (1994) Learning long-term dependencies with gradient descent is
difficult. IEEE Trans Neural Netw 5(2):157–166. https://doi.org/10.1109/72.279181
29. Balaji E, Brindha D, Elumalai VK, Vikrama R (2021) Automatic and non-invasive Parkinson’s dis-
ease diagnosis and severity rating using lstm network. Appl Soft Comput 108:107463. https://doi.
org/10.1016/j.asoc.2021.107463
30. Hochreiter S, Schmidhuber J (1997) Long short-term memory. Neural Comput 9(8):1735–1780.
https://doi.org/10.1162/neco.1997.9.8.1735
31. Niu Z, Zhong G, Yu H (2021) A review on the attention mechanism of deep learning. Neurocom-
puting 452:48–62. https://doi.org/10.1016/j.neucom.2021.03.091
32. Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez AN, Kaiser Lu, Polosukhin I (2017)
Attention is all you need. In: Advances in Neural Information Processing Systems, vol 30. Curran
Associates, Inc. https://proceedings.neurips.cc/paper/2017/file/3f5ee243547dee91fbd053c1c4a845
aa-Paper.pdf
33. Esmaeili A, Marvasti F (2019) A novel approach to quantized matrix completion using huber loss
measure. IEEE Signal Process Lett 26(2):337–341. https://doi.org/10.1109/LSP.2019.2891134
34. Li M, Soltanolkotabi M, Oymak S (2020) Gradient descent with early stopping is provably robust
to label noise for overparameterized neural networks. In: Chiappa S, Calandra R (eds) Proceedings
of the Twenty Third International Conference on Artificial Intelligence and Statistics. Proceedings
of Machine Learning Research, vol 108, pp 4313–4324. PMLR. https://proceedings.mlr.press/v108/
li20j.html
35. Naghshnejad M, Singhal M (2018) Adaptive online runtime prediction to improve hpc applications
latency in cloud. In: 2018 IEEE 11th International Conference on Cloud Computing (CLOUD).
IEEE, pp 762–769
36. Zhang S, Lin G (2018) Robust data-driven discovery of governing physical laws with error bars.
Proc R Soc A Math Phys Eng Sci 474(2217):20180305. https://doi.org/10.1098/rspa.2018.0305

Publisher’s Note Springer Nature remains neutral with regard to jurisdictional claims in published maps
and institutional affiliations.

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under
a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted
manuscript version of this article is solely governed by the terms of such publishing agreement and
applicable law.

Trade-Off Between Prediction Accuracy and Underestimation Rate
No ratings yet
Trade-Off Between Prediction Accuracy and Underestimation Rate
11 pages
Predicting Application Run Times Using Historical Information
No ratings yet
Predicting Application Run Times Using Historical Information
14 pages
Run-Time Prediction of Parallel Applications On Shared Environments
No ratings yet
Run-Time Prediction of Parallel Applications On Shared Environments
5 pages
Workloads 02 Tutorial
No ratings yet
Workloads 02 Tutorial
149 pages
Report Double Column
No ratings yet
Report Double Column
5 pages
Prakash Qespera Cpe2016
No ratings yet
Prakash Qespera Cpe2016
26 pages
Automated Performance Modeling of HPC Applications Using Machine Learning
No ratings yet
Automated Performance Modeling of HPC Applications Using Machine Learning
15 pages
Slurm MachineLearning
No ratings yet
Slurm MachineLearning
10 pages
New Microsoft Office Power Point Presentation
No ratings yet
New Microsoft Office Power Point Presentation
23 pages
Scheduling SC CamraReady PDF
No ratings yet
Scheduling SC CamraReady PDF
16 pages
Model-Driven Online Capacity Management for Component-Based Software Systems
From Everand
Model-Driven Online Capacity Management for Component-Based Software Systems
André van Hoorn
No ratings yet
Foundations of Scheduling Algorithms: Definitive Reference for Developers and Engineers
From Everand
Foundations of Scheduling Algorithms: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
RL Schedule
No ratings yet
RL Schedule
14 pages
IEEE2023 Deep Reinforcement Learning Competitive Task Assignment Enterprise Blockchain
No ratings yet
IEEE2023 Deep Reinforcement Learning Competitive Task Assignment Enterprise Blockchain
12 pages
Comparative Study On Calculating CPU Bur
No ratings yet
Comparative Study On Calculating CPU Bur
6 pages
Graph Layout Support for Model-Driven Engineering
From Everand
Graph Layout Support for Model-Driven Engineering
Miro Spönemann
No ratings yet
Managment Science
No ratings yet
Managment Science
10 pages
Abstract Final
No ratings yet
Abstract Final
2 pages
The Forgotten Factor: FACTS: On Performance Evaluation and Its Dependence On Workloads
No ratings yet
The Forgotten Factor: FACTS: On Performance Evaluation and Its Dependence On Workloads
78 pages
Machine Learning Feature Based Job Scheduling For Distributed Machine Learning Clusters - 2023
No ratings yet
Machine Learning Feature Based Job Scheduling For Distributed Machine Learning Clusters - 2023
16 pages
Using Long-Term Prediction For Web Service Network Traffic Loads
No ratings yet
Using Long-Term Prediction For Web Service Network Traffic Loads
6 pages
Planning and Metaheuristic Optimization in Production Job Scheduler
No ratings yet
Planning and Metaheuristic Optimization in Production Job Scheduler
19 pages
Applsci 12 08411 v2
No ratings yet
Applsci 12 08411 v2
20 pages
یاماشیرو و نوناکا (2021)
No ratings yet
یاماشیرو و نوناکا (2021)
9 pages
Easy Backfillingpaper 5
No ratings yet
Easy Backfillingpaper 5
19 pages
StarPU: Parallel Computing and Task Scheduling Techniques
From Everand
StarPU: Parallel Computing and Task Scheduling Techniques
Richard Johnson
No ratings yet
Dinda98evaluation PDF
No ratings yet
Dinda98evaluation PDF
29 pages
2016 Liu Wkload
No ratings yet
2016 Liu Wkload
26 pages
NN For Scheduling
No ratings yet
NN For Scheduling
11 pages
Lumos Ipdps24
No ratings yet
Lumos Ipdps24
12 pages
1896-Document Upload-6001-1-10-20201102
No ratings yet
1896-Document Upload-6001-1-10-20201102
9 pages
d22033s - Maru 61 89
No ratings yet
d22033s - Maru 61 89
29 pages
Job Aware Scheduling Algorithm For MapReduce Framework
No ratings yet
Job Aware Scheduling Algorithm For MapReduce Framework
6 pages
2k19 1 PDF
No ratings yet
2k19 1 PDF
21 pages
Predictive Software Maintenance Utilizing Cross-Project Data
No ratings yet
Predictive Software Maintenance Utilizing Cross-Project Data
16 pages
Efficient and Intelligent Multi-Job Federated Learning in Wireless Networks
No ratings yet
Efficient and Intelligent Multi-Job Federated Learning in Wireless Networks
13 pages
Big Data Prediction Orig
No ratings yet
Big Data Prediction Orig
7 pages
Chen and Wang 2023
No ratings yet
Chen and Wang 2023
11 pages
4978, QT, Ca Iii
No ratings yet
4978, QT, Ca Iii
12 pages
CISA Exam-Testing Concept-PERT/CPM/Gantt Chart/FPA/EVA/Timebox (Chapter-3)
From Everand
CISA Exam-Testing Concept-PERT/CPM/Gantt Chart/FPA/EVA/Timebox (Chapter-3)
Hemang Doshi
1.5/5 (3)
Luận Văn Evaluation on Performance and Energy Eciency of Distributed Computing Systems
No ratings yet
Luận Văn Evaluation on Performance and Energy Eciency of Distributed Computing Systems
16 pages
AR L A S P W I G T O S: Einforcement Earning Pproach For Cheduling Roblems ITH Mproved Eneralization Hrough Rder Wapping
No ratings yet
AR L A S P W I G T O S: Einforcement Earning Pproach For Cheduling Roblems ITH Mproved Eneralization Hrough Rder Wapping
12 pages
Practical High Performance Computing: Definitive Reference for Developers and Engineers
From Everand
Practical High Performance Computing: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Unit 6: Scheduling: Class P Vs NP-hard Problems Loading and Sequencing Johnson's Algorithm
No ratings yet
Unit 6: Scheduling: Class P Vs NP-hard Problems Loading and Sequencing Johnson's Algorithm
43 pages
Defect Prediction in Software Development & Maintainence
From Everand
Defect Prediction in Software Development & Maintainence
Rudra Kumar
No ratings yet
Machine Learning Based Workload Prediction in Cloud Computing
No ratings yet
Machine Learning Based Workload Prediction in Cloud Computing
9 pages
A Multi-Action Deep Reinforcement Learning Framework For Flexible Job-Shop Scheduling Problem
No ratings yet
A Multi-Action Deep Reinforcement Learning Framework For Flexible Job-Shop Scheduling Problem
18 pages
Lab08: Objectives: Example Code
No ratings yet
Lab08: Objectives: Example Code
4 pages
Comparative Study Classification Algorit PDF
No ratings yet
Comparative Study Classification Algorit PDF
8 pages
1NH17CS407
No ratings yet
1NH17CS407
110 pages
Final Review Batch 07
No ratings yet
Final Review Batch 07
30 pages
OpenACC Programming Essentials: Definitive Reference for Developers and Engineers
From Everand
OpenACC Programming Essentials: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Sample Report
No ratings yet
Sample Report
34 pages
PRSV OSY Micro Project
No ratings yet
PRSV OSY Micro Project
13 pages
An Integrated Technique To Enhance The Performance of The Classifiers
No ratings yet
An Integrated Technique To Enhance The Performance of The Classifiers
6 pages
2B 1 Jiang Hu
No ratings yet
2B 1 Jiang Hu
36 pages
An Artificial Immune Algorithm For The F
No ratings yet
An Artificial Immune Algorithm For The F
9 pages
71 Implementation of An Adaptive Neural Network Short-Term Electric Load Forecasting System in The Energy Control Center
No ratings yet
71 Implementation of An Adaptive Neural Network Short-Term Electric Load Forecasting System in The Energy Control Center
8 pages
Nesus Porto2014 LIVRE
No ratings yet
Nesus Porto2014 LIVRE
7 pages
Improved Heuristic Job Scheduling Method To Enhance Throughput For Big Data Analytics
No ratings yet
Improved Heuristic Job Scheduling Method To Enhance Throughput For Big Data Analytics
14 pages
DL Nov Dec 2023
No ratings yet
DL Nov Dec 2023
2 pages
Artificial Intelligence As A Service
No ratings yet
Artificial Intelligence As A Service
3 pages
Aihub 1002
No ratings yet
Aihub 1002
20 pages
Resume - VIVEK KUMAR - PANDEY
No ratings yet
Resume - VIVEK KUMAR - PANDEY
4 pages
e Flux Criticism Value in Garbage Out On Ai Art and Hegemony
No ratings yet
e Flux Criticism Value in Garbage Out On Ai Art and Hegemony
7 pages
Skin Cancer Prediction Using Deep Learning Technique
No ratings yet
Skin Cancer Prediction Using Deep Learning Technique
57 pages
CH 16
No ratings yet
CH 16
16 pages
FE590 Introduction To Knowledge Engineering
No ratings yet
FE590 Introduction To Knowledge Engineering
3 pages
Bytedance Soft Mask Bert
No ratings yet
Bytedance Soft Mask Bert
9 pages
SDLC Model Explainable Automated Program Repair
No ratings yet
SDLC Model Explainable Automated Program Repair
7 pages
The Improvement of Forecasting ATMs Cash Demand of Iran Banking Network Using
No ratings yet
The Improvement of Forecasting ATMs Cash Demand of Iran Banking Network Using
11 pages
AIRC 2024 Computing Tracks Full Paper Template
No ratings yet
AIRC 2024 Computing Tracks Full Paper Template
4 pages
Applications of Machine Learning To One Dimensional Problems of Mechanics
No ratings yet
Applications of Machine Learning To One Dimensional Problems of Mechanics
7 pages
Duplichecker Plagiarism Report
No ratings yet
Duplichecker Plagiarism Report
2 pages
Lecture 1 Self Study
No ratings yet
Lecture 1 Self Study
80 pages
Assignment 1
No ratings yet
Assignment 1
3 pages
Vision: Suryo Adhi Wibowo, PH.D
No ratings yet
Vision: Suryo Adhi Wibowo, PH.D
51 pages
ETI Microproject Report
No ratings yet
ETI Microproject Report
22 pages
A Multi-Heuristic Algorithm For Multi-Container 3-D Bin Packing Problem Optimization Using Real World Constraints
No ratings yet
A Multi-Heuristic Algorithm For Multi-Container 3-D Bin Packing Problem Optimization Using Real World Constraints
26 pages
Proj Report Final
No ratings yet
Proj Report Final
115 pages
ML PDF
No ratings yet
ML PDF
17 pages
SOP Sample
0% (1)
SOP Sample
2 pages
Alzheimer's Disease Detection Using Deep Learning On Neuroimaging A Systematic Review
No ratings yet
Alzheimer's Disease Detection Using Deep Learning On Neuroimaging A Systematic Review
42 pages
Mbuyelo Maluleke Resume CV
No ratings yet
Mbuyelo Maluleke Resume CV
2 pages
Introduction To RNNS!: Arun Mallya!
No ratings yet
Introduction To RNNS!: Arun Mallya!
52 pages
Implementation Principles and Early Applications: Continuous Learning AI in Radiology
No ratings yet
Implementation Principles and Early Applications: Continuous Learning AI in Radiology
9 pages
Survey
No ratings yet
Survey
15 pages
Machine Leaning 1 Unit
No ratings yet
Machine Leaning 1 Unit
10 pages
Web Application Attack Detection Using Deep Learning
No ratings yet
Web Application Attack Detection Using Deep Learning
14 pages
Hierarchical Clustering Solution 1
No ratings yet
Hierarchical Clustering Solution 1
2 pages

Job Runtime Prediction of HPC Cluster Based On PC-Transformer

Uploaded by

Job Runtime Prediction of HPC Cluster Based On PC-Transformer

Uploaded by

The Journal of Supercomputing (2023) 79:20208–20234

Job runtime prediction of HPC cluster based

Accepted: 31 May 2023 / Published online: 12 June 2023

Keywords High performance computing · User clustering · Time encoding ·

As computing in scientific computing and engineering applications has been increas-

Runtime prediction is a complex nonlinear problem due to the jobs on HPC

developed Transformer with plain connection (PC-Transformer) is proposed based

2 Data processing and users clustering

DataStar. The above datasets are generated by different supercomputing clusters

Table 1 The details of datasets

ANL 2009.01− 2009.09 18 68,936 236

Fig. 1 Missing data features chart

Table 2 Selected features and values

Value 342072 32 43306 32 43200 3

Fig. 2 The characteristic ampli-

Fig. 3 The curve of Silhouette Coefficient varying with different K value

Fig. 4 3D Scatter Plot of PCA

3 The sequential processing

3.1 Dataset interval partitioning

Fig. 5 Runtime interval distribution on five datasets

Table 3 Job sets division Datasets/interval [0,3600) [3600,10800) [10800,54000]

ANL 20302 16981 10441

Fig. 6 Data sampling

4.1 Recurrent neural network

Fig. 7 Architecture of RNN

it = 𝜎(Wi ∗ [ht−1 , xt ] + bi ) (4)

ot = 𝜎(Wo ∗ [ht−1 , xt ] + bo ) (5)

Fig. 8 Architecture of LSTM

Attention Mechanism (AM) has been generally used in the sequence-to-sequence

Fig. 9 Architecture of Transformer

where dk denotes the dimension of Q and K , dk is the dimension of V,

4.4 PC‑Transformer with time embedding

Transformer is proposed for natural language processing at first. Increasing scholars

Fig. 10 Architecture of PC-Transformer

Fig. 11 The experiment framework

The APA over the entire test set is calculated as:

5.3.1 Model APA and MAPE

Fig. 12 APA on test set

Fig. 13 MAPE in test set

We measure the performance of proposed PC-Transformer in terms of accuracy and

Table 4 Comparison of different Method Job set Accuracy

To further compare the complexity and training speed of different models, we

Table 5 Training detail Model Job set Data_size Parameter_size FLOPs

RNN Short 31758 54K 53 G

Fig. 14 Time required for a single epoch

Table 6 Comparison of different time coding layers on HPC2N

PC-Transformer Short 8 0.141 15.2% 0.892

5.3.4 Comparison of time coding layers

To verify the performance of the proposed time embedding layer, it is compared

5.3.5 Comparison of user information

5.3.6 Prediction error analysis

Table 7 Prediction accuracy of User information Job set Accuracy

Fig. 15 Prediction errors bar of different models

5.3.7 Summary of the experiments

6 Conclusions and future work

Accurate runtime prediction can help scheduling software to schedule efficiently

Funding There is no fund information.

Ethics approval There are no ethical problems with the paper.

You might also like

2 Data processing and users clustering

3 The sequential processing

3.1 Dataset interval partitioning

4.1 Recurrent neural network

4.4 PC‑Transformer with time embedding

5.3.1 Model APA and MAPE

5.3.4 Comparison of time coding layers

5.3.5 Comparison of user information

5.3.6 Prediction error analysis

5.3.7 Summary of the experiments

6 Conclusions and future work