Exploring Transfer Learning to Reduce Training Overhead of HPC Data in Machine Learning

This document discusses the application of transfer learning to reduce the training overhead of machine learning models applied to high-performance computing (HPC) data. The authors demonstrate that transfer learning can significantly decrease training time without greatly affecting performance, particularly in the context of HPC scientific simulations that generate vast amounts of data. A case study involving a compression autoencoder is presented, showing that transfer learning allows for effective training with smaller datasets while maintaining accuracy.

Uploaded by

Niraj kotve

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

67 views7 pages

Exploring Transfer Learning to Reduce Training Overhead of HPC Data in Machine Learning

Uploaded by

Niraj kotve

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 7

Exploring Transfer Learning to Reduce Training

Overhead of HPC Data in Machine Learning

Tong Liu1 , Shakeel Alibhai1 , Jinzhen Wang2 , Qing Liu2 , Xubin He1 , and Chentao Wu3
1
Department of Computer and Information Sciences, Temple University
2
Department of Electrical and Computer Engineering, New Jersey Institute of Technology
3
Department of Computer Science and Engineering, Shanghai Jiao Tong University

Abstract—Nowadays, scientific simulations on high- process for a binary file of only several megabytes could take
performance computing (HPC) systems can generate large hours.
amounts of data (in the scale of terabytes or petabytes) per run. Transfer learning, a machine learning technique, takes a
When this huge amount of HPC data is processed by machine
learning applications, the training overhead will be significant. model developed for one task and reuses it as the starting point
Typically, the training process for a neural network can take of a model for a second task [20]. Such an approach aims to
several hours to complete, if not longer. When machine learning reduce the training overhead by reducing the amount of data
is applied to HPC scientific data, the training time can take required for effective training. Transfer learning is commonly
several days or even weeks. Transfer learning, an optimization used in deep learning applications wherein pre-trained models
usually used to save training time or achieve better performance,
has potential for reducing this large training overhead. In this are used as the staring point of computer vision [8], [9] and
paper, we apply transfer learning to a machine learning HPC natural language processing [10], [11] tasks, as these problems
application. We find that transfer learning can reduce training all require vast computational resources and lengthy times
time without, in most cases, significantly increasing the error. to develop neural network models [20]. These are the same
This indicates transfer learning can be very useful for working challenges experienced in HPC scientific machine learning
with HPC datasets in machine learning applications.
Index Terms—HPC data, machine learning, transfer learning applications, so transfer learning may also be useful for
reducing the lengthy training times for HPC systems. In this
paper, we apply transfer learning to a compression autoencoder
I. I NTRODUCTION
for lossy data compression as a case study and evaluate its
The training process is an essential step in the machine performance with real-world HPC scientific datasets. We find
learning process. In order to make a system learn specific that implementing transfer learning can dramatically reduce
features or patterns, a large training dataset is usually required the training time without, in most cases, netagively impacting
for training. In addition, in order to achieve high predic- performance. For datasets in the same domain, it is very
tion accuracies when testing, training datasets are generally common to see roughly equal, if not improved, performance
updated through multiple training epochs. Having a large when using transfer learning, while for cross-domain datasets,
training dataset and many training iterations, however, can we see such results in at least some of the cases.
put significant pressure on a system’s computational resource.
II. BACKGROUND
As a result, it is not uncommon for the training process of a
machine learning task to take hours or even days [1]–[3], if A. Transfer Learning
not longer. To resolve this issue, there has been much research Traditionally, machine learning algorithms [12]–[14] make
into various methods of accelerating the training process [4], predictions on future data by utilizing models trained on
[5]. While these approaches may reduce the training time by previously collected data. In order to have an accurate predic-
certain amounts, they are not enough when machine learning tor, it is important for a model to be well-trained. However,
is applied to the HPC scientific applications because of the there exist several issues relating to the training step. These
huge amount of data generated through the simulation. problems include not having enough labelled data for training
HPC scientific simulations are extremely important, as they or having very high training overhead. In order to alleviate
allow domain scientists to verify theories and investigate new these problems, semi-supervised classification [15]–[18] has
phenomena on an unprecedented scale [21]. Output from such been proposed. This technique that makes use of a large
simulations, however, can frequently reach into the terabytes amount of unlabelled data and a small amount of labelled data
or even petabytes per run, as HPC simulations record large for training. Nevertheless, most of these work will assume that
amounts of data in multiple dimensions (such as spatial and the distributions of the labelled and unlabelled data are the
temporal) [6], [7]. When the generated data is analyzed, the same. In contrast, transfer learning, allows the domains, tasks,
huge amount of data processing is likely to be a challenge, and distributions used in training and testing to be different,
especially in machine learning applications. According to the which makes some impossible training process practical.
HPC data compression application used in this work, a training In general, transfer learning aims to extract knowledge from

978-1-7281-4409-2/19/$31.00 ©2019 IEEE

Authorized licensed use limited to: MIT-World Peace University. Downloaded on August 14,2024 at 06:29:55 UTC from IEEE Xplore. Restrictions apply.
stored in L1e is represented in layer Z with significantly
fewer neurons. The decoder is similar to the encoder in
Input
file
that each layer has a weight matrix and bias vector. The
information in the Z layer goes through the three decoder
layers and is then written to the output file. If the whole
autoencoder is regarded as a compressor, then layer L1e is
batches the original file, layer Z is the compressed file, layer L3d is
the decompressed file, and the encoder and decoder represent

…...
…...

…...
…...
…...
…...

…...
…
output
file the compression and decompression, respectively.
Z
batches L3_e L1_d
L2_e L2_d III. O BSERVATION AND E XPERIMENT S ETUP
L1_e L3_d
A. Training time observation

Fig. 1: The seven-layer compression autoencoder prototype. In our preliminary experiments, we find that a file
with around 400,000 double-precision floating-point numbers
(around 3.2 MB) will need approximately two hours for
one or more source tasks and apply it to a target task, which training with 25,000 epochs. In order to see how training
is usually in a different domain than the source tasks. As a time can increase as the size of the training dataset in-
formal definition [19], assume that we have a source domain creases, we perform experiments on a series of datasets.1
Ds , a learning task Ts , a target domain Dt , and learning task The datasets we use for testing come form open-source HPC
Tt . By implementing transfer learning, we aim to improve the scientific data benchmarks, such as NEK5000 [38], MCERI
learning of the target predictive function ft (·) in Dt using the [39], GROMACS [40], and FLASH [41]. These experiments
knowledge in Ds and Ts where Ds 6= Dt or Ts 6= Tt . are conducted with 2,500 epochs on a system running Ubuntu
In the above definition, the condition Ds 6= Dt implies that 16.04.5 LTS with an Intel(R) Xeon(R) CPU E5-2620 v4 @
either the training data domains or the data distributions of 2.10GHz and 32 GB of memory (clock: 2133MHz). The
the source and target should be different. Similarly, Ts 6= Tt results are presented in Table I.
implies that either the testing datasets or the prediction func-
tions of the source and target should differ. It should be noted TABLE I: Training time for various datasets with 2,500
that when Ds = Dt and Ts = Tt (meaning the domains and epochs.
learning tasks of the source and target domains are the same), Dataset
Number of Approximate
the learning problem becomes a traditional machine learning Elements Training Time
bump training.bin 25,000 94.185 seconds
problem. yf17 pres training.bin 48,552 170.118 seconds
eddy training.bin 135,000 467.876 seconds
B. HPC lossy data compression with compression autoen- md training.bin 300,000 1057.617 seconds
coder Randomly-Generated Dataset A 500,000 1703.223 seconds
Randomly-Generated Dataset B 1,000,000 3479.543 seconds
In this paper, we use an HPC lossy compressor application Randomly-Generated Dataset C 5,000,000 17175.178 seconds
Randomly-Generated Dataset D 10,000,000 34684.248 seconds
as a case study for transfer learning with HPC data. This
application uses a compression autoencoder (CAE) to conduct
dimensional reduction and thus achieve data compression. As seen from the results, the training time increases as
The autoencoder compression prototype is shown in the size of the training dataset increases in a roughly linear
Figure 1. The autoencoder has seven layers with three layers fashion. In addition, as noted above, these results are for
L1e , L2e , L3e in the encoder part, three layers L1d , L2d , L3d 2,500 epochs. To achieve greater accuracy, a higher number of
in the decoder part, and one code layer Z. Before the input training epochs may be needed. During the experiments, we
file I (which contains the scientific data) enters the neural find that 25,000 is a suitable number of epochs, and having ten
network, it is divided into several batches bi ∈ I. After the times the number of epochs would increase the training time
input is divided into batches, each batch bi will contain a part by approximately ten times. The problem of extensive training
of the original scientific data and then go into the input layer time is further intensified by the fact that HPC datasets can be
L1e for training. When a batch enters the input layer, each terabytes or even petabytes in size. Training times for those
element of the original scientific data becomes one neuron datasets would be much higher than even the training time for
in the layer. In the encoder part, the number of neurons Randomly-Generated Dataset D, which has 10 million data
decreases as the layer goes from the input layer L1e to the points and yet is only around 77 MB.
code layer Z. Each layer in the encoder part has a weight
matrix and bias vector that are used to accomplish dimension 1 With the exception of the randomly-generated datasets, all datasets used
reduction. After three layers of compression, the information in this paper are provided at the link in the Section VI.

Authorized licensed use limited to: MIT-World Peace University. Downloaded on August 14,2024 at 06:29:55 UTC from IEEE Xplore. Restrictions apply.
0.1 0.18
0.09 0.16
0.08 0.14

Mean Testing Error

0.07 0.12
0.06 0.1
0.05 0.08
0.04
0.06
0.03
0.04
0.02
0.02
0.01
0 0

(a) Mean testing errors upon compressing Maclaurin-Temp (b) Mean testing errors upon compressing Maclaurin-Pres
0.385 0.0018
0.0016
0.38
0.0014
Mean Testing Error

Mean Testing Error

0.0012
0.375
0.001
0.37 0.0008
0.0006
0.365 0.0004
0.0002
0.36
0

(c) Mean testing errors upon compressing Sedov-Temp (d) Mean testing errors upon compressing Sedov-Pres
Fig. 2: Prelimilary series of experiments utilizing the concept of transfer learning. The x-axis indicates which dataset, if
any, was used as the original training dataset. Note that each simulation (Maclaurin-Temp, Maclaurin-Pres, Sedov-Temp, and
Sedov-Pres) has two separate datasets, one that is used for training and another that is used for testing (ie. compression).

B. Experiment setup dataset B is very large—and it is very common for both these
conditions to be met. Thus, in order to reduce the time to train
In order to attempt to reduce training time, we develop a dataset B, we set the initial values of wB and bB to be based
basic implementation of transfer learning in a compression on wA and bA instead of setting them to random values and
autoencoder. The purpose of training is to generate weights zeros, respectively. However, it is possible that the features
and biases that can be utilized later. In the compression of dataset B could be very different from those of dataset A,
autoencoder, when training occurs without transfer learning, meaning that wA and bA may not be very applicable for dataset
the weights are initially set to random values from a normal B. Thus, in order to reduce the impact of dataset A on wB
distribution and the biases are initialized to zero. These and bB , we multiply the values of wA and bA by 0.7.2 This
weights and biases are then optimized while the autoencoder means that wB is initialized to wA ∗ 0.7 and bB is initialized
runs through consecutive epochs. Assume that we train one to bA ∗ 0.7. Since the possible features of the HPC dataset are
dataset, dataset A, and obtain weights and biases wA and bA . already represented by wA and bA , we do not need to use the
These weights and biases are optimized for dataset A; they entirety of dataset B for training; instead, we can save valuable
can be used on any datasets similar to A, such as A', but time by only training on a portion of dataset B. As a result,
simply using these on a different dataset, dataset B, will not we take p data points of dataset B for training, rather than
necessarily yield acceptable results. the entire dataset. In our simple implementation, we generally
Now assume we want to train dataset B. Without trans- set p as 30% of the size of dataset A. (If the size of dataset
fer learning, we would again start with randomly-initialized B is smaller than p, then we use the entirety of dataset B
weights and zero-initialized biases. We would then go through
multiple epochs, and in each epoch, we would go through 2 In this case, 0.7 serves as a starting point for our basic implementation of
every data point in dataset B. As previously noted, this could transfer learning. Future work could determine what the optimal number to
be very time-consuming if many epochs are needed or if multiply the weights and biases by is or how to find such a number.

Authorized licensed use limited to: MIT-World Peace University. Downloaded on August 14,2024 at 06:29:55 UTC from IEEE Xplore. Restrictions apply.
for training, but this situation is relatively uncommon.) In our similar datasets. (For example, we may generate weights and
experiments, we then optimize wB and bB for the first p data biases from Sedov-Temp and then use them to generate new
points of dataset B for 2,500 epochs. Table I indicates that weights and biases, wp and bp , on a portion of Sedov-Pres.
smaller datasets need less time for training, and since only a There are two Sedov-Pres datasets, one for training and one for
portion of dataset B (rather than the full dataset) is used for testing. A portion of the training dataset is used for obtaining
training, we can save a significant amount of time. wp and bp , and the testing dataset is compressed using these
weights and biases.) We compare the compression error using
IV. E VALUATION
this method to the error without using transfer learning. The
One major question that can arise from the implementation results, indicating the performance improvement, are shown in
discussed in Section III.B is whether the resulting weights Table II.
and biases, wB and bB , are still effective or whether they
perform poorly when they are used. In order to gain insight TABLE II: Transfer learning results with Sedov-Temp as the
into whether the weights and biases are still usable (without original training dataset.
a significant degradation of accuracy) after this implementa- Number of
Approximate Performance
tion, we performed a series of tests. We take four datasets Dataset Improvement When
Elements
Sedov-Temp is Used
(Maclaurin-Pres, Maclaurin-Temp, Sedov-Pres, and Sedov- Sedov-Pres 39,072 442.9%
Temp) and run them in the compression autoencoder. The Maclaurin-Temp 133,376 153.3%
baseline performance is the testing error when transfer learning YF17-Temp 48,552 140.9%
S3DP 97,020 132.5%
is not used (ie. dataset X is trained and weights and biases wx
Astro 32,768 123.2%
and bx are generated; those weights and biases are then tested YF17-Pres 48,552 107.4%
by being used to compress dataset X'; the compression error Blast2 289,440 98.4%
(ie. the testing error) is recorded); these are the leftmost bars Fish 32,768 93.2%
Eddy 135,000 84.8%
in each graph (in light green). For each of the four datasets, Md-Seg 300,000 76.1%
we choose two datasets to compare it with using transfer 2D-Annulus 181,890 50.4%
learning. For example, one of the datasets with Sedov-Pres is Swept 77,180 40.7%
Sedov-Temp (Figure 2d). To apply transfer learning, we train
Sedov-Temp to obtain weights and biases wt and bt . We then From this table, we can see that Sedov-Pres experiences
multiply these weights and biases by 0.7 and optimize them vastly improved performance when Sedov-Temp is used as
using a portion of Sedov-Pres, as explained in Section III.B. the original training dataset. Several other datasets also see
This test is performed three times in order to ensure that the performance improvements, and two of them, Blast2 and
results are not simply due to randomness. The results are in Fish, only see slight reductions in performance. Four datasets,
Figure 2. Note that lower bars (ie. lower errors) signify better however, experience more significant performance reductions.
performance. From these results, we can conclude that using transfer
From these four tests, we see that this method is promising. learning is promising. We are able to significantly reduce the
For each of the four datasets in this experiment, not only is it training time, which can be extremely important for large
possible to obtain weights and biases with similar performance datasets. In many cases, we obtain the added benefit of having
to those generated without transfer learning, but it is possible improved accuracy. While the accuracy does decrease in
to obtain weights and biases that have better performance (ie. some cases, this could be solved by using a different dataset
lower errors) than those generated previously. (It is important as the original training dataset or using different parameters
to note, however, that the performance does not improve in instead of the aforementioned 0.7 and 30%. Further research
all cases.) into these cross-domain trends is potential future work.
While using cross-domain datasets for transfer learning may
Finding 1: Using transfer learning to train a dataset based
have some ambiguity, we find that using transfer learning with
on previously-generated weights and biases can decrease the
datasets within the same domain may have more consistent
training time and frequently has acceptable performance,
performance. To explore transfer learning performance within
thus indicating that transfer learning is a promising tech-
the same simulation, we choose three datasets from three
nique to reduce the training overhead for HPC machine
simulations, 2d annulus, maclaurin, and sedov, and generate
learning applications.
different metrics within each simulation. For 2d annulus, we
collect temperature data (temp), velocity data on the x-axis
We observe that, in these experiments, whenever Sedov- (velx), and velocity data on the y-axis (vely), each for 121,260
Temp is used as the original dataset, the training error is timesteps. For maclaurin, we collect temperature (temp) and
reduced. As a result, we conduct another series of experiments pressure data (pres) for 266,752 timesteps. For sedov, we
wherein we take Sedov-Temp as the original training dataset to collect temperature (temp) and pressure data (pres) for 78,144
generate weights and biases wt and bt . We use these weights timesteps. Each of these metrics are divided into two equal
and biases as the starting point for training a variety of other parts, one for training and another for testing.
datasets. We then use those weights and biases to compress

Authorized licensed use limited to: MIT-World Peace University. Downloaded on August 14,2024 at 06:29:55 UTC from IEEE Xplore. Restrictions apply.
0.1
0.09
0.08
mean training error

0.07
0.06
0.05
0.04
0.03
0.02
0.01
0
temp1 full velx1 (Full) + vely1 (Full) + velx1 full only temp1 (Full) + vely1 (Full) + vely1 full only temp1 (Full) + velx1 (Full) +
only 30% temp1 (3 30% temp1 (3 (baseline) 30% velx1 (3 30% velx1 (3 (baseline) 30% vely1 (3 30% vely1 (3
(baseline) average run) average run) average run) average run) average run) average run)

(a) Mean training errors for three metrics from the 2d annulus simulation.
0.14
0.12
mean testing error

0.1
0.08
0.06
0.04
0.02
0
temp1 full velx1 (Full) + vely1 (Full) + velx1 full only temp1 (Full) + vely1 (Full) + vely1 full only temp1 (Full) + velx1 (Full) +
only 30% temp1 (3 30% temp1 (3 (baseline) 30% velx1 (3 30% velx1 (3 (baseline) 30% vely1 (3 30% vely1 (3
(baseline) average run) average run) average run) average run) average run) average run)

(b) Mean testing errors for three metrics from the 2d annulus simulation.
Fig. 3: Mean training and testing errors for three metrics, temp, velx, and vely, from the 2d annulus HPC simulation.

0.09 0.09
0.08 0.08
mean traiining error

0.07
mean testing error

0.07
0.06 0.06
0.05 0.05
0.04 0.04
0.03 0.03
0.02 0.02
0.01 0.01
0 0
temp1 full pres1 (Full) pres1 full temp1 temp1 full pres1 (Full) pres1 full temp1
only + 30% only (Full) + 30% only + 30% only (Full) + 30%
(baseline) temp1 (3 (baseline) pres1 (3 (baseline) temp1 (3 (baseline) pres1 (3
average average average average
run) run) run) run)

(a) Mean training errors for two metrics from the maclaurin (b) Mean testing errors for two metrics from the maclaurin
simulation. simulation.
Fig. 4: Mean training and testing errors for two metrics, temp and pres, from the maclaurin HPC simulation.

Figures 3, 4, and 5 show the results for both training Transfer learning was impelemented in the same way as in
and testing on these metrics.3 For each simulation, we first the previous experiments. From these results, we can see that
conduct traditional machine learning; this serves as the if we use transfer learning within one simulation, the,n in
baseline performance. We then use transfer learning from most of the cases, the testing results with transfer learning
the other metrics in the same simulation on that metric to are similar to the baseline performance, if not better. In
compare the performance of baseline and transfer learning. certain cases, such as with velx, the transfer learning results
are slightly worse than the baseline, but they may still remain
3 The testing results are the results used to determine performance. Having roughly around the same level as the baseline with low
better or worse training results does not necessarily indicate that transfer error (less than 0.02 in that case). Note that worse training
learning performs better or worse than the baseline; this determination is performance does not necessarily mean that transfer learning
made using the testing results.

Authorized licensed use limited to: MIT-World Peace University. Downloaded on August 14,2024 at 06:29:55 UTC from IEEE Xplore. Restrictions apply.
0.5 0.45490056
0.45 0.45
0.4 0.37166792 0.381636323
mean training error

0.4 0.36255268

mean testing error

0.35 0.35
0.3 0.3
0.25 0.25
0.2 0.2
0.15 0.15
0.1 0.1
0.05 0.00214198 0.000450437 0.05 0.00154164 0.000348113
0 0
temp1 full pres1 (Full) pres1 full temp1 temp1 full pres1 (Full) pres1 full temp1
only + 30% only (Full) + 30% only + 30% only (Full) + 30%
(baseline) temp1 (3 (baseline) pres1 (3 (baseline) temp1 (3 (baseline) pres1 (3
average average average average
run) run) run) run)

(a) Mean training errors for two metrics from the sedov simula-
tion. (b) Mean testing errors for two metrics from the sedov simulation.

Fig. 5: Mean training and testing errors for two metrics, temp and pres, from the sedov HPC simulation.

performs worse, as the testing performance is the metric optimization to avoid using a time-consuming numerical QP
that determins how transfer learning performs in comparison optimization as an inner loop.
to the baseline. In these experiments, the training time is Transfer learning, as a hot topic that aims to extract knowl-
reduced by around 70% while still maintaining roughly the edge from one domain to another, has been applied to multiple
same (if not better) level of prediction error. This matches machine learning scenarios recently. Transfer learning research
the intuition that metrics within the same simulation usually can be generally divided into four categories based on the way
share similar distributions, even though they may be different. it is implemented [19]. 1) Instance transfer [26]–[28], in which
For example, when running astrometeorology simulations, some labelled data in the source domain is re-weighted before
locations with higher temperatures will also have higher being used in the target domain. 2) Feature-representation-
pressures; this indicates that metrics inside one simulation transfer [29]–[31], which finds a feature representation that
may frequently be correlated. reduces differences between the source and target domains. 3)
Parameter-transfer [32]–[34], which discovers shared parame-
Finding 2: When implementing transfer learning, if the ters between the source and target domain models that can ben-
original training dataset and the second training dataset efit from transfer learning. 4) Relational-knowledge-transfer
are in different simulations, then the resulting weights and [35]–[37], which builds a mapping of relational knowledge
biases often perform well, but occassionally perform very between the source and target domains. Our transfer learning
poorly. When both training datasets are from the same scheme in this paper is similar to the first category, as we
simulation, however, using transfer learning is very likely take advantage of external knowledge from the source domain
to produce results similar to, if not better than, the baseline data’s weights and biases by applying a weighted version of
performance. them to the target domain that we want to test.

In addition, we see another interesting observation: in some VI. C ONCLUSION

of the cases, using transfer learning leads to higher training In this paper, we explore the possibility of using transfer
error but lower testing error. This could be because, in those learning on HPC data by using a compression autoencoder
cases, the model experiences overfitting, a common problem (CAE) as a case study. We implement transfer learning in
where training leads to the weights and biases being too well- a CAE; this generally leads to a vast reduction in training
fit to the training data and not generalized enough to work as time. We find that when transfer learning is applied to datasets
well on similar datasets [22]. This could indicate that transfer within the same simulation, it is highly likely to achieve
learning can, in some cases, serve as a foil against overfitting. similar or even better performance than traditional machine
learning. In addition, we run a series of experiments with a
V. R ELATED W ORK fixed dataset for the original transfer learning and 12 datasets
Much research has been performed to accelerate training that use the results of that transfer learning to generate their
time by minimizing the training overhead. For example, Osuna own weights and biases. In the majority of these 12 cases,
et al. [23] present a decomposition algorithm that is guaran- the performance is either very similar to or much better than
teed to solve the quadratic programming problem to improve the performance of traditional machine learning, indicating
training performance. Joachims et al. [24] analyze particular that transer learning can even be possible for cross-domain
properties of learning with text data and explore the use of datasets. Our future work could include exploring ways of
support vector machines to reduce training time. Work has also determining which data features make datasets suitable for
been performed by Microsoft [25] that uses sequential minimal cross-domain transfer learning and looking into ways of

Authorized licensed use limited to: MIT-World Peace University. Downloaded on August 14,2024 at 06:29:55 UTC from IEEE Xplore. Restrictions apply.
further improving the performance of transfer learning. As [18] T. Joachims, Transductive inference for text classification using support
transfer learning can reduce training time without, in most vector machines, in Proceedings of Sixteenth International Conference
on Machine Learning, 1999, pp. 825830.
cases, significantly increasing the error, transfer learning can [19] S.J. Pan, Q. Yang, A survey on transfer learning, IEEE Trans. Knowl.
be very useful for working with HPC datasets in machine Data Eng. 22 (10) (2010) 13451359.
learning applications. [20] A Gentle Introduction to Transfer Learning for Deep Learning.
https://securityglobal24h.com/a-gentle-introduction-to-transfer-learning-
The source code for the compression autoencoder (both for-deep-learning/machine-learning/Information-Security-latest-
the original version and the version with transfer learning) Hacking-News-Cyber-Security-Network-Security. Accessed May 7,
and the HPC datasets used in this paper are available at 2019.
[21] T. Lu, et al. Understanding and Modeling Lossy Compression Schemes
https://github.com/tobivcu/autoencoder. on HPC Scientific Data in IEEE International Parallel and Distributed
The work performed is partially sponsored by the U.S. Processing Symposium, 2018.
National Science Foundation grants CCF-1813081, CNS- [22] T. Dietterich, ”Overfitting and undercomputing in machine learning” in
ACM computing surveys, 1995.
1702474, CCF-1718297, and CCF-1812861. Any opinions, [23] E. Osuna, R. Freund, and F. Girosi. An improved training algorithm
findings, and conclusions or recommendations expressed in for support vector machines, IEEE Workshop on Neural Networks and
this material are those of the authors and do not necessarily Signal Processing, Amelia Island,IEEE Press, 1997: 276- 285.
[24] T. Joachims. ”Text categorization with support vector machines: Learn-
reflect the views of the funding agencies. ing with many relevant features.” European conference on machine
learning. Springer, Berlin, Heidelberg, 1998.
R EFERENCES [25] J. Platt. ”Sequential minimal optimization: A fast algorithm for training
support vector machines.” (1998).
[1] S. Shen, et al. Minimum risk training for neural machine translation. [26] W. Dai, Q. Yang, G. Xue, and Y. Yu, Boosting for transfer learning, in
arXiv preprint arXiv:1512.02433 (2015). Proceedings of the 24th International Conference on Machine Learning,
[2] P. Stone, et al. Keepaway soccer: From machine learning testbed to Corvalis, Oregon, USA, June 2007, pp. 193200.
benchmark. Robot Soccer World Cup. Springer, Berlin, Heidelberg, [27] W. Dai, G. Xue, Q. Yang, and Y. Yu, Transferring naive bayes classifiers
2005. for text classification, in Proceedings of the 22rd AAAI Conference on
[3] J. Snoek, L. Hugo, and P. Ryan Practical bayesian optimization of Artificial Intelligence, Vancouver, British Columbia, Canada, July 2007,
machine learning algorithms. Advances in neural information processing pp. 540545.
systems. 2012. [28] J. Quionero-Candela, M. Sugiyama, A. Schwaighofer, and N. D.
[4] M. Di, and M. Er. A survey of machine learning in wireless sensor Lawrence, Dataset Shift in Machine Learning. The MIT Press, 2009.
netoworks from networking and application perspectives. 2007 6th [29] R. Raina, A. Battle, H. Lee, B. Packer, and A. Y. Ng, Self-taught
international conference on information, communications & signal pro- learning: Transfer learning from unlabeled data, in Proceedings of the
cessing. IEEE, 2007. 24th International Conference on Machine Learning, Corvalis, Oregon,
[5] B. Catanzaro, S. Narayanan, and K. Kurt. Fast support vector machine USA, June 2007, pp. 759766.
training and classification on graphics processors. Proceedings of the [30] W. Dai, G. Xue, Q. Yang, and Y. Yu, Co-clustering based classification
25th international conference on Machine learning. ACM, 2008. for out-of-domain documents, in Proceedings of the 13th ACM SIGKDD
[6] T. Lu, et al. Canopus: A paradigm shift towards elastic extreme-scale International Conference on Knowledge Discovery and Data Mining,
data analytics on hpc storage. 2017 IEEE International Conference on San Jose, California, USA, August 2007.
Cluster Computing (CLUSTER). IEEE, 2017. [31] R. K. Ando and T. Zhang, A high-performance semi-supervised learning
[7] D. Tao, et al. Significantly improving lossy compression for scientific method for text chunking, in Proceedings of the 43rd Annual Meeting
data sets based on multidimensional prediction and error-controlled on Association for Computational Linguistics. Morristown, NJ, USA:
quantization. 2017 IEEE International Parallel and Distributed Process- Association for Computational Linguistics, 2005, pp. 19.
ing Symposium (IPDPS). IEEE, 2017. [32] N. D. Lawrence and J. C. Platt, Learning to learn with the informative
[8] C. Lampert, N. Hannes, and H. Stefan. Learning to detect unseen object vector machine, in Proceedings of the 21st International Conference on
classes by between-class attribute transfer. 2009 IEEE Conference on Machine Learning. Banff, Alberta, Canada: ACM, July 2004.
Computer Vision and Pattern Recognition. IEEE, 2009. [33] E. Bonilla, K. M. Chai, and C. Williams, Multi-task gaussian process
[9] H. Shin, et al. Deep convolutional neural networks for computer- prediction, in Proceedings of the 20th Annual Conference on Neural
aided detection: CNN architectures, dataset characteristics and transfer Information Processing Systems. Cambridge, MA: MIT Press, 2008, pp.
learning. IEEE transactions on medical imaging 35.5 (2016): 1285-1298. 153160.
[10] A. Conneau, et al. Supervised learning of universal sentence rep- [34] A. Schwaighofer, V. Tresp, and K. Yu, Learning gaussian process kernels
resentations from natural language inference data. arXiv preprint via hierarchical bayes, in Proceedings of the 17th Annual Conference
arXiv:1705.02364 (2017). on Neural Information Processing Systems. Cambridge, MA: MIT Press,
[11] R. Collobert, and W. Jason. A unified architecture for natural language 2005, pp. 12091216.
processing: Deep neural networks with multitask learning. Proceedings [35] L. Mihalkova, T. Huynh, and R. J. Mooney, Mapping and revising
of the 25th international conference on Machine learning. ACM, 2008. markov logic networks for transfer learning, in Proceedings of the
[12] X. Yin, J. Han, J. Yang, and P. S. Yu, Efficient classification across 22nd AAAI Conference on Artificial Intelligence, Vancouver, British
multiple database relations: A crossmine approach, IEEE Transactions Columbia, Canada, July 2007, pp. 608614.
on Knowledge and Data Engineering, vol. 18, no. 6, pp. 770783, 2006. [36] L. Mihalkova and R. J. Mooney, Transfer learning by mapping with
[13] L. I. Kuncheva and J. J. Rodrguez, Classifier ensembles with a random minimal target data, in Proceedings of the AAAI-2008 Workshop on
linear oracle, IEEE Transactions on Knowledge and Data Engineering, Transfer Learning for Complex Tasks, Chicago, Illinois, USA, July 2008.
vol. 19, no. 4, pp. 500508, 2007. [37] J. Davis and P. Domingos, Deep transfer via second-order markov logic,
[14] E. Baralis, S. Chiusano, and P. Garza, A lazy approach to associative in Proceedings of the AAAI-2008 Workshop on Transfer Learning for
classification, IEEE Transactions on Knowledge and Data Engineering, Complex Tasks, Chicago, Illinois, USA, July 2008.
vol. 20, no. 2, pp. 156171, 2008. [38] Nek5000 User Guide. https://nek5000.mcs.anl.gov/files/2015/09/NEK doc.pdf.
[15] X. Zhu, Semi-supervised learning literature survey, University of Wis- [39] Trace3D, 3D streamline simulator. http://www.pe.tamu.edu/mceri/software.html.
consinMadison, Tech. Rep. 1530, 2006. [40] GROMACS, molecular dynamics performance simulator.
[16] K. Nigam, A. K. McCallum, S. Thrun, and T. Mitchell, Text classifica- http://www.gromacs.org/About Gromacs/Benchmarks.
tion from labeled and unlabeled documents using em, Machine Learning, [41] ASCF Center. FLASH Users Guide.
vol. 39, no. 2-3, pp. 103134, 2000. http://flash.uchicago.edu/site/flashcode/user support/.
[17] A. Blum and T. Mitchell, Combining labeled and unlabeled data with
co-training, in Proceedings of the Eleventh Annual Conference on
Computational Learning Theory, 1998, pp. 92100.