Possibility of HPC Application On Cloud Infrastructure by Container Cluster
Possibility of HPC Application On Cloud Infrastructure by Container Cluster
267
Authorized licensed use limited to: Nirma University Institute of Technology. Downloaded on October 03,2024 at 07:14:58 UTC from IEEE Xplore. Restrictions apply.
computation can be deployed to one or more CPUs or GPGPUs For evaluation of RNNs with LSTM, the PTB [22] project
on any environment without rewriting the code. It was originally dataset is used. Every evaluation is repeated 10 times and the
developed by researchers and engineers working in the Google mean value is used as the result. An evaluation is performed on
Brain team. Now, it is open source and widely used for machine native environment and container environment. The kubernetes
learning applications such as neural networks. is used for container environment as an orchestration tool.
G. Horovod IV. RESULTS AND DISCUSSION
A Horovod [12] is an open source distributed deep learning
A. MPI application scalability
framework on top of Tensorflow by Uber. It is developed to
make it easy to take a single-GPGPU Tensorflow program and Fig. 2 and Table 1 show Poisson’s equation solver scalability
train it on many GPGPUs in parallel, to enhance its speed. on native and container environment. To show optimal result,
Developers found the MPI model to be more straightforward and Fig. 2 and Table 1 illustrates only asynchronous communication
requires less code changes than a distributed Tensorflow. manner for scalability. A DOF of Matrix used in solving
Horovod is based on MPI concepts like size, rank, local rank, Poisson’s equation was 40,401. Maximum iterations of
all-reduce, all-gather and broadcast. conjugate gradient were 10,000. Tolerance for solution is 1.0e-
10. We only measure strong scale performance. Overhead
III. EXPERIMENTS AND EVALUATION DETAILS represents overhead in container environment against native
We conducted a number of experiments to check the environment.
possibility of running HPC application on cloud infrastructure
TABLE I. WALL TIME FOR POISSON’S EQUATION SOLVER (SEC)
using containers. We evaluated the performance of parallel
Poisson’s equation solver and InfiniBand bandwidth of native Node Count 1 10 20 30 40 44
and container environment. The native environment means that Native 3203.19 103.41 28.10 12.57 8.39 6.26
there is no software stacking over OS. The container Container 3205.04 103.43 28.14 16.14 10.15 10.27
environment is an environment which is built using container Overhead (%) 0.06 0.02 0.13 22.15 17.27 39.07
technology. In Poisson’s equation solver evaluation case, both
environments communicate between nodes by MPI. We also
measure cache miss in both environments. Other experiments 10000
Container
are machine learning training applications that use a lot of
offloading computing on GPGPU. For machine learning training Native
applications, we evaluated two types of machine learning 1000
training algorithms; the ResNet50 [19] and Recurrent Neural
Networks (RNNs) with Long Short-Term Memory (LSTM) [20].
Time(sec)
268
Authorized licensed use limited to: Nirma University Institute of Technology. Downloaded on October 03,2024 at 07:14:58 UTC from IEEE Xplore. Restrictions apply.
environment cannot fully utilize CPU computing resources. Fig. 1.60%
3 and Fig. 4 show efficiency of communication optimization in 1.40%
both environments. In Fig. 3 and Fig. 4, an efficiency of
1.20%
communication optimization represents wall time of each
evaluation results as ratio against evaluation result on node 1.00%
Rate(%)
count 1. We confirmed that MPI communication optimal 0.80%
method can also be applied to container environment. 0.60%
0.40%
Native environment
1.2 0.20%
1
0.00%
I1 Miss rate LLi miss rate D1 miss rate LLd miss rate LL miss rate
0.8 Native Container
Ratio
0.6
Fig. 5. Cache miss rate evaluation result – Lower is better
0.4
C. InfiniBand latency
0.2
To measure communication performance via InfiniBand in
0
1 10 20 30 40 44 both environments, we measure InfiniBand bandwidth and
Collective 1 1 1 1 1 1 latency using openfabrics enterprise distribution [24]. The
Synchrous 1.00 0.31 0.16 0.11 0.09 0.08
Asynchrous 1.00 0.31 0.16 0.11 0.09 0.08
InfiniBand bandwidth is measured for hardware performance.
So, there is no difference between the two environments. A
Fig. 3. Communication optimal efficiency on a native environment – Lower latency comparison shows very little difference between the two
is better environments. Table 3 and Fig. 6 show InfiniBand latency
evaluation results. When we measure InfiniBand performance,
Container environment
1.2
we change kubernetes CNI from calico to flannel, because calico
does not support vxlan which is required for using InfiniBand in
1
container environment. A test application sends 83Mb
0.8 (8,388,608 bytes) data 1,000 times. The evaluation results are
show for minimum, maximum, typical (Typ), average, 99%
Ratio
0.6
0.4
percentile, and 99.9% percentile values.
0.2
TABLE III. INFINIBAND LATENCY EVALUATION RESULT (ɆS)
0
1 10 20 30 40 44 Evaluation Min Max Typ Avg 99% 99.9%
Collective 1 1 1 1 1 1
Synchrous 1.00 0.31 0.16 0.11 0.09 0.09
Native 715.39 751.18 718.78 719.68 735.58 751.18
Asynchrous 1.00 0.31 0.16 0.11 0.09 0.09 Container 714.87 745.01 717.66 718.38 735.70 745.01
cache miss rate is one of the most important optimal points in a 100
HPC application. Based on this result, we confirm that cache 0
miss rate on container is acceptable for HPC application. Min Max Typical Avg 99%
percentile
99.90%
percentile
Native Container
TABLE II. CACHE MISS EVALUATION RESULT OF MPI APPLICATION (%)
Type I1 LLi D1 LLd LL Fig. 6. InfiniBand latency evaluation result – Lower is better
Native 0.31% 0.17% 1.40% 0.80% 0.30%
There is no significant difference between the two
Container 0.28% 0.15% 1.30% 0.70% 0.30%
environments between average and typical. However, at
maximum and 99.9% percentile, there is a difference in
performance.
269
Authorized licensed use limited to: Nirma University Institute of Technology. Downloaded on October 03,2024 at 07:14:58 UTC from IEEE Xplore. Restrictions apply.
D. Machine learning training application performance applications. Even though container technology has a lot of
To measure performance overhead of machine learning benefit, this level of performance overhead cannot be accepted
training application in container environment, we evaluate the by HPC applications. However, we found that communication
two machine learning training applications in both environments. optimization method could be applied in container technology.
We measure the performance of distributed machine learning Therefore, we cannot find cache miss rate overhead in container
applications – ResNet50 and RNNs with LSTM. This environment. We also observe that machine learning training
measurement shows that performance gap of GPGPU oriented application have very small overhead in container environment.
HPC application on both environment. Both evaluations use a We submit that there is no performance loss in InfiniBand usage,
total of 64 GPGPU cards on 8 physical servers with each server too. We can confirm that possibility of using HPC application
containing 8 GPGPU cards. One worker in ResNet50 and RNNs on cloud infrastructure conditionally. If an HPC application is
with LSTM run in one GPGPU card. The ResNet50 GPGPU oriented or working on heterogeneous computing, it
performance evaluation result is presented as image processing can be run in container cluster environment with numerous
counts per second. The RNNs with LSTM evaluation result is benefits from cloud computing.
presented as words per second. Table 4 shows evaluation results A cloud environment, built by container technology has
of the two applications as performance evaluation result values. limitation in several aspects. It needs extra computation
Fig. 7 represents evaluation results as ratio against native resources for managing and orchestration hence some
evaluation result computing resources should be assigned to it. That overhead
should be reduced in future. It also has overhead in network. In
TABLE IV. MACHINE LEARNING APPLICATION EVALUATION RESULT -64 our next research, we will investigate and find the most suitable
WORKERS ARE USED
network configuration in container environment for HPC
Application ResNet50 RNNs with LSTM application and improve the network performance. We shall
study the best fit optimization method for HPC application in
3,622.08 49,296.24
Native
(images/sec) (words/sec)
container-based environment.
3,604.11 47,782.82
Container
(images/sec) (words/sec) REFERENCES
[1] “Amazone Web Services” [Online]. Available: https://aws.amazon.com/
[2] “Microsoft Azure” [Online]. Available: https://azure.microsoft.com/en-
ForResNet50, there is only 0.05% performance overhead. us/
The ResNet50 has a lot of code block for GPGPU processing [3] “Google Cloud Platform” [Online]. Available: https://cloud.google.com/
and only send parameter values which are needed to compute [4] Z. Li, M. Kihl, Q. Lu, and J. A. Andersson, “Performance Overhead
neural network weight and bias. The RNNs with LSTM has 3% Comparison between Hypervisor and Container Based Virtualization,” in
performance overhead. RNNs with LSTM has more 2017 IEEE 31st International Conference on Advanced Information
performance overhead than ResNet50. However, it has less Networking and Applications (AINA), 2017, pp. 955–962.
performance overhead than the previous measurement of other [5] R. Morabito, J. Kjallman, and M. Komu, “Hypervisors vs. Lightweight
application by Zheng Li et al. [4]. It can be attributed to machine Virtualization: A Performance Comparison,” in 2015 IEEE International
Conference on Cloud Engineering, 2015, pp. 386–393.
learning training application characteristic. Two machine
[6] “Linux Container.” [Online]. Available: http://linuxcontainers.org.
learning training applications have more computation part
working on GPGPU than CPU. This means that machine [7] “Docker.” [Online]. Available: https://www.docker.com/.
learning training application is less affected by computation [8] “Kubernetes.” [Online]. Available: https://kubernetes.io/.
performance overhead in container environment. [9] W. Felter, A. Ferreira, R. Rajamony, and J. Rubio, “An updated
performance comparison of virtual machines and Linux containers,” in
1.2 2015 IEEE International Symposium on Performance Analysis of
Systems and Software (ISPASS), 2015, pp. 171–172.
1 [10] C. Ruiz, E. Jeanvoine, and L. Nussbaum, “Performance evaluation of
containers for HPC,” in European Conference on Parallel Processing,
0.8 2015, pp. 813–824
[11] M. Abadi et al., “Tensorflow: a system for large-scale machine learning.,”
0.6
Raito
2016.
0.4 [12] A. Sergeev and M. Del Balso, “Horovod: fast and easy distributed deep
learning in TensorFlow,” Feb. 2018.
0.2 [13] “MPICH Multinode Cluster.” [Online]. Available:
https://github.com/ContinUSE/kubernetes-coreos-
0 cluster/tree/master/examples/mpich.
ResNet50 RNNs with LSTM
Native 1 1 [14] A. M. Joy, “Performance comparison between Linux containers and
Container 0.995 0.97 virtual machines,” in 2015 International Conference on Advances in
Computer Engineering and Applications, 2015, pp. 342–346.
Fig. 7. Machine learning application performance comparing [15] P. R. Luszczek et al., “S12---The HPC Challenge (HPCC) benchmark
suite,” in Proceedings of the 2006 ACM/IEEE conference on
V. CONCLUSIONS AND FUTURE WORK Supercomputing - SC ’06, 2006, p. 213.
[16] M. de Bayser and R. Cerqueira, “Integrating MPI with Docker for HPC,”
We present the possibility of HPC application on cloud in 2017 IEEE International Conference on Cloud Engineering (IC2E),
infrastructure which is built on container cluster. We observed 2017, pp. 259–265.
that there was performance overhead in CPU oriented
270
Authorized licensed use limited to: Nirma University Institute of Technology. Downloaded on October 03,2024 at 07:14:58 UTC from IEEE Xplore. Restrictions apply.
[17] D. M. Jacobsen and R. S. Canon, “Contain This, Unleashing Docker for [21] J. Deng, W. Dong, R. Socher, L.-J. Li, Kai Li, and Li Fei-Fei, “ImageNet:
HPC,” 2015. A large-scale hierarchical image database,” in 2009 IEEE Conference on
[18] R. Priedhorsky and T. Randles, “Charliecloud,” in Proceedings of the Computer Vision and Pattern Recognition, 2009, pp. 248–255.
International Conference for High Performance Computing, Networking, [22] M. P. Marcus, B. Santorini, M. A. Marcinkiewicz, and A. Taylor,
Storage and Analysis on - SC ’17, 2017, pp. 1–10. “Treebank-3 LDC99T42.” Philadelphia: Linguistic Data Consortium,
[19] K. He, X. Zhang, S. Ren, and J. Sun, “Deep Residual Learning for Image 1999.
Recognition,” in The IEEE Conference on Computer Vision and Pattern [23] N. Nethercote, “Dynamic binary analysis and instrumentation,”
Recognition (CVPR), 2016. University of Cambridge, Computer Laboratory, 2004.
[20] W. Zaremba, I. Sutskever, and O. Vinyals, “Recurrent Neural Network [24] “Openfabrics Enterprise Distribution.” [Online]. Available:
Regularization,” 2014. https://www.openfabrics.org/.
271
Authorized licensed use limited to: Nirma University Institute of Technology. Downloaded on October 03,2024 at 07:14:58 UTC from IEEE Xplore. Restrictions apply.