Buber 2018

The document summarizes a study that compares the performance of CPU and GPU for deep learning tasks. The study benchmarks a recurrent neural network model for web page classification on a Tesla K80 GPU and Intel Xeon Gold 6126 CPU in cloud environments. The GPU was found to be 4-5 times faster than the CPU for the deep learning model, according to tests performed on GPU and CPU servers. Optimizing hyperparameters could further increase performance gains when using GPUs.

Uploaded by

Jeanpierre H. Asdikian

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

29 views6 pages

Buber 2018

Uploaded by

Jeanpierre H. Asdikian

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 6

2018 6th International Conference on Control Engineering & Information Technology (CEIT), 25-27 October 2018, Istanbul, Turkey

Performance Analysis and

CPU vs GPU Comparison for Deep Learning
Ebubekir BUBER Banu DIRI
Computer Engineering Department Computer Engineering Department
Yildiz Technical University Yildiz Technical University
Istanbul, Turkey Istanbul, Turkey
[email protected] [email protected]

Abstract—Deep learning approaches are machine learning CPU cores have a higher processor frequency when GPUs have
methods used in many application fields today. Some core a large number of cores.
mathematical operations performed in deep learning are suitable
to be parallelized. Parallel processing increases the operating Some hyper parameters are important for the parallelization
speed. Graphical Processing Units (GPU) are used frequently for of mathematical operations. If these parameters are not adjusted
parallel processing. Parallelization capacities of GPUs are higher properly, the desired performance gains cannot be achieved with
than CPUs, because GPUs have far more cores than Central parallelization. In this study, it has been investigated how
Processing Units (CPUs). In this study, benchmarking tests were performance gains can be achieved by optimizing these
performed between CPU and GPU. Tesla k80 GPU and Intel Xeon parameters.
Gold 6126 CPU was used during tests. A system for classifying
Web pages with Recurrent Neural Network (RNN) architecture For testing, a personal computer with Macbook Pro 2.3 GHz
was used to compare performance during testing. CPUs and GPUs Intel Core i5, 8 GB DDR3 Ram, 128 GB SSD disk was used.
running on the cloud were used in the tests because the amount of However, these computer features are insufficient for the tests
hardware needed for the tests was high. During the tests, some applied. For this reason, cloud-based services are used for CPU
hyperparameters were adjusted and the performance values were and GPU tests.
compared between CPU and GPU. It has been observed that the
In the tests performed for comparison, a deep learning-based
GPU runs faster than the CPU in all tests performed. In some
cases, GPU is 4-5 times faster than CPU, according to the tests
system for classifying web pages was used. The Web page is a
performed on GPU server and CPU server. These values can be set of documents for the World Wide Web that can be viewed
further increased by using a GPU server with more features. using a web browser. Web pages are mostly encoded in HTML
format, using CSS, script, visual and other auxiliary resources to
Keywords—cloud computing; gpu; cpu; performance analysis; get the final look and functionality. Web page classification is
deep learning. an Information Retrieval application that provides useful
information that can be a basis for many different application
I. INTRODUCTION domains. An example of application area is the creation of a
user's internet usage profile for network anomaly detection. The
Deep learning applications are used in many applications web page classification project used during the tests is run using
that we use in everyday life. Deep learning approaches provide deep learning approaches based on natural language processing.
very successful results for many machine learning problems.
Detailed information on the definition of the web page
It is also important for the system to operate as fast as classification problem in section II, the information on the deep
working with high accuracy. GPUs are widely used today for learning architectures used during the tests and the experimental
rapid training and testing with deep learning architectures. results are explained in Section III. The results and future work
Mathematical operations that are computed for deep learning are are explained in Section IV.
parallelizable operations. Thanks to the large number of cores
on the GPUs, computational operations can be distributed over
II. WEB PAGE CLASSIFICATION
many cores to accelerate operations. In this study, performance
difference between CPU and GPU is analyzed on a specific The content-based classification system used will be
problem. designed only on the classification of web pages prepared in
English language and it is aimed to increase the number of
The multiplicity of the number of processing units (cores) languages that can be classified in the following stages.
that can operate independently of each other is very important in
the amount of parallelization of a process. Parallelization There are many data sets published for use in the projects
operations on four or eight cores in PCs which have average developed for classification of web page. One of the data sets
hardware features and parallelization on GPUs with thousands that can be used to classify Web pages is Roksit's web
of cores will not be at the same level. A large number of cores classification database [1]. Millions of websites are categorized
will have the effect of increasing the amount of parallelization. in this database. During the classification process, web page
contents are used as well as some network-based features. The

978-1-5386-7641-7/18/$31.00 ©2018 IEEE

2018 6th International Conference on Control Engineering & Information Technology (CEIT), 25-27 October 2018, Istanbul, Turkey

feature set used is large, allowing for the creation of a successful The web page content must be well-structured so that a web
web page classifier. Roksit firm claims that the web page page can successfully reach the target audience. It is very
classification system that they have developed has a success rate important for web pages to be indexed by search engines and be
of 99.9%. visible in the top row from the search queries to the target users.
In this study, only the comparison of the working times
between the CPU and the GPU has been examined, and the
optimizations for success increase have not been explained.
Detailed information on the data set is given in subsection A.
A deep learning based RNN approach has been used to classify
web pages in the developed project. Information on the deep
learning architecture used in the web page classification project
is given in subsection B. Information about word embedding is
given in subsection C.
In some of the tests performed, transfer learning was used.
Transfer learning allows the use of some pre-learned parameters
in the deep learning architecture. The parameters used are
obtained as a result of long training periods on very large data
sets carried out by the researchers. The use of these pre-learned
parameters allows the training phase to be completed more
quickly. Detailed information on transfer of learning is
explained in subsection D.

A. Dataset
Textual information on the target page is used in web page
classification. The datasets used to classify web pages should be Fig. 1. Distribution of Data According to Categories
considered in two stages. These are;
• First Stage: Only web pages and categories are included. Meta tags are one of the methods used to index the content
of a web page nicely and to make the web page appear high in
• Second Stage: The crawled data used to classify web search engines. A well-structured web page contains meta tags.
pages. These tags contain summary information entered by the web
page administrator indicating the purpose / function of the web
In the first stage, a system was developed to recognize the 23
page. At the beginning of the most important Meta Tags are the
categories of English web pages. These categories are given in
following labels;
Table I.
• Title
TABLE I. CATEGORY LIST • Description
Business Food Game Alchocol • Keywords
Porn Sport News Dating Meta tags are included in the source code of the web page,
Education Reales Goverment Abort may not be displayed on the screen by the browser to be viewed
by the user.
Travel Animal Gambling Hate

Techno Reference Adult Drug B. RNN Architecture

Health Finance Politic The RNN (Recurrent Neural Network) allows us to produce
models with the ability to instantly use the information obtained
in the past words. In this deep learning approach, the states
The system was developed using 1,210,967 domains. The before t are given as input for the state t for the evaluation of t.
distribution of the obtained data according to the categories is The RNN architecture consists of cells that are repeated one after
given in Fig. 1. the other. A cell information is given as input to the next cell. In
some sources, the recursive representation is plotted over a
In order to classify content-based web page, it is necessary single cell, while in others some, the cells are displayed
to determine what kind of data to use first. The data to be used separately. The structure of the RNN architects is given in Fig.
can be obtained from the content of the analyzed page as well as 2.
from the neighboring pages [2-3] In this study, the information
obtained from pages analyzed for the classification of web pages In natural language processing problems, the amount of text
is used. contained in each data instance is not standardized. The
sequence dimensions are reduced to a certain value so that all
text can be processed.
2018 6th International Conference on Control Engineering & Information Technology (CEIT), 25-27 October 2018, Istanbul, Turkey

you can detect is a very important criterion for the system.

However, data centers consisting of powerful computers may be
needed to train systems that will recognize thousands of objects.
In such a case, we can have a system that has the ability to
recognize thousands of objects with read only models and
parameters that have been trained and recorded by someone else.
It is possible to include a new object that I have identified to the
objects recognized. Transfer learning allows many researchers
Fig. 2. RNN Architecture [4]
to save time and system resources in many fields of application.
In the deeper learning architecture to be designed, in the deep Transfer learning is also widely used in natural language
learning architecture. The sequence is filled to the specified processing problems. It is a very costly process to learn the word
value if the instance has dimension with less than the specified embeddings. In order to learn word embeddings successfully, it
value. If the sequence size is excessive, the excess is discarded. is necessary to do learning on very large texts. While collecting
large amounts of data is a very difficult problem in itself,
C. Word Embeddings processing and understanding this data is also a difficult and
Word embeddings are types of word representation that are time-consuming problem. Some leading technology firms, some
made meaningful by the computer. Word embeddings have universities and researchers in the field are able to share the
features that can capture semantic relationships. Learning of parameters they have obtained by using these processes, which
word embedding is an unsupervised learning problem. require a lot of time and effort, to be used by other researchers.
Successful word embeddings can be extracted by processing In this regard, researchers who do not have sufficient hardware
textual data in very large quantities. The feature values of the resources can work on well-learned models.
given words are calculated by the learning process performed on The effect of the use of transfer learning on the processing
the large texts. The learned properties represent the semantic time duration was also analyzed in this study.
relationships determined by the system. However, these features
are not meaningful to humans alone, although they can be
understood by the computer. III. EXPERIMENTAL RESULTS
In this study, only the comparison of the working times
Natural language processing problems are processed in a between the CPU and the GPU has been examined, and the
certain word space. The structure in which all the words in the optimizations for success increase have not been explained.
word space take place is called a dictionary. The words in the Tests have been run on the web page classification project.
dictionary are in the form of a list. A special token (Unknown-
UNK) is added to the dictionary to represent words that are not CPU tests were performed on the servers provided by the
in the dictionary. The words are represented by the order in Roksit Company [7]. For GPU testing, a cloud-based GPU
which they appear in the dictionary. Only words in the dictionary server from Floydhub [8] has been leased.
can be analyzed in problem solving. Word vectors are used for
The CPU model used in Roksit's data center is Intel® Xeon®
each word in the operations to be performed in the algorithm.
Gold 6126 [9]. There are a large number of these processors in
The structure in which all the word vectors contained in the
the data center where the tests are performed. Designed for use
dictionary are stored is called the Embedding Matrix.
in servers, each processor has 12 cores. Each of the cores has 2.6
Each word vector is in a column. The Embedding Matrix GHz (3.70 GHz with Turbo) processor frequency. The resources
dimension (d x m) is for a sample with a word vector dimension allocated for use in the tests can be arranged. For example, it is
d and a dictionary word count m. possible to perform tests on 12 cores as well as on 50 cores or
even more. In addition, the operating frequencies of the cores are
The best-known methods for learning word embeddings are also one of the adjustable parameters. Tests can be done on cores
Word2Vec [5] and Glove [6]. Glove method has some with 1300 Mhz working frequency as well as with 2600 Mhz
performance improvements over Word2Vec method. core working frequency. A limit can also be placed on the
maximum operating frequency assigned to the tests so that the
D. Transfer Learning tests to be run on the company's server do not affect other
In some machine learning problem types, too much running systems.
processor power and too many system resources are needed to
train deep learning architects. If the problem complexity is high, A server from Floydhub [8] has been leased for GPU testing.
the required processing power is also increasing. At the same The leased server has a Nvidia Tesla K80 [10] model GPU.
time, the time required for the training phase is also increasing. Tesla K80 has 2 different processing units. Each processing unit
For this reason, researchers who work on problems requiring can be considered as separate GPUs. So this means Tesla k80
high processing power and time benefit from transfer learning. consists of two GPUs. Tesla K80 has 2x2496 cores and 2x12 GB
Thus, they can build great deep learning architects with less VRAM. The operating frequency of each core is 562 Mhz, and
hassle. the boost mode is 875 Mhz [12].

Learning transfer is based on the use of parameters learned Only 1 GPU unit of the Tesla K80 can be allocated to us on
by someone else. For example, if you are working on a project floydhub leased servers. For this reason, 2496 cores and 12 GB
that can detect object from a given image, the number of objects vMemory were used during the tests.
2018 6th International Conference on Control Engineering & Information Technology (CEIT), 25-27 October 2018, Istanbul, Turkey

The detailed hardware features of the personal computer For the comparison of core frequency;
used for the tests and the servers used on the cloud are given in
Table II. CPU cases have been tested with different hardware • Transfer Learning was used
features. The different hardware parameters tested on the CPU • Word embedding dimension was 300
are; the number of cores and the operating frequency of a single
core. • Single-layered deep learning architecture was tested.
For the comparison of core count;
TABLE II. HARDWARE FEATURES OF DEVICES
• Transfer Learning was not used
Devices
Features • 5-layered deep learning architecture was tested.
PC CPU Server GPU Server
The number of
2 16/24 2496 • Core frequency was 2600 Mhz.
Cores
Core Clock 2,3 GHz 1,3GHz/2,6 GHz 562 MHz Core frequency comparison results are given in Fig.3, core
count comparison results are given in Fig. 4.
Boost Clock -- 3,7 GHz 875 MHz

Ram 8 GB 24 GB 61 GB

vRam (GPU) 1,5 GB -- 12 GB

Storage 128 GB 2 TB 110 GB

PC: Personal Computer
Since the tests performed were costly, they were only run for
3 epochs and the average time was taken. Common
hyperparameters used for all tests are as follows;
• Epoch count: 3 Fig. 3. CPU Core Frequency Comparison
• Loss function: Categorical Cross Entropy
• Maximum sequence size: 100
• Parameter update function: ADAM, Learning rate: 0.01,
Beta_1: 0.9, Beta_2: 0.999.
• Validation set split ratio: 0.2
• Unique word count in dictionary: 1,515,634
• Overall sample count: 1,210,954
ed: Embedding Dimension
• All layers were created with 128 neurons.
Fig. 4. Core Count Comparison
Two different deep learning architectures were tested during
the tests. One of the architects tested was a single layered There is a decrease in the running time as a result of the
structure and the other tested in a 5 layered structure. doubling of the core frequency according to Fig. 3. As a result
of increasing the number of cores, it was observed that the
Some performance analysis cases were testing in following working time decreased according to Fig. 4.
experiments. CPU and GPU comparisons have been made in
addition to the performance analysis case tested in each
experiment. Optimizations for success increase are out of scope B. The Effect of Batch Size
for this study. The durations given in the experiments are the The effect of the batch size on performance was examined in
values obtained for the training phase of the tests. Time this experiment. For the comparison of batch sizes;
durations are given in minutes format. • Transfer Learning was used
A. The Effect of Different CPU Specifications • Word embedding dimension was 300
Various CPU specifications can be tried on the servers in the • Single-layered deep learning architecture was tested.
Roksit’s data center. The CPU specifications tested are as
follows. • 16 core 2600 Mhz CPU was tested.
• Spec 1: 16 core processor, each core has 1300 Mhz Increasing the batch size causes large matrix multiplications.
Since the matrix multiplications are parallelizable operations,
• Spec 2: 16 core processor, each core has 2600 Mhz increasing the batch size has shortened the working time on the
GPU. Increasing the batch size increased the parallelization rate
• Spec 3: 24 core processor, each core has 2600 Mhz
of the process. Memory errors are detected when the batch size
2018 6th International Conference on Control Engineering & Information Technology (CEIT), 25-27 October 2018, Istanbul, Turkey

is more than 1000 because the memory usage on the GPU is high
in the tests performed. The obtained results are given in Fig. 5.

TL: Transfer Learning is used. - Non-TL: Transfer Learning was not used.

Fig. 6. Running Times for the Comparion of TL and Non-TL

Fig. 5. The Effect of Batch Size

If the batch size is further increased in some machine

learning problems where the memory usage is less, the working
speed can be further increased.
The maximum acceleration of the running time in the tests
for web page classification on given hardware specification is
obtained where batch size is 5000.

C. The Effect of Transfer Learning Fig. 7. Validation Accuracies for the Comparion of TL and Non-TL
Learning word embeddings is a process that requires high While the total number of unique words in our corpus is
processing power. In this experiment, transfer learning was
1,515,634, only 176,280 words can be processed with deep
performed using Stanford University's word vectors [11], which learning architecture. The other words are not included in the
were calculated using the Glove method. The word embeddings
embedding matrix which has 400,000 unique words in it. In this
used in transfer learning were created by analyzing 6 billion case, very few of the words in my dictionary can be processed.
words in corpus of Wikipedia 2014 + Gigaword 5. There are
If the word embeddings are learned by the system itself, all
400,000 unique words in embedding matrix used. The word words in the dictionary can be used in the algorithm. High
embedding used during the transfer learning tests is 100 [11].
success rates can be obtained by learning the word embeddings
Same test was also examined without using transfer learning, in the system. According to the Fig.7, although the initial success
then the results obtained were evaluated. rate was lower, the test on which the word embeddings were
The effect of the transfer learning on performance was learned by the system was more successful than the progressive
examined in this experiment. During the experiment; epoch. If more extensive embedding matrix is used in transfer
learning, more successful values can be obtained with transfer
• Word embedding dimension was 100. learning. The results obtained are specific to the word matrix
• 24 core 2600 Mhz CPU was tested. used.

• Batch Size was 1000, 5-layered architecture was tested. D. The Effect of Hidden Layer Size
Tests without transfer learning took longer than tests using The effect of the hidden layer size on running time was
transfer learning. However, the difference in running time is not examined in this experiment. For the comparison of hidden layer
very high. The obtained results are given in Fig. 6. Validation sizes;
accuracies for the two cases are given in the Fig. 7.
• Transfer Learning was used
Learning word embeddings is very important for the
performance of the system. The test in which the transfer • Word embedding dimension was 300.
learning was not used took longer than the test in which the • 16 core 2600 Mhz CPU was tested.
transfer learning was used.
• Batch Size was 1000.
The disadvantage of using learning transfer is limited use of
the word index. For example, there are 400 unique words in the • All layers were created with 128 neurons.
matrix used, but the corpus used contains only 176,280 of these In a deep learning application, the number of layers is a very
words. important hyperparameter. While multi-layered deep learning
architectures are needed to solve complex problems, high
success rates can be achieved with low-layered architects for
some problems. Increasing the number of layers increases
running times. For this reason, it is important to determine the
optimum number of layers at which high performance values
can be obtained for performance improvement.
2018 6th International Conference on Control Engineering & Information Technology (CEIT), 25-27 October 2018, Istanbul, Turkey

According to the results obtained, the increase in the number

of cores seems to shorten the running time. Similarly, the
increase in core operating frequency has also been seen to
increase the operating speed of the system. Increasing the batch
size increases the parallel ability of the process. It is seen that
the system is accelerated by tests performed on the GPU with
high batch size on this count. Increasing the number of hidden
layers causes the parameters to be learned to increase. This leads
to an increase in the duration of the training. Although the
increase in the number of layers increases the training period for
the web page classification problem, it does not increase the
Fig. 8. The Effect of Hidden Layer Size on Running Time
success rates. Transfer learning for natural language processing
Fig. 8 shows the comparison of running times for 1 and 5- problems is the creation of a model by the ready use of word
layer architects. According to the obtained results, the increase embeddings. Transfer learning can be used to obtain a model
in the number of layers causes an increase in the running time. with high success rate even in low epochs. The success rates are
It has been seen that the GPU runs faster in the tests performed. increasing slowly in the tests in which the word vectors are
learned in the system. For this reason, a large number of epoch
The results of the effect of the increase in the number of training may be needed in order to achieve the desired success
layers on the success rates are given in Fig 9. levels. It is planned to make optimizations in order to increase
the success rates obtained in the following studies.

ACKNOWLEDGMENT
Thanks to Roksit for contributing to the development of this
study.
REFERENCES
[1] Domain Categorization, https://www.roksit.com/domain-categorization/,
accessed on October 2017
Fig. 9. Validation Accuracies for The Effect of Hidden Layer Size [2] Wongkot Sriurai, Phayung Meesad, Choochart Haruechaiyasak (2010),
“Hierarchial web page Classification based on a Topic Model and
It is observed that the increase in the number of layers Neighboring Pages Integration”, International Jopurnal of Computer
according to the result obtained in Fig. 9 does not have an effect Science and Information Security, Vol. 7, No.2.
for increasing the success rate for the problem we are working [3] Sara Meshkizadeh, Amir Masoud Rahmani, Mashallah Abassi Dezfuli
on. However, it is observed in Fig. 8 that the increase in the (2010), “Web Page Classification based on URL features and Features of
number of layers causes an increase in the running times. Sibling Pages”, IJCSIS, Vol. 8, No:2.
[4] Understanding LSTM Networks, http://colah.github.io/posts/2015-08-
Understanding-LSTMs/, Son Erişim: Mayıs 2018.
IV. CONCLUSION AND FUTURE WORK
[5] Mikolov, T., Chen, K., Corrado, G., & Dean, J. (2013). Efficient
In this study, performance analysis tests were conducted estimation of word representations in vector space. arXiv preprint
through a deep learning application to classify web pages. Some arXiv:1301.3781.
hyperparameters affecting performance have been investigated. [6] Pennington, J., Socher, R., & Manning, C. (2014). Glove: Global vectors
for word representation. In Proceedings of the 2014 conference on
In addition, the tests were repeated on CPU and GPU servers empirical methods in natural language processing (EMNLP) (pp. 1532-
running on the cloud. Thus, CPU and GPU comparison 1543).
evaluations were carried out for deep learning applications. The [7] Roksit | DNS Firewall & Malware Detection | The Easiest Way to be
test cases are; Secure, https://www.roksit.com, Son Erişim: Nisan 2018.
[8] FloydHub is a zero setup Deep Learning platform for productive data
• The effect of different CPU specification science teams. https://www.floydhub.com/, Son Erişim: Nisan 2018.
• The effect of batch size [9] Intel® Xeon® Gold 6126 Processor,
https://ark.intel.com/products/120483/Intel-Xeon-Gold-6126-Processor-
• The effect of hidden layer size 19_25M-Cache-2_60-GHz, Son Erişim: Nisan 2018.
[10] NVIDIA TESLA K80, https://www.nvidia.com/en-us/data-center/tesla-
• The effect of transfer learning k80/, Son Erişim: Nisan 2018.
[11] GloVe: Global Vectors for Word Representation,
The test cases were examined separately with the results
https://nlp.stanford.edu/projects/glove/, Son Erişim: Nisan 2018.
obtained on the CPU and GPU. The tests performed on the GPU
[12] NVIDIA Launches Tesla K80, GK210 GPU,
in all tested cases were completed faster. In some tests the https://www.anandtech.com/show/8729/nvidia-launches-tesla-k80-
acceleration rate is up to 4-5 times. gk210-gpu, Son Erişim: Nisan 2018.