Buber 2018
Buber 2018
Abstract—Deep learning approaches are machine learning CPU cores have a higher processor frequency when GPUs have
methods used in many application fields today. Some core a large number of cores.
mathematical operations performed in deep learning are suitable
to be parallelized. Parallel processing increases the operating Some hyper parameters are important for the parallelization
speed. Graphical Processing Units (GPU) are used frequently for of mathematical operations. If these parameters are not adjusted
parallel processing. Parallelization capacities of GPUs are higher properly, the desired performance gains cannot be achieved with
than CPUs, because GPUs have far more cores than Central parallelization. In this study, it has been investigated how
Processing Units (CPUs). In this study, benchmarking tests were performance gains can be achieved by optimizing these
performed between CPU and GPU. Tesla k80 GPU and Intel Xeon parameters.
Gold 6126 CPU was used during tests. A system for classifying
Web pages with Recurrent Neural Network (RNN) architecture For testing, a personal computer with Macbook Pro 2.3 GHz
was used to compare performance during testing. CPUs and GPUs Intel Core i5, 8 GB DDR3 Ram, 128 GB SSD disk was used.
running on the cloud were used in the tests because the amount of However, these computer features are insufficient for the tests
hardware needed for the tests was high. During the tests, some applied. For this reason, cloud-based services are used for CPU
hyperparameters were adjusted and the performance values were and GPU tests.
compared between CPU and GPU. It has been observed that the
In the tests performed for comparison, a deep learning-based
GPU runs faster than the CPU in all tests performed. In some
cases, GPU is 4-5 times faster than CPU, according to the tests
system for classifying web pages was used. The Web page is a
performed on GPU server and CPU server. These values can be set of documents for the World Wide Web that can be viewed
further increased by using a GPU server with more features. using a web browser. Web pages are mostly encoded in HTML
format, using CSS, script, visual and other auxiliary resources to
Keywords—cloud computing; gpu; cpu; performance analysis; get the final look and functionality. Web page classification is
deep learning. an Information Retrieval application that provides useful
information that can be a basis for many different application
I. INTRODUCTION domains. An example of application area is the creation of a
user's internet usage profile for network anomaly detection. The
Deep learning applications are used in many applications web page classification project used during the tests is run using
that we use in everyday life. Deep learning approaches provide deep learning approaches based on natural language processing.
very successful results for many machine learning problems.
Detailed information on the definition of the web page
It is also important for the system to operate as fast as classification problem in section II, the information on the deep
working with high accuracy. GPUs are widely used today for learning architectures used during the tests and the experimental
rapid training and testing with deep learning architectures. results are explained in Section III. The results and future work
Mathematical operations that are computed for deep learning are are explained in Section IV.
parallelizable operations. Thanks to the large number of cores
on the GPUs, computational operations can be distributed over
II. WEB PAGE CLASSIFICATION
many cores to accelerate operations. In this study, performance
difference between CPU and GPU is analyzed on a specific The content-based classification system used will be
problem. designed only on the classification of web pages prepared in
English language and it is aimed to increase the number of
The multiplicity of the number of processing units (cores) languages that can be classified in the following stages.
that can operate independently of each other is very important in
the amount of parallelization of a process. Parallelization There are many data sets published for use in the projects
operations on four or eight cores in PCs which have average developed for classification of web page. One of the data sets
hardware features and parallelization on GPUs with thousands that can be used to classify Web pages is Roksit's web
of cores will not be at the same level. A large number of cores classification database [1]. Millions of websites are categorized
will have the effect of increasing the amount of parallelization. in this database. During the classification process, web page
contents are used as well as some network-based features. The
feature set used is large, allowing for the creation of a successful The web page content must be well-structured so that a web
web page classifier. Roksit firm claims that the web page page can successfully reach the target audience. It is very
classification system that they have developed has a success rate important for web pages to be indexed by search engines and be
of 99.9%. visible in the top row from the search queries to the target users.
In this study, only the comparison of the working times
between the CPU and the GPU has been examined, and the
optimizations for success increase have not been explained.
Detailed information on the data set is given in subsection A.
A deep learning based RNN approach has been used to classify
web pages in the developed project. Information on the deep
learning architecture used in the web page classification project
is given in subsection B. Information about word embedding is
given in subsection C.
In some of the tests performed, transfer learning was used.
Transfer learning allows the use of some pre-learned parameters
in the deep learning architecture. The parameters used are
obtained as a result of long training periods on very large data
sets carried out by the researchers. The use of these pre-learned
parameters allows the training phase to be completed more
quickly. Detailed information on transfer of learning is
explained in subsection D.
A. Dataset
Textual information on the target page is used in web page
classification. The datasets used to classify web pages should be Fig. 1. Distribution of Data According to Categories
considered in two stages. These are;
• First Stage: Only web pages and categories are included. Meta tags are one of the methods used to index the content
of a web page nicely and to make the web page appear high in
• Second Stage: The crawled data used to classify web search engines. A well-structured web page contains meta tags.
pages. These tags contain summary information entered by the web
page administrator indicating the purpose / function of the web
In the first stage, a system was developed to recognize the 23
page. At the beginning of the most important Meta Tags are the
categories of English web pages. These categories are given in
following labels;
Table I.
• Title
TABLE I. CATEGORY LIST • Description
Business Food Game Alchocol • Keywords
Porn Sport News Dating Meta tags are included in the source code of the web page,
Education Reales Goverment Abort may not be displayed on the screen by the browser to be viewed
by the user.
Travel Animal Gambling Hate
Learning transfer is based on the use of parameters learned Only 1 GPU unit of the Tesla K80 can be allocated to us on
by someone else. For example, if you are working on a project floydhub leased servers. For this reason, 2496 cores and 12 GB
that can detect object from a given image, the number of objects vMemory were used during the tests.
2018 6th International Conference on Control Engineering & Information Technology (CEIT), 25-27 October 2018, Istanbul, Turkey
The detailed hardware features of the personal computer For the comparison of core frequency;
used for the tests and the servers used on the cloud are given in
Table II. CPU cases have been tested with different hardware • Transfer Learning was used
features. The different hardware parameters tested on the CPU • Word embedding dimension was 300
are; the number of cores and the operating frequency of a single
core. • Single-layered deep learning architecture was tested.
For the comparison of core count;
TABLE II. HARDWARE FEATURES OF DEVICES
• Transfer Learning was not used
Devices
Features • 5-layered deep learning architecture was tested.
PC CPU Server GPU Server
The number of
2 16/24 2496 • Core frequency was 2600 Mhz.
Cores
Core Clock 2,3 GHz 1,3GHz/2,6 GHz 562 MHz Core frequency comparison results are given in Fig.3, core
count comparison results are given in Fig. 4.
Boost Clock -- 3,7 GHz 875 MHz
Ram 8 GB 24 GB 61 GB
is more than 1000 because the memory usage on the GPU is high
in the tests performed. The obtained results are given in Fig. 5.
TL: Transfer Learning is used. - Non-TL: Transfer Learning was not used.
C. The Effect of Transfer Learning Fig. 7. Validation Accuracies for the Comparion of TL and Non-TL
Learning word embeddings is a process that requires high While the total number of unique words in our corpus is
processing power. In this experiment, transfer learning was
1,515,634, only 176,280 words can be processed with deep
performed using Stanford University's word vectors [11], which learning architecture. The other words are not included in the
were calculated using the Glove method. The word embeddings
embedding matrix which has 400,000 unique words in it. In this
used in transfer learning were created by analyzing 6 billion case, very few of the words in my dictionary can be processed.
words in corpus of Wikipedia 2014 + Gigaword 5. There are
If the word embeddings are learned by the system itself, all
400,000 unique words in embedding matrix used. The word words in the dictionary can be used in the algorithm. High
embedding used during the transfer learning tests is 100 [11].
success rates can be obtained by learning the word embeddings
Same test was also examined without using transfer learning, in the system. According to the Fig.7, although the initial success
then the results obtained were evaluated. rate was lower, the test on which the word embeddings were
The effect of the transfer learning on performance was learned by the system was more successful than the progressive
examined in this experiment. During the experiment; epoch. If more extensive embedding matrix is used in transfer
learning, more successful values can be obtained with transfer
• Word embedding dimension was 100. learning. The results obtained are specific to the word matrix
• 24 core 2600 Mhz CPU was tested. used.
• Batch Size was 1000, 5-layered architecture was tested. D. The Effect of Hidden Layer Size
Tests without transfer learning took longer than tests using The effect of the hidden layer size on running time was
transfer learning. However, the difference in running time is not examined in this experiment. For the comparison of hidden layer
very high. The obtained results are given in Fig. 6. Validation sizes;
accuracies for the two cases are given in the Fig. 7.
• Transfer Learning was used
Learning word embeddings is very important for the
performance of the system. The test in which the transfer • Word embedding dimension was 300.
learning was not used took longer than the test in which the • 16 core 2600 Mhz CPU was tested.
transfer learning was used.
• Batch Size was 1000.
The disadvantage of using learning transfer is limited use of
the word index. For example, there are 400 unique words in the • All layers were created with 128 neurons.
matrix used, but the corpus used contains only 176,280 of these In a deep learning application, the number of layers is a very
words. important hyperparameter. While multi-layered deep learning
architectures are needed to solve complex problems, high
success rates can be achieved with low-layered architects for
some problems. Increasing the number of layers increases
running times. For this reason, it is important to determine the
optimum number of layers at which high performance values
can be obtained for performance improvement.
2018 6th International Conference on Control Engineering & Information Technology (CEIT), 25-27 October 2018, Istanbul, Turkey
ACKNOWLEDGMENT
Thanks to Roksit for contributing to the development of this
study.
REFERENCES
[1] Domain Categorization, https://www.roksit.com/domain-categorization/,
accessed on October 2017
Fig. 9. Validation Accuracies for The Effect of Hidden Layer Size [2] Wongkot Sriurai, Phayung Meesad, Choochart Haruechaiyasak (2010),
“Hierarchial web page Classification based on a Topic Model and
It is observed that the increase in the number of layers Neighboring Pages Integration”, International Jopurnal of Computer
according to the result obtained in Fig. 9 does not have an effect Science and Information Security, Vol. 7, No.2.
for increasing the success rate for the problem we are working [3] Sara Meshkizadeh, Amir Masoud Rahmani, Mashallah Abassi Dezfuli
on. However, it is observed in Fig. 8 that the increase in the (2010), “Web Page Classification based on URL features and Features of
number of layers causes an increase in the running times. Sibling Pages”, IJCSIS, Vol. 8, No:2.
[4] Understanding LSTM Networks, http://colah.github.io/posts/2015-08-
Understanding-LSTMs/, Son Erişim: Mayıs 2018.
IV. CONCLUSION AND FUTURE WORK
[5] Mikolov, T., Chen, K., Corrado, G., & Dean, J. (2013). Efficient
In this study, performance analysis tests were conducted estimation of word representations in vector space. arXiv preprint
through a deep learning application to classify web pages. Some arXiv:1301.3781.
hyperparameters affecting performance have been investigated. [6] Pennington, J., Socher, R., & Manning, C. (2014). Glove: Global vectors
for word representation. In Proceedings of the 2014 conference on
In addition, the tests were repeated on CPU and GPU servers empirical methods in natural language processing (EMNLP) (pp. 1532-
running on the cloud. Thus, CPU and GPU comparison 1543).
evaluations were carried out for deep learning applications. The [7] Roksit | DNS Firewall & Malware Detection | The Easiest Way to be
test cases are; Secure, https://www.roksit.com, Son Erişim: Nisan 2018.
[8] FloydHub is a zero setup Deep Learning platform for productive data
• The effect of different CPU specification science teams. https://www.floydhub.com/, Son Erişim: Nisan 2018.
• The effect of batch size [9] Intel® Xeon® Gold 6126 Processor,
https://ark.intel.com/products/120483/Intel-Xeon-Gold-6126-Processor-
• The effect of hidden layer size 19_25M-Cache-2_60-GHz, Son Erişim: Nisan 2018.
[10] NVIDIA TESLA K80, https://www.nvidia.com/en-us/data-center/tesla-
• The effect of transfer learning k80/, Son Erişim: Nisan 2018.
[11] GloVe: Global Vectors for Word Representation,
The test cases were examined separately with the results
https://nlp.stanford.edu/projects/glove/, Son Erişim: Nisan 2018.
obtained on the CPU and GPU. The tests performed on the GPU
[12] NVIDIA Launches Tesla K80, GK210 GPU,
in all tested cases were completed faster. In some tests the https://www.anandtech.com/show/8729/nvidia-launches-tesla-k80-
acceleration rate is up to 4-5 times. gk210-gpu, Son Erişim: Nisan 2018.