0% found this document useful (0 votes)
6 views

Big Data Processing Technologies in Distributed in

Uploaded by

Huy Nguyen
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
6 views

Big Data Processing Technologies in Distributed in

Uploaded by

Huy Nguyen
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 6

Available online at www.sciencedirect.

com
ScienceDirect
ScienceDirect
Procedia Computer Science 00 (2018) 000–000
Available online at www.sciencedirect.com
Procedia Computer Science 00 (2018) 000–000 www.elsevier.com/locate/procedia
www.elsevier.com/locate/procedia
ScienceDirect
Procedia Computer Science 160 (2019) 561–566

The 6th International Symposium on Emerging Information, Communication and Networks


The 6th International Symposium on Emerging
(EICN 2019)
Information, Communication and Networks
(EICN
November 4-7, 2019,2019)
Coimbra, Portugal
November 4-7, 2019, Coimbra, Portugal
Big Data Processing Technologies in Distributed Information
Big Data Processing Technologies
Systems in Distributed Information
Systems
Nataliya Shakhovskaa *, Nataliya Boykoa, Yevgen Zasobaa, Eleonora Benovab
Nataliya Shakhovskaa *, Nataliya Boykoa, Yevgen Zasobaa, Eleonora Benovab
a
Lviv Polytechnic National University, 12 Bandera Street, 79000, Lviv, Ukraine
b
Faculty of management,
a Comenius
Lviv Polytechnic University
National in Bratislava,
University, 12 BanderaOdbojárov 10, Bratislava,
Street, 79000, Slovak Republic
Lviv, Ukraine
b
Faculty of management, Comenius University in Bratislava, Odbojárov 10, Bratislava, Slovak Republic

Abstract
Abstract
The analysis of Big data technologies was provided. An example of MapReduce paradigm application, uploading of big volumes
of data,
The processing
analysis of Big and
dataanalyzing of unstructured
technologies was provided. information andofits
An example distributionparadigm
MapReduce into the application,
clustered database wasof
uploading provided. The
big volumes
article
of data,summarizes
processing the
andconcept
analyzingof of
"big data". Examples
unstructured of methods
information and itsfordistribution
working with
into arrays of unstructured
the clustered database data
was are given. The
provided.
parallelsummarizes
article system Resilient Distributed
the concept of "bigDatasets (RDD) is organized.
data". Examples of methodsThe for class of basic
working with database
arrays ofoperations wasdata
unstructured realized: database
are given. The
con-nection,
parallel table
system creation,
Resilient getting in Datasets
Distributed line id, returning
(RDD) isallorganized.
elements of
Thetheclass
database, update,
of basic delete
database and create
operations wastherealized:
line. database
con-nection, table creation, getting in line id, returning all elements of the database, update, delete and create the line.
© 2019 The Authors. Published by Elsevier B.V.
© 2019
This The
is an Authors.
open Published
accessPublished by Elsevier
article under B.V.
the CC B.V.
BY-NC-ND license (http://creativecommons.org/licenses/by-nc-nd/4.0/)
© 2019 The Authors. by Elsevier
This is an open access article under the CC BY-NC-ND license (http://creativecommons.org/licenses/by-nc-nd/4.0/)
Peer-review
This is an under
open responsibility
access article of the
under theConference
CC BY-NC-NDProgram Chairs.
Peer-review under responsibility of the Conference Programlicense
Chairs.(http://creativecommons.org/licenses/by-nc-nd/4.0/)
Peer-review under responsibility of the Conference Program Chairs.
Keywords: Big data; Web application; Modeling; Processing, analytics.
Keywords: Big data; Web application; Modeling; Processing, analytics.

1. Introduction
1. Introduction
The information technology (IT) field is a promising field of research. Recently big systems have consisted of
The information
several servers and technology
terabytes of(IT) field is a Nowadays,
information. promising field of research.
the systems use aRecently big systems
cloud cluster model, have
whichconsisted
includesofa
several servers
thousand and terabytes
of multicore of information.
processors and petabytes Nowadays, the is
of data. That systems
why it use
wasacreated
cloud cluster model, which
a new research area asincludes
Big data.a
thousand
This of multicore
paradigm processors
has already and in
reflected petabytes of data.
academic That isExamples
programs. why it was
of created
Big dataa new research
branch are thearea as Big data.
structured and
This paradigm has already reflected in academic programs. Examples of Big data branch are the structured and

* Corresponding author. Tel.: +380322582404 fax+380322582404..


E-mail address:author.
* Corresponding [email protected]
Tel.: +380322582404 fax+380322582404..
E-mail address: [email protected]
1877-0509 © 2019 The Authors. Published by Elsevier B.V.
This is an open
1877-0509 access
© 2019 Thearticle under
Authors. the CC BY-NC-ND
Published license (http://creativecommons.org/licenses/by-nc-nd/4.0/)
by Elsevier B.V.
Peer-review
This under
is an open responsibility
access of the
article under the Conference
CC BY-NC-NDProgram Chairs.
license (http://creativecommons.org/licenses/by-nc-nd/4.0/)
Peer-review under responsibility of the Conference Program Chairs.

1877-0509 © 2019 The Authors. Published by Elsevier B.V.


This is an open access article under the CC BY-NC-ND license (http://creativecommons.org/licenses/by-nc-nd/4.0/)
Peer-review under responsibility of the Conference Program Chairs.
10.1016/j.procs.2019.11.047
562 Nataliya Shakhovska et al. / Procedia Computer Science 160 (2019) 561–566
2 Nataliya Shakhovska/ Procedia Computer Science 00 (2018) 000–000

unstructured data, media or random processes as they practically can't be processed traditionally. The traditional
monolithic systems are being replaced with the new asynchronous and parallel solutions. These new solutions
provide the ability to work with Big data [1].
Big Data information technology is the set of methods and means of processing different types of structured
(databases) and unstructured (text, stream) dynamic large amounts of data for their analysis and usage for decision
support. This technology is an alternative to traditional database management systems and Business Intelligence
solutions class. Besides, Big data technology can be used for parallel (distributed) data processing [2, 3]. The system
consists of several independent blocks that efficiently process information under conditions of continuous growth
and distribution throughout the multiple cluster nodes. In such systems, the volumes of information increase
exponentially, and unstructured data is the most significant part of the whole data. Therefore, the issues of a proper
interpretation of data flow in systems of such type become more and more urgent [1].
The subject of research is the methods and tools for building, editing and adapting the information flow in
distributed information systems.

2. State-of-art

In [1], the concept of Big data, criteria for their classification are given. The paper [3] considered Big data as a
revolutionary technology of innovation, competition, and productivity of the economy, a new resource for business.
The architecture, informational value for business and the impact of Big Data are given in [4]. The possibilities of
involving innovative Big Data to develop a business strategy are analyzed in [5].
The analysis of the methods of data consolidation is given in [6]. In [7] the authenticity, integration, scalability,
and confidentiality of "open" structured (databases) and unstructured (text) data from social networks are described.
The technic aspects of Big data realization are given in [8]. The method of intelligent data analysis is described in
[9]. The analysis of possibility of Big data implementation in medicine is given in [10, 11]. The information model
of cloud data warehouse and a possibility to implement it as part of Big data technology is provided in [12 – 14].
The big data usage for information analysis in a social network is given in [14]. The methods of deep learning and
machine learning can process Big data consists of different sources such as images, video, audio [13 – 17]. The
business and e-libraries are examples of Big data technologies usage too [18 – 23].
So, the partly solved tasks in Big data processing are: the biggest part of sources are unstructured data; there is
requires to time complexity, so the parallel data processing should be used.

3. Problem statement

The research task is to develop the model of Big data and information technology of distributed unstructured data
processing. To gain such a result, the following tasks must be solved in the paper:
1. To analyze the methods and principles of Big data processing;
2. To analyze existing technologies of Big data processing;
3. To carry out a comparative analysis of productivity of Hadoop and Spark platforms for unstructured data
processing;
4. To test the parallelized system in Scala.

4. Materials and Methods

The clustering is one of the ways to decrease the time complexity for Big data processing. Two variants of
scaling, i.e., horizontal and vertical scaling should be taken into account.
Horizontal scaling divides the data set and distributes the data over multiple servers, or shards. So, you can create
ten instances each with 1TB database. Each shard is an independent database, and collectively, the shards make up a
single logical database. The system should rely on asynchronous message communication to delimit the
components. The controlling of the loads, flows, and message queues should be provided in the system [2 – 4].
Nataliya Shakhovska et al. / Procedia Computer Science 160 (2019) 561–566 563
Nataliya Shakhovska/ Procedia Computer Science 00 (2018) 000–000 3

When we use the systems of this type, several problems arise with all clustered system nodes interworking. For
example, different applications require data access from different nodes. This makes clustered system operation
more complicated, but there is the possibility of vertical data scaling that provides access to data of all system nodes.
Hinchcliff divides the approaches to the Bigdata into three groups depending on the volume:
VolBD = { VolFD, VolBA, VolDI }, (1)
where VolFD is Fast Data: their volume is measured in terabytes; VolBA is Big Analytics; they are petabytes of data;
VolDI is Deep Insight; it is measured in exabytes, zettabytes.
Groups differ among themselves not only in the operating volumes of data but also in the quality of their
processing solutions. Processing information from different expressive power types of information sources, namely
structured, semi-structured, and unstructured is necessary for the Big data technology. A set of information products
is divided into three blocks:
Ip =  St, SemS, UnS, (2)

where St = DB, DW is structured data (databases, data warehouses); SemS = Wb, Tb is semistructured data
(XML, electronic worksheets); UnS = Nd is unstructured data (text) [10, 14].
The following technologies are used for Big data processing:
TBD = TNoSQL, TSQL, THadoop, TV, (3)
where TNoSQL is the technology of NoSQL databases; THadoop is the technology that ensures the massively-parallel
processing; TSQL is the technology of the structured data processing (SQL database); TV is the technology of the Big
data visualization [8, 11].
The main technologies of Big data processing are: NoSQL; MapReduce; Apache Hadoop; Apache Spark.
The information volume increasing problem cannot be solved using classical relational architectures. The
followers of the concept of NoSQL language emphasize that it is not a complete negation of SQL and the relational
model, but the project comes from the fact that SQL is essential and handy tool, that can not be considered as
universal. One problem that point for a classical relational database is a problem of dealing with massive data and
projects with a high load. The first objective approach is to extend the database if SQL flexible enough, and not
displace it wherever it is to perform its tasks. Also, relation approach does not support both types of scaling (vertical
and horizontal).
There are classical approaches and paradigms for the development of data processing facilities. MapReduce
paradigm is one of them [5]. This model of distributed data processing is suggested by Google to process the
significant volume of data on computing clusters. Cluster is several independent computers used together and
working as a single system.
MapReduce provides for data organizing in the form of lists that pass 3 stages of processing:
1. Map stage. At this stage, the data are processed with the help of the map() function defined by the user. The
operation is similar to the map() method in functional programming languages. The map function accepts the list
at the input and returns several key-value pairs.
2. Shuffle stage. At this stage the map function “is divided into buckets” – each bucket conforms to one map stage
key. Later on, these input buckets will serve for reduce() function.
3. Reduce stage. Reduce function defines the result for separate “buckets”.

At present, the Apache Hadoop MapReduce and Apache Spark technologies are a leader in the use of
MapReduce paradigm and creation of the software platform for the arrangement of the distributed processing of
large data volumes [8, 16 – 18].
Apache Hadoop MapReduce is a free platform for the arrangement of large data volumes processing (measured
in petabytes) using the MapReduce paradigm. This paradigm makes it possible to distribute the separate fragments,
each of which can be run at a separate cluster node. Hadoop includes implementation of the distributed Hadoop
HDFS file system, which automatically provides data backup and it is optimized for work with MapReduce. To
simplify the access to the data in Hadoop store, the SQL-like Hive language, which is kind of SQL for MapReduce,
was developed. The requests in this language can be parallelized and processed by several Hadoop platforms.
564 Nataliya Shakhovska et al. / Procedia Computer Science 160 (2019) 561–566
4 Nataliya Shakhovska/ Procedia Computer Science 00 (2018) 000–000

Compared to the previous Hadoop MapReduce, the Spark provides 100 times higher performance when the data
is processed in memory and 10 times higher performance when the data is located on discs. This mechanism is
fulfilled at Hadoop cluster nodes with the help of Hadoop YARN and in a separate mode. It supports the data
processing at HDFS, Cassandra [13] and Hive [11] store and in any Hadoop input format [6, 8].
The main difference between Spark and Hadoop MapReduce is that Spark stores information in computer
memory, providing in such a way the higher platform productivity, while Hadoop stores it on the disc, providing the
higher security level [18 – 19]. In addition to traditional features of Apache Hadoop MapReduce, namely,
processing of unstructured data, the Apache Spark platform includes Spark Streaming for working with
asynchronous streams, Mlib library for computer analysis and GraphX.

5. Experiment

Let us provide the following comparative analysis of productivity of both platforms in the execution time to the
number of iterations ratio (Fig. 1).

Fig. 1. Comparative analysis of productivity of Hadoop and Spark platforms

Spark provides API (Application Program Interface) in Scala, Java, Python and R programming languages. At
first, the Spark program creates the SparkContext object that shows the Spark method of access to the cluster. The
SparkConf object with the information about the application should be built to create the SparkContext.
The concept of Resilient Distributed Datasets (RDD) is the basis of Spark. It is a failure-resistant collection
(list) of elements, that is being processed in parallel. There are two ways to create RDD: parallelization of the
transmitted collection (list) in the program and reference to the external file system, such as HDFS (Hadoop
Distributed File System), or any other data source in Hadoop [5].
Let us divide the service structure into two parts. The first one is a Web page including the UI (User Interface)
with a form for document transmittal to the server and interfaces with data analysis after receiving the processed
data from the server. The second one is the API (Application Program Interface) of our system that will represent a
library of methods for acceptance, processing, analysis, and transmittal of data to the client.
We focused attention on the API systems when Apache Spark is used. The example is provided in Scala
language. To begin with, we set the cluster configuration and create the SparkContext. In the master code the
URL is a cluster configuration setting; setMaster(“local[*]”) means running of Spark locally with the
determined number of information streams according to the quantity of cores on a certain computer;
setMaster(spark://HOST:PORT) is a configuration for connection with external cluster.
We develop the method for file receiving from the client and checking of the file type (csv or xlsx). If so, the
file will be uploaded to the server and its name will be transmitted to the method parseAttachment(inputFile:
String).Otherwise, the method will return the warning.
During the next step each file element should be transmitted to the CSVReader constructor, it should be parsed,
and the raw content should be returned to Spark RDD. This process allows paralleling of data processing. After that
the collection (list) returns and it is transmitted to the toTransactions(data) constructor, and in such a way
the collection returns from transaction. After completion of this process, each element of the collection is
Nataliya Shakhovska et al. / Procedia Computer Science 160 (2019) 561–566 565
Nataliya Shakhovska/ Procedia Computer Science 00 (2018) 000–000 5

transmitted to the DAO.create(), i.e. it is stored in the database.


Let us provide the following Scala object for basic operations with the database: connect to the database, create
a table, delete a table, return the line through the id, return all elements of the database, update the line, remove the
line, and create the line.
At the next stage, we create the transaction class, which includes four fields: id, account, description (DESC),
code and amount. At the last step, an actor is created for asynchronous messaging to set the limit between
components.

6. Results

At the last stage, all elements are united to create the main class of application running. Scala library, namely
spray is used. It is necessary to run the server and deploy applications. Our issue is to create the configuration,
combine it with the database, create the service, actors system and run the HTTP server. Application operating
interface is shown in Fig. 4, the content of the database after the uploading of csv document with the data to the
server is shown on the left and right side.

Fig. 2. Application operating results

7. Discussion

The parallel method for file receiving from the client and checking of the file type (csv or xlsx) is developed.
Each file element is transmitted to the CSVReader constructor. The raw content after parsing is returned to Spark
RDD. Scala object for basic operations with the database is developed. It guarantees loosely coupled interface,
isolation, location transparency and provides means of errors or messages delegation.

8. Conclusions

The information technology for Big data parallel processing is developed. The analysis of the methods and
principles of Big data processing is given. The comparative analysis of the productivity of Hadoop and Spark
platforms for unstructured data processing is provided. An example of the application of the MapReduce paradigm,
loading large volumes of data, processing, and analysis of unstructured information and its distribution into a cluster
database is given. Examples of methods for working with unstructured data arrays are given. A parallel RDD system
is organized. The proposed working class of loneliness for basic database operations such as database connection,
table creation, spread-sheet, id readout, the return of all database elements, update, deletion, and line creation.
The parallelized system in Scala is developed and testing. This information technology allows us processing
566
6 Nataliya
Nataliya Shakhovska
Shakhovska/ et al. / Computer
Procedia Procedia Computer
Science 00Science
(2018)160 (2019) 561–566
000–000

structured, semi-structured and unstructured data and combining vertical and horizontal data scaling.

References

[1] Janssen, M., van der Voort, H., & Wahyudi, A. (2017). “Factors influencing big data decision-making quality”. Journal of Business Research,
70: 338-345.
[2] Shaw, J. (2014). “Why Big Data is a big deal”. Harvard Magazine, 3: 30-35.
[3] Daas, P. J., Puts, M. J., Buelens, B., & van den Hurk, P. A. (2015). “Big data as a source for official statistics”. Journal of Official Statistics,
31(2): 249-262.
[4] Shakhovska, N., Vovk, O., Hasko, R., Kryvenchuk, Y. (2018). “The Method of Big Data Processing for Distance Educational System”. In:
Shakhovska N., Stepashko V. (eds) Advances in Intelligent Systems and Computing II. 689: 461-473.
[5] De Mauro, A., Greco, M., & Grimaldi, M. (2016). “A formal definition of Big Data based on its essential features”. Library Review, 65(3):
122-135.
[6] Melnykova, N., Marikutsa, U., Kryvenchuk, U. (2018). “The New Approaches of Heterogeneous Data Consolidation”. Proceedings of the
13th International Scientific and Technical Conference on Computer Sciences and Information Technologies (CSIT), Lviv, September 2018
(1): 408-411.
[7] Ediger, D., Jiang, K., Riedy, J., Bader, D. A., Corley, C., Farber, R., Reynolds, W. N. (2010). “Massive social network analysis: Mining
twitter for social good”. Proceedings of the 39th International Conference on Parallel Processing (2010, September): 583-593.
[8] Chen, H., Chiang, R. H., Storey, V. C. (2012). “Business intelligence and analytics: from big data to big impact”. MIS quarterly: 1165-1188.
[9] Boyko, N. (2016). “A look trough methods of intellectual data analysis and their applying in informational systems”. Proceedings of the XIth
International Scientific and Technical Conference “Computer Sciences and Information Technologies (CSIT), Lviv, September 2016: 183-185.
[10] Das, N., Das, L., Rautaray, S. S., Pandey, M. (2018). “Big Data Analytics for Medical Applications”. International Journal of Modern
Education and Computer Science, 10(2): 35
[11] Thusoo, A., Sarma, J. S., Jain, N., Shao, Z., Chakka, P., Anthony, S., ... & Murthy, R. (2009). “Hive: a warehousing solution over a map-
reduce framework”. Proceedings of the VLDB Endowment, 2(2): 1626-1629.
[12] Wang, C., Ren, K., Lou, W., & Li, J. (2010). “Toward publicly auditable secure cloud data storage services”. IEEE network, 24(4).
[13] Fedushko S., Shakhovska N., Syerov Yu. (2018) “Verifying the medical specialty from user profile of online community for health-related
advices”. Proceedings of the 1st International workshop on informatics & Data-driven medicine (IDDM 2018) Lviv, November 28–30, 2018.
2255: 301–310.
[14] Maass, W., Natschläger, T., & Markram, H. (2002). “Real-time computing without stable states: A new framework for neural computation
based on perturbations”. Neural computation, 14(11): 2531-2560
[15] Vitynskyi, P., Tkachenko, R., Izonin, I., Kutucu H. (2018) “Hybridization of the SGTM Neural-like Structure through Inputs Polynomial
Extension”. In Proceedings of the Second International Conference on Data Stream Mining Processing (DSMP), 386-391.
[16] Wang, G., & Tang, J. (2012, August). “The nosql principles and basic application of cassandra model”. In Proceedings of the 2012
International Conference Computer Science & Service System (CSSS), 1332-1335.
[17] Zaharia, M., Xin, R. S., Wendell, P., Das, T., Armbrust, M., Dave, A., & Ghodsi, A. (2016). “Apache spark: a unified engine for big data
processing”. Communications of the ACM, 59(11): 56-65.
[18] Molnár, E., Molnár, R., Kryvinska, N., Greguš M. (2014) “Web Intelligence in practice”. The Society of Service Science, Journal of Service
Science Research, Springer, 6(1):149-172.
[19] Kryvinska, N. (2012) “Building Consistent Formal Specification for the Service Enterprise Agility Foundation”. The Society of Service
Science, Journal of Service Science Research, Springer, Vol. 4, No. 2, 2012, pp. 235-269.
[20] Gregus, M. Kryvinska, N. (2015) “Service Orientation of Enterprises - Aspects, Dimensions, Technologies”. Comenius University in
Bratislava, ISBN: 9788022339780.
[21] Kaczor, S., Kryvinska, N. (2013) “It is all about Services - Fundamentals, Drivers, and Business Models”. The Society of Service Science,
Journal of Service Science Research, Springer, 5(2): 125-154.
[22] Kryvinska, N., Gregus, M. (2014) “SOA and it's Business Value in Requirements, Features, Practices and Methodologies”. Comenius
University in Bratislava, ISBN: 9788022337649.
[23]. Rusyn, B., Vysotska, V., Pohreliuk, L.: “Model and architecture for virtual library information system”. In Proceedings of the 13th
International Scientific and Technical Conference on Computer Sciences and Information Technologies (CSIT), Lviv, September 2018 (1),
37-41

You might also like