0% found this document useful (0 votes)
9 views

Analysis of Frameworks and Technologies For Solving Big Data Storage and Processing Problems in Distributed Systems

Uploaded by

t9951198569
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
9 views

Analysis of Frameworks and Technologies For Solving Big Data Storage and Processing Problems in Distributed Systems

Uploaded by

t9951198569
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 4

Analysis of Frameworks and Technologies for

Solving Big Data Storage and Processing Problems


in Distributed Systems

Olga Blinova Oleg Kuprikov Mais Farkhadov


2023 7th International Conference on Information, Control, and Communication Technologies (ICCT) | 979-8-3503-4094-5/23/$31.00 ©2023 IEEE | DOI: 10.1109/ICCT58878.2023.10347069

lab.17 lab.17 lab.17


Trapeznikov Institute of Control Trapeznikov Institute of Control Trapeznikov Institute of Control
Sciences of Russian Academy of Sciences of Russian Academy of Sciences of Russian Academy of
Sciences, Sciences, Sciences,
Moscow, Russia Moscow, Russia Moscow, Russia
[email protected] [email protected] [email protected]

Abstract—The article discusses the main metrics for be optimal for solving a specific task? The old evaluation
evaluating the performance of big data warehouses, compares methods no longer work. All modern distributed systems are
the characteristics of the most popular modern frameworks. in a state of dynamic equilibrium, data can move between
When considering distributed storage of large amounts of different machines, user requests are also redirected
information, it is impossible to use the same indicators of speed depending on the load in the system. The same query made
and efficiency that are used in local databases. Qualitative and to the same database can be executed in a different time, with
quantitative evaluation is complicated by the inability to a different result, use different data channels and be executed
reproduce the same queries in the same conditions when using on different hardware. How then to compare the proposed
distributed storage. This is caused by the constantly changing
solutions and choose the best one for the current task? What
state of the transmitting environment and computing
do you need to know about the specifics of the task before
resources. When choosing a framework, technology stack, file
system, distributed database architecture for a specific task, it moving on to choosing a technology stack? How to evaluate
is necessary to take into account a large number of factors, as the effectiveness of the information system? Let's consider
well as the totality and frequency of operations and treatments the basic principles of building and evaluating the
that will be required in this system. The features and effectiveness of distributed information systems for storing
advantages of using containers and various file systems for big data.
storing data in distributed systems are presented, as well as
other technologies that allow increasing the speed and II. BASIC PRINCIPLES OF THE ORGANIZATION OF BIG DATA
efficiency of data processing in large arrays of heterogeneous SYSTEMS
data. At the moment, there is no single solution that would
A. The Concept of Big Data and Common Frameworks
allow the best way to organize the storage of large volumes of
heterogeneous data, but there are trends in the development of Before proceeding to evaluating the effectiveness of
technologies for their processing. Many of them are related to systems, it is necessary to determine what makes it possible
the ability to apply to big data the same methods that are used to attribute systems to big data. Despite the great prevalence
in relational databases, but taking into account specific of the term "big data", it does not have an unambiguous
features.* definition. In 2013, this term was added to the Oxford
Dictionary, and it is defined as follows: "sets of information
Keywords—big data, control system architectures, that are too large or too complex to handle, analyze or use
heterogeneous data, metrics of distributed systems, distributed with standard method"[1]. Wikipedia provides the following
systems definition: Big Data is the designation of structured and
unstructured data of huge volumes and significant diversity,
I. INTRODUCTION
efficiently processed by horizontally scalable software tools
Technologies and distributed big data storage systems are that appeared in the late 2000s and alternative to traditional
rapidly developing and gaining popularity, as they are database management systems and solutions of the Business
optimally suited for storing data in conditions of a huge flow Intelligence class [2, 3]. That is, big data can include arrays
of new information, most often poorly structured. New of information that cannot be decomposed and processed on
technologies for organizing data storage, access to them, a single computer, so special hardware and software
methods and methods of processing, technologies for solutions are used to place an array of data on multiple
ensuring and maintaining data integrity, providing remote devices while preserving the integrity and availability of data
access and multi–user work are solutions dictated by the [4,5]. Most often, such storages are built on the basis of
realities of the modern world. There are a large number of cloud technologies. Since, in fact, it is cloud technologies
big data storage solutions on the market, most often these are that provide the necessary virtualization layer, which allows
complex technology stacks, ranging from the file system to you to work with distributed data storage as a whole. This
the user interface. How to determine which technologies will means that when evaluating big data solutions, you can use
the experience of evaluating cloud data warehouses.
*
This study was conducted within the framework of the scientific program
of the National Center for Physics and Mathematics, section №9 «Artificial Of the most widely used frameworks and platforms, a
intelligence and big data in technical, industrial, natural and social systems huge ecosystem of technologies developed by Apache should

979-8-3503-4094-5/23/$31.00 ©2023 IEEE


Authorized licensed use limited to: INDIAN INSTITUTE OF TECHNOLOGY GUWAHATI. Downloaded on September 05,2024 at 10:33:18 UTC from IEEE Xplore. Restrictions apply.
be highlighted. Apache Hadoop is a platform that allows III. BIG DATA STORAGE PERFORMANCE
parallel processing and distributed data storage; Apache When working with any framework and with any data
Spark is a general—purpose distributed data processing source , it will be possible to distinguish three main stages:
environment; Apache Kafka is a streaming processing
platform; Apache Cassandra is a distributed NoSQL database • Data collection. Data can be collected both from the
management system. Storm is another prominent solution, network for certain requests and received through other
focused on working with a large real-time data flow, Flink, channels: for example, video surveillance cameras. The data
Heron, Presto, Samza and Kudu [6,7]. Another of the most source can be local and global. It is worth evaluating the
popular frameworks is Greenplum, an open source massively estimated database volumes and the speed of their collection
parallel relational DBMS for data warehouses with flexible from sources.
horizontal scalability and columnar data storage based on
• Data integration. Special systems convert them into a
PostgreSQL [8]. MongoDB cannot be called a framework,
format suitable for storage and processing, they can also
but it is a flexible and scalable solution based on a document-
continuously monitor for certain requests. At this stage,
oriented NoSQL DBMS.
when choosing technologies and data formats, it is necessary
B. Big Data File Systems to compromise: either more resources for structuring and
Special file systems are needed to store and maintain data transforming data, storing meta-information; or storing
integrity in distributed systems. HDFS (Hadoop Distributed poorly structured information and a lot of overhead during
File System) is a distributed file system that is designed to processing.
store large data sizes that are block–by-block divided • Processing and analysis. There may be real-time
between nodes of a computing cluster. Other file formats are processing or data preparation for processing that will be
built on the basis of HDFS. The main features of HDFS: performed in the future. Popular methods of data analysis:
replication mechanism, data aggregation into blocks, learning associative rules, classification, cluster and
hierarchical file system, conveyor replication. Replication is regression analysis, mixing and integration of data, machine
necessary to maintain data integrity so that when one node learning, pattern recognition and others. At this stage, it is
fails, data is not permanently lost. That is, all data is important to understand the amount of information being
necessarily duplicated. When new data is entered, a list of processed and the frequency of requests.
nodes is determined on which a copy of the blocks will be
stored and, further, data is transmitted from one node to The basic principles of organizing distributed repositories
another by the conveyor method. Hierarchical file of large amounts of information include, firstly, extensibility.
organization is most often used in file systems - the user can When using distributed storage, they talk about horizontal
create file directories in the same way as on local machines extensibility, that is, the ability to increase the required
storage volumes flexibly, without having to rebuild the entire
HDFS is a writeone system, you cannot update a line – system. Secondly, fault tolerance. Any devices can fail and
only delete it completely. HDFS allows you to store fail, nevertheless, the system must work smoothly. This is
completely unstructured binary files, but it is inconvenient usually ensured by distributing the load to other machines
for users and database management systems to work with instead of the failed one and organizing the data in such a
them, so there are a number of file formats based on HDFS way that all information is duplicated in parts on different
that allow partially structuring information. Some of them, machines, then when one of the devices exits due to a failure,
such as AVRO, Parquet, ORC, allow you to work with data the information will not be lost. Thirdly, localization.
using SQL queries (a special technology is used for this – Processing of various amounts of information should take
spark sql [9]. place on the same dedicated servers where they are stored.
All the variety of Big Data file formats (AVRO, This reduces the load on data channels and system overhead.
Sequence, Parquet, ORC, RCFile) can be divided into 2 Two main metrics are used to numerically evaluate the
categories: linear (string) and column (column). Linear ones performance of big data systems: memory consumption and
are more compactly stored on the hard disk, but they are CPU seconds during a test query to the database. It should be
difficult to process – to find the right value, you need to read understood that in distributed systems it will not be possible
the entire string. The columns take up more space due to the to get the same result when repeating a test query at another
storage of indexes, but are much faster to process, since the time. Averages are usually used. In addition to the request,
search goes directly to the desired column [10]. When report preparation, analysis, data collection, etc. can be
choosing the data format used, it is necessary to pay attention tested. For these purposes, it is worth choosing test tasks that
to the complexity and frequency of data processing requests, are as close to real as possible.
as well as the structuring of information.
There are several other indicators that are worth paying
S3 (Simple Storage Service) is an object storage and data attention to: the frequency of data collection, the time it takes
transfer protocol developed by Amazon. Its uniqueness lies for the data to become available for analysis, the time it takes
in storing a huge amount of data in the original format to transform the data into KPIs, etc.
without hierarchy and splitting into separate directories, as
well as in the ability to modify data, which is impossible in IV. CRITERIA FOR CHOOSING A FRAMEWORK FOR THE T ASK
HDFS. S3 storage has no scaling restrictions, the most OF PROCESSING BIG DATA
modern frameworks are starting to use S3 as a replacement When choosing from frameworks and ready-made
for HDFS. solutions, it is necessary to pre-evaluate the signs of big data
and study the features of the selected task. Big data is often
described using a set of "V" attributes:

Authorized licensed use limited to: INDIAN INSTITUTE OF TECHNOLOGY GUWAHATI. Downloaded on September 05,2024 at 10:33:18 UTC from IEEE Xplore. Restrictions apply.
Volume. The data sets must be large. The best criterion is V. TECHNOLOGIES TO IMPROVE PRODUCTIVITY
the need to distribute an array of data in terms of their
storage and processing. But both 100 TB and 10000 TB can A. In-Memory Technologies
be attributed to big data. It is necessary to assess the current Large data arrays require not only a lot of free space on
volume of data and its growth prospects. the hard disk, but also a large amount of RAM for
processing. The memory in the computer is organized
Variety. It is believed that heterogeneity is one of the key according to a hierarchical system and the speed of operation
features of big data information, but heterogeneity can be in RAM is several times higher than the speed of accessing
physical or semantic, as well as differ in the severity of this the hard disk. However, when processing big data,
feature. It is worth evaluating the properties of the incoming processing is most often performed iteratively, with
data stream as well as the complexity of its processing and intermediate results recorded on the hard disk. Due to this,
transformation. considerable time is lost when working. In memory
Velocity is the rate at which information enters your technologies are used to speed up computing operations [11].
system. The incoming flow of information can be huge, or it When using in memory technology, data compression
can be relatively scarce with a large amount of stored algorithms such as DWARF are used. Using these
information. Another important point is the uniformity of this technologies, it became possible to create databases that use
flow. For example, you may have a huge flow of data twice a RAM as the main storage. These include In-Memory Data
year, and the rest of the time the flow of information is quite Base (IMDB) and In-Memory Data Grid (IMDG). IMDB is
scarce. closer in architecture to traditional relational databases,
although it uses generational storage. IMDG is an object
Veracity. This property is important when processing this storage more similar to a multithreaded hash table. The main
data. For example, you plan to train a neural network based advantage of IMDG is the ability to work with objects from a
on this data. And it turns out that the data is unreliable. relational data model. Among the most popular in memory
Additional processing checks may be needed. database offerings are SAP with relational IMDB HANA,
Viability. One can understand both obsolescence and Oracle with IMDB TimesTen, as well as IMDB from
technical obsolescence. When evaluating this property, pay MemSQL and IMDG from GridGain.
attention to whether the number of requests you have B. Containerization
depends on the time that this data is stored in the database.
When cloud technologies and big data technologies are
Value is a value that should be taken into account when used, most often the virtualization layer is deployed using
organizing access to data and evaluating frameworks by the virtual machines. On the basis of a physical server, a virtual
level of security. environment is created that works as a separate computer,
but uses a predetermined amount of resources. The virtual
Variability. In many big data solutions, data overwriting
machine runs on an isolated sector of the server's hard disk,
can be either difficult or impossible. If there is a need to
its own operating system and all necessary applications are
frequently overwrite data, this must be taken into account
installed on it [12]. Virtual machines have many advantages,
when choosing a DBMS,
but there is also a serious drawback – it takes quite a long
Visualization. Different solutions have different tools for time to install and configure an additional virtual machine.
data visualization, it is important that the chosen framework This is inconvenient for systems whose load is very uneven.
provides the opportunity to use these tools. Either the installed and configured virtual machines will be
idle or there will be insufficient computing power during
In addition to the listed "V" properties, there are several peak load hours. A good solution in this situation may be the
more, with a name starting with other letters. use of containers.
Exhausting. This is the presence or absence of pre- Containerization is a technology in which program code
filtering of data; ai—grained and uniquely lexical - detail and is packaged into a single executable package together with
lexical uniqueness. Shows to what extent an element and its libraries and dependencies to ensure its correct launch. Thus,
characteristics can be correctly indexed or identified. an analog of a virtual machine is obtained, but it uses the
Relational — relativity, relativity. Whether the collected data resources of the operating system on which it is deployed.
contains common fields, whether they overlap with each The most well-known container creation platforms include
other or with previously obtained data, which will allow Docker, Kubernetes, and Porto. The first two are available
combining them or making a meta-analysis of different data for a wide range of users. What is the difference between
sets. Extensional — extensibility, extensiveness. It does not these technologies? To choose the right container system for
show the number of new records, but the ability of each your project, you need to understand the main differences
specific record to be supplemented and increased in volume. between Docker and Kubernetes, in addition to the surface
Scalability — scalability. How quickly the amount of data definition. Docker creates containers, and its management
storage in a particular system can increase. capabilities are relatively modest. Kubernetes cannot create a
After a detailed analysis of the information planned for container by itself, in fact it is an orchestrator – it allows you
storage has been carried out, attention should be paid to the to conveniently manage a large number of containers through
following properties of solutions: the file system, the data a single interface. But it requires you to use a third-party tool
model used, additional means of improving performance, to create containers. Docker clusters are more difficult to
adequacy of the task, cost, etc. create and manage compared to Kubernetes, but they are
stronger and much more stable. Kubernetes is designed for
automatic scaling of Docker containers. There are system
containers and application containers. System applications
combine all the necessary applications and are practically

Authorized licensed use limited to: INDIAN INSTITUTE OF TECHNOLOGY GUWAHATI. Downloaded on September 05,2024 at 10:33:18 UTC from IEEE Xplore. Restrictions apply.
variants of virtual machines. Application containers are Russian State University. Series: Literary Studies. Linguistics.
designed to run a separate application in isolation [13, 14]. Cultural studies. 2018. №1 (34). (in Russian)
[2] https://ru.wikipedia.org/
[3] M. Chen, Sh. Mao, Y. Zhang, V. C.M. Leung. Big Data. “Related
Technologies, Challenges, and Future Prospects.”, Spinger, 2014, 100
VI. CONCLUSION p.
Big data technologies offer users a wide range of [4] S. V. Mkrtychev “Big data: approaches to definition and
solutions, including both full technology stacks and many classification”, Information technologies in modeling and
management: approaches, methods, solutions, 2021, pp. 253-258. (in
variations within these stacks. When choosing a Russian)
technological solution for the implementation of a specific [5] A. O. Shcherbinina “The study of architecture and the main criteria
project, it is necessary first of all to evaluate not the solution for choosing a dbms focused on big data processing”, I International
itself, but the features of the data that must be stored and Scientific and Technical Conference "Topical issues of the use of data
processed in the systems under consideration. The most analysis technologies and artificial intelligence", 2018, pp. 196-200.
common tasks, processing speed requirements, data (in Russian)
structuring and variability, etc. After evaluating the stored [6] V. I Khakhanov., V I. Obrizan, A. S. Mishchenko, B. A. Tamer
“Metric for big data analysis”, Radioelectronics and Informatics.
data, it is already possible to consider possible solutions, 2014. №2 (65). (in Russian)
taking into account not only the metrics listed above, but also [7] https://jelvix.com/blog/top-5-big-data-frameorks
the cost of the entire solution, the necessary information [8] Z. Lyu, “Greenplum: a hybrid database for transactional and
transformations, information processing speed requirements, analytical workloads”, Proceedings of the 2021 International
etc. Highlighting the key qualities of a specific practical task Conference on Management of Data, 2021, pp. 2530-2542.
is the key to understanding the requirements for the system. [9] V. S. Yakovlev “Big data”, Technique and technology: role in the
For example, when storing archived information, which is development of modern society,2015, no. 6. pp. 83-90. (in Russian)
accessed extremely rarely, information can be stored in an [10] E. A. Artyushina, I. I. Salnikov, “In-memory technologies for storing,
unstructured form. If there is a need for frequent processing processing and analyzing large volumes of structured and weakly
of information, especially if these are automated queries, or structured data”, XXI century: results of the past and problems of the
present plus, 2018, vol. 7, no 4, pp. 147-152. (in Russian)
if there are strict requirements for the response time to a
[11] A.V. Gordeev, “Virtual machines and networks”, Information and
request, more structured databases, such as key-value, will be control systems, 2006, no. 2, pp 21-26. (in Russian)
suitable. [12] M. K. Gupta, V. Verma, M. S. Verma. “In-memory database systems-
a paradigm shift”, arXiv preprint arXiv:1402.1258, 2014.
REFERENCES
[13] D. A Kozintsev., A. A. Shiyan, “containerization for big data
analysis by example kubernetes and docker”, /Actual problems of
[1] M. S. Kornev, “The history of the concept of "Big data" (Big Data): infotelecommunications in science and educatio (APINO 2020),
dictionaries, scientific and business periodicals”, Bulletin of the 2020, pp. 393-396 (in Russian)

Authorized licensed use limited to: INDIAN INSTITUTE OF TECHNOLOGY GUWAHATI. Downloaded on September 05,2024 at 10:33:18 UTC from IEEE Xplore. Restrictions apply.

You might also like