Analysis of Frameworks and Technologies For Solving Big Data Storage and Processing Problems in Distributed Systems
Analysis of Frameworks and Technologies For Solving Big Data Storage and Processing Problems in Distributed Systems
Abstract—The article discusses the main metrics for be optimal for solving a specific task? The old evaluation
evaluating the performance of big data warehouses, compares methods no longer work. All modern distributed systems are
the characteristics of the most popular modern frameworks. in a state of dynamic equilibrium, data can move between
When considering distributed storage of large amounts of different machines, user requests are also redirected
information, it is impossible to use the same indicators of speed depending on the load in the system. The same query made
and efficiency that are used in local databases. Qualitative and to the same database can be executed in a different time, with
quantitative evaluation is complicated by the inability to a different result, use different data channels and be executed
reproduce the same queries in the same conditions when using on different hardware. How then to compare the proposed
distributed storage. This is caused by the constantly changing
solutions and choose the best one for the current task? What
state of the transmitting environment and computing
do you need to know about the specifics of the task before
resources. When choosing a framework, technology stack, file
system, distributed database architecture for a specific task, it moving on to choosing a technology stack? How to evaluate
is necessary to take into account a large number of factors, as the effectiveness of the information system? Let's consider
well as the totality and frequency of operations and treatments the basic principles of building and evaluating the
that will be required in this system. The features and effectiveness of distributed information systems for storing
advantages of using containers and various file systems for big data.
storing data in distributed systems are presented, as well as
other technologies that allow increasing the speed and II. BASIC PRINCIPLES OF THE ORGANIZATION OF BIG DATA
efficiency of data processing in large arrays of heterogeneous SYSTEMS
data. At the moment, there is no single solution that would
A. The Concept of Big Data and Common Frameworks
allow the best way to organize the storage of large volumes of
heterogeneous data, but there are trends in the development of Before proceeding to evaluating the effectiveness of
technologies for their processing. Many of them are related to systems, it is necessary to determine what makes it possible
the ability to apply to big data the same methods that are used to attribute systems to big data. Despite the great prevalence
in relational databases, but taking into account specific of the term "big data", it does not have an unambiguous
features.* definition. In 2013, this term was added to the Oxford
Dictionary, and it is defined as follows: "sets of information
Keywords—big data, control system architectures, that are too large or too complex to handle, analyze or use
heterogeneous data, metrics of distributed systems, distributed with standard method"[1]. Wikipedia provides the following
systems definition: Big Data is the designation of structured and
unstructured data of huge volumes and significant diversity,
I. INTRODUCTION
efficiently processed by horizontally scalable software tools
Technologies and distributed big data storage systems are that appeared in the late 2000s and alternative to traditional
rapidly developing and gaining popularity, as they are database management systems and solutions of the Business
optimally suited for storing data in conditions of a huge flow Intelligence class [2, 3]. That is, big data can include arrays
of new information, most often poorly structured. New of information that cannot be decomposed and processed on
technologies for organizing data storage, access to them, a single computer, so special hardware and software
methods and methods of processing, technologies for solutions are used to place an array of data on multiple
ensuring and maintaining data integrity, providing remote devices while preserving the integrity and availability of data
access and multi–user work are solutions dictated by the [4,5]. Most often, such storages are built on the basis of
realities of the modern world. There are a large number of cloud technologies. Since, in fact, it is cloud technologies
big data storage solutions on the market, most often these are that provide the necessary virtualization layer, which allows
complex technology stacks, ranging from the file system to you to work with distributed data storage as a whole. This
the user interface. How to determine which technologies will means that when evaluating big data solutions, you can use
the experience of evaluating cloud data warehouses.
*
This study was conducted within the framework of the scientific program
of the National Center for Physics and Mathematics, section №9 «Artificial Of the most widely used frameworks and platforms, a
intelligence and big data in technical, industrial, natural and social systems huge ecosystem of technologies developed by Apache should
Authorized licensed use limited to: INDIAN INSTITUTE OF TECHNOLOGY GUWAHATI. Downloaded on September 05,2024 at 10:33:18 UTC from IEEE Xplore. Restrictions apply.
Volume. The data sets must be large. The best criterion is V. TECHNOLOGIES TO IMPROVE PRODUCTIVITY
the need to distribute an array of data in terms of their
storage and processing. But both 100 TB and 10000 TB can A. In-Memory Technologies
be attributed to big data. It is necessary to assess the current Large data arrays require not only a lot of free space on
volume of data and its growth prospects. the hard disk, but also a large amount of RAM for
processing. The memory in the computer is organized
Variety. It is believed that heterogeneity is one of the key according to a hierarchical system and the speed of operation
features of big data information, but heterogeneity can be in RAM is several times higher than the speed of accessing
physical or semantic, as well as differ in the severity of this the hard disk. However, when processing big data,
feature. It is worth evaluating the properties of the incoming processing is most often performed iteratively, with
data stream as well as the complexity of its processing and intermediate results recorded on the hard disk. Due to this,
transformation. considerable time is lost when working. In memory
Velocity is the rate at which information enters your technologies are used to speed up computing operations [11].
system. The incoming flow of information can be huge, or it When using in memory technology, data compression
can be relatively scarce with a large amount of stored algorithms such as DWARF are used. Using these
information. Another important point is the uniformity of this technologies, it became possible to create databases that use
flow. For example, you may have a huge flow of data twice a RAM as the main storage. These include In-Memory Data
year, and the rest of the time the flow of information is quite Base (IMDB) and In-Memory Data Grid (IMDG). IMDB is
scarce. closer in architecture to traditional relational databases,
although it uses generational storage. IMDG is an object
Veracity. This property is important when processing this storage more similar to a multithreaded hash table. The main
data. For example, you plan to train a neural network based advantage of IMDG is the ability to work with objects from a
on this data. And it turns out that the data is unreliable. relational data model. Among the most popular in memory
Additional processing checks may be needed. database offerings are SAP with relational IMDB HANA,
Viability. One can understand both obsolescence and Oracle with IMDB TimesTen, as well as IMDB from
technical obsolescence. When evaluating this property, pay MemSQL and IMDG from GridGain.
attention to whether the number of requests you have B. Containerization
depends on the time that this data is stored in the database.
When cloud technologies and big data technologies are
Value is a value that should be taken into account when used, most often the virtualization layer is deployed using
organizing access to data and evaluating frameworks by the virtual machines. On the basis of a physical server, a virtual
level of security. environment is created that works as a separate computer,
but uses a predetermined amount of resources. The virtual
Variability. In many big data solutions, data overwriting
machine runs on an isolated sector of the server's hard disk,
can be either difficult or impossible. If there is a need to
its own operating system and all necessary applications are
frequently overwrite data, this must be taken into account
installed on it [12]. Virtual machines have many advantages,
when choosing a DBMS,
but there is also a serious drawback – it takes quite a long
Visualization. Different solutions have different tools for time to install and configure an additional virtual machine.
data visualization, it is important that the chosen framework This is inconvenient for systems whose load is very uneven.
provides the opportunity to use these tools. Either the installed and configured virtual machines will be
idle or there will be insufficient computing power during
In addition to the listed "V" properties, there are several peak load hours. A good solution in this situation may be the
more, with a name starting with other letters. use of containers.
Exhausting. This is the presence or absence of pre- Containerization is a technology in which program code
filtering of data; ai—grained and uniquely lexical - detail and is packaged into a single executable package together with
lexical uniqueness. Shows to what extent an element and its libraries and dependencies to ensure its correct launch. Thus,
characteristics can be correctly indexed or identified. an analog of a virtual machine is obtained, but it uses the
Relational — relativity, relativity. Whether the collected data resources of the operating system on which it is deployed.
contains common fields, whether they overlap with each The most well-known container creation platforms include
other or with previously obtained data, which will allow Docker, Kubernetes, and Porto. The first two are available
combining them or making a meta-analysis of different data for a wide range of users. What is the difference between
sets. Extensional — extensibility, extensiveness. It does not these technologies? To choose the right container system for
show the number of new records, but the ability of each your project, you need to understand the main differences
specific record to be supplemented and increased in volume. between Docker and Kubernetes, in addition to the surface
Scalability — scalability. How quickly the amount of data definition. Docker creates containers, and its management
storage in a particular system can increase. capabilities are relatively modest. Kubernetes cannot create a
After a detailed analysis of the information planned for container by itself, in fact it is an orchestrator – it allows you
storage has been carried out, attention should be paid to the to conveniently manage a large number of containers through
following properties of solutions: the file system, the data a single interface. But it requires you to use a third-party tool
model used, additional means of improving performance, to create containers. Docker clusters are more difficult to
adequacy of the task, cost, etc. create and manage compared to Kubernetes, but they are
stronger and much more stable. Kubernetes is designed for
automatic scaling of Docker containers. There are system
containers and application containers. System applications
combine all the necessary applications and are practically
Authorized licensed use limited to: INDIAN INSTITUTE OF TECHNOLOGY GUWAHATI. Downloaded on September 05,2024 at 10:33:18 UTC from IEEE Xplore. Restrictions apply.
variants of virtual machines. Application containers are Russian State University. Series: Literary Studies. Linguistics.
designed to run a separate application in isolation [13, 14]. Cultural studies. 2018. №1 (34). (in Russian)
[2] https://ru.wikipedia.org/
[3] M. Chen, Sh. Mao, Y. Zhang, V. C.M. Leung. Big Data. “Related
Technologies, Challenges, and Future Prospects.”, Spinger, 2014, 100
VI. CONCLUSION p.
Big data technologies offer users a wide range of [4] S. V. Mkrtychev “Big data: approaches to definition and
solutions, including both full technology stacks and many classification”, Information technologies in modeling and
management: approaches, methods, solutions, 2021, pp. 253-258. (in
variations within these stacks. When choosing a Russian)
technological solution for the implementation of a specific [5] A. O. Shcherbinina “The study of architecture and the main criteria
project, it is necessary first of all to evaluate not the solution for choosing a dbms focused on big data processing”, I International
itself, but the features of the data that must be stored and Scientific and Technical Conference "Topical issues of the use of data
processed in the systems under consideration. The most analysis technologies and artificial intelligence", 2018, pp. 196-200.
common tasks, processing speed requirements, data (in Russian)
structuring and variability, etc. After evaluating the stored [6] V. I Khakhanov., V I. Obrizan, A. S. Mishchenko, B. A. Tamer
“Metric for big data analysis”, Radioelectronics and Informatics.
data, it is already possible to consider possible solutions, 2014. №2 (65). (in Russian)
taking into account not only the metrics listed above, but also [7] https://jelvix.com/blog/top-5-big-data-frameorks
the cost of the entire solution, the necessary information [8] Z. Lyu, “Greenplum: a hybrid database for transactional and
transformations, information processing speed requirements, analytical workloads”, Proceedings of the 2021 International
etc. Highlighting the key qualities of a specific practical task Conference on Management of Data, 2021, pp. 2530-2542.
is the key to understanding the requirements for the system. [9] V. S. Yakovlev “Big data”, Technique and technology: role in the
For example, when storing archived information, which is development of modern society,2015, no. 6. pp. 83-90. (in Russian)
accessed extremely rarely, information can be stored in an [10] E. A. Artyushina, I. I. Salnikov, “In-memory technologies for storing,
unstructured form. If there is a need for frequent processing processing and analyzing large volumes of structured and weakly
of information, especially if these are automated queries, or structured data”, XXI century: results of the past and problems of the
present plus, 2018, vol. 7, no 4, pp. 147-152. (in Russian)
if there are strict requirements for the response time to a
[11] A.V. Gordeev, “Virtual machines and networks”, Information and
request, more structured databases, such as key-value, will be control systems, 2006, no. 2, pp 21-26. (in Russian)
suitable. [12] M. K. Gupta, V. Verma, M. S. Verma. “In-memory database systems-
a paradigm shift”, arXiv preprint arXiv:1402.1258, 2014.
REFERENCES
[13] D. A Kozintsev., A. A. Shiyan, “containerization for big data
analysis by example kubernetes and docker”, /Actual problems of
[1] M. S. Kornev, “The history of the concept of "Big data" (Big Data): infotelecommunications in science and educatio (APINO 2020),
dictionaries, scientific and business periodicals”, Bulletin of the 2020, pp. 393-396 (in Russian)
Authorized licensed use limited to: INDIAN INSTITUTE OF TECHNOLOGY GUWAHATI. Downloaded on September 05,2024 at 10:33:18 UTC from IEEE Xplore. Restrictions apply.