0% found this document useful (0 votes)
29 views

Big Data Architecture

This document presents a security architecture design for a Big Data environment. Defines concepts such as Big Data, Hadoop, HDFS, Cloudera Navigator and Spark. It then discusses security topics such as Kerberos, Apache Sentry, auditing, data line, and data encryption. Finally, it proposes a security design and architecture that includes communications security and different security components.
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
29 views

Big Data Architecture

This document presents a security architecture design for a Big Data environment. Defines concepts such as Big Data, Hadoop, HDFS, Cloudera Navigator and Spark. It then discusses security topics such as Kerberos, Apache Sentry, auditing, data line, and data encryption. Finally, it proposes a security design and architecture that includes communications security and different security components.
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 17

MASTER IN COMPUTER SECURITY

Security in Mobile Environments and Virtualization

Prof. Rafael Sada

DESIGN OF AN ARCHITECTURE WITH SECURITY IN A BIG DATA ENVIRONMENT

JOSÉ ESPEJEL VILLAFUERTE

DAVID MARTINEZ GARCIA

JUAN FRANCISCO BELMAREZ AMADOR

SONNY FABIÁN MORALES CÁRDENAS

26-July-2022
Index

Introduction ------------------------------------------------- ----------------------------3

Big Data Definition ----------------------------------------------- ----------------------3

Hadoop ecosystem------------------------------------------------ ---------------------4

(HDFS)---------------------------------------------------------- ------------------------------------ 6

Cloudera Navigator ------------------------------------------------ ---------------------6

Spark ------------------------------------------------- ------------------------------------ 6

Big data security and data protection -------------------------------------------------------8

Kerberos---------------------------------------------------------------- -------------------------------- 9

Apache Sentry ------------------------------------------------ --------------------------- 10

Audit ------------------------------------------------- --------------------------------10

Data Lineage ----------------------------------------------- ---------------------------eleven

Data encryption ----------------------------------------------- -------------------------- eleven

Big data security design and architecture ------------------------------------- 12

Communications security ---------------------------------------------- -------12

Security architecture ----------------------------------------------- ------------- 13

Conclusions

Bibliography
Introduction

Dear Prof. Rafael Sada, in this work we dedicate ourselves to the task of investigating all
the colleagues who make up the team mentioned in the title of the work, a table in context
and the task of proposing a Hadoop system in which we imply the concepts of big data and
the solution selected according to what the activity gave us.

Big Data Definition.

The process that analyzes large volumes of data is called Big Data . These analyzes
simultaneously generate all types of valuable information that can be implemented in
different business areas such as Marketing, Population Statistics, Health, among others.

(Douglas da Silva, Web Content & SEO Associate, LATAM, 2021)

Its function is based on the user's behavior, extracting stored data from which big data
formulates random predictions according to observation patterns. It is a discipline
dedicated to massive data and frames an IT and communications sector as the term Big. It
has been in use since the 1990s and credit for all of this goes to computer scientist John
Mashey for discussing the term at that time.(Susanne Lindholm, 2017)

Currently, Big Data works based on what is called “5 Vs”: Volume, Variety, Speed,
Veracity and Value.

 Volume refers to the amount of data that is generated in the company every second
example, social networks, emails, electronic devices, etc.
 Variety concerns all places where data can be stored and extracted.
 Speed is the analysis of data whose speed for analysis is very high. This is where the
third “V” occurs. This process analyzes the data at the time it is created, for
example. Facebook and its virals, transactions made by credit card, etc.
 And finally we have the Value, if you have too much information, it must generate
value for the business. Here, a precise analysis of all that data is carried out and
valuable insights are generated for which the managers who will use them.

Haddop ecosystem

Fig. 1 Hadoop Ecosystem Diagram.

What is “Apache Hadoop” is an ecosystem of open source components that radically changes
how companies store, process and analyze data. Hadoop enables multiple types of
analytics workloads to be run on the same data at once, at scale, and on industry-standard
hardware. (cloudera, 2022)

Hadoop is also a framework with the ability to store and process data on a large scale. This
system has a quick learning curve and is not the only solution out there.

Hadoop is an innovation that has the attached properties:


- It is programmed, since in the positions that are executed in the
They try not to require manual intercession for their preparation.
- Its customization model is simple.
- It even allows adaptability by adding figure cores to the group in which it is introduced.
group into which it is introduced, which further develops the execution of the process as
the information sources are developed.
with the development of information sources.
- It is a simple structure, and that implies that the positions, once executed, are
once executed, they are effectively worked on without the client having to
the customer who ships them, regardless of whether there is a disappointment in any
figure hub, registration hub.
- Hadoop is an adaptable framework to fit any business you are faced with.
let it be given.
- Preferably, it is a frame in which each of its parts is recoverable.
They are recoverable.
- Flaws in the framework do not create disorder or invalid information conditions.
information, since it is a predictable system.
Hadoop center provides a disseminated storage log framework to store information, a
transported document framework to store information, a sparse log framework to store
disseminated capacity logging framework to store the information, an asset monitor to
schedule the tasks that run on it, and a
In addition, an information processing engine.
Furthermore, it is made up of numerous parts that cover the needs to complete a device
equipped to incorporate
finish a device equipped to be incorporated with the Big Data of a specific organization,
element, meeting or client.

HDFS

(Hadoop Distributed File System) is a file system with distribution in the Google file
system.

Its fundamental aspects are flexibility, strength, versatility, multi-hub use and
transportability,
versatility, multi-hub use and transportability. It is written in Java and is responsible for

monitor logs on different machines in a Big Data climate, mostly within the Hadoop
biological system.

In typical logging frameworks the activities are constant but composed of the information,
while in HDFS the information is composed only once and is imitated however many times
as determined.

Information is imitated as many times as determined in the addition, while uses can be
carried out as many times as expected in the

You can proceed as many times as necessary once the information is saved in the frame.

Cloudera Navigator

Tool responsible for executing the information examination and discernment capabilities
in a Hadoop environment in a Hadoop biological system dispersed by Cloudera. In
particular, its value depends on the information in view of information management,
where it is expected to have an aggregate and absolute domain over the entire

all information saved within the group.

Spark

It is a device or engine and an innovation that is used to work with a lot of information and
carry out estimates or specific tasks.

work with a lot of information and perform specific estimates or procedures on it, to
change, move, delete or eliminate

on the information, to change, move, delete or produce new information in the result.

produce new ones in the result. It is used to produce tests, measurements or

new arrays of data that can be used to track a progression of

options.

At first, Hadoop relied on MapReduce as its driving engine, however, options emerged, one
of which was MapReduce.
However, options emerged, one of which is currently ruling the Big Data scene in terms of
its examination.

The big data scene in terms of examination is Apache Spark.

It is an extremely fast engine to handle a lot of information on one or a few machines.

on one or a few machines. Involves Map and Reduce tasks like cluster management

In addition, it offers several libraries to manage information continuously, as well as AI


capabilities.

also, AI capabilities. Its interaction model allows

procedure on multiple cores, as well as capacity admission, and is deficient.

Furthermore, it is shortcoming open minded. It scales very well on a flat plane, that is, as
the interest/process load increases.

increases the request/process load, it is feasible to further develop the execution by


integrating more hubs into the framework.

integrating more hubs into the framework.

Some peculiarities that Spark presents are a control center in which to interconnect

to leverage the engine on a data set, utilizing Python or Scala programming dialects, and
executing tasks made from a few errands to carry out certain tasks to perform specific
roles. One of Spark's key core aspects of Flash is that it doesn't need an asset manager (like
YARN) to execute work. Despite this, it is prudent to consolidate YARN + Spark when
working in a group to rationalize assets and occupations as much as possible. It is built into
Hadoop in a very similar way to MapReduce.

Big Data security and data protection

For some organizations, it is critical to have their information highly safeguarded, as it may
be customer, accomplice or external data. The information may be from the client, the
accomplice or a third party. Be that as it may, from now on it is not simply the security of
the actual information security, but also the consistency with the unofficial laws, the
admission to certain the admission to specific information regions within the actual
organization to control the proceedings or on the other hand real anonymization to extract
reports, perceptions or results to be distributed remotely. To distribute them to the rest of
the world. Having a masterful course of action for information governance is vital. It is
perceived by information management that it involves a multitude of cycles responsible for
processing the data contained in the information, whether by data contained in the
information, whether by means of provisions, guidelines or devices.

These cycles are used to distinguish the owner of the information, who approaches it, how
it is used, and how it is used. The blend of both security and information management is a
critical component of information security and management. The mixture of both, security
and information management, works with the assurance tasks in Big Data and allows the
evaluation methods to be executed in a way that is easier to execute as they have a low
level. of dominion over all information being accessible information. Security starts with
the smallest unit you can have, which for this situation is the information, and from that
point also, from that point try to safeguard all the parts, layers and parts that are based on
top of it.

Kerberos

Kerberos is an organization authentication protocol created by the Massachusetts


Innovation Institute of Technology (MIT).

in view of symmetric key cryptography and a trusted outsider. Confidentiality abroad. Its
main capacity is to provide verification at the client level and maintain harmful substances
from also preventing harmful substances from performing pantomime or usurpation
maneuvers. With Kerberos enabled on a machine organization, it is exceptionally easy for
client programs running on the machines

run client programs on the machines to confirm that the client is who the person in
question professes to be.

The Kerberos confirmation associated with Hadoop approval systems, such as Sentry,
provides a template that can be used to verify that the client is who the individual in
question professes to be.
Like Sentry, it provides a strong security model to safeguard the access to information in a
Big Data group.

Apache Sentry

The function of Apache Sentry is to offer a security layer at a specific level based on
assigned roles and privileges. This type of technology is included in the Cloudera
distribution ( For access security, which controls what users and applications can do
with data, the standard, open source, unified authorization engine to enforce access
controls Role -Based Access (RBAC) across the Hadoop ecosystem. Provides unified access
control for data and metadata stored in Hadoop. Enterprises can define privileges for
data sets to be applied from multiple access paths, including. HDFS, Apache Hive,
Impala, Cloudera Search, as well as Apache Pig, Apache MapReduce/Yarn via
HCatalog.) (Cloudera, 2022)

Specifically, it refers to a functioning registry and a progression of metadata configuration


records for the correct functioning of the module.

metadata configuration documents for correct module activity.

Practically, it gives the ability to control and apply various degrees of honors on
information and metadata for clients and applications.

honors levels on information and metadata for newly validated clients and applications in
a group. Additionally, newly verified applications in a Hadoop cluster. Likewise, it allows
you to characterize approval rules to approve

approval rules to approve input requests from a client or, alternatively, from an application
to Hadoop assets, so it is coordinated as a module for various parts of the biological
system.

parts of the biological system. It is feasible to grant admission to multiple meetings for
similar information at various honor levels. For example, for a given information
provision, the security group may be granted honors to view all sections, the experts only
the non-sensitive segments, etc. experts only the non-sensitive sections, etc. with any
mixture that may exist.
For the approval cycle to work, Sentry would have to rely on the verification component
given by the security verification instrument given by the mix of Kerberos and
LDAP/dynamic index.

Audit

The audit is a mechanism used in computing that allows us to manage and analyze a
certain number of systems with which the only purpose is to detect threats or misuse of
these. When an audit is carried out, not only the equipment is analyzed, but the type of
network it comprises such as servers, workstations and user accounts must also be
analyzed to have a complete scope. In the context of Big Data, the audit is more thorough,
apart from reviewing the data cluster of the equipment, an exhaustive study is carried out
on the type of data and content, it focuses on factors that have already been carried out in
the study, stored data, who is used for it or whom. have accessed the information and what
type of user has accessed them.

data lineage

It is a security process that arises with data analysis and consists of obtaining the
traceability of the data, from the moment it is generated, indicating its origin to its final
origin, the place where it is stored is described in itself as the analysis procedure that Those
who consult this data suffer what work they do with them if they modify, replicate, modify
or delete it and the entire process remains in registration logs.

Encryption

Safeguarding data is a useful and valuable task from several perspectives. Above all, it
protects you against external threats that may occasionally infiltrate the framework and
need to grab specific sensitive data. On the other hand, the transmission of data between
various substances is always more secure when it is encrypted.
secure when it is in an encrypted structure. Finally, although the group is now set to
delimit the information region of the frame, it is also more secure when encrypted. to
delimit the information regions for approved people, it is constantly.

Better to maintain a more prominent veneer of righteousness if some sort of erratic


blunder occurs or an eccentric blunder occurs. Although it is considered a great practice to
scramble information, it is a computationally expensive activity.

Encryption calculations use strong numerical activities and huge keys to store the
information, which added to the amount of volume processed in Big Data makes the tasks
of cryptographically encoding such collection cryptographically encoding such collection is
slow and computationally expensive.

In Hadoop environments, information encryption is completed at the board level

typically with AES calculation with a key size of 256 cycles. To overcome the high
computational load, encryption in Big Data bypasses the java loop stack (explicitly the
application layer) and talks directly with the framework, which sends the activity.
Encrypting information from circles, HDFS or some other storage directly to the processor.

or on the other hand some other storage directly to the processor. much faster encryption
activity.

A big positive point is that all this interaction is easy for customers. That is to say

That is, any client that is validated and approved to access the information will actually
want to view it as if it were not encrypted.

information, you'll actually want to view them as if they were not encrypted.

Big data security architecture design

In this segment we will show the security engineering that has been planned based on the
exploration carried out. Next, a progression of segments is given to establish the
configuration prior to the characterization of the proposed design.

Communications security

To guarantee that the network of Equipment in which you are going to work is safeguarded
at the level of correspondences between the hubs of the group, it is essential to have
correspondences between the hubs of the group, it is important to have several
components such as firewalls, testaments and components such as firewalls, statements
and conventions that establish the security of network security.

In any case, the organization must have a series of firewalls that are also responsible for
promptly dealing with any danger coming from the outside that may affect the security of
the PC organization. Additionally, a progression of statements should be used to trace a
secure route to the Internet from the machines in the group.

Web from the machines that make up the group. This is essential, since a large number of
the administrations used in Hadoop

A large number of the administrations used in Hadoop have a web terminal (such as
Cloudera Manager or the HUE UI) and access is through a web terminal.

The declaration must be given by a trusted external person (authority of the

certifier).

Security architecture

This engineering is designed to meet the objectives of having a controlled and confined
climate for a group of clients within an association.
The client will do so from several areas that will be very far from the frame space to be
safeguarded. This is a front layer to interact with PC networks with the driving engine for
Big Data tasks.

two other organizations, then again, must be safeguarded by various security components
of the organization.

network security instruments. One of them is the backend layer where the machines
intended to offer the Big Data management that we are working with, as well as other
different frameworks that can store touch information. The third and last organization
proposed in engineering is intended to associate machines dedicated to the organization of
security and in this way try not to have the information and its related security systems in
proximity. The three organizations, despite being absolutely autonomous, have perception
capacity at the IP level to have the option of

associations (but in addition to discarding them for safety).

The fundamental parts of engineering are indicated with letters to make it easier to
distinguish them in this representation.

easier to distinguish them in this portrait:

Part A is the Big Data group that needs to be secured. It contains each of the centers, the
two specialists each of the hubs, the two specialists and bosses, which propose
administrations maintained within Hadoop.

Furthermore, in this organization there may be different groups, components or


frameworks dedicated to different purposes. The Big Data cluster should be Kerberized
and each of the machines should have a place with a particular space that is meant to be

have a place with a particular space that will be used to produce the tickets.

Only tickets related to that space can be used within it, and only clients that are saved on
the server will actually want to use them.

Only clients that are saved on the validation server will actually want to commit against the
machines that contain the Hadoop administrations.

Machines must have the fix configured at the SSSD framework level, to consistently and
securely talk to the KDC and produce tickets without manual methods. Additionally, it is
important that the IP addresses are in the disposition documents of the host nodes.
in the hostnode design documents to establish intra- and inter-cluster correspondence.
Their addresses must also be demonstrated in Kerberos-related records so that the weather
is properly safeguarded at the validation level.

Dissemination of Hadoop to be used to carry out Big Data ventures,

relying on vital Sentry modules for approval and Navigator for approval.

revision.

One hub that should stand out from the rest is the Edge, signified by the letter B. The
capacity of this hub is to serve as a "passage" to the whole. In no case would it be a good
idea for it to be feasible to directly reach a component within the backend of an
organization or association.

Along these lines, access to the group is unified and must go through a main question of
access.

incorporated and must pass through a solitary crossing point safeguarded by various
components of the organization. network components. It is essential to clarify that the
edge hub is populated as a boundary between the backend network and the frontend. but
at the legitimate level it is a component that has a place with the group and must therefore
have the SSSD and Kerberos design. Clients, administrators, or application clients interact
with the Edge via ssh. Occupations are executed from this hub, including verification and
approval.

approval. When the client has the ticket, they will actually want to run positions from the
When the client has the ticket, they will actually want to run positions from the Edge to the
group and they will actually want to see the results of the group, without having to the
centers responsible for executing the administrations of Hadoop.

Activity C is probably the most significant of the entire design. Its ability is to naturally
request, through the SSSD of the Edge hub, a Kerberos pass to the KDC.

At the KDC, the client's certifications will be checked and a ticket will be issued assuming
the verification is substantial. Essentially, the authorizations for the client will be the same.

Essentially, authorizations for the client across the pools it is related to in the
dynamic/LDAP registry will also be mentioned.
In this sense, when a task is executed in the group, the Sentry modules will give the
important granular approval in view of these consents with the jobs planned to the Sentry
server and to the dynamic catalog collections / LDAP.

Dynamic index/LDAP meetings. As long as Sentry is activated, a core group can be


characterized for clients, which actually looks at them to see if they are approaching the
reverb or not.

As long as Sentry is enabled, you can characterize a group of experts for clients, checking
whether they are close to the obtained environment and what tasks they can perform.

For granular access to information, tables or sections of the different storage frameworks
that the framework has, you can characterize a main group for clients.

Regarding the capacity frameworks that the framework has, a progression of groups can be
established that are responsible for executing an error about which clients can do specific
things and which cannot, due to consents, jobs and honors.

Within the organization, there is the KDC (D). This is a group of Teams dedicated
exclusively and solely to ensuring customer confirmation and approval. The KDC contains
the expected Kerberos parts for ticketing and consistency.

The important parts for ticketing and convention consistency are located in the KDC. The
design configuration of this machine arrangement is basic, since they are responsible for
coordinating the innovation of the dynamic catalog. The innovation of dynamic index or
LDAP (depending on whether they are Windows or UNIX machines) with the design of
Kerberos and SSSD, responsible for performing all the framework commit validation.

This climate must incorporate a commit server that involves dynamic index or LDAP as a
store. The commit server must include a dynamic or LDAP index as a store for querying
clients and related groups.

Returning to the frontend network from which potential clients interact with the group, a
validation server must be incorporated. In the group, different security components have
been referenced to ensure that admission to the backend is done securely. Firewalls,
intermediary servers

Intermediary servers or load balancers (E) are frameworks that allow you to obstruct
threats and prevent harmful admission to the edge hub responsible for passing as a line
between the two organizations.
between the two organizations. It should not be forgotten that the reliability of the
frontend cannot necessarily be guaranteed in all cases. It cannot necessarily be assured,
which is why it is important to give associations an extra layer of security at the
organizational level.

extra layer of security at the organization level so that the point-by-point assurance of this
work is completed.

There is a component whose consolidation gives the design a stage to have the option to
dump any record (F). have the option to dump any log (F) delivered into the group, and
through vital progress to perform evaluation and verification tasks (G).

See SIEM, a vault to save security opportunities and apply a progression of security
arrangements strategies to know the situation of the group consistently and go preventive
for total climate safeguard.

An external database can be used to monitor the logs as a In contrast to SIEM, an external
database can be used to monitor the logs as a reporting database, for example, a dedicated
NoSQL.

Likewise, the Navigator server can be used in conjunction with this framework to gain
power in the examination of proceedings. to acquire power in the review of proceedings.

The last component that appears within engineering is found again in the administration
organization, and is the group of security concentrators (H).

The last component that is presented within the design is located again in the
administration organization, and is the central security group (H).

The motivation behind this group is to integrate those security components that it is
convenient to avoid information. prescribed to avoid information. Sentry and Navigator
servers can be placed in this group. servers, which will remain in correspondence with the
KDC to run your utility securely. run your utility safely. This plan allows all presentation
flows to be as if the weather is not assured and acquire righteousness in the plan.

Conclusions

The enthusiasm of individuals, and consequently of organizations that are made up of


individuals, to examine information and consume data, is evident and continues to
develop.
Advances in exchanges and various applications mean that the amount of open
information is enormous. How much information is available is colossal, for associations
the convenience of researching the information is clear and for associations the
convenience of researching the information is clear and they base their business decisions
on the data they obtain.

Bibliography
Cloudera. (26 de 07 de 2022). Apache Sentry . Obtenido de
https://www.cloudera.com/content/dam/www/marketing/resources/datasheets/sentry-
datasheet.pdf.landing.html
cloudera. (26 de 07 de 2022). Productos. Obtenido de Hadoop:
https://es.cloudera.com/products/open-source/apache-hadoop.html
Douglas da Silva, Web Content & SEO Associate, LATAM. (19 de 02 de 2021). ¿Qué es el Big Data y para
qué sirve? Obtenido de Blog de Zendesk: https://www.zendesk.com.mx/blog/big-data-que-es/
Susanne Lindholm. (11 de abril de 2017). macrodatos e inteligencia de datos, alternativas a big data.
Obtenido de https://www.fundeu.es/recomendacion/macrodatosalternativa-abig-data-1582/

You might also like