0% found this document useful (0 votes)
190 views29 pages

BlueData EPIC Software Architecture Technical White Paper

Blue Data Architecture

Uploaded by

Rajesh Shenoy
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
190 views29 pages

BlueData EPIC Software Architecture Technical White Paper

Blue Data Architecture

Uploaded by

Rajesh Shenoy
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 29

White Paper

BlueData Software Architecture


BlueData EPIC™ version 3.2

The BlueData EPIC software platform


introduces patent-pending innovations that
leverage Docker container technology to
deliver the agility and efficiency benefits of
Big-Data-as-a-Service (BDaaS) either on-
premises, in the public cloud, or in a hybrid
environment.

www.bluedata.com
Table of Contents
1. BlueData EPIC 1
Key Features and Benefits 1
App Store 2
2. BlueData EPIC Architecture 3
Software Components 3
3. Tenants 5
Users and Roles 6
User Authentication 7
4. Virtual Compute and Memory 9
Virtual Cores and RAM 9
Virtual Node Flavors 10
5. Storage 12
About DataTaps 12
On-Premises Tenant Storage 13
Node Storage 13
Application Path Inputs 13
6. Networking 14
Networks and Subnets 14
Gateway Hosts 17
7. Cluster Management 19
Cluster Creation 19
Isolated Mode 19
Host Tags 21
High Availability 22
Platform High Availability 22
Virtual Cluster High Availability 23
Gateway Host High Availability 23
Appendix 24
Definitions 24
Hadoop and Spark Support 25

White Paper: BlueData Software Architecture


www.bluedata.com
1. BlueData EPIC
The BlueData Elastic Private Instant Clusters platform (referred • Create virtual clusters: EPIC duplicates the functionality of
to as “EPIC” throughout this white paper) allows users to provide physical clusters while adding flexibility and scalability at
Big-Data-as-a-Service either on-premises, in the public cloud, or reduced cost. For example, each virtual cluster can run a
in a hybrid environment. This white paper provides an in-depth different Hadoop or Spark distribution. You may create,
overview of the main components and architecture for version modify, re-prioritize, and remove virtual clusters on demand
3.2 of the BlueData EPIC software platform. to respond to ever-changing needs within individual business
Virtualization and container technologies have already introduced units/departments. EPIC reduces time-to-value from months
flexibility, agility, and reduced costs to most applications in the to hours.
enterprise data center. BlueData EPIC uses Docker containers to • Multi-tenancy and enterprise-grade security model: EPIC
extend these benefits to Big Data by allowing enterprises to integrates with enterprise LDAP and Active Directory
create, modify, and remove virtual clusters on demand without authentication systems. Administrators can then create
sacrificing performance. groupings of users and resources that restrict access to jobs,
With the BlueData EPIC software platform, enterprises can data, or clusters based on departments and/or roles. The
simultaneously run hundreds of Big Data workloads with result is an integrated, secure, multi-tenant infrastructure.
automated policy-based scheduling and self-provisioning. • Self-service portal: EPIC includes a self-service web portal
Distributed applications are efficient and elastic, thanks to EPIC’s that allows users to create and manage clusters, create and
proprietary application-sensitive caching, data path, network manage nodes, run Big Data jobs, and view monitoring
optimization, and policy-based automation and management. IT statistics. User visibility and available functionality vary based
administrators use a single interface to monitor Hadoop or Spark on each user's role and tenant, in accordance with existing
clusters, jobs, and infrastructure status. EPIC also automates security policies. For example, department administrators can
routine tasks such as provisioning, updates, and monitoring. use the portal to provision nodes/clusters to run their own
BlueData EPIC dramatically reduces Big Data deployment Big Data applications without impacting nodes/clusters that
complexity while improving business agility by providing an elastic are assigned to different departments and without having to
self-service infrastructure that reduces the time-to-value from manage the physical infrastructure.
months to days, and at the same time reducing overall costs by • RESTful API: EPIC supports a RESTful API that surfaces
50%-75% compared to traditional, non-virtualized Hadoop and programmable access to the same capabilities as the self-
Spark deployments. Users create virtual clusters on demand and service portal.
execute jobs without ever having to worry about the underlying • Superior performance: EPIC provides I/O optimization to
infrastructure. deliver high performance for Big Data applications, without
the penalties commonly associated with virtualization or
Key Features and Benefits containerization. The CPU cores and RAM in each host are
The key features and benefits of BlueData EPIC include: pooled and then partitioned into virtual resource groups
based on tenants. When using AWS, EPIC runs on a curated
• Integrated platform for Big Data analytics and data
set of Amazon EC2 instance flavors that allows you to get
science: EPIC is an infrastructure platform purpose-built for
just the right amount of computing power needed to run jobs.
Big Data, with enterprise-grade security, networking, and
You only pay for the number and type(s) of EC2 instances
support for a variety of local and remote storage options.
required at any given time, while ensuring high performance.
• Runs on-premises and/or on Amazon Web Services
• Works with existing infrastructure: EPIC leverages your
(AWS): EPIC can be deployed on-premises, on AWS, or in a
data center investments by allowing your enterprise to
hybrid environment that includes both AWS and on-premises
repurpose its existing infrastructure to run Big Data
resources.
deployments. EPIC supports your physical or virtualized
Note: BlueData EPIC is not yet generally available for Microsoft Azure infrastructure, including commodity x86 servers and virtual
and Google Cloud Platform. Please inquire about early access by machines, as well as AWS. Existing storage protocols are
visiting www.bluedata.com/multi-cloud. also supported (HDFS, HDFS with Kerberos, and NFS).

White Paper: BlueData Software Architecture


www.bluedata.com Page 1
• Reduced IT overhead: EPIC streamlines operations and The App Store contains three classes of images:
reduces IT costs by automating provisioning, unifying • Major Hadoop, Spark, and other Big Data products
management, and supporting push-button upgrades. provided out-of-the-box by BlueData. These images contain
• Increases utilization while lowering costs: EPIC delivers open-source bits of major distributions that are both
hardware and operational cost savings while simultaneously unmodified and supported by the distribution vendors.
eliminating the complexity of managing multiple physical • ETL, business intelligence, and data science tools
clusters. When running on AWS, EPIC allows clusters to be supported out-of-the-box by BlueData. These include
turned on and off at will, meaning that you only pay for the images for open-source web-based notebooks as well as
AWS instances that you need when you need them. commercial Big Data analytics tools. The images for
• High Availability: EPIC supports three levels of High commercial applications have been validated by BlueData and
Availability to provide redundancy and protection, as are available in the App Store via partnerships with these
described in “High Availability” on page 22. companies.
• Compute and storage separation for Big Data: EPIC • Custom distributions and applications added specifically by
decouples analytical processing from data storage, giving you individual customers. BlueData provides an Application
the ability to independently scale compute and storage Workbench that allows customers to create and add their own
instantly on an as-needed basis. This enables more effective images to the App Store. Users can then deploy these images
utilization of infrastructure resource and reduces overall and use them in a similar way as any of the out-of-the-box
costs. images described above.
• In-place access to on-premises enterprise storage and Note: App Store images are independent of BlueData EPIC software.
Amazon S3: EPIC allows you to access and run Big Data jobs Any distribution or application can be added or removed from your
directly against both existing enterprise-class storage EPIC platform to suit your specific needs.
systems and S3 in the AWS cloud. The separation of compute The Platform Administrator may install or uninstall images.
and storage provided by EPIC means that you don’t need to Installed images are available for use by Tenant Members when
move or duplicate data before running analytics. creating jobs and clusters.
BlueData and/or application vendors may provide new images or
App Store new versions of existing images. If the EPIC Controller host can
The BlueData EPIC platform includes an App Store with one-click access the internet and a new version becomes available for an
deployment for common Big Data applications, such as: image that is currently installed, the image will be marked in the
• Open-source Apache Hadoop distributions from Cloudera, App Store with an Upgrade Available banner, and its tile will
Hortonworks, and MapR. provide a button for upgrading to the new version. Other new
images or new versions of currently uninstalled images will
• Open-source Apache Spark with support for standalone
display a New banner.
resource manager and YARN.
See “Hadoop and Spark Support” on page 25 for examples of
• Open-source Apache Kafka messaging system.
supported Big Data application services and Hadoop/Spark
• Open-source Cassandra NoSQL database from Datastax. ecosystem products.
• Integration with JupyterHub, RStudio Server, and Zeppelin
notebooks, as well as software applications such as AtScale,
Datameer, H2O, Splunk, and others.
The App Store contains Docker container images of each
available product, allowing fully automated self-service
deployment. Each image in the App Store provides a particular
version, is pre-configured, and ready-to-run on the EPIC platform.
EPIC also supports a “bring your own app” model that allows
users to quickly add images for any Big Data application or data
processing platform to the App Store.

White Paper: BlueData Software Architecture


www.bluedata.com Page 2
2. BlueData EPIC Architecture
This section describes the BlueData EPIC architecture, user EPIC consists of three key capabilities:
hierarchy, DataTaps, and other important EPIC concepts. • ElasticPlane™ is a self-service web portal interface that spins
up virtual Hadoop or Spark clusters on demand in a secure,
Software Components multi-tenant environment.
BlueData EPIC is an enterprise-grade software platform that • IOBoost™ provides application-aware data caching and tiering
forms a layer between the underlying infrastructure and Big Data to ensure high performance for virtual clusters running Big
applications, transforming that infrastructure into an agile and Data workloads.
flexible platform for virtual clusters running on Docker containers. • DataTap™ accelerates time-to-value for Big Data by allowing
in-place access to any storage environment, thereby
Figure 1: BlueData EPIC software architecture (below) eliminating time-consuming data movement.

9
11
Tenant Administrators Data Scientists Developers Data Engineers Data Analysts

TM
BlueData EPIC Platform 8
10

BI/Analytics Tools Bring-Your-Own

ElasticPlane™– Self-service, multi-tenant clusters


7 IOBoost™– Extreme performance and scalability
DataTap™– In-place access to data on-prem or in the cloud

4 5
Compute 6 6 EC2
5
Platform Administrator
1 2 3
2 3
Storage NFS HDFS S3

On-Premises Public Cloud


The high-level EPIC architecture is as follows (numbers time and reducing network traffic. Please see “About
correspond to the callouts in Figure 1, above): DataTaps” on page 12 for more about how EPIC handles data
• Data Source (1): This is where EPIC reads and writes storage.
persistent job data required by the tenants and virtual • Cluster file system (2): This is where EPIC reads and writes
clusters. A data source is typically a DataTap: a shortcut temporary data that is generated while running jobs. The
within EPIC that points to existing remote data storage cluster file system is built within the virtual cluster, on
locations on your network. A special instance of DataTap, storage taken from the node storage space of the underlying
constructed from storage local to the EPIC hosts, is known as EPIC host (on-premises) and/or an Amazon S3 bucket (on
tenant storage. The use of DataTaps reduces or even AWS).
eliminates the need to copy large volumes of data to and from
the virtual clusters before and after running jobs, thus saving

White Paper: BlueData Software Architecture


www.bluedata.com Page 3
• Unique file directories for each tenant (3): EPIC this case, there is no distinction between a “Worker host”
automatically creates a sandboxed shared-storage area for and a “virtual node.” In this example, a cluster that
each tenant within the tenant storage space of the EPIC consists of four nodes will be located on four EC2
platform, whether on-premises or on AWS in S3. This per- instances.
tenant storage can be used to isolate data that should be • EPIC Platform (7): The EPIC platform consists of the EPIC
accessible by only one tenant. Optionally, it can also be used services that are installed on each of the hosts. EPIC handles
to enforce a quota on the tenant's use of that space. all of the back-end virtual cluster management for you,
• Platform Administrator (4): One or more Platform thereby eliminating the need for complex, time-consuming IT
Administrator(s) handle overall EPIC administration, including support. Platform and Tenant Administrator users can
managing hosts and creating tenants. perform all of these tasks in moments using the EPIC web
• Controller host (5): The Controller host is the host where portal.
you initially install EPIC. This host controls the rest of the • Role-based security (8): EPIC includes three user roles
hosts in the EPIC platform. (Platform Administrator, Tenant Administrator, and Member)
- If you are installing EPIC 3.2 on-premises, then the that allow you to control who can see certain data and
Controller host will be located on your own perform specific functions. Roles are granted on a per-tenant
infrastructure (on a physical server or virtual machine). basis, meaning that you can either restrict users to a single
In this case, EPIC runs on either the Red Hat Enterprise tenant or grant access to multiple tenants. Each user may
Linux or CentOS operating system, version 6.8 have at most one role per tenant.
(minimum), version 7.3, or version 7.4 (recommended). • Tenant Administrators (9): A Tenant Administrator
You must have the operating system installed on all of manages the resources assigned to that tenant. Each tenant
the nodes that you will be using for EPIC before must have at least one user with the Tenant Administrator
beginning the installation process. role.
- If you are installing EPIC on AWS, then the Controller • Clusters (10): Clusters can be created to run a wide variety
host will be located on an Amazon EC2 instance. of Big Data applications, services, and jobs.
• Worker hosts (6): Worker hosts are under the direct control • Tenants (11): Tenants allow you to restrict EPIC access as
of the Controller host. EPIC dynamically allocates the needed, such as by department. Each tenant has its own
resources on these hosts to the virtual clusters and jobs unique sets of authorized users, DataTaps, applications, and
within each tenant as needed, based on user settings and virtual clusters that are never shared with any other tenant.
resource availability. This dynamic resource allocation means Users with access to one tenant cannot access or modify any
that EPIC achieves a much higher host utilization rate than aspect of another tenant unless they have also been assigned
traditional Hadoop and Spark deployments. If you have EPIC a role (Tenant Administrator or Member) on that tenant.
installed on-premises with High Availability enabled, then two
of these Worker hosts will have additional roles, as described
in “High Availability” on page 22.
- If you are installing EPIC on-premises, then the Worker
host(s) will be located either on your infrastructure
(physical server or virtual machine) or on Amazon EC2
instances (for a hybrid deployment). Each Worker host
that is located on your infrastructure can manage
multiple containers/virtual nodes. For example, a cluster
that consists of four nodes could potentially be located on
a single Worker host. Worker hosts that are located on
Amazon EC2 instances can support a single virtual node
per host, as described below.
- If you are installing EPIC on AWS, then the Worker
host(s) will be located on Amazon EC2 instances, where
each instance will appear as a virtual node in EPIC. In

White Paper: BlueData Software Architecture


www.bluedata.com Page 4
3. Tenants
Tenants are created by the Platform Administrator after installing • Seasonal needs: Some parts of your organization may have
BlueData EPIC. The infrastructure resources (e.g. CPU, memory, varying needs depending on the time of year. For example,
storage) available on the BlueData EPIC platform are split among your Accounting department may need to run jobs between
the tenants on the platform. Each tenant can be configured as January 1 and April 15 each year but have little to no needs at
either using on-premises resources or AWS resources. Each other times of the year.
tenant is allocated a set of resources and restricts access to a set • Amount and location(s) of hosts: The number and
of data to only those users authorized to access the tenant. location(s) of the hosts that you will use to deploy an EPIC
Resources used by one tenant cannot be used by another tenant. platform may also be a factor. If your hosts are physically
All users who are members of a tenant can access the resources distant from the users who need to run jobs, then network
and data objects available to that tenant. bandwidth may become an important factor as well.
You will need to decide how to create tenants to best suit your • Personnel who need EPIC access: The locations, titles, and
organizational needs, such as by: job functions of the people who will need to be able to access
• Office location: If your organization has multiple office EPIC at any level (Platform Administrator, Tenant
locations, you could choose to create one or more tenants per Administrator, or Member) may influence how you plan and
location. For example, you could create a tenant for the San create tenants.
Francisco office and one for the New York office. EPIC does • IT policies: Your organization’s IT policies may play a role in
not take location into account; this is just an example of how determining how you create tenants and who may access
you could use a tenant. them.
• Department: You could choose to create one or more tenants • Regulatory needs: If your organization deals with regulated
for each department, For example, you could create one products or services (such as pharmaceuticals or financial
tenant each for the Manufacturing, Marketing, Research & products), then you may need to create additional tenants to
Development, and Sales departments. safeguard regulated data and keep it separate from non-
• Use cases, application lifecycle, or tools: Different use regulated data.
cases for Big Data analytics and data science may have These are just a few of the possible criteria you must evaluate
different image/resource requirements. when planning how to create tenants. EPIC has the power and
• Combination: You could choose to create one tenant by flexibility to support the tenants you create regardless of the
department for each location. For example, you could create schema you use. You may create, edit, and delete tenants at any
a tenant for the Marketing department in San Francisco and time. However, careful planning for how you will use your EPIC
another tenant for the Marketing department in New York. platform that includes the specific tenant(s) your organization will
Some of the factors to consider when planning how to create need now and in the future will help you better plan your entire
tenants may include: EPIC installation from the number and type of hosts to the tenants
you create once EPIC is installed on those nodes.
• Structure of your organization: This may include such
considerations as the department(s), team(s), and/or
function(s) that need to be able to run jobs.
• Location of data: If the data to be accessed by the tenant
resides in Amazon S3 storage on AWS, then the tenant
should be configured to use Amazon EC2 compute
resources. If the data to be accessed by the tenant resides
on-premises, then the tenant can be configured to use either
on-premises or Amazon EC2 compute resources.
• Use cases/tool requirements: Different use cases for Big
Data analytics and data science may have different image/
resource requirements.

White Paper: BlueData Software Architecture


www.bluedata.com Page 5
Users and Roles PERMISSION PA TA M
A user consists of the following components: View a tenant X X
• Login credentials (username and password) Create a tenant X
• Role Edit a tenant X
Some of the user-related things you must consider when planning Delete a tenant X
and maintaining your EPIC installation include:
Add an EPIC user X
• Tenants: The number of tenants and the function(s) each
Remove an EPIC user X
tenant performs will determine how many Tenant
Administrator users you will need and, by extension, the Grant a user platform role X
number of Member users you will need for each tenant. The Revoke a user platform role X
reverse is also true, because the number and functions of Grant a user tenant role X X
users needing to run jobs can influence how you create Revoke a user tenant role X X
tenants. For example, different levels of confidentiality might
View detailed host information X
require separate tenants.
Manage user authentication X
• Job functions: The specific work performed by each user
will directly impact the EPIC role they receive. For example, a View job information X X X
small organization may designate a single user as the Tenant Enter/exit Lockdown mode X
Administrator for multiple tenants, while a large organization Add a job X X
may designate multiple Tenant Administrators per tenant.
Delete a job X X
• Security clearances: You may need to restrict access to
View virtual cluster information X X X
information based on each user’s security clearance. This
can impact both the tenant(s) a user has access to and the Add a virtual cluster (no constraints) X X
role that user has within the tenant(s). Add a virtual cluster w/constraints X X
Each user may have a maximum of one role per tenant. For Access a cluster in Isolated Mode* X X
example, if your EPIC installation consists of 20 tenants, then Edit a virtual cluster X X
each user may have up to 20 separate roles. A user with more
Delete a virtual cluster X X
than one role may be a Member of some tenants and a Tenant
Administrator of other tenants. Create, edit, delete cluster templates X X

Figure 10 lists the specific functions that can be performed within Run ActionScripts in a cluster X X
EPIC and the role(s) that can perform each of those actions. In View detailed DataTap information X X
this table: Add a DataTap X
• Permission stands for the right to perform a given action. Edit a DataTap X
Users with specific roles receive specific permissions within Remove a DataTap X
EPIC.
View summary DataTap information X X
• PA stands for the Platform Administrator role.
View virtual node information X X
• TA stands for the Tenant Administrator role.
Manage EPIC platform configuration X
• M stands for the Member (non-administrative) role.
Create, edit, delete flavor definitions X
• An X in a column means that a user with the indicated role
Install/uninstall/upgrade X
can perform the indicated action.
App Store images
• A blank entry means that a user with the indicated role
Add and manage EPIC Worker hosts X
cannot perform the indicated action.
View global usage/health/metrics X
View tenant resource usage X X
Figure 10: EPIC permissions by platform/tenant role (right)
*The ability to access isolated clusters depends on the tenant Cluster
Superuser Privilege setting.

White Paper: BlueData Software Architecture


www.bluedata.com Page 6
User Authentication • LDAP: LDAP Host, User Attribute, User Subtree DN

Each user has a unique username and password that they must • Active Directory: AD Host, User Attribute, User Subtree DN
provide in order to login to BlueData EPIC. Authentication is the Accessing EPIC (SSO)
process by which EPIC matches the user-supplied username and
Single Sign On (SSO) allows users to supply login credentials
password against the list of authorized users and determines both
once, and then gain access to all authorized resources and
- whether to grant access (stored either in the local user applications without having to log in every time. When BlueData
database server or in the remote LDAP/Active Directory EPIC is configured for SSO, authorized users will proceed directly
server), and to the Dashboard screen without having to log in. From there,
- what exact access to allow, in terms of the specific users can access cluster services (such as Hue), if needed.
role(s) granted to that user (stored on the EPIC Configuring EPIC for SSO requires both of the following:
Controller node).
• A metadata XML file that is provided by the Identity Provider
User authentication information is stored on a secure server. (IdP)
EPIC can authenticate users using any of the following methods:
• XPATH to a location in the SAML response that will contain
• Internal user database. the LDAP/AD username of the user, such as
• Your existing LDAP or Active Directory server that you can //saml:AttributeStatement/
connect to using Direct Bind or Search Bind. saml:Attribute[@Name="PersonImmutableID"]/
saml:AttributeValue/text()
Accessing EPIC (Non-SSO)
You can then use LDAP/AD groups to assign roles.
The non-SSO user authentication process is identical when using
If platform HA is not enabled for the EPIC platform, then the
either the internal EPIC user database or an external LDAP/AD
hostname of the EPIC Controller host must be mapped to the IP
server:
address of the Controller host via a DNS server that can be
1. A user accesses the EPIC Login screen using a Web browser accessed by the user. This allows a user-initiated browser GET
pointed to the IP address of the Controller host. request to correctly resolve to the Controller host. For EPIC
2. The user enters her or his username and password in the platforms with platform HA enabled, this will be a hostname that
appropriate fields and attempts to login. resolves to the cluster IP address. Figure 2 shows the DNS name
3. EPIC passes the user-supplied username and password to resolution process.
the authentication server.
4. The authentication server returns a response that indicates NO Resolve to
Platform HA enabled?
either a valid (allow user to login) or invalid (prevent user Controller hostname.
from logging in) login attempt.
5. If the login attempt is valid, EPIC will match the user with the YES
role(s) granted to that user and allow the proper access.
Using the internal user database included with EPIC is fast and
convenient from an IT perspective. However, it may complicate
Cluster hostname NO Resolve to
user administration for various reasons, such as: "#ɑ,#"E cluster IP address.
• The user may be required to change their password on the
rest of the network but this change will not be reflected in
YES
EPIC.
• A user who is removed from the network (such as when they
leave the organization) must be removed from EPIC Resolve to
Controller hostname.
separately.
Connecting EPIC to your existing user authentication server Figure 2: DNS name resolution process
requires you to supply some information about that server when
installing EPIC. Contact your user administrator for the following The IdP must be configured with the following information:
information:

White Paper: BlueData Software Architecture


www.bluedata.com Page 7
• Audience: This field is not required; however, providing the
base URL of the SAML server is more secure than a blank
entry. If you do enter a URL, then this URL must exactly
match the SAML Application Name that you will specify in
EPIC.
• Recipient: [<hostname>|<ip_address>]/bdswebui/
login, where <hostname> is the name of either the
Controller host or HA cluster, as appropriate, and
<ip_address> is either the Controller or cluster IP
address. Use either a hostname or an IP address, but not
both. For example, controllername/bdswebui/login
or 10.32.1.10/bdswebui/login.
• Consumer URL Validator: Enter <epic_info>/
bdswebui/login/, where <epic_info> is either:
- .* - This is a valid generic entry, but is less secure. For
example, .*/bdswebui/login/.
- < name or ip>, which will be either the FQDN or IP
address of either the Controller host or cluster, as
described above. This entry is more secure than the
generic entry. For example, 10.32.0.75/bdswebui/
login/ or epic-01.organization.com/
beswebui/login/.
• Consumer URL: <epic_info>/bdswebui/
saml_login>, where <epic_info> is either a generic or
specific entry, as described above.
The IdP will provide a SAML IdP XML metadata file that you will
use when configuring EPIC for SSO.

White Paper: BlueData Software Architecture


www.bluedata.com Page 8
4. Virtual Compute and Memory
The BlueData EPIC software platform allows you to deploy together to prevent tenant members from overloading the
multiple virtual clusters using fully managed and embedded system's CPU resources.
Docker containers. A virtual cluster is a collection of virtual nodes • When two nodes are assigned to the same host and contend
(Docker containers) that are available to a specific tenant. Each for the same physical CPU cores, EPIC allocates resources to
virtual node is allocated virtual CPU and memory, as described in those nodes in a ratio determined by their vCPU core count.
this section. This section also describes how EPIC uses virtual For example, a node with 8 cores will receive twice as much
node flavors. CPU time as a node with 4 cores.
• The Platform Administrator can also specify a Quality of
Virtual Cores and RAM Service (QOS) multiplier for each tenant. In the case of CPU
BlueData EPIC models virtual CPU cores and memory differently resource contention, the node core count is multiplied by the
based on whether your hosts are located on-premises in your tenant QOS multiplier when determining the CPU time it will
datacenter, or on AWS. be granted. For example, a node with 8 cores in a tenant with
a QOS multiplier of 1 will receive the same CPU time as a
On-Premises Hosts
node with 4 cores in a tenant with a QOS multiplier of 2. The
For EPIC hosts that reside on your premises, each virtual node QOS multiplier is used to describe relative tenant priorities
within a virtual cluster consumes system CPU and memory when CPU resource contention occurs; it does not affect the
resources, according to the virtual CPU (vCPU) core count and overall cap on CPU load established by the CPU allocation
amount of RAM specified for that virtual node. The EPIC platform ratio and tenant quotas.
license is based on the maximum number of CPU cores managed. • EPIC models RAM as follows:
EPIC manages the total amount of available vCPU cores and RAM
• The total amount of available RAM is equal to the amount of
aggregated across all of the hosts. Each tenant may be assigned a
unreserved RAM in the EPIC platform. Unreserved RAM is
quota that limits the total number of vCPU cores, the amount of
the amount of RAM remaining after reserving some memory
RAM, and the amount of storage that are available for use by the
in each host for EPIC services. For example, if your EPIC
clusters within that tenant. A tenant's ability to utilize its entire
platform consists of four hosts that each have 128GB of
quote of resources is limited by both the EPIC license and the
physical RAM with 110GB of unreserved RAM, the total
availability of physical resources.
amount of RAM available to share among EPIC tenants will be
EPIC grants vCPU cores and RAM to clusters on a first-come, 440GB.
first-served basis. Each cluster requires CPU and RAM resources
• EPIC allows an unlimited amount of RAM to be allocated to
in order to run, based on the number and size of its component
each tenant. The collective RAM usage for all nodes within a
nodes. Cluster creation can only proceed if the total resources
tenant will be constrained by either the tenant's assigned
assigned to that cluster will not exceed the tenant quota, if a
quota or the available RAM in the system, whichever limit is
corresponding amount of resources are free in the system.
reached first.
EPIC models vCPU cores as follows:
During tenant creation, EPIC will suggest some default values.
• The number of available vCPU cores is the number of These default values will be 25% of the total system vCPU cores
physical CPU cores multiplied by the CPU allocation ratio or RAM, respectively. The total available amount of that resource
specified by the Platform Administrator. For example, if the is displayed for comparison below each field. You may edit these
hosts have 40 physical CPU cores and the Platform quota values or delete a value and leave the field blank to indicate
Administrator specifies a CPU allocation ratio of 3, then EPIC that the tenant does not have a quota defined for that resource.
will display a total of 120 available cores. EPIC allows an
Assigning a quota of vCPU cores and/or RAM to a tenant does not
unlimited number of vCPU cores to be allocated to each
reserve those resources for that tenant when that tenant is idle
tenant. The collective core usage for all nodes within a tenant
(not running one or more clusters). This means that a tenant may
will be constrained by either the tenant's assigned quota or
not actually be able to acquire system resources up to its
the available cores in the system, whichever limit is reached
configured maximum CPU cores and RAM.
first. The tenant quotas and the CPU allocation ratio act

White Paper: BlueData Software Architecture


www.bluedata.com Page 9
You may assign a quota for any amount of vCPU cores and/or GB Each cluster requires a specified number of instances in order to
of RAM to any tenant(s) regardless of the actual number of run, based on the number of virtual nodes added when that
available system resources. A configuration where total allowed cluster was created (one virtual node equals one instance).
tenant resources exceed the current amount of system resources Cluster creation can only proceed if the total resources assigned
is called over-provisioning. Over-provisioning occurs when one or to that cluster will not exceed the tenant quota, if a corresponding
more of the following conditions are met: amount of resources are free in the system, and if creating the
• You only have one tenant which has quotas that either exceed cluster will not exceed the licensed maximum number of
the system resources or are undefined quotas. This tenant instances for the EPIC platform.
will only be able to use the resources that are actually The number of available vCPU cores is the number of CPU cores
available to the EPIC platform; this arrangement is just a per instance multiplied by the number of instances assigned to the
convenience to make sure that the one tenant is always able cluster. For example, if each instance has 16 vCPU cores and a
to fully utilize the platform, even if you add more hosts in the Member creates a cluster with five instances, then that cluster
future. will have a total of 80 vCPU cores (16*5). The collective core
• You have multiple tenants where none have overly large or usage for all nodes within a tenant will be constrained by either
undefined quotas, but where the sum of their quotas exceeds the tenant's assigned quota or by the total number of licensed
the resources available to the EPIC platform. In this case, you instances available to the EPIC platform, whichever limit is
are not expecting all tenants to attempt to use all their reached first.
allocated resources simultaneously. Still, you have given each The total amount of available RAM is equal to the amount of RAM
tenant the ability to claim more than its “fair share” of the per instance multiplied by the number of instances assigned to the
EPIC platform's resources when these extra resources are cluster. For example, if each instance has 48GB of RAM and a
available. In this case, you must balance the need for Member creates a cluster with five instances, then that cluster
occasional bursts of usage against the need to restrict how will use a total of 240GB of RAM (48*5) from the EPIC platform.
much a “greedy” tenant can consume. A larger quota gives EPIC allows an unlimited amount of RAM to be allocated to each
more freedom for burst consumption of unused resources tenant, up to licensed maximum for the entire EPIC platform. The
while also expanding the potential for one tenant to prevent collective GB of RAM for all nodes within a tenant will be
other tenants from fully utilizing their quotas. constrained by either the tenant's assigned quota or by the total
• You have multiple tenants where one or more has overly GB of all of the EC2 instances in the EPIC platform, whichever
large and/or undefined quotas. Such tenants are trusted or limit is reached first.
prioritized to be able to claim any free resources; however, Assigning a quota of vCPU cores and/or RAM to a tenant does not
they cannot consume resources being used by other tenants. reserve those resources for that tenant when that tenant is idle
Note: Over-provisioning is useful in certain situations; however, (not running one or more clusters). This means that a tenant may
avoiding over-provisioning prevents potential resource conflicts by not actually be able to acquire system resources up to its
ensuring that all tenants are guaranteed the maximum number of configured maximum CPU cores and RAM.
allocated virtual CPU cores and RAM.
Virtual Node Flavors
AWS Instances
BlueData EPIC models virtual node flavors based on where hosts
Each virtual node within a cluster is built on top of an Amazon are located, as follows:
EC2 instance and consumes CPU and memory resources,
• On-premises hosts: If the hosts are located on your
according to the virtual CPU (vCPU) core count and amount of
premises, then a virtual node flavor defines the number of
RAM specified for that node. The EPIC platform is licensed for a
vCPU cores and RAM used by a virtual node. For example, if
maximum number of EC2 instances. The flavor(s) used by these
the flavor “small” specifies a single vCPU core and 3GB of
instances determine the total number of vCPU cores and GB of
RAM, then all virtual nodes created with the “small” flavor will
RAM that are available to the EPIC platform. Each tenant may also
have those specifications, subject to the rules described in
be assigned a quota that limits the total number of vCPU cores,
Virtual Cores and RAM. EPIC creates a default set of flavors
RAM, and/or storage that are available to the clusters within that
(such as Small, Medium, and Large) during installation.
tenant.
• AWS instances: If the hosts are located on AWS, then a
EPIC grants EC2 instances (and the vCPU cores and RAM within
virtual node flavor defines the Amazon template and storage
those instances) to clusters on a first-come, first-served basis.

White Paper: BlueData Software Architecture


www.bluedata.com Page 10
capacity used by a virtual node. For example, if the flavor
uses the Amazon c4.4xlarge template, then each instance
created using that flavor will have 16 CPU cores and 30GB of
RAM, with either the default or user-specified amount of
storage. EPIC creates a default set of flavors (such as Small,
Medium, Large, Extra Large, and Extra Extra Large) during
installation.
Note: Flavors are tenant-specific. A flavor that is created or edited in
one tenant will not affect a flavor with the same name in another
tenant.
The Platform Administrator should create flavors with virtual
hardware specifications appropriate to the clusters that tenant
members will create. Application characteristics will guide these
choices, particularly the minimum virtual hardware requirements
per node. Using nodes with excessively large specifications will
waste resources (and count toward a tenant's quota). It is
therefore important to define a range of flavor choices that closely
match user requirements.
The Platform Administrator may freely edit or delete these
flavors. When editing or deleting a flavor:
• If you edit or delete an existing flavor, then all virtual nodes
using that flavor will continue using the flavor as specified
before the change or deletion. EPIC displays the flavor
definition being used by clusters.
• You may delete all of the flavors defined within your EPIC
installation; however, if you do this, then you will be unable to
create any clusters until you create at least one new flavor.
• You may specify a root disk size when creating or editing a
flavor. This size overrides the size specified by the image in
the App Store and/or the default Amazon template storage
size, as appropriate. Specifying a root disk size that is smaller
than the minimum size indicated by a given image will
prevent you from being able to instantiate that image on a
cluster that uses that flavor. Creating a larger root disk size
will slow down cluster creation. This may be necessary in
situations where you are using the cluster to run an
application that uses a local file system.
Note: Consider using DataTaps (see “About DataTaps” on page 12)
where possible for optimal performance.

White Paper: BlueData Software Architecture


www.bluedata.com Page 11
5. Storage
This section describes how BlueData EPIC uses DataTaps, tenant • Port: For HDFS DataTaps, this is the port for the NameNode
storage, and node storage. server on the host used to access the data.
• Path: Complete path to the directory containing the data
About DataTaps within the specified NFS share or HDFS file system. You can
DataTaps expand access to shared data by specifying a named leave this field blank if you intend the DataTap to point at the
path to a specified storage resource. Big Data jobs within EPIC root of the specified share/volume/file system.
virtual clusters can then access paths within that resource using • Standby NameNode: DNS name or IP address of a standby
that name. This allows you to run jobs using your existing data NameNode that an HDFS DataTap will try to reach if it cannot
systems without the need to make time-consuming copies or contact the primary host. This field is optional; when used, it
transfers of your data. Tenant Administrator users can quickly provides high-availability access to the specified HFDS
and easily build, edit, and remove DataTaps using the DataTaps DataTap.
screen. Tenant Member users can use DataTaps by name. • Kerberos parameters: If the HDFS DataTap has Kerberos
Each DataTap consists of some of the following properties enabled, then you will need to specify additional parameters.
depending on its type (NFS, HDFS, or HDFS with Kerberos): EPIC supports two modes of user access/authentication.
• Name: Unique name for each DataTap. This name may - Passthrough mode passes the individual user's (client
contain letters (A-Z or a-z), digits (0-9), and hyphens (-), but principal) Kerberos credentials from the compute
may not contain spaces. You can use the name of a valid Hadoop cluster to the remote HDFS cluster for
DataTap to compose DataTap URIs that you give to authentication.
applications as arguments. Each such URI maps to some path - Proxy mode permits a “proxy user” to be configured to
on the storage system that the DataTap points to. The path have access to the remote HDFS cluster. Individual users
indicated by a URI might or might not exist at the time you are granted access to the remote HDFS cluster by the
start a job, depending on what the application wants to do proxy user configuration. Mixing and matching
with that path. Sometimes the path must indicate a directory distributions is permitted between the compute Hadoop
or file that already exists, because the application intends to cluster and the remote HDFS. This functionality
use it as input. Sometimes the path must NOT be something duplicates the Kerberos interactions between compute
that currently exists, because the application intends to create and HDFS services in a co-located Hadoop deployment.
it. The semantics of these paths are entirely application-
The storage pointed to by a BlueData DataTap can be accessed by
dependent, and are identical to their behavior when running
a Map/Reduce job (or by any other Hadoop-or Spark-based
the application on a physical Hadoop or Spark platform.
activity in an EPIC virtual node) by using a URI that includes the
• Description: Brief description of the DataTap, such as the name of the DataTap.
type of data or the purpose of the DataTap.
A DataTap points at the top of some directory hierarchy in the
• Type: Type of file system used by the shared storage storage container. A URI of the following form can be used to
resource associated with the DataTap (HDFS, or NFS). This access the root of that hierarchy:
is completely transparent to the end job or other process
dtap://data_connector_name
using the DataTap. In addition, you can use a DataTap to
In this example, data_connector_name is the name of the
connect to a MapR FS; you can also configure a cluster to
DataTap that you wish to use. You can access files and directories
access an Amazon S3 bucket on AWS.
further in the hierarchy by appending path components to the URI:
• Host: DNS name or address of the service providing access
dtap://datatap_name/some_subdirectory/
to the storage resource. For example, this could be the another_subdirectory/some_file
NameNode of an HDFS cluster.
For example, the URI dtap://mydataconnector/home/
• Share: For NFS DataTaps, this is the exported share on the mydirectory means that the data is located within the /home/
selected host. mydirectory directory in the storage that the DataTap named
mydataconnector points to.

White Paper: BlueData Software Architecture


www.bluedata.com Page 12
DataTaps exist on a per-tenant basis. This means that a DataTap uses the HDFS back-end to enforce this quota, meaning that the
created for Tenant A cannot be used by Tenant B. You may, quota applies to storage operations that originate from both the
however, create a DataTap for Tenant B with the exact same EPIC DataTap browser or the nodes within that tenant.
properties as its counterpart for Tenant A, thus allowing both
tenants to use the same shared network resource. This allows Node Storage
jobs in different tenants can access the same storage
Node storage is based on each host in the EPIC platform and is
simultaneously. Further, multiple jobs within a tenant may use a
used for the volumes that back the local storage for each virtual
given DataTap simultaneously. While such sharing can be useful,
node. Using SEDs (Self-Encrypting Drives) will ensure that any
be aware that the same cautions and restrictions apply to these
data written to node storage is encrypted on write and decrypted
use cases as for other types of shared storage: multiple jobs
on read by the OS. A tenant can optionally be assigned a quota for
modifying files at the same location may lead to file access errors
how much storage the nodes in that tenant can consume.
and/or unexpected job results.
BlueData instances running on AWS utilize Amazon EBS (Elastic
Users who have a Tenant Administrator role may view and
Block Storage) as node storage. The size of this storage can be
modify detailed DataTap information. Members may only view
specified in the flavor definition.
general DataTap information and are unable to create, edit, or
remove a DataTap.
Application Path Inputs
Note: Data conflicts may occur id more than one DataTap points to a
location being used by multiple jobs at once. Applications running on EPIC virtual clusters will typically accept
input arguments that specify one or more storage paths. These
On-Premises Tenant Storage paths identify files or directories that either currently exist or are
intended to be created by the application. Each path can be
For tenants that reside on on-premises hosts, tenant storage is a specified in one of three ways:
storage location that is shared by all nodes within a given tenant.
• You can use a UNIX-style file path, such as /home/directory.
The Platform Administrator configures tenant storage while
This path refers to a location within the cluster file system.
installing EPIC and can change it at any time thereafter. Tenant
Remember that the cluster file system and all data stored
storage can be configured to use either a local HDFS installation
therein will be deleted when the cluster is deleted.
(configured by BlueData EPIC on the host storage) or a remote
HDFS or NFS system. Alternatively, you can create a tenant • You can use a DataTap URI, such as dtap://mydatatap/
without dedicated storage. home/mydirectory. This path refers to a location within
the storage system pointed to by the named DataTap within
Note: If all tenants are created using the same tenant storage service
the same tenant. A Hadoop or Spark application will “mount”
settings, then no tenant can access the storage space of any other
a given DataTap before using any paths within that DataTap.
tenant.
If the DataTap changes or is deleted, a running application
When a new tenant is created, that tenant automatically receives a will not see the effects until the next time it attempts to mount
DataTap called “TenantStorage” that points at a unique directory that DataTap.
within the Tenant Storage space. This DataTap can be used in the
• You can use a s3a://<Bucket Name>/<folder name>
same manner as other DataTaps, but it cannot be edited or
path, such as s3a://logdata/2017/. This path refers to
deleted. This does not apply if no tenant storage has been defined
the location of the files in Amazon S3 storage that can be
(meaning that you selected None for tenant storage when
used as the input or output location for Hadoop or Spark jobs.
installing EPIC).
External tables may also be created against files stored in
The TenantStorage DataTap points at the top-level directory that Amazon S3 storage. BlueData auto-configures the S3A client
a tenant can access within the tenant storage service. The Tenant in all Hadoop and Spark clusters. Access controls to specific
Administrator can create or edit additional DataTaps that point at S3 buckets are controlled by the AWS Instance profile
or below that directory; however, one cannot create or edit a specified in the tenant definition.
DataTap that points outside the tenant storage on that particular
Please see the technical white paper BlueData EPIC Storage
storage service. (It is of course possible to create and edit
Options for additional details about the networking design and
DataTaps that point to external services.)
network configurations for the BlueData EPIC software platform.
If the tenant storage is based on a local HDFS, then the Platform
Administrator can specify a storage quota for each tenant. EPIC

White Paper: BlueData Software Architecture


www.bluedata.com Page 13
6. Networking
This section describes how BlueData EPIC manages networking, The two networks are laid out as follows:
including public and private container networks, subnets, and • Network for the physical EPIC Controller, Worker, and
Gateway hosts. Gateway hosts: This network must be both routable and part
of the organization’s network that is managed by that
Networks and Subnets organization’s IT department. All of the physical hosts must
EPIC operates on two networks, as shown in Figure 3: be routable. If the EPIC Platform HA feature is enabled (see
“High Availability” on page 22), then both the Primary
Top of Rack (TOR) Switch
Controller and Shadow Controller must be in the same
10.16.1.1 subnet. If different subnets are used for the Worker hosts,
10.16.1.2
EPIC Gateway 1 then all subnets used must share the same path MTU setting;
Gateway hosts can use subnets with different MTU settings.
.3 See “Multiple Subnets” on page 17.
EPIC Gateway 2
• Network for virtual nodes (Docker containers): EPIC
creates and manages this network, which can be either public
.10 (routable) or private (non-routable). See “Private (Non-
EPIC Controller Routable) Virtual Node Network” on page 15 and “Public
.11
(Routable) Virtual Node Network” on page 16.
EPIC Worker 1
.12
EPIC Worker 2
.13
EPIC Worker 3
192.168.x.x/16

...
.51
EPIC Worker n

Network for Network for


virtual nodes EPIC hosts
Figure 3: Dual networks on the BlueData EPIC platform

White Paper: BlueData Software Architecture


www.bluedata.com Page 14
Private (Non-Routable) Virtual Node Network
Private, non-routable virtual node networks keep the virtual node EPIC replaces the IP addresses of outgoing and incoming packets,
IP addresses private and hidden within the private network. as shown in Figure 3.

BlueData EPIC Management


Cluster provisioning/automation: Embedded containers for Hadoop/Spark nodes
Internal networking: EPIC assigns IP address to containers from a private pool that provides out-of-box network isolation
Policy engine: Resources/placement

Access to services on Controller Host (Host IP2) Worker Host (Host IP3) Worker Host (Host IP4)
containers is proxied
via Gateway host port Tenant 2
numbers.
BD IP8 BD IP9 BD IP10 BD IP11

Gateway Host (Host IP1)

BD IP4 BD IP5
Tenant 1

BD IP6
X X BD IP7
Cluster A

BD IP1 BD IP2 BD IP3


Cluster B

HAProxy Service DataTap Service DataTap Service DataTap Service

Access into/from Docker containers for control traffic, such as SSH and web UIs via the Gateway host (HAProxy)
DataTap access path from Container via Worker host interface to remote storage (eg HDFS) External Switch/
Access to remote systems (e.g. AD, CA, SSO) via Worker hosts (using IP masquerading)
Users
Next Hop
External Network

Figure 4: BlueData EPIC platform configured to use a private, non-routable virtual node network

BlueData EPIC software is deployed on a set of hosts. Each host End user access to services in the containers (such as SSH or
has an IP address and FQDN such as Host IP1, Host IP2, etc. web applications) is routed through a Gateway host that runs the
Hosts are typically deployed as one or more rack(s) of servers HAProxy service. This access is purely for control traffic. All
that are connected to an external switch for access to other other traffic, including access to remote HDFS or other enterprise
networks in the organization (e.g. end user network etc). systems such as Active Directory (AD), MIT KDC (Kerberos
BlueData EPIC provisions clusters of embedded, fully-managed provider), SSO (Identity providers), and Certificate Authority (CA),
Docker containers. Each cluster spins up within a tenant and is performed via the host network interface, as opposed to the
receives distinct assigned IP addresses and FQDNs from a private Gateway host.
IP range (or optionally from a routable IP range provided by the
network teams), which appear in the diagram above as BD IP1, BD
IP2, etc. Containers in one tenant do not have network access to
containers in a different tenant despite potentially being placed on
the same host.

White Paper: BlueData Software Architecture


www.bluedata.com Page 15
Public (Routable) Virtual Node Network
Network traffic in a public EPIC virtual node network flows as shown in Figure 5, below.

BlueData EPIC Management


Cluster provisioning/automation: Embedded containers for Hadoop/Spark nodes
Internal networking: EPIC assigns IP address to containers from a ɗ-2',% IP range that provides out-of-box network isolation
Policy engine: Resources/placement

Controller Host (Host IP1) Worker Host (Host IP2) Worker Host (Host IP3) Worker Host (Host IP4)

Tenant 2
BD IP8 BD IP9 BD IP10 BD IP11

BD IP4 BD IP5
Tenant 1

BD IP6
X X BD IP7
Cluster A

BD IP1 BD IP2 BD IP3


Cluster B

DataTap Service DataTap Service DataTap Service DataTap Service

Docker container network access to/from external pathway (via host NIC)
Access to remote systems (e.g. AD, CA, SSO) via Worker hosts External Switch/
Users
Next Hop
External Network

Figure 5: BlueData EPIC platform configured to use a public, routable virtual node network

The above diagram depicts a deployment where a routable IP


range is used for containers. In this case, the BlueData EPIC
platform does not require Gateway hosts/roles. All the key
functionality is identical to the recommended approach for a non-
routable private IP range (see “Private (Non-Routable) Virtual
Node Network” on page 15). This approach allows the Docker
containers to directly access the external network via the network
interface connector (NIC) on the hosts where they reside.

White Paper: BlueData Software Architecture


www.bluedata.com Page 16
Multiple Subnets
See Figure 6, below, for a sample EPIC deployment using multiple subnets.
End-user clients
(Hue console, Datameer, etc.) Remote data lake AD/LDAP/KDC

External switch

Rack 1/Subnet 1 Rack 2/Subnet 2 Rack 3/Subnet 3


Top of Rack (TOR) switch Top of Rack (TOR) switch Top of Rack (TOR) switch

EPIC host (Controller) EPIC host EPIC host

EPIC host (Shadow) Controller ... ...


EPIC host (Arbiter/Worker) EPIC host n EPIC host n

EPIC host
...
EPIC host n

BlueData EPIC cluster

Figure 6: BlueData EPIC platform that spans multiple subnets

When multiple subnets are used: conform to all applicable requirements listed in the EPIC
• The EPIC hosts may be located on-premises and/or in a documentation. Unlike Compute Worker hosts, Gateway hosts do
public cloud. For example, EPIC hosts can reside on multiple not run containers/virtual nodes. Instead, they enable access to
racks and/or can be virtual machines residing on cloud-based user-facing services such as Hue console, Cloudera Manager,
services, such as AWS. Ambari, and/or SSH running on containers via High Availability
Proxy (HAProxy). You can configure multiple Gateway hosts with
• If the EPIC platform includes cloud-based hosts, then the
a common Fully Qualified Domain Name (FQDN) for round-robin
container network must be private and non-routable.
load balancing and High Availability. You may also use a hardware
• All of the subnets used by EPIC Worker hosts must share the load balancer in front of the multiple Gateway hosts.
same path MTU setting.
The ability to run EPIC virtual nodes in a private, non-routable
• The subnet(s) used by EPIC Gateway hosts may have network can drastically reduce the routable IP address pool
different path MTU settings. requirement. For example, a /16 private network can support
thousands of containers. The corporate network need only
Gateway Hosts manage the physical host addresses (for example, 10.16.1.2-51),
A Gateway host is an optional first-class role that is managed by while EPIC Gateway hosts provide access to the virtual nodes
EPIC in a manner similar to the Controller and Compute Worker running on those hosts. All control traffic to the virtual nodes/
hosts. One or more Gateway host(s) are required when the IP Docker containers from end-user devices (browsers and
addresses used by the virtual nodes in the EPIC platform are command line), such as https, SSH, and/or AD/KDC, goes through
private and non-routable, meaning that the virtual nodes cannot the Gateway host(s), while all traffic from the virtual nodes/
be accessed via the corporate network. Gateway hosts must Docker containers is routed through the EPIC hosts on which

White Paper: BlueData Software Architecture


www.bluedata.com Page 17
those nodes/containers reside. Every container spawned within
the EPIC platform is assigned a private IP address that is Top of Rack (TOR) Switch External Switch(es)
managed by the EPIC private network. Only the Gateway host(s) 10.16.1.1
10.16.1.2
and Controller need be exposed to the end-user network in the EPIC Gateway 1 AD/LDAP/KDC
firewall.
.3
EPIC Gateway 2 Secure Data Zone
Support for multiple subnets increases Gateway host flexibility.
Data Lake
For example, you can use “small” virtual machines that meet all
Corporate D/C Network
Gateway host requirements located on different racks or in EPIC Controller .10

different areas of your network, instead of having to place these


.11
hosts on the same rack as Compute hosts. This can help optimize EPIC Worker 1
resource usage. .12
EPIC Worker 2
Figure 7 (right) displays the physical architecture of an EPIC
.13
platform with Gateway hosts. Figure 8 (below) displays the logical EPIC Worker 3

192.168.x.x/16
architecture of an EPIC platform with Gateway hosts.
...
.51
EPIC Worker n

Figure 7: Physical EPIC platform architecture with Gateway hosts

BlueData EPIC Cluster


EPIC Controller Host
DN DN
(EPIC web interface)
Ambari Hue NN DN CM Hue NN DN
172.16.1.10
192.168.10.12 192.168.10.26 192.168.10.196 192.168.10.197
DN DN

12001 12002 14001 14002


EPIC Gateway(s)
172.16.1.2

Ambari: http://172.16.1.2:12001 Cloudera Manager: http://172.16.1.2:14001


Hue: http://172.16.1.2: 12002 Hue: http://172.16.1.2: 14002

Switch Corporate D/C Network


Platform Administrator

Data Scientist A Data Scientist B

Figure 8: Logical EPIC platform architecture with Gateway hosts

Please see the technical white paper BlueData EPIC Network Design
for additional details about the networking design and network
configurations for the BlueData EPIC software platform.

White Paper: BlueData Software Architecture


www.bluedata.com Page 18
7. Cluster Management
This section describes how BlueData EPIC creates on-premises 12. The startscript registers application services with the Node
and AWS clusters, cluster Isolated Mode, and host tags. Agent to identify services that must be started every time the
node is (re)started.
Cluster Creation 13. After the startscript has succeeded on all nodes, the Node
EPIC performs the following tasks when you create a cluster: Agent reports a “ready” cluster status to the Controller, and
the startscript output from each node is uploaded to the
1. The EPIC Controller checks whether cluster creation would
Controller to be displayed as the cluster setup logs in the
violate any tenant quotas or licensing limits.
EPIC interface and API.
2. If EPIC is being installed on-premises, then the Controller
14. If the EPIC platform is configured to use LDAP/AD user
checks whether enough total free system resources are
authentication, then an additional post-config setup takes
available for cluster creation.
place to allow users to access the containers.
3. The Controller chooses the host(s) in the EPIC platform on
The two figures on the following page illustrate how this works.
which to deploy virtual nodes. For on-premises EPIC
Figure 9 shows the container orchestration process for an on-
platforms, these are physical hosts managed as EPIC
premises EPIC platform. Figure 10 shows the container
Workers, and they are selected based on node affinity,
orchestration process for an AWS-based EPIC platform. A hybrid
scheduling constraints, matching host tag(s) (see “Host Tags”
EPIC platform will also use this process for AWS resources.
on page 21), and the per-host free resources. For AWS EPIC
platforms, these hosts are EC2 instances that are generated
on demand.
Isolated Mode
4. The Controller copies the appropriate application image onto There may be times when you need to limit access to a cluster in
each chosen host if it is not already available there. order to perform configuration or maintenance. Clusters in on-
premises tenants on an EPIC platform running RHEL/CentOS 7.x
5. The EPIC Host Agent service runs on each host and manages
can be placed into Isolated Mode. Running a cluster in Isolated
the lifecycle of the virtual nodes (containers) deployed on that
Mode allows authorized users to access clusters for maintenance
host. This service creates a Docker container for each virtual
or updates using either automated scripts or manual commands,
node and then configures the networking and storage for the
while preventing other users from accessing the services running
node(s).
in that cluster. Clusters can be placed into Isolated Mode at any
6. The Controller copies the Node Agent package into each time. For example, you can create a new cluster in Isolated Mode
virtual node, and then logs into those containers via SSH to and then configure Kerberos protection before making that cluster
install and start that agent. available for tenant-wide access, thereby avoiding any security
7. The Node Agent in each virtual node registers the node with issues that may arise from running an unprotected cluster. You
its host's DataTap service and requests the application setup may also specify a “bootstrap” ActionScript that will automatically
package and cluster metadata from either the Controller or run as soon as a new cluster comes up.
the specified Docker-compatible registry, as appropriate. Placing a cluster into Isolated Mode limits access to only those
8. The Node Agent unpacks the application setup package, and users who have access to the keypair based on the Cluster
then runs the extracted startscript. Superuser Privilege setting for the tenant. If LDAP/AD is used for
9. The Node Agent begins reporting cluster configuration status authenticating EPIC users, then these users may also log in with
and services status to the Controller. their LDAP/AD credentials instead of using the tenant keypair.
10. The startscript in each virtual node queries bd_vcli Placing a cluster into Isolated Mode has the following effects:
namespaces to obtain information from the cluster metadata, • None of the cluster ports except SSH can be accessed from
which includes the IP addresses and DNS names of nodes in outside the cluster.
the cluster, node roles, service definitions, etc. • Only authorized users (see above) can access the cluster via
11. The startscript performs final runtime configuration SSH or make changes via the EPIC interface or API, such as
appropriate for the application. configuration, power, ActionScripts, and/or cluster deletion.

White Paper: BlueData Software Architecture


www.bluedata.com Page 19
Create Cluster EPIC Physical Hosts All Hosts Run
L-,20-**#04#0'ɑ#10#1-urce availability/quotas. - EPIC Management runs on the
- Import image to physical node(s) from Controller node.
Controller/registry, if needed. - Host Agent runs on all nodes and
- Create container using Host Agent. manages the container lifecycle.
- Setup container storage & networking.
EPIC Controller
Schedules
cluster creation.
Between Physical Hosts & Docker Containers
- EPIC Management copies the Node Agent into
the Docker containers.
- EPIC Management SSH into containers and
runs Node Agent install.

On the Docker Containers Query Runtime Information


- Node Agent requests metadata copy from the Containers query bd_vcli namespaces
Controller. for:
L 620!21!-,ɑ%+#2"2r..!-,ɑ%G2%8sG - Host IP addresses and DNS names
- Runs startscript and starts registered L#04'!#"#ɑ,'2'-,1I03,,',g services,
services. status, etc.

Figure 9: On-premises container orchestration

.
Create Cluster Controller Schedules EC2 Instance Creation All Hosts Run
- Controller checks against tenant quotas. - EPIC Management runs on the
- Creates Worker EC2 instance & copies image Controller node.
from S3 image bucket/registry to instance. - Host Agent runs on all nodes and
- Creates container using Host Agent. manages the container lifecycle.
- Sets up container storage & networking.
EPIC Controller
Schedules
cluster creation.
Between EC2 Instance & Docker Container
- EPIC Management copies the Node Agent into
the Docker containers.
- EPIC Management SSH into containers and
runs Node Agent install.

On the Docker Containers Query Runtime Information


- Node Agent requests metadata copy from the Containers query bd_vcli namespaces
Controller. for:
L 620!21!-,ɑ%+#2"2r..!-,ɑ%G2%8sG - Host IP addresses and DNS names
- Runs startscript and starts registered L#04'!#"#ɑ,'2'-,1I03,,',g1#04'!#1I
services. 12231I#2!G

Figure 10: AWS container orchestration

White Paper: BlueData Software Architecture


www.bluedata.com Page 20
• Non-authorized users can place a cluster into Isolated Mode, The general process of creating and using tags is:
but only authorized users can exit the cluster from this mode. 1. Create one or more tag(s). Tag names are arbitrary and may
• Any SSH connections that are in progress when the cluster is include special characters and/or spaces; however, best
placed into Isolated Mode will be terminated. practice is to assign consistent tag names.
• Users cannot access cluster services via the Node(s) Info tab 2. Assign the tag(s) to one or more Compute Worker host(s),
of the Cluster Details screen. and then specify the value to assign to each tag assigned to
• A cluster can be configured to require two-step deletion. If each host. Values are arbitrary; however, best practice is to
this flag is set, then an attempt to delete a cluster will place assign consistent values to each tag.
that cluster into Isolated Mode. An authorized user can then 3. Create or edit a placement constraint for the cluster to force
delete that cluster once they determine that it is proper to do EPIC to place that cluster on a host where one or more tag(s)
so. do(es) or do(es) not equal a value.
• Any jobs being run by the cluster when placed into Isolated
Mode will continue; however, no new jobs can be run in this
cluster until it has been removed from Isolated Mode.
• Isolated Mode events (enabling and disabling this mode) will
appear in the Cluster History tab of the Cluster Details
screen.

Host Tags
Tags allows you to control the physical host(s) on which EPIC will
place a newly created or edited cluster by forcing EPIC to place
the cluster on only those host(s) that meet the criteria that you
specify for that cluster. For example, you could use tags to:
• Place GPU-centric clusters such as TensorFlow on only
those hosts that are tagged with a GPU identifier, such as
gpu=yes.
• Place specific workload clusters that require SSD-based
virtual node storage, such as ssd=true or storage=ssd-
only.
• Isolate virtual clusters of a specific tenant onto specific EPIC
hosts for physical isolation in addition to network isolation,
such as tenant=marketing or tenant=finance.
• Differentiate between cloud-based EPIC hosts and on-
premises based EPIC hosts in a single unified EPIC
deployment, such as onprem=true or location=onprem.

White Paper: BlueData Software Architecture


www.bluedata.com Page 21
8. High Availability
BlueData EPIC supports three levels of High Availability Each on-premises host in the EPIC platform has its own IP
protection: address. If the Controller host fails, then attempting to access the
• Platform High Availability: This level of protection only Shadow Controller host using the same IP address will fail.
applies to EPIC platform resources that are located on your Similarly, accessing the Shadow Controller host using that node’s
premises. AWS resources are not covered. IP address will fail once the Controller host recovers. To avoid
this problem, you must specify a cluster IP address that is bonded
• Virtual Cluster High Availability: This level of protection
to the node acting as the Controller host, and then log into EPIC
applies to clusters that reside on both on-premises and AWS
using that cluster IP address. EPIC will automatically connect you
resources.
to the Controller host (if the EPIC platform is running normally) or
• Gateway Host High Availability: This level of protection to the Shadow Controller host with a warning message (if the
applies to Gateway host(s) for on-premises EPIC resources. Controller host has failed and triggered the High Availability
protection).
Platform High Availability
Platform-level EPIC High Availability protects against the failure
For on-premises installations, EPIC supports platform-level High of any one of the three hosts being used to provide this
Availability functionality that protects your EPIC platform against protection. The warning message will therefore appear if either
the failure of the Controller host. Your EPIC platform must include the Shadow Controller or Arbiter host fails, even if the Controller
at least three hosts that conform to all of the requirements listed host is functioning properly.
in the EPIC documentation in order to support this feature, as When platform High Availability is enabled:
shown in Figure 11.
• The Controller host manages the EPIC platform, and the
Platform-level High Availability requires designating two Worker Shadow Controller and Arbiter hosts function as Worker
hosts as the Shadow Controller and Arbiter, respectively. If the hosts.
Controller host fails, then EPIC will failover to the Shadow
• If the Controller host fails, then the Arbiter host switches
Controller host within approximately two or three minutes, and a
management of the EPIC Platform to the Standby Controller
warning will appear at the top of the EPIC screens. You may need
host, and the EPIC platform is no longer protected against any
to log back in to EPIC if this occurs.
Controller Shadow
Host Controller Arbiter

Cluster IP address

Minimum conɑguration Additional Worker Host(s), if Any


Figure 11: BlueData EPIC platform configured for High Availability

When a failure of a High Availability host occurs, EPIC takes the • A message appears in the upper right corner of the EPIC
following actions: interface warning you that the system is running in a
• If the Controller host has failed, then EPIC fails over to the degraded state. Use the Service Status tab of the Platform
Standby Controller host and begins running in a degraded Administrator Dashboard to see which host has failed and
state. This process usually takes 2-3 minutes, during which which services are down.
you will not be able to log in to EPIC.
• If the Shadow Controller or Arbiter host fails, then EPIC
keeps running on the Controller host in a degraded state.

White Paper: BlueData Software Architecture


www.bluedata.com Page 22
• EPIC analyzes the root cause of the host failure and attempts Availability. In general, you should implement High Availability just
to recover the failed host automatically. If recovery is after first installing EPIC.
possible, then the failed host will come back up and EPIC will
resume normal operation. Virtual Cluster High Availability
• If EPIC cannot resolve the problem, then the affected host will Some Big Data applications allow you to create cluster with High
be left in an error state and you will need to manually Availability protection. This is separate and distinct from the EPIC
diagnose and repair the problem (if possible) and then reboot platform High Availability described above, as follows:
that host. If rebooting solves the problem, then the failed host
A cluster with High Availability enabled for that virtual cluster is
will come back up and EPIC will resume normal operation. If
still dependent on the EPIC platform. If the Controller host on an
this does not solve the problem, then you will need to contact
EPIC platform without High Availability fails, then the virtual
BlueData Technical Support for assistance.
cluster will also fail.
Enabling platform High Availability protection in EPIC is a two-
If the EPIC platform has High Availability enabled and the Master
stage process: First, you must designate one Worker host as the
node of a virtual cluster fails, then that virtual cluster will fail if
Shadow Controller host in the Installation tab of the EPIC
High Availability for that virtual cluster was not enabled when that
Installation screen. This is a one-time assignment; you cannot
virtual cluster was created.
transfer the Shadow Controller role to another Worker host.
Second, you must enable High Availability protection and then The following table displays the relationship between virtual
assign the Arbiter role to a third Worker host in the HA Settings cluster and EPIC platform High Availability protection under a
tab of the Settings screen. variety of scenarios:

EPIC does not support enabling platform High Availability if any Note: In Figure 12, below, the asterisks (*) denote platform-level
virtual clusters already exist. If you create one or more virtual High Availability protection that is only available to resources
clusters before deciding to enable High Availability, then you located on-premises. AWS resources only support High
should delete those clusters before proceeding with the High Availability for virtual clusters.

FAILURE TYPE NO HA CLUSTER HA ONLY EPIC HA ONLY* BOTH HA


EPIC Controller host down; virtual cluster X X protected protected
Master node up.
EPIC Controller host up; virtual cluster Master X protected X protected
down.
EPIC Controller host up; Other virtual cluster protected protected protected protected
node down.
Figure 12: BlueData EPIC High Availability protection under various scenarios

Gateway Host High Availability


You can add redundancy for Gateway hosts by mapping multiple
Gateway host IP addresses to a single hostname. When this is
done, then either the DNS server or an external load balancer will
load-balance requests to the hostname among all of the Gateway
hosts on a round-robin basis. This ensures that there is no single
point of failure for the Gateway host.

White Paper: BlueData Software Architecture


www.bluedata.com Page 23
Appendix
This Appendix defines the key terms used in this white paper and • Gateway host (or Gateway Worker): A Gateway host or
describes how BlueData EPIC supports Hadoop and Spark. Gateway Worker is a Worker host that maps services running
on virtual nodes to ports in order to allow users to access
Definitions those services when the virtual node IP addresses are non-
routable and thus cannot be accessed directly.
This white paper uses the following terms:
• Master node: A Master node is a container that manages a
• Host: A host is a physical or virtual server that is available to
cluster.
the EPIC platform. Nodes and clusters reside on hosts.
• Worker node: A Worker node is a container that is managed
• Node: A node (also called a virtual node or instance) is a
by a Master node in a cluster.
Docker container that is created when creating a persistent
cluster. This white paper may also use the terms container • Shadow Controller: A Shadow Controller is a designated
and/or Docker container. Worker host that assumes the Controller host role if the
primary Controller host fails. Your EPIC platform must meet
• Microservice: A microservice is a method of developing
all applicable requirements and have High Availability enabled
software applications as a suite of small, modular, and
for this protection to be active.
independently deployable services in which each service runs
a unique process and communicates through a well-defined, • Arbiter: An Arbiter is a designated Worker host that triggers
lightweight mechanism to serve a business goal. the Shadow Controller host to assume the Controller role if
the primary Controller host fails.
• Big Data application: A Big Data application generally refers
to a distributed, multi-node, inter-related service that can • Hadoop job tracker: Within a virtual cluster running
process large amounts of data computing on several nodes. MapReduce version 1, a Hadoop job tracker is the service that
Some examples of Big data Applications include Hadoop, is responsible for coordinating the MapReduce application
Spark, Kafka, Cassandra, HBase, and others. Big Data efforts of the remaining virtual nodes within that virtual
applications should not be confused with microservices. cluster.
• Docker container: A Docker container is a lightweight, • Hadoop task tracker: Within a virtual cluster running
standalone, executable software package that runs specific MapReduce version 1, a Hadoop task tracker is a service that
services. This software package includes code, runtime, accepts tasks (Map, Reduce, and Shuffle operations) from the
system libraries, configurations, etc. that run as isolated Hadoop job tracker. Each task tracker configuration specifies
process in user space. A Docker container is typically used to the number of “slots” or tasks that it can accept. Task
deploy scalable and repeatable microservices. trackers notify the job tracker when tasks complete. They
also send out periodic “heartbeat” messages with the number
• EPIC platform: An EPIC platform consists of the hosts that
of available slots, to keep the job tracker updated.
comprise the overall infrastructure available to a virtual
cluster. • Resource manager: Within a virtual cluster running YARN, a
resource manager is the service that is responsible for
• Controller: A Controller is a host that manages the other
coordinating the MapReduce application efforts of the
nodes while also serving as a Worker host in the EPIC
remaining virtual nodes within that virtual cluster.
platform.
• Node manager: Within a virtual cluster running YARN, a node
• Worker host: A Worker host is a host that is managed by a
manager is a service that accepts tasks (Map, Reduce, and
Controller.
Shuffle operations) from the resource manager. Each node
• Compute host (or Compute Worker): A Compute host or manager configuration specifies the number of “slots” or
Compute Worker is a Worker host that runs the virtual nodes tasks that it can accept. Node managers notify the resource
(Docker containers) used by persistent clusters. manager when tasks complete. They also send out periodic
“heartbeat” messages with the number of available slots, to
keep the resource manager updated.

White Paper: BlueData Software Architecture


www.bluedata.com Page 24
• Platform: A platform includes all of the tenants, nodes, virtual including any specific sharing or isolation of data between
clusters, and users that exist on a given EPIC deployment. tenants.
• Tenant: A tenant is a unit of resource partitioning and data/ • Node storage: Node storage is storage space available for
user access control in a given deployment. The resources of backing the root filesystems of containers. Each host in the
an EPIC platform are shared among the tenants on that EPIC platform contributes node storage space that is used by
platform. Resources used by one tenant cannot be used by the virtual nodes (Docker containers) assigned to that host.
another tenant. All users who are a member of a tenant can The Platform Administrator may optionally specify a quota
access the resources and data objects available to that limiting how much node storage a tenant's virtual nodes may
tenant. consume.
• Edge node: An edge node is a container running a business • Tenant storage: Tenant storage is a shared storage space that
intelligence tool that interacts with the cluster’s Hadoop or may be provided by either a local HDFS installation on the
Spark framework. EPIC platform or a remote storage service. Every tenant is
• Virtual cluster: A virtual cluster is a collection of virtual nodes assigned a sandbox area within this space that is accessible
that is available to a specific tenant. by a special, non-editable TenantStorage DataTap. All virtual
nodes within the tenant can access this DataTap and use it
• cnode: cnode is the BlueData caching node service. The
for persisting data that is not tied to the life cycle of a given
transfer of storage I/O requests from the BlueData
cluster. Tenant storage differs from other DataTap-
implementation of the HDFS Java client to this caching node
accessible storage as follows:
service is being optimized as latency is reduced.
- A tenant may not access tenant storage outside of its
• Node flavor: A node flavor describes the amounts and sizes
sandbox.
of resources (virtual CPU cores, memory, and hard disk size)
allotted to a virtual node. - The Platform Administrator can choose to impose a
space quota on the sandbox.
• Platform Administrator: The Platform Administrator (or
Platform Admin) is a role granted to an EPIC user. A user • Cluster file system: Many types of EPIC clusters set up
with this role has the ability to create/delete tenants. This cluster-specific shared storage that consists of services and
user will typically also be responsible for managing the hosts storage resources inside the cluster nodes. For example, an
in the EPIC platform. EPIC cluster running Hadoop will set up an in-cluster HDFS
instance. This shared storage is referred to as the cluster file
• Tenant Administrator: A Tenant Administrator (or Tenant
system. The cluster file system is often used for logs and
Admin) is a role granted to an EPIC user. A user with this role
temporary files. It can also be used for job input and output;
has the ability to manage the specific tenant(s) for which they
however, the cluster file system and all data therein will be
have been granted this role, including creating DataTaps for
deleted when the cluster is deleted.
that tenant.
• Tenant Member: A Tenant Member (or Member) is a role Hadoop and Spark Support
granted to an EPIC user. A user with this role has non-
administrative access to the specific tenant(s) for which they The BlueData EPIC platform includes pre-configured, ready-to-
have been granted this role. Members may use existing run versions of major Hadoop distributions, such as Cloudera
DataTaps for reading and writing data. (CDH), Hortonworks (HDP), and MapR (CDP). It also includes
recent versions of Spark standalone as well as Kafka and
• User: A user is the set of information associated with each
Cassandra. Other Big Data distributions, services, commercial
person accessing the EPIC platform, including the
applications, and custom applications can be easily added to a
authentication and site roles.
BlueData EPIC deployment, as described in “App Store” on
• DataTap: A DataTap is a shortcut that points to a storage page 2.
resource on the network. A Tenant Administrator creates a
BlueData EPIC supports a wide range of Big Data application
DataTap within a tenant and defines the storage namespace
services and Hadoop/Spark ecosystem products out-of-the-box,
that the DataTap represents (such as a directory tree in a file
such as:
system). A Tenant Member may then access paths within that
resource for data input and/or output. Creating and editing • Ambari (for HDP): Ambari is an open framework that allows
DataTaps allows Tenant Administrators to control which system administrators to provision, manage, and monitor
storage areas are available to the members of each tenant,

White Paper: BlueData Software Architecture


www.bluedata.com Page 25
Apache Hadoop clusters and integrate Hadoop with the running queries using the HiveQL language. HiveQL is similar
existing network infrastructure. to SQL and supports using custom mappers when needed.
• Cassandra (from Datastax): Cassandra is a free, open- • Hue: Hue is an open-source Web interface that integrates the
source, distributed NoSQL database management system for most common Hadoop components into a single interface
processing large datasets on commodity servers with no that simplifies accessing and using a Hadoop cluster.
single points of failure. • Impala: Impala by Cloudera is an open-source MPP query
• Cloudera Manager (for CDH): Cloudera Manager provides a engine that allows users to query data without moving or
real-time view of the entire cluster, including a real-time view transforming that data, thus enabling real-time analytics
of the nodes and services running, in a single console. It also without the need to migrate data sets. EPIC supports Impala
includes a full range of reporting and diagnostic tools to help on CDH.
optimize performance and utilization. • JupyterHub: JupyterHub is a multi-user server that provides
• Cloudera Navigator (for CDH): Cloudera Navigator is a fully a dedicated single-user Jupyter Notebook server for each
integrated Hadoop data management tool that provides user in a group.
critical data governance capabilities for regulated enterprises • Kafka: Kafka allows a single cluster to act as a centralized
or enterprises with strict compliance requirements. These data repository that can be expanded with zero down time. It
capabilities include verifying access privileges and auditing all partitions and spreads data streams across a cluster of
Hadoop data access. machines to deliver data streams beyond the capability of any
• Flume: Flume-NG is a distributed, reliable, and available single machine.
service for efficiently collecting, aggregating, and moving • MapReduce: MapReduce assigns segments of an overall job
large amounts of server log data. It is robust and fault to each Worker, and then reduces the results from each back
tolerant with many failover and recovery mechanisms. It uses into a single unified set.
a simple extensible data model that allows one to build online
• MLlib: MLlib is Spark’s scalable machine learning library that
analytic applications.
contains common learning algorithms, utilities, and underlying
• GraphX (for Spark): GraphX works seamlessly with graphs optimization primitives.
and collections by combining ETL, exploratory analysis, and
• Oozie: Oozie is a workflow scheduler system for managing
iterative graph computation within a single system. The
Hadoop jobs that specializes in running workflow jobs with
Pregel API allows you to write custom iterative graph
actions that run Hadoop MapReduce and Pig jobs.
algorithms.
• Pig: Pig is a language developed by Yahoo that allows for
• Hadoop Streaming: Hadoop Streaming is a utility that allows
data flow and transformation operations on a Hadoop cluster.
you to create and run MapReduce jobs with any script as the
mapper and/or the reducer taking its input from stdin and • Ranger: Apache Ranger is a central security framework that
writing its output to stdout. allows granular access control to Hadoop data access
components, such as Hive and HBase. Security
• HAWQ: Apache HAWQ is a native Hadoop application that
administrators manage policies that govern access to a wide
combines high-performance Massively Parallel Processing
array of resources in an individual and/or group basis.
(MPP)-based analytics performance with robust ANSI SQL
Additional capabilities include auditing, policy analytics, and
compliance, integration, and manageability within the Hadoop
encryption.
ecosystem, as well as support for a variety of datastore
formats with no connectors required. • RStudio: RStudio is a free, open-source integrated
development environment (IDE) for the R programming
• HBase: HBase is a distributed, column-oriented data store
language that is used for statistical computing and graphics.
that provides random, real-time read/write access to very
large data tables (billions of rows and millions of columns) on • Spark SQL: Spark SQL is a Spark module designed for
a Hadoop cluster. It is modeled after Google’s BigTable processing structured data. It includes the DataFrames
system. programming abstraction and can also act as a distributed
SQL query engine. This module can also read data from an
• Hive: Hive facilitates querying and managing large amounts
existing Hive installation.
of data stored on distributed storage. This application
provides a means for applying structure to this data and then • SparkR: This R package provides a lightweight front end for
using Spark from R.

White Paper: BlueData Software Architecture


www.bluedata.com Page 26
• Spark Streaming: Spark Streaming is an extension of the
core Spark API that enables fast, scalable, and fault-tolerant
processing of live data streams.
• Sqoop: Sqoop is a tool designed for efficiently transferring
bulk data between Hadoop and structured datastores, such as
relational databases. It facilitates importing data from a
relational database, such as MySQL or Oracle DB, into a
distributed filesystem like HDFS, transforming the data with
Hadoop MapReduce, and then exporting the result back into
an RDBMS.
• Sqoop 2: Sqoop 2 is a server-based tool designed to transfer
data between Hadoop and relational databases. You can use
Sqoop 2 to import data from a relational database
management system (REBMS) such as MySQL or Oracle into
the Hadoop Distributed File System (HDFS), transform the
data with Hadoop MapReduce, and then export it back into an
RDBMS.
• Zeppelin: Web-based Zeppelin notebooks allow interactive
data analytics by bringing data ingestion, exploration,
visualization, sharing, and collaboration features to Spark as
well as Hadoop.

White Paper: BlueData Software Architecture


www.bluedata.com © 2017 BlueData Page 27

You might also like