0% found this document useful (0 votes)
15 views

Open Source Software Referance Guide

Uploaded by

sergetekelian
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
15 views

Open Source Software Referance Guide

Uploaded by

sergetekelian
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 9

Open source software

reference guide

Resource guide for the course titled Open source tools and HPE Ezmeral Software.
From HPE Learn On-Demand.

HPE EZMERAL OPEN SOURCE REFERENCE GUIDE


Open source software reference guide contents
Administration and management tools ------------------------------------------------------------------------- 1
Data engineering tools --------------------------------------------------------------------------------------------- 3
Data analytics tools ------------------------------------------------------------------------------------------------- 5
Data science tools --------------------------------------------------------------------------------------------------- 6
Data security tools--------------------------------------------------------------------------------------------------- 7

Administration and management tools


These tools include standard frameworks for collecting and storing data, in tables or databases,
as well as coordinating and processing data tasks. Many cloud platforms offer their own open
source database options, including the Amazon relational database for AWS, Microsoft Azure
SQL database, Google Cloud SQL, and more.

Apache Hadoop/YARN – Apache Hadoop provides an open source framework that enables
the distributed processing of large data sets across clusters of compute resources. Its design
can scale from single servers to thousands, each offering local compute and storage
capabilities.

Apache Hadoop is a storage and processing standard, and is commonly associated with
MapReduce, which is a programming model that runs on Hadoop and writes applications that
run in parallel to process large volumes of data stored on clusters. With the advent of Hadoop
2.0, YARN was introduced. YARN stands for Yet Another Resource Negotiator. It allows other
applications to run on a Hadoop cluster. Most Hadoop clusters use YARN to manage the many
components of the Hadoop ecosystem, including MapReduce and other tools like Apache
Spark.

Apache Iceberg – Apache Iceberg is a high performance format for big data tables. It uses
flexible SQL commands to merge data, update existing table rows, and perform targeted
deletes. Iceberg can rewrite files to improve read performance, or use delete deltas to achieve
faster updates. Apache Iceberg also supports full schema evolution, hidden partitioning, time
travel, rollbacks, and data compaction. Other open source engines like Apache Spark, Trino,
Hive, Presto, Impala, and more, can safely work with the same tables, all at the same time.

HPE EZMERAL OPEN SOURCE REFERENCE GUIDE 1


Apache HBase – Apache HBase is a distributed, versioned, NoSQL database modeled after
Google's Bigtable. HBase is designed to handle enormous amounts of inconsistent data and is
especially suited for: capturing millions or even billions of rows and columns worth of data from
something like systems metrics or user clicks, storing sparse data such as chats or emails that
have inconsistent values across columns, or for continuously captured data, where random
read/write access is needed, such as a web application back-end or search index. Developers
primarily write HBase applications in Java.

Apache Hue – Apache Hue, which stands for Hadoop User Experience, provides a user-
friendly, web-based interface for managing files, running jobs, and querying data using a
variety of Hadoop components, including Hive, Pig, and MapReduce. Hue is a SQL cloud editor
and assistant for databases and data warehouses, that provides a simplified SQL querying
experience to its users and finds and connects to all types of databases.

Apache Hue is a flexible tool for interacting with Hadoop clusters, making it a popular choice
for data analysts, developers, and other users who need to work with data stored in Hadoop. Its
goal is to make self-service data querying easy and widespread in organizations.

Apache ZooKeeper – Apache ZooKeeper provides a robust and flexible platform for managing
distributed systems, making it a popular choice for use cases such as Hadoop clusters,
distributed databases, and messaging systems.

As a resource coordinator for the HPE Ezmeral Data Fabric, Apache ZooKeeper is a centralized
service that manages the actions of distributed systems. It maintains configuration information,
naming, distributed synchronization, heartbeats, and group services for all servers. Hadoop
uses ZooKeeper to coordinate between services and shared resources running across multiple
nodes. Key benefits include a simple API, its ability to provide a highly available and reliable
service, to detect and recover from failures, and its support for dynamic reconfiguration.

Kubernetes (K8s) – From the CNCF and Linux Foundation, Kubernetes, also referred to as K8s,
is an extensible open source platform that helps manage containerized workloads and services
on a cluster. It reduces complexity and automates numerous processes for scaling, managing,
and deploying applications and services. All these programs run in containers and remain
isolated, which eliminates manual processes and helps ensure secure deployment and
development. It automates web traffic and provides load balancing by augmenting web servers
depending on service requirements and availability.

Kubernetes supervises the operations of multiple clusters and manages various containerized
programs, like Apache Spark.

HPE EZMERAL OPEN SOURCE REFERENCE GUIDE 2


Data engineering tools

These tools include those that support the entire ETL process, or flow of data as it proceeds
through a distributed data pipeline. This includes support for data processing, tracking, and
real time streaming frameworks.

Apache Kafka – Apache Kafka is a distributed event-streaming platform that supports high
performing data pipelines, streaming analytics, data integrations, and the processing of
mission-critical applications.

Apache Kafka handles large volumes of data, supports horizontal scaling, and has a decoupled
architecture, that allows for the processing of data streams to be separated from the
production of data. Its core capabilities include scalable, permanent storage, with high
throughput and high availability. All these features make it a popular choice for use cases such
as data streaming, event processing, and messaging.

Apache Hive – Apache Hive is a data warehouse infrastructure built on top of Hadoop. Hive
stores data in tables using HCatalog and tracks this data using the Hive Metastore. Hive uses a
SQL-like query language called HiveQL, that can be translated into MapReduce or Tez jobs,
allowing for efficient processing of large datasets. HiveQL makes it easy for analysts already
familiar with SQL to process and execute queries to perform data analysis on Hadoop clusters.

Apache NiFi – Apache NiFi supports powerful and scalable directed graphs of logic, like data
routing, system mediation, or transformation. It is easy to use with a browser based web UI and
provides complete data provenance tracking, from beginning to end. Some main features of
Apache NiFi include: secure communication with https, configurable authentication, and
encryption with TLS and SSH. It has extensive configuration options for low latency, high
throughput, and dynamic prioritization, and an extensible design to allow for iterative testing
and rapid development.

Apache Airflow – Apache Airflow is an open source workflow management platform for data
engineering pipelines. Airflow uses directed acyclic graphs, or DAGS, to manage workflow
orchestration. Tasks and dependencies are defined in Python, for Airflow to manage the
scheduling and execution on. DAGs can be run either on a defined time-based schedule or
based on external event triggers, like a file upload. Previous DAG-based schedulers like Apache
Oozie and Azkaban tended to rely on multiple configuration files and file system trees to create
a DAG, whereas in Airflow, DAGs can often be written in one Python file for simplified
administration.

HPE EZMERAL OPEN SOURCE REFERENCE GUIDE 3


Istio – Istio is a popular service mesh from Google, that serves as an open platform to connect,
manage, and secure microservices in modern, distributed environments. Istio manages the
traffic flow between microservices, enforces access policies, and aggregates telemetry data, all
without requiring changes to microservice code. Main features include automatic load
balancing, fine-grained traffic behavior control, a pluggable policy layer, secure service-to-
service communications, and automatic metrics, logs, and traces.

Istio provides a unified platform for managing microservices in large-scale, cloud and
distributed environments, making it a powerful tool for cloud-native application development
and deployment.

Argo – Argo is a Kubernetes-native workflow engine that supports DAGs and step-based
workflows. It provides an easy to use, fully loaded UI, making advanced rollouts and
deployment strategies easy to execute. Argo runs workflows, manages clusters, and provides
simple, event-based dependency management for Kubernetes.

Great Expectations – Great Expectations is used with a workflow orchestration service, that
helps accelerate data projects by monitoring and catching data issues. It’s a tool that profiles,
documents, and validates data to maintain high data quality throughout a data pipeline. By
catching data issues early, data engineers are notified to fix the problems as quickly as possible.
Not only does this catch issues before insights are created or shared, but it also saves time
during data cleaning, accelerates the ETL and data normalization process, and streamlines
engineer-to-analyst handoffs.

Elasticsearch – Elasticsearch is a distributed, JSON-based, RESTful search and analytics


engine that allows you to store, search, and analyze data with speed at scale. Perform and
combine many types of searches on any type of data, from structured to unstructured, geo,
metric data, et cetera. Monitoring and logging allow for complete observability, by unifying logs,
metrics, traces, and observing applications, infrastructure, or users. As part of the Elastic stack,
it works well with the visualization tool Kibana.

HPE EZMERAL OPEN SOURCE REFERENCE GUIDE 4


Data analytics tools

Open source tools supporting data analytics include a wide range of data analyzers, visualizers,
dashboard creators, and more. Some of these will also provide support for machine learning
tools.

Apache Superset – Apache Superset is a useful tool used by data analysts and business
intelligence professionals, as a data exploration and visualization platform that creates rich,
customized, interactive dashboards. It is extremely lightweight and highly scalable. With an
intuitive UI, Apache Superset is accessible to all users, even those without coding experience.
Superset makes it easy to create dashboards and visualizations from a variety of data sources,
integrating with any SQL-based database, Hadoop, and NoSQL databases.

Apache Spark – Apache Spark is a powerful framework for performing general data analytics
on distributed computing clusters like Hadoop. Spark caches datasets to provide fast, in-
memory computation. These in-memory, iterative jobs mean faster processing, which supports
complex analysis like machine learning.

Spark can also process structured data from Hive, Flume, or your own custom data sources,
through Spark Streaming or Spark SQL. Spark provides high-level APIs in Java, Scala, Python,
and R, and is an optimized engine that supports general execution graphs.

Apache Livy – Apache Livy is a service that enables easy interactions with Apache Spark
clusters over a REST interface. Livy simplifies interactions between Spark and application
servers, which enables the use of Spark for interactive mobile or web applications. Other
features include allowing easy Spark job or code snippet submissions, supporting async/synch
results retrievals, and Spark context management for multiple, or long running jobs.

Presto – Presto is an open source distributed SQL query engine that runs interactive analytic
queries. Designed for efficiency, Presto uses distributed execution to process and run
interactive, ad hoc queries from data sources of various sizes, from gigabytes to petabytes, in
sub second speeds. Presto leverages in-memory parallel processing and a multithreaded
execution model, to keep all available CPU cores busy. It federates and queries data right where
the data lives.

Apache Drill – Apache Drill is a query engine for big data exploration. Drill can perform
dynamic queries on structured, semi-structured, and unstructured data on HDFS or Hive tables
using ANSI-SQL. Drill is very flexible and can query a variety of data from multiple data sources.
This makes it an excellent tool for exploring unknown data. Unlike Hive, Drill does not use
MapReduce and is ANSI-SQL compliant, making it faster and easier to use.

HPE EZMERAL OPEN SOURCE REFERENCE GUIDE 5


Grafana – Grafana is a popular data visualization and dashboard platform that allows you to
query, monitor, and visualize data anywhere, pulling metrics in from wherever they are stored.
Users create, explore, and share customized dashboards to foster a data-driven culture, with
the ability to setup custom, unified alerts with fine-grained access controls.

Kibana – Kibana creates interactive data visualizations through a simple, intuitive UI that
allows you to build a variety of charts, stacks, and graphs for customized dashboards. These
dashboards allow drill-downs with deeper analysis and alert triggers. Dashboards can be shared
and extend to any web application or URL. As part of the Elastic stack, Kibana works well with
other open source tools like Elasticsearch.

Data science tools

Data science is a rapidly evolving field and there are many available open source tools for these
types of use cases. Many tools now focus on analysis that support specific machine learning
workflows and tasks, across a variety of platforms and environments.

TensorFlow – Developed by Google, TensorFlow is a popular machine learning platform that


provides a complete, end-to-end solution, from research to production, for every stage of a ML
pipeline. TensorFlow is highly flexible and scalable, which enables users to create complex
neural networks to train large models on the huge datasets of a distributed system. It allows
users to build their own custom models or leverage pre-trained models with a wide range of
algorithms, and includes a large toolkit for processing data, and building and testing models.
TensorFlow supports model deployment and implementation in a variety of environments,
either on-prem, on-device, in a browser, or in the cloud.

PyTorch – PyTorch, now a part of the Linux Foundation, is another open source machine
learning library that is gaining popularity, particularly in the deep learning community. It
provides a flexible platform for building and training deep neural networks and is very effective
with applications like computer vision and natural language processing. PyTorch handles such
complex networks through the use of tensor computing and strong acceleration obtained with
GPU processors. PyTorch supports all major cloud platforms, providing frictionless cloud
development and easy scaling. It works with Python and C++ interfaces and is known for its
ease of use with an intuitive UI.

Apache Ray – Apache Ray is used to easily scale complex AI and Python workloads. It supports
general Python apps and provides scalable data processing for data loading, writing,
converting, and transforming datasets. Ray accelerates and improves model training by
providing hyperparameter tuning capabilities and model deployment with an agnostic, Python,
model serving framework.

HPE EZMERAL OPEN SOURCE REFERENCE GUIDE 6


Apache Ray works with reinforcement learning libraries like RLlib, and deep learning models,
providing a flexible, efficient, distributed execution framework for both PyTorch, and
Tensorflow workloads.

Kubeflow – Kubeflow provides an easy to use toolkit for performing machine learning with
Kubernetes infrastructures. Its aim is to create simple, portable, and scaleable ML solutions that
support the entire workflow. It includes services to support Jupyter notebooks to create and
manage interactive model development, along with a customized operator to support model
training through TensorFlow. And More 3rd party integrations are planned.

Feast – Feast is a stand-alone feature store that allows an organization to build and manage
data pipelines with their existing tools, all while using Feast to reliably store and serve feature
values for machine learning projects. A feature store acts as a central data hub of input features
and metadata throughout a project’s lifecycle.

Feast is highly scalable and supports low latency online serving. Using a feature store provides
consistent, quality feature data for model training, for both offline training and online inference.
Feast extends stacks to work with existing infrastructure and tools already used with your
existing machine learning workflows.

MLflow – MLflow is another CNCF project from the Linux Foundation that supports the entire
machine learning lifecycle. MLflow manages the flow of model development from
experimentation, reproducibility, and deployment with a central model registry. It records and
queries model experiments for tracking and packages code in multiple formats. This allows
teams to reproduce runs on multiple platforms to compare results or deploy models in diverse
serving environments.

Data security tools

Security tools can specifically impact encryption or authentication verification processes, access
control policies, or auditing tasks, and may also range between a wide variety of security
purposes like intrusion detection, securing communications or services, penetration testing, and
more.

SPIFFE – As part of the Linux Foundations CNCF, the open source SPIFFE project stands for
Secure Production Identity Framework for Everyone. It is a standard specification and toolchain
for service identity. SPIFFE supports enterprise level zero trust architectures and works with
SPIRE.

HPE EZMERAL OPEN SOURCE REFERENCE GUIDE 7


SPIRE – SPIRE is the SPIFFE Runtime Environment, which is an extensible system that
implements the principles embodied in the SPIFFE standards. Together, these two tools
combine to provide a cryptographic, platform agnostic identity to individual services across
heterogeneous environments and organizational boundaries. This provides a secure, universal
identity control plane for distributed systems.

Falco – Falco is a cloud-native runtime security project for Kubernetes for threat detection. It
detects threats at runtime by observing the behavior of applications and containers. Falco
essentially acts as a security camera to find unexpected behavior, intrusions, and data theft in
real time. Through available plugins, Falco can also be extended across cloud environments.

Apache Sentry – Apache Sentry provides fine-grained access control to data and metadata
stored in Hadoop clusters. The tool enables users to control and restrict access to sensitive
data and defines role-based access policies, ensuring that only authorized users can access and
modify it. Apache Sentry provides a centralized authorization solution for Apache Hadoop and
integrates with a variety of other Hadoop components like Hive, Impala, and HBase. This
simplifies security administration and reduces the risk of errors or inconsistencies.

Apache Ranger – Apache Ranger provides overall data security across the Apache Hadoop
ecosystem. It is a central framework used to manage and administer security policies and
access. With Apache Ranger, authorization methods can be standardized with fine-grained
controls. This tool also makes it easy to centralize auditing activities for all Hadoop
components.

HPE EZMERAL OPEN SOURCE REFERENCE GUIDE 8

You might also like