Open Source Software Referance Guide
Open Source Software Referance Guide
reference guide
Resource guide for the course titled Open source tools and HPE Ezmeral Software.
From HPE Learn On-Demand.
Apache Hadoop/YARN – Apache Hadoop provides an open source framework that enables
the distributed processing of large data sets across clusters of compute resources. Its design
can scale from single servers to thousands, each offering local compute and storage
capabilities.
Apache Hadoop is a storage and processing standard, and is commonly associated with
MapReduce, which is a programming model that runs on Hadoop and writes applications that
run in parallel to process large volumes of data stored on clusters. With the advent of Hadoop
2.0, YARN was introduced. YARN stands for Yet Another Resource Negotiator. It allows other
applications to run on a Hadoop cluster. Most Hadoop clusters use YARN to manage the many
components of the Hadoop ecosystem, including MapReduce and other tools like Apache
Spark.
Apache Iceberg – Apache Iceberg is a high performance format for big data tables. It uses
flexible SQL commands to merge data, update existing table rows, and perform targeted
deletes. Iceberg can rewrite files to improve read performance, or use delete deltas to achieve
faster updates. Apache Iceberg also supports full schema evolution, hidden partitioning, time
travel, rollbacks, and data compaction. Other open source engines like Apache Spark, Trino,
Hive, Presto, Impala, and more, can safely work with the same tables, all at the same time.
Apache Hue – Apache Hue, which stands for Hadoop User Experience, provides a user-
friendly, web-based interface for managing files, running jobs, and querying data using a
variety of Hadoop components, including Hive, Pig, and MapReduce. Hue is a SQL cloud editor
and assistant for databases and data warehouses, that provides a simplified SQL querying
experience to its users and finds and connects to all types of databases.
Apache Hue is a flexible tool for interacting with Hadoop clusters, making it a popular choice
for data analysts, developers, and other users who need to work with data stored in Hadoop. Its
goal is to make self-service data querying easy and widespread in organizations.
Apache ZooKeeper – Apache ZooKeeper provides a robust and flexible platform for managing
distributed systems, making it a popular choice for use cases such as Hadoop clusters,
distributed databases, and messaging systems.
As a resource coordinator for the HPE Ezmeral Data Fabric, Apache ZooKeeper is a centralized
service that manages the actions of distributed systems. It maintains configuration information,
naming, distributed synchronization, heartbeats, and group services for all servers. Hadoop
uses ZooKeeper to coordinate between services and shared resources running across multiple
nodes. Key benefits include a simple API, its ability to provide a highly available and reliable
service, to detect and recover from failures, and its support for dynamic reconfiguration.
Kubernetes (K8s) – From the CNCF and Linux Foundation, Kubernetes, also referred to as K8s,
is an extensible open source platform that helps manage containerized workloads and services
on a cluster. It reduces complexity and automates numerous processes for scaling, managing,
and deploying applications and services. All these programs run in containers and remain
isolated, which eliminates manual processes and helps ensure secure deployment and
development. It automates web traffic and provides load balancing by augmenting web servers
depending on service requirements and availability.
Kubernetes supervises the operations of multiple clusters and manages various containerized
programs, like Apache Spark.
These tools include those that support the entire ETL process, or flow of data as it proceeds
through a distributed data pipeline. This includes support for data processing, tracking, and
real time streaming frameworks.
Apache Kafka – Apache Kafka is a distributed event-streaming platform that supports high
performing data pipelines, streaming analytics, data integrations, and the processing of
mission-critical applications.
Apache Kafka handles large volumes of data, supports horizontal scaling, and has a decoupled
architecture, that allows for the processing of data streams to be separated from the
production of data. Its core capabilities include scalable, permanent storage, with high
throughput and high availability. All these features make it a popular choice for use cases such
as data streaming, event processing, and messaging.
Apache Hive – Apache Hive is a data warehouse infrastructure built on top of Hadoop. Hive
stores data in tables using HCatalog and tracks this data using the Hive Metastore. Hive uses a
SQL-like query language called HiveQL, that can be translated into MapReduce or Tez jobs,
allowing for efficient processing of large datasets. HiveQL makes it easy for analysts already
familiar with SQL to process and execute queries to perform data analysis on Hadoop clusters.
Apache NiFi – Apache NiFi supports powerful and scalable directed graphs of logic, like data
routing, system mediation, or transformation. It is easy to use with a browser based web UI and
provides complete data provenance tracking, from beginning to end. Some main features of
Apache NiFi include: secure communication with https, configurable authentication, and
encryption with TLS and SSH. It has extensive configuration options for low latency, high
throughput, and dynamic prioritization, and an extensible design to allow for iterative testing
and rapid development.
Apache Airflow – Apache Airflow is an open source workflow management platform for data
engineering pipelines. Airflow uses directed acyclic graphs, or DAGS, to manage workflow
orchestration. Tasks and dependencies are defined in Python, for Airflow to manage the
scheduling and execution on. DAGs can be run either on a defined time-based schedule or
based on external event triggers, like a file upload. Previous DAG-based schedulers like Apache
Oozie and Azkaban tended to rely on multiple configuration files and file system trees to create
a DAG, whereas in Airflow, DAGs can often be written in one Python file for simplified
administration.
Istio provides a unified platform for managing microservices in large-scale, cloud and
distributed environments, making it a powerful tool for cloud-native application development
and deployment.
Argo – Argo is a Kubernetes-native workflow engine that supports DAGs and step-based
workflows. It provides an easy to use, fully loaded UI, making advanced rollouts and
deployment strategies easy to execute. Argo runs workflows, manages clusters, and provides
simple, event-based dependency management for Kubernetes.
Great Expectations – Great Expectations is used with a workflow orchestration service, that
helps accelerate data projects by monitoring and catching data issues. It’s a tool that profiles,
documents, and validates data to maintain high data quality throughout a data pipeline. By
catching data issues early, data engineers are notified to fix the problems as quickly as possible.
Not only does this catch issues before insights are created or shared, but it also saves time
during data cleaning, accelerates the ETL and data normalization process, and streamlines
engineer-to-analyst handoffs.
Open source tools supporting data analytics include a wide range of data analyzers, visualizers,
dashboard creators, and more. Some of these will also provide support for machine learning
tools.
Apache Superset – Apache Superset is a useful tool used by data analysts and business
intelligence professionals, as a data exploration and visualization platform that creates rich,
customized, interactive dashboards. It is extremely lightweight and highly scalable. With an
intuitive UI, Apache Superset is accessible to all users, even those without coding experience.
Superset makes it easy to create dashboards and visualizations from a variety of data sources,
integrating with any SQL-based database, Hadoop, and NoSQL databases.
Apache Spark – Apache Spark is a powerful framework for performing general data analytics
on distributed computing clusters like Hadoop. Spark caches datasets to provide fast, in-
memory computation. These in-memory, iterative jobs mean faster processing, which supports
complex analysis like machine learning.
Spark can also process structured data from Hive, Flume, or your own custom data sources,
through Spark Streaming or Spark SQL. Spark provides high-level APIs in Java, Scala, Python,
and R, and is an optimized engine that supports general execution graphs.
Apache Livy – Apache Livy is a service that enables easy interactions with Apache Spark
clusters over a REST interface. Livy simplifies interactions between Spark and application
servers, which enables the use of Spark for interactive mobile or web applications. Other
features include allowing easy Spark job or code snippet submissions, supporting async/synch
results retrievals, and Spark context management for multiple, or long running jobs.
Presto – Presto is an open source distributed SQL query engine that runs interactive analytic
queries. Designed for efficiency, Presto uses distributed execution to process and run
interactive, ad hoc queries from data sources of various sizes, from gigabytes to petabytes, in
sub second speeds. Presto leverages in-memory parallel processing and a multithreaded
execution model, to keep all available CPU cores busy. It federates and queries data right where
the data lives.
Apache Drill – Apache Drill is a query engine for big data exploration. Drill can perform
dynamic queries on structured, semi-structured, and unstructured data on HDFS or Hive tables
using ANSI-SQL. Drill is very flexible and can query a variety of data from multiple data sources.
This makes it an excellent tool for exploring unknown data. Unlike Hive, Drill does not use
MapReduce and is ANSI-SQL compliant, making it faster and easier to use.
Kibana – Kibana creates interactive data visualizations through a simple, intuitive UI that
allows you to build a variety of charts, stacks, and graphs for customized dashboards. These
dashboards allow drill-downs with deeper analysis and alert triggers. Dashboards can be shared
and extend to any web application or URL. As part of the Elastic stack, Kibana works well with
other open source tools like Elasticsearch.
Data science is a rapidly evolving field and there are many available open source tools for these
types of use cases. Many tools now focus on analysis that support specific machine learning
workflows and tasks, across a variety of platforms and environments.
PyTorch – PyTorch, now a part of the Linux Foundation, is another open source machine
learning library that is gaining popularity, particularly in the deep learning community. It
provides a flexible platform for building and training deep neural networks and is very effective
with applications like computer vision and natural language processing. PyTorch handles such
complex networks through the use of tensor computing and strong acceleration obtained with
GPU processors. PyTorch supports all major cloud platforms, providing frictionless cloud
development and easy scaling. It works with Python and C++ interfaces and is known for its
ease of use with an intuitive UI.
Apache Ray – Apache Ray is used to easily scale complex AI and Python workloads. It supports
general Python apps and provides scalable data processing for data loading, writing,
converting, and transforming datasets. Ray accelerates and improves model training by
providing hyperparameter tuning capabilities and model deployment with an agnostic, Python,
model serving framework.
Kubeflow – Kubeflow provides an easy to use toolkit for performing machine learning with
Kubernetes infrastructures. Its aim is to create simple, portable, and scaleable ML solutions that
support the entire workflow. It includes services to support Jupyter notebooks to create and
manage interactive model development, along with a customized operator to support model
training through TensorFlow. And More 3rd party integrations are planned.
Feast – Feast is a stand-alone feature store that allows an organization to build and manage
data pipelines with their existing tools, all while using Feast to reliably store and serve feature
values for machine learning projects. A feature store acts as a central data hub of input features
and metadata throughout a project’s lifecycle.
Feast is highly scalable and supports low latency online serving. Using a feature store provides
consistent, quality feature data for model training, for both offline training and online inference.
Feast extends stacks to work with existing infrastructure and tools already used with your
existing machine learning workflows.
MLflow – MLflow is another CNCF project from the Linux Foundation that supports the entire
machine learning lifecycle. MLflow manages the flow of model development from
experimentation, reproducibility, and deployment with a central model registry. It records and
queries model experiments for tracking and packages code in multiple formats. This allows
teams to reproduce runs on multiple platforms to compare results or deploy models in diverse
serving environments.
Security tools can specifically impact encryption or authentication verification processes, access
control policies, or auditing tasks, and may also range between a wide variety of security
purposes like intrusion detection, securing communications or services, penetration testing, and
more.
SPIFFE – As part of the Linux Foundations CNCF, the open source SPIFFE project stands for
Secure Production Identity Framework for Everyone. It is a standard specification and toolchain
for service identity. SPIFFE supports enterprise level zero trust architectures and works with
SPIRE.
Falco – Falco is a cloud-native runtime security project for Kubernetes for threat detection. It
detects threats at runtime by observing the behavior of applications and containers. Falco
essentially acts as a security camera to find unexpected behavior, intrusions, and data theft in
real time. Through available plugins, Falco can also be extended across cloud environments.
Apache Sentry – Apache Sentry provides fine-grained access control to data and metadata
stored in Hadoop clusters. The tool enables users to control and restrict access to sensitive
data and defines role-based access policies, ensuring that only authorized users can access and
modify it. Apache Sentry provides a centralized authorization solution for Apache Hadoop and
integrates with a variety of other Hadoop components like Hive, Impala, and HBase. This
simplifies security administration and reduces the risk of errors or inconsistencies.
Apache Ranger – Apache Ranger provides overall data security across the Apache Hadoop
ecosystem. It is a central framework used to manage and administer security policies and
access. With Apache Ranger, authorization methods can be standardized with fine-grained
controls. This tool also makes it easy to centralize auditing activities for all Hadoop
components.