Data Pipeline Essentials: See Ya Later
Data Pipeline Essentials: See Ya Later
• Introduction
• Conclusion
SUDIP SENGUPTA
TECHNICAL WRITER AT JAVELYNN
Modern data-driven applications are based on various data sources BATCH PIPELINES
and complex data stacks that require well-designed frameworks to In batch pipelines, data sets are collected over time in batches and
deliver operational efficiency and business insights. The result is a then fed into storage clusters for future use. These pipelines are
flexible, dynamic, and scalable application that enables businesses mostly considered applicable for legacy systems that cannot deliver
to predict, influence, and optimize their business outcomes based on data in streams, or in use cases that deal with colossal amounts of
real-time recommendations. Based on the numerous benefits that data. Batch pipelines are usually deployed when there’s no need
data-driven applications offer to businesses, Gartner predicts that for real-time analytics and are popular for use cases such as billing,
the role of data in achieving agility and collaboration is expected to payroll processing, and customer order management.
grow further over the next five years.
STREAMING PIPELINES
In this Refcard, we delve into the fundamentals of a data pipeline and In contrast to batch pipelines, streaming data pipelines continuously
the problems it solves for modern enterprises, along with its benefits ingests data, processing it as soon as it reaches the storage layer.
and challenges. Such pipelines rely on highly efficient frameworks that support the
ingestion and processing of a continuous stream of data within a
WHAT IS A DATA PIPELINE?
sub-second time frame. As a result, stream data pipelines are mostly
A data pipeline comprises a collection of tools and processes
for efficient transfer, storage, and processing of data across
multiple systems. With data pipelines, organizations can automate
information extraction from distributed sources while consolidating
data into high-performance storage for centralized access. A data See Ya Later,
pipeline essentially forms the foundation to build and manage
analytical tools for critical insights and strategic business decisions.
Star Schema
Acquire, enrich, analyze and act on data.
By building reliable pipelines for the consolidation and management No data warehouse required.
of data flows, development and DataOps teams can also efficiently
train, analyze, and deploy machine learning models. GET STARTED FOR FREE
1
See Ya Later,
Star Schema
Acquire, enrich, analyze and act on data.
No data warehouse required.
incorta.com
REFCARD | DATA PIPELINE ESSENTIALS
suitable for operations that require quicker analysis and real-time CORRECTION
insights of smaller data sets. Typical use cases include social media This process involves cleansing the data to eliminate errors and
engagement analysis, log monitoring, traffic management, user pattern anomalies. When performing correction, data engineers
experience analysis, and real-time fraud detection. typically use rules to identify a violation of data expectation, then
modify it to meet the organization’s needs. Unaccepted values can
DATA PIPELINE PROCESSES then be ignored, reported, or cleansed according to pre-defined
Though the underlying framework of data pipelines differ based on
business or technical rules.
use cases, they mostly rely on a number of common processes and
elements for efficient data flow. Some key processes of data pipelines LOADS
include: Once data has been extracted, standardized, and cleansed, it is
loaded into the destination system, such as a data warehouse or
SOURCES relational database, for storage or analysis.
In a data pipeline, a source acts as the first point of the framework
that feeds information into the pipeline. These include NoSQL AUTOMATION
databases, application APIs, cloud sources, Apache Hadoop, Data pipelines often involve multiple iterations of administrative
relational databases, and many more. and executive tasks. Automation involves monitoring the workflows
to help identify patterns for scheduling tasks and executing them
JOINS with minimal human intervention. Comprehensive automation of a
A join represents an operation that enables the establishment of a
data pipeline also involves the detection of errors and notification
connection between disparate data sets by combining tables. While
mechanisms to maintain consistent data sanity.
doing so, join specifies the criteria and logic for combining data from
different sources into a single pipeline. DEPLOYING A DATA PIPELINE
Considered one of the most crucial components of modern data-
Joins in data processing are categorized as:
driven applications, a data pipeline automates the extraction,
• INNER Join — Retrieves records whose values match in both correlation, and analysis of data for seamless decision-making. When
tables building a data pipeline that is production-ready, consistent, and
• LEFT (OUTER) Join — Retrieves all records from the left table reproducible, there are plenty of factors to consider that make it a
plus matching values from the right table highly technical affair. This section explores the key considerations,
components, and options available when building a data pipeline in
• RIGHT (OUTER) Join — Retrieves all records from the right
production.
table plus matching records from the left table
• FULL (OUTER) Join — Retrieves all records, whether there COMPONENTS OF A DATA PIPELINE
is a match or not in any of the two tables. In SQL tables with The data pipeline relies on a combination of tools and methodologies
star schema, full joins are typically implemented through to enable efficient extraction and transformation of data. These
conformed dimensions to link fact tables, creating fact-to-fact include:
joins.
Common components of a data pipeline
EXTR ACTION
Source data remains in a raw format that requires processing for
further analysis. Extraction is the first step of data ingestion where
the data is crawled and analyzed to ensure information relevancy
before it is passed to the storage layer for transformation.
STANDARDIZATION
Once data has been extracted, it is converted into a uniform
format that enables efficient analysis, research, and utilization.
Standardization is the process of formulating data with disparate
EVENT FR AMEWORKS
variables on the same scale to enable easier comparison and trend
Event processing encompasses the analysis and decision-making
analysis. Data standardization is commonly used for attributes such
based on data streamed continuously from applications. These
as dates, units of measure, color, size, etc.
systems extract information from data points that respond to tasks data pipelines. The traditional approach of doing so is to build
performed by users or various application services. Any identifiable in-house data pipelines that require provisioning infrastructure in a
task or process that causes a change in the system is marked as an self-managed, private data center setup. This offers various benefits,
event, which is recorded in an event log for processing and analysis. including flexible customization and complete control over the
handling of data.
MESSAGE BUS
A message bus is a combination of a messaging infrastructure and However, self-managed orchestration frameworks rely on a number
data model that both receives and queues data sent between different of various tools and niche skills. Such platforms are also considered
systems. Leveraging an asynchronous messaging mechanism, less flexible to handle pipelines that require constant scaling or high
applications use a message bus to instantaneously exchange data availability. On the other hand, unified data orchestration platforms
between systems without having to wait for an acknowledgement. are supported by the right tools and skills that offer higher computing
A well-architected message bus also allows disparate systems to power and replication that enables organizations to scale quickly
communicate using their own protocols without worrying about while maintaining minimum latency.
system inaccessibility, errors, or conflicts.
ONLINE TR ANSACTION PROCESSING (OLTP) VS ONLINE
In ETL (Extract - Transform - Load), data is first transformed in the GROWING TALENT GAP
staging server before it is loaded to the destination storage or data With the growth of emerging disciplines such as data science and
warehouse. ETL is easier to implement and is suited for on-premises deep learning, companies require more personnel resources and
data pipelines running mostly structured, relational data. expertise than job markets can offer. Combined with this is the fact
that a typical data pipeline implementation requires a huge learning
On the other hand, in ELT (Extract - Load - Transform), data is
curve, thereby requiring organizations to dedicate resources to either
loaded directly into the destination system before processing or
upskill existing staff or hire skilled experts.
transformation. When compared to ETL, ELT is more flexible and
scalable, making it suitable for both structured and unstructured SLOW DEVELOPMENT OF RUNNABLE DATA
cloud workloads. TRANSFORMATIONS
With modern data pipelines, organizations are able to build
CHALLENGES OF IMPLEMENTING A functional data models based on the recorded data definitions.
DATA PIPELINE However, developing functional transformations from these models
A data pipeline includes a series of steps that are executed comes with its own challenges as the process is expensive, slow,
sequentially on each dataset in order to generate a final output. and error-prone. Developers are often required to manually create
The entire process usually involves complex stages of extraction, executable codes and runtimes for data models, thereby resulting in
processing, storage, and analysis. As a result, each stage as well as ad-hoc, unstable transformations.
the entire framework requires diligent management and adoption of
best practices. Some common challenges while implementing a data ADVANCED STRATEGIES FOR MODERN
pipeline include: DATA PIPELINES
Some best practices to implement useful data pipelines include:
COMPLEXITY IN SECURING SENSITIVE DATA
Organizations host petabytes of data for multiple users with GRADUAL BUILD USING MINIMUM VIABLE
different data requirements. Each of these users has different access PRODUCT PRINCIPLES
permissions for different services, requiring restrictions on how data When developing a lean data pipeline, it is important to implement
can be accessed, shared, or modified. Assigning access rights to an architecture that scales to meet growing needs while still being
every individual manually is often a herculean task, which if not done easy to manage. As a recommended best practice, organizations
right, may lead to the access of sensitive information to malicious should apply a modular approach while incrementally developing
Joins allow data teams to combine data from two separate tables and DataOps teams should leverage auto-provisioning, auto-scaling, and
extract insights. Given the number of sources, modern data pipelines auto-tuning to reduce design time and simplify routing. Autoscaling
use multiple joins for end-to-end orchestration. These joins consume is crucial since big data workloads have data intake requirements
computing resources, thereby slowing down data operations. Besides that vary dramatically within short durations.