0% found this document useful (0 votes)
6 views17 pages

Week8 Classroom Exercise

A data pipeline is a process that involves collecting, transforming, and transferring data from its source to a destination for analysis. Key steps include data ingestion, processing, storage, and monitoring, with considerations for scalability, security, and cost. The document outlines various architectural options, tools, and best practices to build an efficient data pipeline while avoiding common pitfalls.

Uploaded by

litch711
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
6 views17 pages

Week8 Classroom Exercise

A data pipeline is a process that involves collecting, transforming, and transferring data from its source to a destination for analysis. Key steps include data ingestion, processing, storage, and monitoring, with considerations for scalability, security, and cost. The document outlines various architectural options, tools, and best practices to build an efficient data pipeline while avoiding common pitfalls.

Uploaded by

litch711
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 17

DATA

PIPELINE
Week 8 Classroom Exercise

By Min Zaw, May Khine


WHAT IS DATA PIPELINE?
A data pipeline is the process of collecting data from the original
source and transferring it to the new destination — optimizing,
consolidating, and modifying them while transferring. This definition
looks simplistic and doesn’t capture the main key of the data pipeline.

For example, transferring from Point A to Point B is done in data


replication but this is not a qualified data pipeline. The key
differentiation lies in the transformational steps of data pipelines.
GENERAL IDEAS
Data Ingestion
Data Processing
Data destination & sharing
Collect data from
sources like APIs,
databases, logs, and
Process of converting Process of making
IoT devices the data, ingested data available for
from one format, analysis and utilization.
structure, or set of (Ensure BI, analytics,
Types: values of another way and machine learning
of joining, filtering, etc.
Streaming (Google models)
Pub/Sub)
Batching (AirByte)
INDUSTRIAL VS HOBBY
REQUIREMENTS
Industrial Hobby

Require high scalability, Focus on cost-


compliance, security, efficiency, simplicity,
and automation and easy
deployment
ARCHITECTURAL OPTIONS

Batch Processing Pipeline Streaming Data Pipeline

Use Case : Large-scale ETL Use Case : Real-time


jobs for historical data analytics and event-driven
architectures
Common tools : Apache Common tools : Apache Kafa,
Spark, Hadoop, Spark Streaming

Example: Nightly ETL job that Example: Processing real-


updates a data warehouse time sensor data from IoT
devices
Hobby Consideration: Use Hobby Consideration: Use MQTT
lightweight tools like Python and lightweight framework
scripts and SQLite
KEY ENVIRONMENTAL
CONSIDERATIONS

Cloud vs. On-Premises: Cloud offers scalability,


while on-premises gives control.

Scalability Needs: Consider auto-scaling and


distributed computing.

Security & Compliance: Data encryption,


access control, regulatory requirements.
STEPS TO BUILD A DATA
PIPELINE
THE GOAL AND THE
DESIGN(ARCHITECTURE)
The foundation of successful data pipeline is a clear
understanding of its goal and the architectural framework which
supports it. Two actions for that:

1. Identify the primary goal of your data pipeline, such as automating data
reporting for monthly sales data.
2. Choosing the right tools and technologies and design the data models which
supports your data pipelines: like schema (star).
CHOOSING DATA SOURCES,
INGESTING, AND VALIDATING
DATA

Create a system for collecting your data from various


sources and making them accuracy. (Actions)

Create connections to your data sources like Customer


Relationship Management system or social medias.
DESIGNING THE DATA
PROCESSING PLAN
After data ingesting, the main focus is to processing which
means converting the data to the source which is readable
or editable.

Defining or applying data transformation which are filtering or


aggregating)
Code and configuration tools can also carry out these
transformations.
SETTING THE DATA STORAGE
AND ORCHESTRATE THE FLOW
OF DATA
Once data is processed and validating, the main important
step is determining where the data will be stored and how
the data flow will be control or maintain efficiently.

1. Choose the relevant storage and design the schemas for data.
2. Setting up the data orchestration, for instance, schedule for data
flowing, protocols and dependencies.
ORGANIZE THE DATA PIPELINE
AND SETTING THE MONITORING
AND MAINTAINING PROCESS

Making the data pipeline ensuring it operates smoothly


and monitor the routines and maintain the data

1. Choose the suitable environment (cloud or local)


2. Monitor the track of data performance and update and maintain
the pipeline.
PLANNING THE CONSUMPTION
LAYER FOR DATA

The final step is to consider how we process the data and


its usage.

Setting up the delivery process through the data which will be


available to analytics tools
TOOLS AND THEIR PROS &
CONS
SQLite - Lightweight
Google Pub/Sub - Event Python Scripts - Simple
messaging database
file-based ingestion

Pros - Scalable, managed Pros - Easy, setup,


Pros -Easy for small-scale
Cons- Google Cloud only local storage
projects
Cons- Not for
Cons- Not scalable
distributed use

Apache Spark - Batch & Apache Kafka - Event-driven


real-time processing streaming

Pros - Fast, scalable Pros - Scalable, real-time


Cons- High memory Cons- Complex setup
usage
GUIDELINES
✅ Modularity – Keep components independent
for flexibility.
✅ Scalability – Choose tools that support
growing data needs.

✅ Observability – Implement logging,


monitoring, and alerting.
✅ Security Best Practices – Encrypt data in
transit & at rest, role-based access control.

✅ Error Handling & Retry Mechanisms – Implement


auto-retries for failures.

✅ Cost Awareness – Optimize costs for cloud


usage, tools, and storage.
WHAT TO AVOID
❌ Tightly Coupled Systems – Hard to scale and
maintain.
❌ Ignoring Data Quality – Always clean and
validate data.

❌ Hardcoding Dependencies – Use


environment variables/config management.

❌ Neglecting Cost Optimization – Optimize


storage, processing frequency, and cloud
resources.

❌ Overengineering for Hobby Projects – Keep it simple and


manageable.hem while transferring. This definition looks
simplistic and doesn’t capture the main key of the data
pipeline.
REFERENCE
How to Build a Data Pipeline in 6 Steps

https://www.ascend.io/blog/how-to-build-a-data-
pipeline-in-six-steps/

You might also like