0% found this document useful (0 votes)

6 views17 pages

Week8 Classroom Exercise

A data pipeline is a process that involves collecting, transforming, and transferring data from its source to a destination for analysis. Key steps include data ingestion, processing, storage, and monitoring, with considerations for scalability, security, and cost. The document outlines various architectural options, tools, and best practices to build an efficient data pipeline while avoiding common pitfalls.

Uploaded by

litch711

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

6 views17 pages

Week8 Classroom Exercise

Uploaded by

litch711

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 17

DATA

PIPELINE
Week 8 Classroom Exercise

By Min Zaw, May Khine

WHAT IS DATA PIPELINE?
A data pipeline is the process of collecting data from the original
source and transferring it to the new destination — optimizing,
consolidating, and modifying them while transferring. This definition
looks simplistic and doesn’t capture the main key of the data pipeline.

For example, transferring from Point A to Point B is done in data

replication but this is not a qualified data pipeline. The key
differentiation lies in the transformational steps of data pipelines.
GENERAL IDEAS
Data Ingestion
Data Processing
Data destination & sharing
Collect data from
sources like APIs,
databases, logs, and
Process of converting Process of making
IoT devices the data, ingested data available for
from one format, analysis and utilization.
structure, or set of (Ensure BI, analytics,
Types: values of another way and machine learning
of joining, filtering, etc.
Streaming (Google models)
Pub/Sub)
Batching (AirByte)
INDUSTRIAL VS HOBBY
REQUIREMENTS
Industrial Hobby

Require high scalability, Focus on cost-

compliance, security, efficiency, simplicity,
and automation and easy
deployment
ARCHITECTURAL OPTIONS

Batch Processing Pipeline Streaming Data Pipeline

Use Case : Large-scale ETL Use Case : Real-time

jobs for historical data analytics and event-driven
architectures
Common tools : Apache Common tools : Apache Kafa,
Spark, Hadoop, Spark Streaming

Example: Nightly ETL job that Example: Processing real-

updates a data warehouse time sensor data from IoT
devices
Hobby Consideration: Use Hobby Consideration: Use MQTT
lightweight tools like Python and lightweight framework
scripts and SQLite
KEY ENVIRONMENTAL
CONSIDERATIONS

Cloud vs. On-Premises: Cloud offers scalability,

while on-premises gives control.

Scalability Needs: Consider auto-scaling and

distributed computing.

Security & Compliance: Data encryption,

access control, regulatory requirements.
STEPS TO BUILD A DATA
PIPELINE
THE GOAL AND THE
DESIGN(ARCHITECTURE)
The foundation of successful data pipeline is a clear
understanding of its goal and the architectural framework which
supports it. Two actions for that:

1. Identify the primary goal of your data pipeline, such as automating data
reporting for monthly sales data.
2. Choosing the right tools and technologies and design the data models which
supports your data pipelines: like schema (star).
CHOOSING DATA SOURCES,
INGESTING, AND VALIDATING
DATA

Create a system for collecting your data from various

sources and making them accuracy. (Actions)

Create connections to your data sources like Customer

Relationship Management system or social medias.
DESIGNING THE DATA
PROCESSING PLAN
After data ingesting, the main focus is to processing which
means converting the data to the source which is readable
or editable.

Defining or applying data transformation which are filtering or

aggregating)
Code and configuration tools can also carry out these
transformations.
SETTING THE DATA STORAGE
AND ORCHESTRATE THE FLOW
OF DATA
Once data is processed and validating, the main important
step is determining where the data will be stored and how
the data flow will be control or maintain efficiently.

1. Choose the relevant storage and design the schemas for data.
2. Setting up the data orchestration, for instance, schedule for data
flowing, protocols and dependencies.
ORGANIZE THE DATA PIPELINE
AND SETTING THE MONITORING
AND MAINTAINING PROCESS

Making the data pipeline ensuring it operates smoothly

and monitor the routines and maintain the data

1. Choose the suitable environment (cloud or local)

2. Monitor the track of data performance and update and maintain
the pipeline.
PLANNING THE CONSUMPTION
LAYER FOR DATA

The final step is to consider how we process the data and

its usage.

Setting up the delivery process through the data which will be

available to analytics tools
TOOLS AND THEIR PROS &
CONS
SQLite - Lightweight
Google Pub/Sub - Event Python Scripts - Simple
messaging database
file-based ingestion

Pros - Scalable, managed Pros - Easy, setup,

Pros -Easy for small-scale
Cons- Google Cloud only local storage
projects
Cons- Not for
Cons- Not scalable
distributed use

Apache Spark - Batch & Apache Kafka - Event-driven

real-time processing streaming

Pros - Fast, scalable Pros - Scalable, real-time

Cons- High memory Cons- Complex setup
usage
GUIDELINES
✅ Modularity – Keep components independent
for flexibility.
✅ Scalability – Choose tools that support
growing data needs.

✅ Observability – Implement logging,

monitoring, and alerting.
✅ Security Best Practices – Encrypt data in
transit & at rest, role-based access control.

✅ Error Handling & Retry Mechanisms – Implement

auto-retries for failures.

✅ Cost Awareness – Optimize costs for cloud

usage, tools, and storage.
WHAT TO AVOID
❌ Tightly Coupled Systems – Hard to scale and
maintain.
❌ Ignoring Data Quality – Always clean and
validate data.

❌ Hardcoding Dependencies – Use

environment variables/config management.

❌ Neglecting Cost Optimization – Optimize

storage, processing frequency, and cloud
resources.

❌ Overengineering for Hobby Projects – Keep it simple and

manageable.hem while transferring. This definition looks
simplistic and doesn’t capture the main key of the data
pipeline.
REFERENCE
How to Build a Data Pipeline in 6 Steps

https://www.ascend.io/blog/how-to-build-a-data-
pipeline-in-six-steps/

Wa200-5 MT
No ratings yet
Wa200-5 MT
782 pages
The Complete Guide To Google Adsense
No ratings yet
The Complete Guide To Google Adsense
44 pages
Database Terminologies
No ratings yet
Database Terminologies
13 pages
Lean Construction
100% (2)
Lean Construction
43 pages
Power Flow in Transmission - Neutral ZF
100% (3)
Power Flow in Transmission - Neutral ZF
25 pages
PowerFactory 2022 Product Specification
No ratings yet
PowerFactory 2022 Product Specification
20 pages
Jira Workflows For Business Teams
100% (1)
Jira Workflows For Business Teams
45 pages
K+DcanTroubleshooting v1.2 PDF
No ratings yet
K+DcanTroubleshooting v1.2 PDF
7 pages
(B-0240) Health Information Unit
No ratings yet
(B-0240) Health Information Unit
29 pages
Unit 3 - BDA - Notes
No ratings yet
Unit 3 - BDA - Notes
9 pages
Data Pipeline Architecture
No ratings yet
Data Pipeline Architecture
6 pages
Data Pipeline Scaling
No ratings yet
Data Pipeline Scaling
13 pages
OpenBSD Commands
No ratings yet
OpenBSD Commands
195 pages
Winch Turn Sensor Adjustment at LICCON 2
No ratings yet
Winch Turn Sensor Adjustment at LICCON 2
37 pages
Creating Efficient Data Pipelines For Simulation Projects
No ratings yet
Creating Efficient Data Pipelines For Simulation Projects
4 pages
4-Data Processing Pipelines in Science and Business
100% (1)
4-Data Processing Pipelines in Science and Business
22 pages
Cs112 - Programming Fundamental: Lecture # 02 - Program Design Methodology Syed Shahrooz Shamim
No ratings yet
Cs112 - Programming Fundamental: Lecture # 02 - Program Design Methodology Syed Shahrooz Shamim
18 pages
FortiGate High-End NGFW-DC-NSE-Oct15-19
No ratings yet
FortiGate High-End NGFW-DC-NSE-Oct15-19
48 pages
Data Pipeline
No ratings yet
Data Pipeline
34 pages
UNIT 1 To 5
No ratings yet
UNIT 1 To 5
37 pages
Cisco Business 150AX Access Point Quick Start Guide
No ratings yet
Cisco Business 150AX Access Point Quick Start Guide
2 pages
ACCENTURE State-Cybersecurity
No ratings yet
ACCENTURE State-Cybersecurity
39 pages
N3 2020 Copy Updated
No ratings yet
N3 2020 Copy Updated
22 pages
DE Skills and Tools Guide
No ratings yet
DE Skills and Tools Guide
20 pages
CCD 4,5,6
No ratings yet
CCD 4,5,6
21 pages
Google App Engine
100% (1)
Google App Engine
14 pages
Lecture-4: Data Communication and Computer Networks
No ratings yet
Lecture-4: Data Communication and Computer Networks
24 pages
Unit 4
No ratings yet
Unit 4
11 pages
Pipeline
No ratings yet
Pipeline
19 pages
DZ Data Pipeline Essentials 2024
No ratings yet
DZ Data Pipeline Essentials 2024
6 pages
Ai&ds Ie Report
No ratings yet
Ai&ds Ie Report
6 pages
Data Pipeline
No ratings yet
Data Pipeline
14 pages
chp4 CCD
No ratings yet
chp4 CCD
8 pages
Data Engineering and Data Engineer - Students
No ratings yet
Data Engineering and Data Engineer - Students
56 pages
20230314-EB-Transform Your Data Pipelines
No ratings yet
20230314-EB-Transform Your Data Pipelines
9 pages
11 Best Practices For Data Engineers
No ratings yet
11 Best Practices For Data Engineers
7 pages
ICT Presentation by Aparna Vasaniya
No ratings yet
ICT Presentation by Aparna Vasaniya
15 pages
3.4 IEEE 802.11 Amendments
No ratings yet
3.4 IEEE 802.11 Amendments
10 pages
Data Pipeline
No ratings yet
Data Pipeline
13 pages
Data Models (Module - II)
No ratings yet
Data Models (Module - II)
101 pages
Data Management Guide Checklists
No ratings yet
Data Management Guide Checklists
15 pages
Practical Task SESSION 1 DFC10033.1662349265782
No ratings yet
Practical Task SESSION 1 DFC10033.1662349265782
9 pages
What Is A Data Pipeline - IBM
No ratings yet
What Is A Data Pipeline - IBM
10 pages
PIR-sensor-based Electronic Device Control With Ultra-Low Standby Power Consumption
No ratings yet
PIR-sensor-based Electronic Device Control With Ultra-Low Standby Power Consumption
6 pages
T-GCPBDML-B - M2 - Data Engineering For Streaming Data - ILT Slides
No ratings yet
T-GCPBDML-B - M2 - Data Engineering For Streaming Data - ILT Slides
59 pages
CCD Unit 4
No ratings yet
CCD Unit 4
5 pages
Open Access Review: Petar Radanliev
No ratings yet
Open Access Review: Petar Radanliev
34 pages
HLS - Digital Output Module (DOM) - DS
No ratings yet
HLS - Digital Output Module (DOM) - DS
4 pages
Personalization in User Interface Design
No ratings yet
Personalization in User Interface Design
8 pages
19.1 - Data Pipelines
No ratings yet
19.1 - Data Pipelines
18 pages
Datasheet PM851 MSATA v10
No ratings yet
Datasheet PM851 MSATA v10
2 pages
22A31A4484 Final Google AIML-1-5
No ratings yet
22A31A4484 Final Google AIML-1-5
5 pages
DMBPD
No ratings yet
DMBPD
2 pages
DNV Training Overview 2023
No ratings yet
DNV Training Overview 2023
3 pages
STPM 4 Raspi
No ratings yet
STPM 4 Raspi
13 pages
Laser Inatra
No ratings yet
Laser Inatra
1 page
D Report
No ratings yet
D Report
19 pages
Data Engineering
No ratings yet
Data Engineering
22 pages
Aditya Technical Seminar
No ratings yet
Aditya Technical Seminar
10 pages
DocScanner 20 Oct 2024 2-19 PM
No ratings yet
DocScanner 20 Oct 2024 2-19 PM
16 pages
DZone TR Data Pipelines 2022 Spotlight Dremio
No ratings yet
DZone TR Data Pipelines 2022 Spotlight Dremio
42 pages
Data Pipeline Essentials: See Ya Later
No ratings yet
Data Pipeline Essentials: See Ya Later
6 pages
T-GCPBDML-B - M2 - Data Engineering For Streaming Data - ILT Slides
No ratings yet
T-GCPBDML-B - M2 - Data Engineering For Streaming Data - ILT Slides
71 pages
Data Engineering with Scala and Spark: Build streaming and batch pipelines that process massive amounts of data using Scala
From Everand
Data Engineering with Scala and Spark: Build streaming and batch pipelines that process massive amounts of data using Scala
Eric Tome
No ratings yet
Prefect Workflow Orchestration Essentials: Definitive Reference for Developers and Engineers
From Everand
Prefect Workflow Orchestration Essentials: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Essays on Infrastructure-as-code
From Everand
Essays on Infrastructure-as-code
Ravi Rajamani
No ratings yet
Efficient Data Processing with Apache Pig: Definitive Reference for Developers and Engineers
From Everand
Efficient Data Processing with Apache Pig: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Building Modern Data Applications Using Databricks Lakehouse: Develop, optimize, and monitor data pipelines on Databricks
From Everand
Building Modern Data Applications Using Databricks Lakehouse: Develop, optimize, and monitor data pipelines on Databricks
Will Girten
No ratings yet
Exploring Hadoop Ecosystem (Volume 2): Stream Processing
From Everand
Exploring Hadoop Ecosystem (Volume 2): Stream Processing
Wei Liu
No ratings yet
Study Guide 300-435 ENAUTO: Automating and Programming Cisco Enterprise Solutions Certification Exam
From Everand
Study Guide 300-435 ENAUTO: Automating and Programming Cisco Enterprise Solutions Certification Exam
Anand Vemula
No ratings yet
Learning Informatica PowerCenter 9.x
From Everand
Learning Informatica PowerCenter 9.x
Rahul Malewar
3/5 (4)
Effective Business Intelligence with QuickSight
From Everand
Effective Business Intelligence with QuickSight
Rajesh Nadipalli
No ratings yet
Synapse Administration and Deployment: The Complete Guide for Developers and Engineers
From Everand
Synapse Administration and Deployment: The Complete Guide for Developers and Engineers
William Smith
No ratings yet
Azure Synapse Analytics Solutions: Definitive Reference for Developers and Engineers
From Everand
Azure Synapse Analytics Solutions: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Talend Data Integration Essentials: Definitive Reference for Developers and Engineers
From Everand
Talend Data Integration Essentials: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Sqoop Essentials: Definitive Reference for Developers and Engineers
From Everand
Sqoop Essentials: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Comprehensive Guide to Azure HDInsight: Definitive Reference for Developers and Engineers
From Everand
Comprehensive Guide to Azure HDInsight: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Dataiku Platform Foundations: Definitive Reference for Developers and Engineers
From Everand
Dataiku Platform Foundations: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Data Pipeline Automation with Airbyte: Definitive Reference for Developers and Engineers
From Everand
Data Pipeline Automation with Airbyte: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Efficient Data Preparation with AWS Glue DataBrew: Definitive Reference for Developers and Engineers
From Everand
Efficient Data Preparation with AWS Glue DataBrew: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
InfluxDB Essentials: Definitive Reference for Developers and Engineers
From Everand
InfluxDB Essentials: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Snowflake Data Platform Engineering: Definitive Reference for Developers and Engineers
From Everand
Snowflake Data Platform Engineering: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
DataDog Operations and Monitoring Guide: Definitive Reference for Developers and Engineers
From Everand
DataDog Operations and Monitoring Guide: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Essential Apache Beam: Definitive Reference for Developers and Engineers
From Everand
Essential Apache Beam: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Dataproc Administration and Engineering Solutions: Definitive Reference for Developers and Engineers
From Everand
Dataproc Administration and Engineering Solutions: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Zabbix Systems Monitoring and Management: Definitive Reference for Developers and Engineers
From Everand
Zabbix Systems Monitoring and Management: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Airflow for Data Workflow Automation
From Everand
Airflow for Data Workflow Automation
Richard Johnson
No ratings yet
Practical Data Strategies and Recipes
From Everand
Practical Data Strategies and Recipes
Tom Henricksen
No ratings yet
Google Cloud Platform for Data Engineering: From Beginner to Data Engineer using Google Cloud Platform
From Everand
Google Cloud Platform for Data Engineering: From Beginner to Data Engineer using Google Cloud Platform
alasdair gilchrist
5/5 (1)
Application Design: Key Principles For Data-Intensive App Systems
From Everand
Application Design: Key Principles For Data-Intensive App Systems
Rob Botwright
No ratings yet

Week8 Classroom Exercise

Uploaded by

Week8 Classroom Exercise

Uploaded by

DATA

By Min Zaw, May Khine

For example, transferring from Point A to Point B is done in data

Require high scalability, Focus on cost-

Batch Processing Pipeline Streaming Data Pipeline

Use Case : Large-scale ETL Use Case : Real-time

Example: Nightly ETL job that Example: Processing real-

Cloud vs. On-Premises: Cloud offers scalability,

Scalability Needs: Consider auto-scaling and

Security & Compliance: Data encryption,

Create a system for collecting your data from various

Create connections to your data sources like Customer

Defining or applying data transformation which are filtering or

Making the data pipeline ensuring it operates smoothly

1. Choose the suitable environment (cloud or local)

The final step is to consider how we process the data and

Setting up the delivery process through the data which will be

Pros - Scalable, managed Pros - Easy, setup,

Apache Spark - Batch & Apache Kafka - Event-driven

Pros - Fast, scalable Pros - Scalable, real-time

✅ Observability – Implement logging,

✅ Error Handling & Retry Mechanisms – Implement

✅ Cost Awareness – Optimize costs for cloud

❌ Hardcoding Dependencies – Use

❌ Neglecting Cost Optimization – Optimize

❌ Overengineering for Hobby Projects – Keep it simple and

You might also like