0% found this document useful (0 votes)

51 views28 pages

CH 05 Data Engineering

Module 5 covers data engineering, focusing on the design, building, and management of infrastructure for data collection, storage, and analysis. It distinguishes between data engineers, data scientists, and data analysts, detailing their roles and responsibilities. The module also discusses data ingestion techniques, storage solutions like data lakes and warehouses, and best practices for managing data quality and governance.

Uploaded by

Rishi Kokil

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

51 views28 pages

CH 05 Data Engineering

Uploaded by

Rishi Kokil

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 28

MODULE 5

Data
Engineering

Compile by Dr. Rohini Temkar 1

Module
5.1 5- Contents
• Introduction to Data Engineering, Data Ingestion:
Techniques and Best Practices, Data Storage and
Management: Data Lakes, Data Warehouses, Data
Processing Pipelines.

5.2

• Lamda Architecture, Batch Processing, Stream

Processing, Data Quality and Governance

Compiled by Dr. Rohini Temkar 2

Data engineering
• Data engineering is a technology, focusing on

designing, building, and managing the infrastructure

required to collect, store, and analyze large volumes
of data.

• It enables organizations to transform raw data into

useful insights and is the backbone of data science,

machine learning, and business intelligence.

Compiled by Dr. Rohini Temkar 3

Data engineering
• Data engineering is a set of operations to make data

available and usable to data scientists, data analysts,

business intelligence (BI) developers, and other
specialists within an organization.
• It takes dedicated experts – data engineers – to

design and build systems for gathering and storing

data at scale as well as preparing it for further
analysis.

Compiled by Dr. Rohini Temkar 4

Data Scientist Vs Data Engineer Vs Data Analyst

5
Data Engineer
■ A data engineer is a professional who prepares and
manages big data that is then analyzed by data
analysts and scientists.

■ They are responsible for designing, building,

integrating, and maintaining data from several
sources, , thus designing the infrastructure of the
data that is collected in the database.

6
Data Scientist
■ Data Scientist make use of Machine Learning,
Deep Learning techniques and Inferential
Modeling and ﬁnd correlations between data and
create predictive models on the basis of which
he/she can develop recommendation systems
useful for said business.

7
Data Analyst
■ Data Analyst is responsible for:

■ screening and cleaning/polishing of the raw data

collected;
■ data preparation;
■ understanding of business metrics and problems;
■ visualization of data through reports and graphs;
■ identiﬁcation of trends and useful suggestions to
aid in strategic business decisions.

8
Roles and Responsibilities

9
Data Engineering Process

10
Data Engineering Process
■ The data engineering process covers a
sequence of tasks that turn a large amount
of raw data into a practical product meeting
the needs of analysts, data
scientists, machine learning engineers, and
others.

11
Data Engineering Process
Data ingestion (acquisition) moves data from multiple
sources — SQL and NoSQL databases, IoT devices,
websites, streaming services, etc. — to a target system
to be transformed for further analysis.

Data comes in various forms and can be both structured

and unstructured.

12
Data Engineering Process
■ Data transformation adjusts disparate data to
the needs of end users. It involves removing
errors and duplicates from data, normalizing it,
and converting it into the needed format.

■ Data serving delivers transformed data to end

users — a BI platform, dashboard, or data science
team.

13
Data Ingestion Techniques
1) Batch Data Ingestion:

● involves collecting large amounts of raw data from various

sources into one place and then processing it later.

● Data is collected and processed in intervals (e.g., hourly,

daily).

● Use Cases: Suitable for historical data processing, reporting,

and use cases where real-time analysis is not required.
● Tools: Apache Sqoop, AWS Glue, Google Dataflow, Talend.

14
Data Ingestion Techniques
2) Real-Time (Stream) Data Ingestion:

● involves streaming data into a data warehouse in real-time,

often using cloud-based systems that can ingest the data
quickly, store it in the cloud, and then release it to users
almost immediately.

● Use Cases: Real-time analytics, fraud detection, IoT sensor

data monitoring, and alerting.
● Tools: Apache Kafka, Apache Flink, Amazon Kinesis,
Apache NiFi, Google Pub/Sub.

15
Data Ingestion Techniques
Lambda Architecture (Hybrid):

● Combines batch and real-time processing to get the benefits

of both techniques. The real-time layer provides immediate
data processing, while the batch layer ensures data accuracy
and completeness by processing larger volumes periodically.

● Use Cases: When both real-time insights and historical data

analysis are required (e.g., in recommendation systems,
social media analytics, and fraud detection).
● Tools: Hadoop (batch processing) + Apache Kafka (stream
processing).

16
Best Practices for Data Ingestion
Understand the Data Sources:
● Identify all data sources, their structure (structured, semi-structured,
or unstructured), and their frequency of updates.
● Ensure the ability to handle various types of data (e.g., relational
databases, IoT devices, logs, APIs).

Data Schema Management:

● Ensure schema consistency across datasets. When ingesting data, it

is important to account for evolving schemas (e.g., adding new fields)
without breaking the system.
● Use schema registries for real-time data, such as Apache Avro, to
enforce structure during ingestion.

17
Best Practices for Data Ingestion
Data Validation and Cleansing:
● Apply checks to ensure the quality and validity of incoming data.
Common issues such as missing values, duplicates, or incorrect data
formats should be addressed during ingestion.
● Tools like Apache Nifi and Talend can help automate validation and
transformation during ingestion.

Scalability:

● Use scalable solutions that can handle data volume growth over time,
especially with increasing data sources and higher data velocity.
● Consider using cloud-based storage solutions (e.g., Amazon S3,
Google Cloud Storage) for dynamic scaling capabilities.

18
Best Practices for Data Ingestion

Data Deduplication:
● Duplicate data can distort analytics and increase storage costs.
Ensure that your ingestion system includes mechanisms to
identify and remove duplicate records.

Optimize Throughput and Latency:

● For real-time ingestion, reduce latency by using efficient,

low-latency transport layers like Apache Kafka or AWS Kinesis.
● In batch ingestion, ensure the throughput is maximized by tuning
the data transfer rates and scheduling ingestion during off-peak
hours to optimize system performance.

19
Best Practices for Data Ingestion

Data Compression and Serialization:

● To optimize storage and transmission, use efficient serialization
formats (e.g., Parquet, Avro, ORC) and compress data where
possible. These formats are especially useful for handling large
datasets efficiently.

Error Handling and Monitoring:

● Implement proper logging, error handling, and retry mechanisms

for failed ingestion attempts. Tools like Datadog and Prometheus
can help monitor the ingestion process.
● Ensure that you have alerting systems in place in case of data
pipeline failures or bottlenecks.

20
Best Practices for Data Ingestion

Secure Data Transfers:

● Ensure encryption during data transit using HTTPS or other
secure protocols. Additionally, secure access to data sources by
enforcing authentication and access controls.
● For sensitive data, ensure compliance with regulations such as
GDPR or HIPAA by masking or anonymizing sensitive fields.

Data Partitioning and Load Balancing:

● Partition large datasets to improve ingestion speed and

scalability. Tools like Apache Kafka allow for partitioned topic
structures for distributed ingestion.
● Load balance the ingestion workloads across multiple nodes or
systems to avoid bottlenecks.

21
Best Practices for Data Ingestion

Incremental Ingestion:
● Rather than ingesting entire datasets repeatedly, use
techniques that ingest only the new or updated data (delta
loads). This is particularly useful for batch ingestion and
significantly reduces resource usage.
Metadata Management:
● Maintain clear metadata around the ingestion process, such
as data source details, ingestion timestamps, and
transformations applied. This makes the ingestion pipeline
more transparent and easier to troubleshoot.

22
Data Storage Management
■ Data storage is an essential component of any data
architecture, and two of the most common solutions for
storing and managing large volumes of data are data
lakes and data warehouses.

■ Although both serve to store data, they differ

significantly in terms of structure, use cases, and
functionality.

23
Data lake
■ A data lake is a centralized repository that allows you to store vast
amounts of raw data in its original format, whether structured,
semi-structured, or unstructured.
■ The idea behind a data lake is to provide a flexible environment for
storing data without requiring upfront structuring or processing.

24
Data lake
■ A data lake uses the ELT approach and starts data loading
immediately after extracting it, handling raw — often unstructured
— data.

■ A data lake is worth building in those projects that will scale and
need a more advanced architecture.

■ Besides, it’s very convenient when the purpose of the data hasn’t
been determined yet. In this case, you can load data quickly, store
it, and modify it as necessary.

■ Data lakes are also a powerful tool for data scientists and ML
engineers, who would use raw data to prepare it for predictive
analytics and machine learning.

25
Data Warehouse
■ A data warehouse is a highly structured, centralized repository
designed for storing processed and structured data, usually to
support reporting, business intelligence (BI), and analytics.
■ It is designed to optimize query performance for large datasets
and supports OLAP (Online Analytical Processing) workloads.

26
Data Warehouse
OLAP and OLAP cubes
■ OLAP or Online Analytical Processing refers to the computing
approach allowing users to analyze multidimensional data.
■ It’s contrasted with OLTP or Online Transactional Processing, a
simpler method of interacting with databases, not designed for
analyzing massive amounts of data from different
perspectives.
■ Traditional databases resemble spreadsheets, using the
two-dimensional structure of rows and columns.
■ However, in OLAP, datasets are presented in multidimensional
structures -- OLAP cubes.
■ Such structures enable efficient processing and advanced
analysis of vast amounts of varied data.
■ For example, a sales department report would include such
dimensions as product, region, sales representative, sales
amount, month, and so on.

27
28

Oreilly Technical Guide Understanding Etl
No ratings yet
Oreilly Technical Guide Understanding Etl
107 pages
AnsibleNetworkAutomation PDF
100% (3)
AnsibleNetworkAutomation PDF
63 pages
Big Data Components
No ratings yet
Big Data Components
31 pages
Salesforce Apex Language Reference
No ratings yet
Salesforce Apex Language Reference
3,809 pages
Ethical Hacking Workshop
100% (2)
Ethical Hacking Workshop
71 pages
Classification of Digital Data
No ratings yet
Classification of Digital Data
19 pages
Deeplearning - Ai Deeplearning - Ai
No ratings yet
Deeplearning - Ai Deeplearning - Ai
163 pages
Install 12c Forms and Reports On Windows
No ratings yet
Install 12c Forms and Reports On Windows
116 pages
Data Engineering Part 1 1735286787
No ratings yet
Data Engineering Part 1 1735286787
22 pages
PMS-Microsoft Project 2019, Day 1
No ratings yet
PMS-Microsoft Project 2019, Day 1
36 pages
Creating Reports in Oracle E-Business Suite Using XML Publisher
No ratings yet
Creating Reports in Oracle E-Business Suite Using XML Publisher
49 pages
Data Engineering Life Cycle
No ratings yet
Data Engineering Life Cycle
33 pages
Coursera - IBM - Introduction To Data Analytics
No ratings yet
Coursera - IBM - Introduction To Data Analytics
13 pages
Essentials of Data Engineering - Saini, DR - Mukesh - 2024 - Anna's Archive
No ratings yet
Essentials of Data Engineering - Saini, DR - Mukesh - 2024 - Anna's Archive
431 pages
Data Engineer Toolkit in 2025 - Must Have Skills, Tools & Resources - by Vijay Gadhave - May, 2025 - Medium
No ratings yet
Data Engineer Toolkit in 2025 - Must Have Skills, Tools & Resources - by Vijay Gadhave - May, 2025 - Medium
15 pages
GCP Data Engineer Course Content
No ratings yet
GCP Data Engineer Course Content
7 pages
BDA UT2 QB Answers
100% (1)
BDA UT2 QB Answers
22 pages
Dsbda Ut3
No ratings yet
Dsbda Ut3
14 pages
Unit II Big Data Architecture
No ratings yet
Unit II Big Data Architecture
5 pages
Module 3
No ratings yet
Module 3
187 pages
Module 2
No ratings yet
Module 2
117 pages
Cs614 Grand Quiz Merge
No ratings yet
Cs614 Grand Quiz Merge
81 pages
Lec 4 - Big Data Ecosystem Architecture
No ratings yet
Lec 4 - Big Data Ecosystem Architecture
28 pages
Essentials of Data engineeringByMukeshSaini
No ratings yet
Essentials of Data engineeringByMukeshSaini
30 pages
A Guide To Using jUDDI
No ratings yet
A Guide To Using jUDDI
82 pages
Big Data Components
No ratings yet
Big Data Components
58 pages
Aligning Access Rights To Governance Needs With The Responsibility MetaModel (ReMMo) in The Frame of Enterprise Architecture
No ratings yet
Aligning Access Rights To Governance Needs With The Responsibility MetaModel (ReMMo) in The Frame of Enterprise Architecture
57 pages
Complete Data Engineering Roadmap With Resources
No ratings yet
Complete Data Engineering Roadmap With Resources
16 pages
Data Engineering Unit-1
No ratings yet
Data Engineering Unit-1
16 pages
UNIT 1 Merged
No ratings yet
UNIT 1 Merged
11 pages
Da Unit-I
No ratings yet
Da Unit-I
19 pages
VMware Presentation - Chicago VMUG - Whats-New-vSphere-VMUG
No ratings yet
VMware Presentation - Chicago VMUG - Whats-New-vSphere-VMUG
71 pages
Data Engineering (Ut-2)
No ratings yet
Data Engineering (Ut-2)
22 pages
Big Data
No ratings yet
Big Data
51 pages
Group 4
No ratings yet
Group 4
10 pages
Notes For DMML
No ratings yet
Notes For DMML
27 pages
32study of Data Ingestion Tools
No ratings yet
32study of Data Ingestion Tools
9 pages
Module 5 Data Engineering
No ratings yet
Module 5 Data Engineering
10 pages
What Is Data Ingestion? Big Data Architecture - Where Does Data Ingestion Fit ?
No ratings yet
What Is Data Ingestion? Big Data Architecture - Where Does Data Ingestion Fit ?
3 pages
Cacti
No ratings yet
Cacti
33 pages
DP-203 Resources
No ratings yet
DP-203 Resources
14 pages
Configuration Management (SRAN7.0 01)
No ratings yet
Configuration Management (SRAN7.0 01)
19 pages
DSBDA EndSem2023 12F FlyHigh
No ratings yet
DSBDA EndSem2023 12F FlyHigh
20 pages
All Questions
No ratings yet
All Questions
7 pages
ECS765P - W6 - Big Data Ingestion and Storage
No ratings yet
ECS765P - W6 - Big Data Ingestion and Storage
34 pages
System Design
No ratings yet
System Design
6 pages
Data Engineering Lab
No ratings yet
Data Engineering Lab
6 pages
De Imp Qa
No ratings yet
De Imp Qa
12 pages
Java Persistence
No ratings yet
Java Persistence
11 pages
DS Day 6
No ratings yet
DS Day 6
5 pages
Unit 4
No ratings yet
Unit 4
11 pages
Unit 2
No ratings yet
Unit 2
11 pages
Data Arch Base
No ratings yet
Data Arch Base
11 pages
Big Data Module 2
No ratings yet
Big Data Module 2
31 pages
CCD Unit 4
No ratings yet
CCD Unit 4
5 pages
Data Processing
No ratings yet
Data Processing
5 pages
Course1 Summary
No ratings yet
Course1 Summary
4 pages
Big Data Integration and Processing 15 Marks
No ratings yet
Big Data Integration and Processing 15 Marks
5 pages
Data Engineering UNIT-1
No ratings yet
Data Engineering UNIT-1
5 pages
BC Front Pages Including Index
No ratings yet
BC Front Pages Including Index
5 pages
Data Ingestion, Processing and Architecture Layers For Big Data and Iot
No ratings yet
Data Ingestion, Processing and Architecture Layers For Big Data and Iot
32 pages
How To Create An Awesome Github Profile README
No ratings yet
How To Create An Awesome Github Profile README
14 pages
Data Engineers Instagram Story
No ratings yet
Data Engineers Instagram Story
8 pages
4.data Engineering
No ratings yet
4.data Engineering
9 pages
Profile Manager
No ratings yet
Profile Manager
3 pages
Donations
No ratings yet
Donations
3 pages
Trends in Big Data
No ratings yet
Trends in Big Data
3 pages
Module 2 Data Engineering 6 Mark Answers
No ratings yet
Module 2 Data Engineering 6 Mark Answers
3 pages
Muhammad Hassan - CV - ATSG
No ratings yet
Muhammad Hassan - CV - ATSG
3 pages
Few Question Ans
No ratings yet
Few Question Ans
19 pages
Data Ingestion Layer
No ratings yet
Data Ingestion Layer
2 pages
Data Engineering Activities
No ratings yet
Data Engineering Activities
1 page
Life
No ratings yet
Life
3 pages
IP QuestionBank 23 24
No ratings yet
IP QuestionBank 23 24
5 pages
Ds 6
No ratings yet
Ds 6
7 pages
Mtech in Data Science and Machine Learning
No ratings yet
Mtech in Data Science and Machine Learning
13 pages
Lecture 3 (Data Ingestion)
No ratings yet
Lecture 3 (Data Ingestion)
3 pages
UNIT No. 1 Introduction To Software and Software Engineering
No ratings yet
UNIT No. 1 Introduction To Software and Software Engineering
2 pages
The Various Facets of Data Ingestion
No ratings yet
The Various Facets of Data Ingestion
2 pages
Hogan Registration Document
No ratings yet
Hogan Registration Document
6 pages
Users Guide-Integration Manager
No ratings yet
Users Guide-Integration Manager
74 pages
Prep4sure: Review
No ratings yet
Prep4sure: Review
6 pages
Install PHP 5.3 and 5.2 Together On Ubuntu 12.04
No ratings yet
Install PHP 5.3 and 5.2 Together On Ubuntu 12.04
8 pages
Ankitseth SAP Basis
No ratings yet
Ankitseth SAP Basis
2 pages
CV Faisal Baloch
No ratings yet
CV Faisal Baloch
3 pages
RisingWave for Real-Time Data Processing: The Complete Guide for Developers and Engineers
From Everand
RisingWave for Real-Time Data Processing: The Complete Guide for Developers and Engineers
William Smith
No ratings yet
Practical TimescaleDB Solutions: Definitive Reference for Developers and Engineers
From Everand
Practical TimescaleDB Solutions: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Sqoop Essentials: Definitive Reference for Developers and Engineers
From Everand
Sqoop Essentials: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Architecting Real-Time Analytics with Druid: Definitive Reference for Developers and Engineers
From Everand
Architecting Real-Time Analytics with Druid: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
InfluxDB Essentials: Definitive Reference for Developers and Engineers
From Everand
InfluxDB Essentials: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Snowflake Data Platform Engineering: Definitive Reference for Developers and Engineers
From Everand
Snowflake Data Platform Engineering: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
The InfluxDB Handbook: Deploying, Optimizing, and Scaling Time Series Data
From Everand
The InfluxDB Handbook: Deploying, Optimizing, and Scaling Time Series Data
Robert Johnson
No ratings yet
Practical Data Strategies and Recipes
From Everand
Practical Data Strategies and Recipes
Tom Henricksen
No ratings yet
Learn Data Warehousing in 24 Hours
From Everand
Learn Data Warehousing in 24 Hours
Alex Nordeen
No ratings yet
Google Cloud Platform for Data Engineering: From Beginner to Data Engineer using Google Cloud Platform
From Everand
Google Cloud Platform for Data Engineering: From Beginner to Data Engineer using Google Cloud Platform
alasdair gilchrist
5/5 (1)

CH 05 Data Engineering

Uploaded by

CH 05 Data Engineering

Uploaded by

MODULE 5

Compile by Dr. Rohini Temkar 1

• Lamda Architecture, Batch Processing, Stream

Compiled by Dr. Rohini Temkar 2

designing, building, and managing the infrastructure

• It enables organizations to transform raw data into

useful insights and is the backbone of data science,

Compiled by Dr. Rohini Temkar 3

available and usable to data scientists, data analysts,

design and build systems for gathering and storing

Compiled by Dr. Rohini Temkar 4

■ They are responsible for designing, building,

■ screening and cleaning/polishing of the raw data

Data comes in various forms and can be both structured

■ Data serving delivers transformed data to end

● involves collecting large amounts of raw data from various

● Data is collected and processed in intervals (e.g., hourly,

● Use Cases: Suitable for historical data processing, reporting,

● involves streaming data into a data warehouse in real-time,

● Use Cases: Real-time analytics, fraud detection, IoT sensor

● Combines batch and real-time processing to get the benefits

● Use Cases: When both real-time insights and historical data

Data Schema Management:

● Ensure schema consistency across datasets. When ingesting data, it

Optimize Throughput and Latency:

● For real-time ingestion, reduce latency by using efficient,

Data Compression and Serialization:

Error Handling and Monitoring:

● Implement proper logging, error handling, and retry mechanisms

Secure Data Transfers:

Data Partitioning and Load Balancing:

● Partition large datasets to improve ingestion speed and

■ Although both serve to store data, they differ

You might also like